# Introduction to Python for (Probability and) Statistics
*Paulo Serra (VU, Amsterdam), 2021, version 1.0.*

## Before you start

Before you start reading this guide, you should get acquainted with the user interface of Jupyter Notebook. 
Click the menu **[Help]** on the top right and select **[User interface tour]** and follow the instructions. 
I suggest that you also have a look at **[Help]** and then **[Keyboard Shortcuts]**. 
These are quite useful to perform tasks quicker.

## Writing and running code and text on Jupyter Notebook
Code on Jupyter Notebook is organised into **Cells**. 
Each cell can contain several lines of content. 
Cells are a nice way of organising your code by splitting it into blocks. 

Jupyter Notebook has two modes: **Command mode** and **Edit mode**. 
You can go into edit mode on an active notebook by hitting the **[Enter]** key, and you can go into command mode by hitting the **[esc]** key. 
This is reflected by the colour of the cell where you can enter code being blue (while in command mode) or green (while in edit mode). 
You can also just click the content of a cell to go into edit mode. 

In edit mode, you can alter the content of a cell, while command mode is used when running the content of the cell.

### Markdown cells
Cells can contain things other than code such as text with instructions or comments on the code, conclusions that you draw based on outputs, to-do lists, etc. 
You can use Jupyter Notebook to essentially write up a report whose results can be reproduced and checked by the reader; *reproducibility* of results is crucial for good science. 

Markdown is a way of creating formatted text using plain text. 
You can change the type of a cell to markdown by clicking the menu **[Cell]** and choosing **[Cell type]** and then **[Markdown]**. 
You can also change the type of the cell with the drop down menu next to the icons above.
The first five cells in this notebook are of the markdown type, while the fifth (now empty) is of the Code type. 
The fourth cell looks different from the preceeding ones because the markdown on it hasn't been run yet. 

Some examples of markdown code:
1. text sandwiched between \* gets outputted as italic; 
2. text sandwiched between \** gets outputted in bold;
3. adding \#, \##, etc., followed by a space at the start of a line gives you headers,  sub-headers, etc.;

You can easily find other markdown commands by searching online.

You can also add Latex code to a markdown cell by sandwiching it between \$\$. 
If you are unfamiliar with Latex, it is another form of markdown that easily allows you to type up nicely formatted math formulas and is often used in professionally typeset scientific documents.

HTML is another type of markdown that you can add to your notebooks to get extra functionality like centering, adding links to webpages, etc.

### Further reading
The Jupyter Notebook website contains notebooks that show you examples of many other things you can do within a Jupyter Notebook. 
Have a look at it to learn more. 
You can also easily find lots of resourses about markdown and Latex online.


👉🏻 Try adding some markdown and Latex below the text that is already in the cell right below and running it:
1. add:
<p style="text-align: center;"><b>Example of some *Latex* code in a *Markdown* cell:</b></p>

2. add: 
<p style="text-align: center;"><b>\$\$e^{i\pi}+1=0.\$\$</b></p>

3. now run the code on that cell by hitting the **[> Run]** button above with the cell selected (green highlight to the left); pretty fancy, right?

#### An example of some *Latex* code in Jupyter Notebook:
Example of some *Latex* code in a *Markdown* cell:
$$e^{i\pi}+1=0$$

## Getting started and getting help
With this cell selected, click **[Insert]** and then **[Insert Cell Below]** to get a new cell below this one. 
By default, a new cell is a **Code** cell. 
Also by default, this new cell is now highlighed in blue (command mode).


👉🏻 Click on the newly created cell (or hit **[Enter]**) to go into edit mode, type 

<p style="text-align: center;"><b>print("Hello world")</b></p>

and run the cell.


The command **print** is followed by **(...)** which indicates that it is a function. The **...** are arguments or inputs that you pass to the function. 
The **print** function, in particular, takes a string (a sequence of characters which is always sandwiched in **"..."** or **'...'**) as an argument between quotation marks. 
It then outputs that string. 

💡 You can get help on any function by running **?...**, e.g., **?print**. 
This will tell you what inputs the function expects and what outputs it produces.

You can see that the output of the code in your cell is reported below the cell and you can have several **print** commands in one cell if you want to report several outputs for that cell.

After running, you'll see **In [1]:** to the top left of the cell, and below the cell the output **Hello world**. 
Run the cell again; you'll now see **In [2]:**.

Each line of code represents and instruction and running instructions in different orders leads to different outputs (or errors) so the numbers inside the **[ ]** are handy to keep track of the order in which you ran you blocks of code.

While a cell is still running the cell will be tagged as **[\*]**. 
You will only ever see this when running a cell that contains code that takes a long time to run. 
If a cell takes too long to run there might be a problem and you can interrupt the execution by clicking **[Kernel]** and **[Interrupt]**.

💡 **This gets out of the way all of the basics of using Jupyter Notebook.
From this point onwards all the content of the guide is about Python.**

## Simple data input and manipulation
You can use Python to do some arithmetic.

👉🏻 Run the code in the following cell:

In [None]:
print(3+1,3**2,2*3-1)

The results should be self evivent except maybe the use symbol **\**** which represents taking a power. 
It might be a good point to mention that a lot of things in Phython can be done in multiple ways. 
Running the function **pow(3,2)** is the same as doing **3\*\*2**. 
Calling **?pow** by running the cell below will point this out to you.

👉🏻 Run the code in the following cell:

In [None]:
# how to get help:
?pow

First note that Python ignored the line starting with \#: this is how we add comments to code.

From the description of the function **pow** you can see that it expects arguments **base** and **exp** (we return later to what "mod=None" means when we talk about creating your own functions; for now just pretend it's not there.)

👉🏻 Run the code in the following cell:

In [None]:
print(pow(3,2))

Like with the **print** function, **()** indicates that **pow** is a function and takes in some arguments: 3 - the base - and 2 - the exponent; you can see this from the documentation.
When you run **pow(3,2)** it will assume that you want to set **base=3**, and **exp=2**. 

You can also just specify what input gets assigned to what argument. 
In fact, if you want to pass the inputs to the function in an order other than the one specified in the definition of the function, then you *need* to do this - Python can't guess what order you mean.

👉🏻 Run the code in the following cell to produce the same output as in the previous:

In [None]:
print(pow(exp=2, base=3))

Note that, contrary to the **print** function, the **pow** function (like most functions) produces an **Out [...]** label for the output in red. 
Also, contrary to the print function, if you run more than one computation in a code block, only the last computation is reported. 

Adding **;** at the end of a command suppresses the output.

👉🏻 Run the code in the following two cells:

In [None]:
2**2
2**3

In [None]:
2**2
2**3;

Intermediate computations in a code block are, in some sense, lost so you can store intermediate computations in objects called variables.

👉🏻 Run the code in the following two cells:

In [None]:
x = 2
y = 3
w = z = 1

In [None]:
print(x,y,w,z)

The numbers **2** and **3** are now stored in variables **x** and **y** which are now stored in memory and can be reused in later cells. 
Note that if you had run the cell containing the **print** command before the one above you would get an error since the the values **2** and **3** would not have been stored into the variables **x** and **y** yet.

The error that you would have get would have been **name 'x' is not defined** since Python does not know what **x** is (yet). 
It is important to read errors carefully to see what went wrong.

In the previous block you also saw that you can assign the same value to two (or more!) variables at the same time with **w = z = 1**.

### 📝 Solve the following exercises in the cell below:

In each of the following exercises I ask you to make a computation.
Store the results of each computation in variables **e1**, **e2**, ... and print them all at the end.
(Remember that you can ask for help with **?...**, and searching online is often helpful.)

**(1)** 
Compute the volume of a cube with edges of length $s=10$.

**(2)** 
Suppose that you put $c=5$€ in the bank.
You get an interest rate of $r=0.5\%$ per month so that after one month your $5$€ turn into $(5 + 5\times 0.5/100)$€.
How much money will you have after $m=6$ months?

**(3)** 
How much would you have earned **in interest** if instead you kept the money in the bank for $n=10$ years?


## Structures for storing information

### Lists

We often want to store and manipulate collection of objects. 
Lists can be used for this purpose and are created using **[...]**. 
You can create an empty list with **[]** or **list()**.

Note that different types of objects can be collected in a list.

👉🏻 Run the code in the following cell:

In [None]:
x = ["apple", 123, False]
print(x)

You can concatenate lists using the **+** sign.

👉🏻 Run the code in the following cell:

In [None]:
["apple"] + ["orange"]

Perhaps a bit confusingly, entries of a list can be accessed with **[...]**; don't confuse this wit hthe list itself.

The first entry of a list corresponds to index 0 (not 1.)
This is quite common in programming languages.

The function **len** gives us the length of a list.

👉🏻 Run the code in the following cell:

In [None]:
x = ["apple", "cherry", "strawberry", "banana", "kiwi"]
print(x[0])
print(len(x))

So "pineapple" ocupies the 0-th entry of the list x. 
Quite often we want to extract several entries of a list. 
The following are common ways of accessing some of the entries of a list.

👉🏻 Run the code in the following cell:

In [None]:
print(x[0:2])
print(x[-1])
print(x[-1] == x[len(x)-1])
print(x[-3:-1])
print(x[-3:])
print(x[:2])
print(x[0::2])
print(x[3:1:-1])
# print(x[5])

Lets unpack that:
- The operator **:** as in **0:2** gives us all indices from 0 to 2 *excluding* 2, so first index is *inclusive* and the second *exclusive*;
- The index -1 indicates the last element of the list; this is quite handy since you don't need to know the length of the list. (It might now be clearer why we start counting at 0; to be consistent the last entry would have to be -0 but that is equal to 0);
- **-3:-1** indexes the third entry from the end up to (but excluding) the last;
- **-3:** indexes the third entry from the end up to the end of the list;
- **:2** indexes all entries up to (but excluding) the one with index 2 (third entry);
- The format **a:b:c** gives every **c**-th index from **a** up to (but excluding **b**), so **0::2** gives every second index from the first to the end, so that **0::2** gives indices **[0, 2, 4]**. Note that **0::2** is the same as **::2**, and **0::1** is the same as **0:** or just **:** (or **::**) which means all.
- Negative steps are allowed, so **3:1:-1** starts at index **3** and goes down one by one to index **1** (exclusive). Doing **::-1**, for instance, reverses the list.
- if you try to access an non-existant index then you get the error "list index out of range";

Another useful thing to know is that Python treats strings as lists of characters so that you can also use the indexing from above on a string to manipulate them.

👉🏻 Think about the outcomes for the code in the following cell and then run the cell:

In [None]:
print("banana"[::-1])
print("banana"[2:])
print("banana"[1::2])

Unfortunately, getting elements corresponding to abribrary indices (i.e., indices that are not expressable in terms of the **::** operator) from a list is not so straightforward. 
(We will return to this after we discuss **for** loops.) 
That, plus the fact that large lists can be rather slow to work with makes **arrays** a more attactive structure. (We discuss arrays later on when we talk about the NumPy library.) 
However, lists still have many uses and it is essential that you know how to work with them.

We now move on to so called **methods**. 
We can use methods to modify objects like lists. 
These are of the form **.method_name(arguments)** and you can think of them as a special function that either modifies what comes before it or uses what comes before it as an argument.

Methods often make code more readable since they can be stacked.
We could define a function **function(input, args)** that just returns **input.method(args)** so it may seem that methods are redundant.
However, imagine that you want to apply several methods sequentially;
while using methods **input.method1(args1).method2(args2).method3(args3)** is quite readable, the equivalent code using functions **function3(function2(function1(input,args1),args2),args3)** seems more challenging to parse visually...

👉🏻 Run the code in the following cell for some examples of methods to modify lists:

In [None]:
x = ["apple", "kiwi"]
x.insert(0,"pineapple")
print(x)

x.append("melon")
print(x)

x.extend(["cherry", "tomato"])
print(x)

x.append(["banana", "strawberry"])
print(x)

Lets unpack that:
- **.insert** adds exactly one element to a given position in the list;
- **.append** adds exactly one element to the end of the list;
- to append more than one element at a time, you use **.extend**;
- note that using append does not do what we want. Rather than adding the new elements at the end of the list, it adds a list at the end of the list: we now have a list in our list.

We messed up our list so it is also useful to lear about how to remove elements from a list.

👉🏻 Run the code in the following cell:

In [None]:
print(x)

x.pop()
print(x)

fruit = x.pop()
print(fruit)
print(x)

x.remove("kiwi")
print(x)

Lets unpack that:
- **.pop** removes the last element of the list, so we are back to a list of strings;
- **.pop** actually returns the value that got popped;
- we can also remove a specific entry of the list using **.remove**;

Other useful methods for lists are **.sort**, **.reverse**. Their use is self-evident.

👉🏻 Run the code in the following cell:

In [None]:
print(x)

x.sort()
print(x)

x.reverse()
print(x)

There are many other methods that can be applied to lists. 
If you want to see what methods can be applied to x just type **x.** and hit **[Tab]**; this brings up a list of allowed methods. 
(This works for other types of objects as well, not just a variable continings a list.)

👉🏻 Try looking up methods available for **x** in the cell below:

Among these, you may notice the method index. 
This method can be used to find the indices of specific elements of the list. 
Also useful is the operator **in** which checks if something is an element of a list

👉🏻 Run the code in the following cell:

In [None]:
print(x)
print(x.index("melon"))
print("melon" in x)
print("banana" in x)

To close our discussion of lists, Python has an *interesting* quirk when it comes to assignements. 
This is not just for lists but for so called mutable objects.

👉🏻 Before running the following cell, think about what you expect the output to be. Then run the cell below.

In [None]:
x = ["a", "b"]
y = x
x[0] = "z"
print(x)
print(y)

You can see that even though we only changed the first element of **x** to **"z"**, the same change happened to **z**. 
This might seem confusing at first, but it has to do with what the assignement operator **=** is doing. 
When you run **y = x** you may think that you are assigning the content of **x** to **y**, but what you are actually doing is saying that the variables **x** and **y** now point to the same place in memory.

This means that when you change **x** you are also implicitly changing the content of **y** since **y** just points to the same place as **x**. 
This might be the intended behaviour but if you actually want to create a copy of **x** and then modify it, then the cell below shows you how you can do it.

👉🏻 Run the code in the following cell and compare the result with that of the cell from before:

In [None]:
x = ["a", "b"]
y = x.copy()
x[0] = "z"
print(x)
print(y)

### Tuples
For the most part tuples work the same way as lists but are created with **(...)** rather than with **[...]**. 
You can create an empty tuple with **()** or **tuple()**.

The big different is that tuples are immutable - once they are created they cannot be modified.
In fact, other than being immutable, tuples are quite interchangeable with lits.

👉🏻 Run the code in the following cell:

In [None]:
x = ("a", "b", "c")
print(x[0])

If you check what methods you can apply to x you'll see that the choice is limited as all methods that modify it are gone. 
If you try to modify **x** you'll get an error. 
Other than that, tuples work just like lists.

👉🏻 Type **x.** and hit **[Tab]** in the cell below:

### Sets
Sets can be created with **{...}** rather than with **[...]** or **(...)** like lists and tuples, respectively. 
You can create an empty set with **set()**. 
(Note that **{}** doesn *not* create an empty set.)

The differences with sets is that they do not store duplicates and the elements of the set have no order.

👉🏻 Run the code in the following cell:

In [None]:
x = {"a", "b", "c", "a", "b", "b"}
print(x)

The previous example pretty much tells thw whole story: in sets, duplicates ignored, and the element in the set are not stored in any particular order. 

Sets are optimised for set operations such as checking for memebership with **in**, taking the intersection or union, the difference, etc..

👉🏻 Run the code in the following cell:

In [None]:
print(x)
print("a" in x)
print({"a"}.issubset(x))

y = {"d", "a", "z"}
print(x.intersection(y))
print(x.union(y))
print(x.difference(y))

All of the outcomes should be self-explanatory.

### 📝 Solve the following exercises in the cell below:

**(4)**
Suppose that a librarian starts their round with an empty book trolly and goes around a library collecting abandoned books.
Start out by creating an empty list called **trolly** representing the empty trolly that they start with.
Use this list to keep track of the order of the books on the trolly.
Make sure that your code mimics the following steps:
1. The librarian collects the book **A** from a table and places it in the trolly;
2. Next they grab two copies of **B** from next to the coffee machine;
3. A child is done with their copy of **C** and hands it over to the librarian who places it on top of the pile;
4. The librarian trips on three books on the floor: **D** volumes 1, 2 and 3 which get placed on the trolly in that order;
5. Someone comes over to the librarian and asks if they know where they can get a copy of **A**.
The librarian remembers that that was the first book they picked up and they lift the other books so the person can take it;
6. While the librarian is still holding the books, another person shows up and asks if they can place their copy of **E** on the trolly; they do so and the librarian put the books that they were holding on top;
7. While passing the return counter, the librarian grabs copies of **F**, **G**, and **H** and sorts them alphabetically before placing them on the pile with **F** now on top;
8. Another child comes and asks if they can take the copy of C that is on the pile.
The librarian doesn't know anymore what position that book occupies so they just let the child take it themselves;
9. The librarian arrives at a shelf and places the books in the pile (top to bottom) on the shelf (left to right);
10. What order (left to right) are the books in on the shelf? 

Print out your solution.

**(5)**
Consider two sets **A = {"a", "b", "d", "f"}** and **B = {"b", "d", "g"}**.
Compute the following:
1. the intersection between **A** and **B**;
2. the union of **A** and **B**;
3. all elements in **B** but not **A**;
4. all elements that are exclusive to **A** or exclusive to **B**;
5. which of the subsets does "a" belong to?

Store the answers in variables **e1**, **e2**..., and print them out.

## Booleans and conditionals

A variable is a boolean if it can take the values **True** or **False**. 
Conditionals allow us to execute different code based on the value of a boolean.

### Logical operations
These should be familiar to you and include **and**, **or**, and **not**.

👉🏻 Run the code in the following cell:

In [None]:
print(True and True, True and False, False and True, False and False)

print(True or True, True or False, False or True, False or False)

print(not True, not False)

The above are just the rules for dealing with booleans.

👉🏻 Run the code in the following cell:

In [None]:
x = 2
print(x == 2)

The **==** looks similar to **=** but does some something different. 
While **x = 2** assigns the number **2** to **x**, **x == 2** does a comparison and tells you if the left hand side and right hand side are equal. 
In this case it returns **True**.

👉🏻 Run the code in the following cell and check that the outputs make sense:

In [None]:
y = 3
print(y == 2, 1 == 1, 2 == "two", y == "three")

The comparison operator **==** is very useful since when writing code we often have to make checks to decide what to do next. 
Other comparison operators include **<**, **<=**, **>**, **>=**, **!=** for respectively less than, less or equal than, greater than, greater or equal than, and different.

👉🏻 Run the code in the following cell and check that the outputs make sense:

In [None]:
print(1 < 1, 1 <= 1, 2 > -2, 0 >= 0, 1 != 0)

A final remark about the booleans **True** and **False** is that they can also double as the numbers **1** and **0** if involved in a computation. 
For instance, the following code makes use of this fact to compute the absolute value of **z**.

👉🏻 Run the code in the following cell trying out different values for **z**; check that indeed the output is **|z|**:

In [None]:
z = -4
print((z < 0) * (-z) + (z >= 0) * z)

### Conditionals

#### **if** statements
The fist conditional statement that we discuss is the **if** statement. 
An **if** statement executes a block of code if it gets passed a **True**.

👉🏻 Run the code in the following cell:

In [None]:
statement = True
if statement:
    print("statement is true")

Note that the print statement is indented. 
Python understands all indented text after the **:** to be a block of code that should be executed if what follows **if** evaluates to **True**. 
If you set **statement** to **False** above and run the block again, there will be no output.

We don't need to pass a variable to the **if** statement. 
Anything that evaluates to either **True** or **False** can be placed there.

👉🏻 Run the code in the following five cells and try to understand why it does what it does:

In [None]:
if True:
    print("This line of code was run!")

In [None]:
statement_1 = True
statement_2 = False
if statement_1 or statement_2:
    print("Either statement_1 is true or statement_2 is true (or both)")

In [None]:
statement_1 = True
statement_2 = True
if statement_1 and statement_2:
    print("statement_1 is true")
    print("statement_2 is true")

In [None]:
today = "Wednesday"
if today == "Thursday":
    print("Looks like today is Thusday!")

In [None]:
weekend_days = {"Saturday", "Sunday"}
if today not in weekend_days:
    print("Looks like today is a week day!")

#### **if** + **else**, and **if** +**elif** + **else** statements
It is often the case that we want to run code whether the **if** statement evaluates to **True** or not but which code we run depends on the evaluation. 
For this we can complement the **if** with and **else**.

👉🏻 Run the code in the following cell:

In [None]:
today = "Saturday"
if today not in weekend_days:
    print("Looks like today is a week day!")
else:
    print("Looks like today is a weekend day!")

If set up like this, one of the two commands will run. Which one runs depends on the value of the variable **today**.

It can also happen that you want to cover more than two cases. 
In that case, you can use **elif** to specify futher alternatives that are sequentially checked until one that evaluates to **True** is found.

👉🏻 Run the code in the following cell:

In [None]:
today = "Sunday"
if today not in weekend_days:
    print("Looks like today is a week day!")
elif today == "Saturday":
    print("Looks like today is Saturday!")
else:
    print("Looks like it must be Sunday!")

You can add as many **elif** as you want but only one **if** and only one **else**.

👉🏻 Run the code in the following cell:

In [None]:
number_you_thought_of = 7

if number_you_thought_of == 1:
    print("You though of the number 1.")
elif number_you_thought_of == 2:
    print("You though of the number 2.")
elif number_you_thought_of == 3:
    print("You though of the number 3.")
else:
    print("You thought of a number other than 1, 2, or 3.")

### 📝 Solve the following exercises in the cell below:

**(6)**
Define a variable **x** and assign a positive integer to it.
Write an *if* statement that checks if **x** is odd or even, and prints out a message with the appropriate conclusion.
(*Hint:* **a%b** returns the remainder of dividing a by b.)

**(7)**
Define a variable **y** and assign a string to it.
Write an *if* statement that checks if **y** is a palindrome, and prints out a message with the appropriate conclusion.
(*Hint:* a string can be indexed the same way as a list of characters.)

**(8)**
Define a variable **w** and assign a string to it.
Write an *if* statement that checks if **w** ends with an 'a', or ends with an 'o', or does not end with either and 'a' or an 'o'.

## Arrays with NumPy
NumPy is a Python library that allows us to create and manipulate arrays. 
Arrays are much more practical to work with (as well as computationally efficient) than lists. 
First you need to load this library which you can do with the command **import**.

👉🏻 Run the code in the following cell including the **pip** command to actually install the library:

In [None]:
# !pip install numpy # Only needs to be run once to install the NumPy library
import numpy as np

The command above tells Python to load up the NumPy library and to name it **np** (it is up to you what you call it but **np** is almost always used). 
This means that every time that you want to call a function from the NumPy library you should preface it with **np.** (or whatever you imported the library as). 
We do this because you can have functions with the same name in different libraries; calling **np.function_name** tells Python that you want to specifically  call the function **function_name** from the NumPy package, as opposed to some other function also called **function_name**.

NumPy contains functions to create and manipulate arrays. 
An array is just a d-dimensional structure, so that a 0-d array is a number, a 1d array is a vector, a 2-d array is a matrix, etc. 
Arrays are much more efficient, flexible, and practical to work with. 

### Creating and combining arrays

There are several ways of creating arrays. 
The following cells constain three different ways of creating a 2 by 3 array of 1's.

👉🏻 Run the code in the following cell:

In [None]:
x = np.array([[1,1,1],[1,1,1]])
print(x) # printing an array just shows you a list of lists
x # outputting x itself explicitly tells you that x is an array

In [None]:
y = np.full((2,3),1)
print(y)

In [None]:
z = np.ones((2,3))
print(z)

Let's unpack what we got. First, just to get it out of the way, note that **z** is full of **1.**'s, while for instance **x** is full of **1**'s; **1** represents the integer 1, while **1.** represent the real number 1.00....


The command **np.array(...)** technically takes a list (or more specifically a list of lists, i.e., a matrix) and converts it into an array with the same dimensions. 
Note that when you call **x** itself, it explicitly tells you that it is internally stored as an array structure, which you do not see if you just print it out. 
Note that you could also have passed a tuple of tuples as input for the function.

The command **np.full((2,3),1)** takes in a shape, **(2,3)** in our case, and fills it with 1's (the second argument.) 
Note that the shape could have been specified using a list **[2,3]** instead of a using the tuple **(2,3)**; a set could not have been used since it would be ambiguous if you wanted a 2 by 3 matrix or a 3 by 3 matrix.

The command **np.ones((2,3))** creates a 2 by 3 array of ones. 
There is also a function **np.zeros** that creates arrays of zeros.


Sometimes you just want to create an evenly spaced sequence of numbers.
The function **np.linspace(...)** can be used for this.

👉🏻 Run the code in the following cell which creates 11 evenly spaced numbers between 0 and 5:

In [None]:
np.linspace(0,5,11)

After you have created an array you can look up its number of dimensions and its shape using respectively **.dim** and **.shape**.

👉🏻 Run the code in the following cell:

In [None]:
print(x.ndim)
print(x.shape)

This tells us that **x** has 2 dimensions (it is a matrix, so it has "rows" and "columns"), and its shape is **(2,3)** (so, more specifically, it is a matrix with 2 "rows", and 3 "columns").

Note that you can use the output of **.shape** with, for instance, **np.ones**. 
The following command creates a matrix of zeros with the same shape as **z**.

👉🏻 Run the code in the following cell:

In [None]:
print(z)

w = np.zeros(z.shape)
print(w)

We can also use the command **.reshape(...)** to modify the shape of an array. 
(Technically speaking, **.reshape(...)** creates a new array with the specified shape, it does not modify the original array.) 


👉🏻 Run the code in the following cell:

In [None]:
v = np.array(range(0,8))
print(v.shape, "is the shape of v")
print(v), print() # I use an extra "print()" to get an empty line making the output more readable

v = v.reshape(2,4)
print(v.shape, "is now the shape of v")
print(v), print()

v = v.reshape(2,2,2)
print(v.shape, "is now the shape of v")
print(v)

So we started by creating a vector **v** with 8 elements, reshaped it into a 2 by 4 matrix, then reshaped it into a  by 2 by 2 array, our first example of a 3-d array. 
A 3-d array has "rows", "columns", and... "layers", say; each of the 2 "layers" of the array is a 2 by 2 matrix. 
You can define arrays with as many dimensions as you like. Also note that the result of the **v.reshape(2,4)** command makes it clear that the function "fills" the 2 by 4 matrix by columns.

The only thing we need to be careful with is that the reshaped array has the same number of elements as the original array. 
So, for instance, we would have gotten an error if we tried to reshape **v** into a 3 by 3 matrix, since a 3 by 3 matrix has 9 entries and **v** only has 8 elements.

We can combine arrays using the **.stack** command. 
We exemplify this in the next cell where we first turn **v** back into a vector.

👉🏻 Run the code in the following cell:

In [None]:
v = v.reshape(8)

print(np.stack((v,v))), print()

print(np.stack((v,v), axis = 0)), print()
print(np.stack((v,v), axis = 1)), print()

print(np.hstack((v,v))), print()
print(np.vstack((v,v))), print()

print(np.repeat(v,2))

Let's unpack the results. As already mentioned, we first turn **v** back into a vector. Then,

- the **np.stack(...)** command takes a tuple (or list) of arrays (vectors in our case) and stacks them. In our case it stacks two copies of **v** on top of one another. By default the stacking is done by rows so we end up with a 2 by 8;

- the command **np.stack(..., axis = ...)** allows us to specify along which dimensions we want to stack the copies of **v**. By default **np.stack((v,v))** does the same as **np.stack((v,v), axis = 0)** where the vectors are stacked along the 0th dimensions (so along "rows"). Doing **np.stack((v,v), axis = 1)** instead, stacks the copis of **v** as columns, so stacks them along the 1th dimention (or along "columns");

- there are two other stack related functions, **np.hstack(...)** and **np.vstack(...)** that do the stacking horizontally and vertically, respectively. The results are self explanatory;

- another often useful command is the **.repeat(...)** function. This functions repeates each element of the passed array a certain number of times.

### Operations on arrays

An advantage of arrays over lists of tuple is that you can more easily do operations involving arrays.

👉🏻 Run the code in the following cell:

In [None]:
print(np.array([1,2])+1)
print(np.array([1,2])+np.array([1,1]))
#[1,2]+1
[1,2]+[1,1]

If you were to run the first commented out command you would get an error. The last command just concatenates the lists, rather actually adding them. This examplifies how arrays are more ameanable to computations, while lists (and tuples) are more appropriate for indexing. (We will turn to indexing of arrays shortly.)

Some more examples of operations follow.

👉🏻 Run the code in the following cell:

In [None]:
print(np.array([1,2])*2)
print(np.array([1,2])*[2,2])
print(np.array([1,2])**2) #remember that ""**2" means "power 2"
#[1,2]**2
[1,2]*2

This again reinforces the different roles of arrays and lists. 
(The commented out line would give an error.)

Some other operations that you can do are taking sums, maxima, and minima.

👉🏻 Run the code in the following cell:

In [None]:
v = v.reshape(2,4)
print(v)

print("Sums of all entries:")
print(np.sum(v))
print(np.max(v))
print(np.min(v)), print()

print("Sums along 1st dimension:")
print(np.sum(v, axis = 0))
print(np.max(v, axis = 0))
print(np.min(v, axis = 0)), print()

print("Sums along 2nd dimension:")
print(np.sum(v, axis = 1))
print(np.max(v, axis = 1))
print(np.min(v, axis = 1))

We exemplify above the use of the **np.sum**, **np.max**, **np.min** functions on a reshaped version of **v**. 
Applying these functions to **v** just does the obvious thing by returning the respectively the sum, max, and min of _all_ entries of **v**.

Specifying the extra argument **axis=0** tells the functions (np.sum, np.max, np.min) to apply the respective operation (sum, max, min) along the 0th dimension (the "rows"). 
This then returns an array with as many entries as there are "columns" since the "row" entries have been sum/max/min'd-out. 
Specifying **axis=1** returns the "row" sums (as we add the "column" elements).

👉🏻 Run the code in the following cell:

In [None]:
print("Sums for the 3-d array")
v = v.reshape(2,2,2)
print(v)

print("Sums along 1st dimension:")
print(np.max(v, axis = 0))
print("Sums along 2nd and 3rd dimensions:")
print(np.max(v, axis = (1,2)))

For higher dimensional arrays you have more options. 
In the cell above we exemplify the use of **np.max** with **axis=0** and **axis=(1,2)**. 
Setting **axis=0** means that you max over layers, so you get an array of the same dimension as each layer, where each entry is the maximum of that row-column combination across all layers. 
Setting **axis=c(1,2)** means that you add over rows and column, so you get a vector with the maximum of each of the two layers. 

In summary, after you specify the axes over which you want to apply the function, you get an object with a shape which is of the size of the remaining dimensions.



Linear algebra operations can be easily done within NumPy. 
The following cell examplifies a few simple concepts.

👉🏻 Run the code in the following cell:

In [None]:
I = np.identity(3)
x = np.array([1,2,3]).reshape(3,1)

print(np.matmul(I,x)), print()
print(np.matmul(np.transpose(x),x)), print()
print(np.matmul(x,np.transpose(x))), print()

print(np.linalg.det(I))
print(np.trace(I))

In the cell above we start by creating an identity matrix $I$ (a 2 by 2 array with 1's in the diagonal, 0's elsewhere), and a vector $x$. 
Note that we reshape the vector **np.array([1,2,3])** (technically a 1-d array) into a "column matrix" $x$ (a 2-d array with just one "column", i.e., a 3 by 1 matrix).

We then compute some matrix multiplications using **np.matmul**; 
we compute **$Ix$** (which is just $x$), 
**$x^Tx$** (which is the squared norm of $x$), and
**$xx^T$** (which is the outer product of $x$).

We also compute the determinant and trace of $I$, which are, of course, 1 and 3, respectively. 
Many other matrix operations are available and you can read about them by running **?np.linalg** on a code cell.

### Indexing of arrays

Like lists and tuples, arrays can also are also indexed. 
This means that you can also use the **[...]** to access and change sub-entries of an array. 
Indexing of arrays, though, is more flexible than indexing of lists and tuples. 

👉🏻 Run the code in the following cell after trying to guess what the outputs will be:

In [None]:
x = np.array(range(0,16)).reshape(4,4)
print(x), print()

print("First row of x:")
print(x[0,:]), print()

print("First column of x:")
print(x[:,0]), print()

print("First two rows and columns of x:")
print(x[0:2,0:2]), print()

print("Set the four central elements of x to 100:")
x[1:-1,1:-1] = 100
print(x)

Note that higher dimensional arrays are indexed in the same way, so for instance if **y** would be a 2 by 2 by 2 array then **y[0,0,0]** is the element that is in the first layer, first row, first column, **y[0,:,:]** gives the 2 by 2 array that makes up the 1st layer, etc.

This already makes arrays more flexible than lists and tuples but another attractive feature is that we can use booleans to index an array.

👉🏻 Run the code in the following cell:

In [None]:
y = np.array(range(0,6))
print(y)

print(y[[True,True,True,False,False,False]])

In the cell above we specify which entries of **y** we want to keep usign a list of booleans (with the same shape as the array **y**). 
Typically, though, we don't use booleans directly but instead index the array with the outcome of a logical operation.

Note that when using logical operators to directly index we need to use **&**, **|** instead of **and**, **or**.
The rule is you should use **&**, **|**, etc. when comparing vectors.

👉🏻 Run the code in the following cell:

In [None]:
print(y[(y>1)&(y<5)]) # note that "&" needs to be used to index directly; "and" would not work

print(y)
print((y>1)&(y<5))

You can also just use a list to indicate an arbitrary collection of indices to access. 
This is a huge advantage over the way in which lists are indexed.

For arrays of dimension 2 or higher you need to use tuples to index the entries; you can use tuples of tuples or tuples of lists, but the higher level object must be a tuple. 
Somewhat unintuitively, though, for 1-d arrays tuples are not allowed for indexing and you can only use lists...

👉🏻 Run the code in the following cell:

In [None]:
print(y[[-1,0,2]]), print()
# print(y[(-1,0,2)]) #this would give an error

x = y.reshape(2,3)
print(x), print()

print(x[([0,1,0],[0,1,2])])
# print(x[((0,1,0),(0,1,2))]) #[] -> (); this would give same as the command from previous line
# print(x[[[0,1,0],[0,1,2]]]) #this would give an error

### A final remark: copying arrays

We already saw before with lists that "copies" of mutable objects need to be handled with care. 
The cell below exemplifies how the command **y=x** does not assign to **y** the content of **x** but instead signifies that **y** now also points to the same place in memory as **x**. 
As such, changing **x** (resp. **y**) also changes **y** (resp. **x**). 
The proper way of assigning to **y** a copy of the content of **x** is to use **y = x.copy()**; in this way, changing **x** does not affect **y** and vice-versa.

👉🏻 Run the code in the following cell:

In [None]:
print("Perhaps unexpected behaviour:")
x = np.array([1, 2])
y = x
x[0] = 3
print(x)
print(y), print()

print("Correct procedure:")
x = np.array([1, 2])
y = x.copy()
x[0] = 3
print(x)
print(y)

### 📝 Solve the following exercises in the cell below:

**(9)**
Create a 3x3x3 array and set all entries of the $i$-th layer to consecutive multiples of $i$, by column, for $i\in\{1,2,3\}$ (indices 0,1,2, resp.).
Print out the result.

**(10)**
For the array from **(9)**, create a list containing lists containing:
1. the sum of all entries of each layer;
2. the minimum of the middle row of each layer;
3. the maximum of the last column of each layer;

(*Hint:* **.tolist()** is a method that converts an array to a list.)


## Loops
When programming, you often have to repeat a similar set of instructions several times. Loops can be used for this purpose.

### **for** loops
A **for** loops is approriate if you know ahead of time how many times you want to perform the instructions in the loop.

👉🏻 Run the code in the following cell:

In [None]:
for i in range(0,5):
    print(2*i+1)

Quite a few things to unpack here. 
Firstly, the code outputs the first 5 odd numbers.

The syntax is **for \<variable\> in \<collection\>:** where **\<collection\>** is some collection of values which can be a list, tuple, set, etc.. 
What the **for** loop does is assign to the **\<variable\>** the values in the **\<collection\>** one by one, and for each one run the indented command, in this case just a **print** command. 
Note that you can use the current value of the **\<variable\>** in the instructions that are being looped.

The function **range** is a particular useful here; for all intents, **range(0,5)** is the same as **[0,1,2,3,4]** but just works as a stand-in for the list without creating it. 
Using **range(0,5)** rather than **[0,1,2,3,4]** is more efficient. 

There is also a step argument in the range function, so **range(a,b,c)** corresponds to the same indices as **a:b:c** would.

If you really want to create the list explicitly, you can run **list(range(...))**. 
(You can do the same thing for tuples, sets, or arrays.)

👉🏻 Run the code in the following cell:

In [None]:
print(list(range(0,10,2)))
print(tuple(range(0,10,2)))
print(set(range(0,10,2)))

A single **for** loop can be used to iterate over more than one object at a time.
We need to make sure that the number of objects matches the size of th elements of the collection.

👉🏻 Run the code in the following cell:

In [None]:
for i,j in [[1,2],[3,4], [1,8]]: # each element of the collection needs to have two values: one for i and one for j
    print(i+j)

As promissed, here is the (unfortunately ellaborate) way in which you can access the entries of a list that occupy an arbitrary list of indices. (If you know of a more straightforward way of doing this, please let me know.)

👉🏻 Run the code in the following cell:

In [None]:
x = ["a", "b", "c", "d", "e"]
y = [x[i] for i in [1,0,4]]
# x[[1,0,4]] or x[1,0,4] would produce an error
print(y)

So **y** is made out of the entries of **x** corresponding to indices 1, 0, and 4, respectively. 
The code uses special syntax that to some extents mimicks mathematical notation: $(x_i: i\in\{1,0,4\})$. 
It is unfortunate that **x[[1,0,4]]** is not a valid way of accessing entries of **x**. 

Fortunately, there are **arrays** which address these and other limitations but this way of creating lists is very useful in itself and we will use it later on.

### **while** loops
These loops keep iterating until a certain stopping condition is met. 
These are useful when you perhaps don't know how many times the loop has to execute before you are done.
Best to see how it works with an example, though.

👉🏻 Run the code in the following cell:

In [None]:
x = [-4, -3, 0, -1, 3, -4, 5, 1]
i = 0
while x[i] <= 0:
    i += 1
print(x[i])

The syntax is very simple and of the form **while \<condition\>:**; as long as **\<condition>** evaluates to **True** Python keeps running the indented command(s). 
You can see some code that you haven't seen before, namely **i += 1** which increments **i** by one, i.e., does the same as **i = i + 1**.

This code increments the value of the index **i** (starting from **0**) until the respective **x[i]** no longer satisfies **x[i] <= 0**, i.e., until **i** is such that **x[i] > 0**, at which point the loop terminates and we execute the line of code following the loop, in this case **print(x[i])**. 
In words: the loop searches one by one through the list **x** until it finds the first stricltly positive number and then prints it out.

Note that if nothing happens inside the loop that eventually causes the **\<condition>** to evaluate to **False**, then the loop will never terminate. 
In such cases you need to go to the menu **[Kernel]** and select **[Interrupt]**.

### A loop within a loop
Note that you can put whatever code you like inside a loop, so also another loop. 
Below you can find an example of this; maybe if can suggest a new dish to try and cook.

👉🏻 Run the code in the following cell:

In [None]:
x = ["pinapple", "chicken", "onion", "potato"]
y = ["salad", "roast", "soup"]
for i in range(len(x)):
    for j in range(len(y)):
        print(x[i], y[j])

You can of course also have **while** loops inside **for** loops, **while** loops inside **while** loops, etc..

It should be noted that while a loop within a loop is often useful, it is hardly ever the case that you need to make a loop within a loop within a loop. 
More often than not, there is a more elegant way of doing whatever it is you are trying to do.

### **break** and **continue**
There are two more useful commands to use in conjuction with loops, namly: **break** and **continue**. 
They both interrupt the execution of the code within a loop, but while **break** jumps out of the loop, **continue** just jumps to the next iteration of the loop. 
We start with an example of a use for **break**.

👉🏻 Run the code in the following cell:

In [None]:
for i in range(5):
    if i == 3: 
        break
print(i)

(Note that the **if** statement is inside the **for** loop but **print(i)** is outside.) 
This **for** loop runs though the numbers **0** through **4** and when **i** takes the value **3** the loop is terminated. 
Checking the value of **i** afterwards confirms that that indeed the loop ended after **i** was set to **3**. 
Next follows an example with **continue**.

👉🏻 Run the code in the following cell:

In [None]:
for i in range(5):
    if i == 3: 
        continue
    print(i)

(Note that both the **if** statement and **print(i)** are inside the **for** loop.) 
Like the previous **for** loop, this one runs though the numbers **0** through **4** and prints out the current value of **i**. 
However, when **i** takes the value **3**, it first executes the **continue**. 
At this point the current iteration of the **for** loop - the fourth, where **i=3** - terminates without executing any more code (i.e., skipping the **print** commnad) and the next iteration of the loop - the fifth, where **i=4** - starts. 
The consequence is that we skip over the **print(i)** command when **i=3** so we don't print the number **3** out.

### 📝 Solve the following exercises in the cell below:

**(11)**
Write a loop that gives you every second even number of an array containing **A = [1, 2, 4, 6, 3, 1, 3, 3, 3, 8, 7, 10, 14, 9]**.

**(12)**
Solve **(11)** *without* using a loop, in one line of code.

## Functions
Beside using built-in functions, you can also create your own. 
Functions are useful whenever you have a batch of instructions that you would like to apply several times.

You can create a new function with the syntax **def \<functions_name\>(arguments):** followed by an iterated set of commands to be executed whenever the function is called.

👉🏻 Run the code in the following cell:

In [None]:
def do_nothing():
    pass

do_nothing()

This is how you create a new function and run it. 
This particular function is atypical because it has no arguments (which is allowed) and does nothing (that is the role of the **pass** argument inside the function.) 
The following function takes an input and prints it.

👉🏻 Run the code in the following cell:

In [None]:
def print_new(to_print = "Nothing to print."):
    print(to_print)

print_new("Something to print.")

print_new()

print_new

This function has an input called **to_print** and all that the function does is print what it receives as input.
The **= "Nothing to print."** specifies a default value for **to_print** which is used in case no argument is provided to the function. 
You can have several inputs for a function - separated by commas - and specify default values for whichever inputs you like. 

When we call **print_new("Something to print.")** the string **"Something to print."** gets printed; when we call **print_new()** the default value for the input is used since we don't provide one; when you run **print_new**, Python just tells us that **print_new** is a function but it does not actually run it. 
The **()** are essential to run a function, even if no inputs are provided.

Often, we also want our functions to return some output. 
(Note that the fucntion **print_new** does not have a output and instead just calls another function.) 
We can specify an output for a function using **return**.

👉🏻 Try to predict what the code in the following cell does and then run it:

In [None]:
def the_date(day = 1, month = "January", year = 2021):
    return "Today is " + month + " " + str(day) + ", " + str(year) + "."

print(the_date())
print(the_date(4, "February"))
print(the_date(year = 2025, day = "13", month = "February"))

Note that the call of the function produces a string which we then print.

### 📝 Solve the following exercises in the cell below:

**(13)** 
Write a function called **sum_pow** that takes in integers $n$ and $p$ and returns ${\sum}_{i=1}^n i^p$, the sum of the $p$-powers of the first $n$ integers;
make sure that if the function uses the default value $1$ for both $n$ and $p$.
Use the function to compute ${\sum}_{i=12}^{28}i^2$.

**(14)**
Create your own sort function **my_sort** which takes a list of integers and returns the list sorted in increasing order.
Apply your function to the list **A = [5, 3, 2, 6, 7, 1, 4, 1, 9]** and check if the result is correct.

## Data analysis

With all of the generic Python basics out of the way, we can finally focus on the using Python to do some statistics.

### Loading data
While NumPy makes data manipulation easier and more efficient, when it come to data analysis there are more specialised packages built on top of NumPy that you can use. 
A popular one is **Pandas**.


👉🏻 Run the code in the following cell to load the pandas library:

In [None]:
# !pip install pandas # Only needs to be run once to install the Pandas library
import pandas as pd

Pandas has functions that allow you to read several different file formats. 
Here we focus on CSV files which are plain text files with different rows of data, each row corresponding to a vector with entries separated by commas. 
(Pandas has other functions to load other file types.)

We will use a file about the World Happiness Report 2021 to exemplify some common data manipulation tasks you can perform with Pandas. 
In the cell below we load the CSV file containing the data.

Although this is not necessary, we immediately use the **.drop(...)** function to drop a few columns of the data to somewhat simplify the dataset.
The commands **happy.columns** or **happy.columns.values** show us the names of the columns that we kept but note that while **happy.columns** returns a collection of indices, **happy.columns.values** returns a list of strings.

New columns can be created using **happy["New column name"] = ...** and a whole new empty dataframe can be created with **new_dataframe = pd.DataFrame(columns = ["Column 1 name", "Column 2 name", ...])**.
For instance, **happy_new = pd.DataFrame(columns = happy.columns)** would create a new, empty dataframe with the same columns as **happy**.

Different dataframes (with the same columns) can be concatenated using a command like **pd.concat(dataframe1,dataframe1)**.

After loading we use the **.head(...)** funtion to look at the first 5 rows of the dataset. 
(The function **.tail(...)** would show us the bottom rows.)

👉🏻 Run the code in the following cell:

In [None]:
# data source: https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021
# (It is always a good idea to mention the data source)
happy = pd.read_csv("whr2021.csv")
happy = happy.drop(columns = happy.columns.values[3:6])
happy = happy.drop(columns = happy.columns.values[10:])

print(happy.columns), print()

print(happy.columns.values)

happy.head(5)

You may notice that the data is presented in a slightly more appealing way than a usual NumPy array. 
Specifically, the rows are numbered (leftmost column) and the column are named (top row). 
You can also see that each column can have a different data type. 
Internally, the variable **happy** is a so called **DataFrame**.

### Indexing dataframes

Dataframes differ from arrays is a few ways like, for instance, when it comes to indexing. 
If you try to index **happy** directly, then you can only refer to the columns. 
The following cell constains two commands (one commented out) that do the same and illustate how you can refer to the columns either by name or by index.

Note that when working with dataframes it is often a good idea to refer to the columns by their *name* (e.g., **"Country name"**) rather than by their *index* (e.g., resp. **0**). 
One reason for this is that it makes the code more readable. 
Another reason is that we may sometimes remove or reorder the columns so that the index of a specific column may be different at different points in the code.

Do note that **happy[["Country name"]]** is the same as **happy["Country name"]**, and that indexing is case-sensitive so that **happy["country name"]** will give you an error.

👉🏻 Run the code in the following cell:

In [None]:
happy[["Country name", "Regional indicator"]]
# happy[0:2] # does the same as the command above
# happy[0:2,0:5] # this would produce and error

Since our dataframe is rather tall, we are only show the first- and last five rows.

If you want to index a dataframe in the same way you would an array, then you should use **.iloc[...]**.
Running the command below gives you a sub-dataframe containing the first 5 rows and first 3 columns.

👉🏻 Run the code in the following cell:

In [None]:
happy.iloc[0:5,0:3]
# again, something like happy[0:5,0:3] would produce and error

So the takeaway is that if you use the **.iloc** property of a dataframe you can index it as you would an array but it might make make sense to use the basic column indexing of dataframes.

There is an alternative command **.loc[...]** that allows you to use conditions.

👉🏻 Run the code in the following cell:

In [None]:
happy.loc[happy["Regional indicator"] == "North America and ANZ"]

The command above only keeps the rows of the table corresponding to the region "North America and ANZ".

### Sorting

Another useful thing to know how to do is sorting.
Looking at **happy.head(5)** as we did above it might seem like the rows are sorted in descending order by "Ladder index" but maybe we want to look at what the dataframe would look like when sorted by another column(s), or maybe sorted in ascending (rather than descending) order.
To sort a dataframe, we can use the **.sort_values(...)** function.
(Note that this only outputs the sorted dataframe but does not alter the original dataframe.)

👉🏻 Run the code in the following cell:

In [None]:
happy.sort_values("Healthy life expectancy").head()

(We also used the **.head()** function to only show the first few rows of the dataframe.)
As you can see, by default **.sort_values(...)** sorted in ascending order the "Healthy life expectancy" column that we specified as an argument.
We can use the extra option **ascending = False** to sort in ascending order instead.

👉🏻 Run the code in the following cell:

In [None]:
happy.sort_values("Healthy life expectancy", ascending = False).head()

We get to see what the dataframe with rows sorted in ascending order in terms of "Healthy life expectancy" looks like.

In certain cases, we may want to sort using two or more criteria.
The command below shows what the list looks like if we sort first by "Regional indicator", and then by "Ladder score" with the former being sorted in ascending order and the latter in descending order.

👉🏻 Run the code in the following cell:

In [None]:
happy.sort_values(["Regional indicator","Ladder score"], ascending = [1,0]) 
# [True, False] would give same as [0,1]; ascending = False would sort *all* in descending order

The result is as expected.

Say that we prefer this sorting for the list and we would like to keep it.
To do this we can just assign the output above to **happy** but note that the row indices still refer to the original sorting.
After assigning the sorted list to **happy** we can use the command **.reset_index(drop=True)** to do this.
The optional argument **drop=True** tells the function not to save the old row indices as a new column which it does by default.

👉🏻 Run the code in the following cell:

In [None]:
happy = happy.sort_values(["Regional indicator","Ladder score"], ascending = [1,0])
happy = happy.reset_index(drop=True)
# we could have combined both commands with happy = happy.sort_values(...).reset_index(drop=True)

happy

### Saving dataframes

We may now want to save this new dataframe to a CSV file.
We can do this using the **to_csv** function.
(There exist other functions to save to other formats.)

The optional argument **index = False** tells the function not to save the column with the row indices.

👉🏻 Run the code in the following cell:

In [None]:
happy.to_csv("happy_sorted.csv", index = False)

### Data summaries and statistics

A quick way to have a feeling for the data is to use the **.describe()** command.
This command returns several statistics (counts, mean, standard deviation, quartiles, range) for the different *numerical* columns of the dataframe.

👉🏻 Run the code in the following cell:

In [None]:
happy.describe()

The **count** row may seem somewhat redundant but if there were missing values in some of the columns, then those would not be counted; in the case of our dataset, there are no missing values so all columns have the same count value.
(In NumPy, and by extension Pandas, **np.nan** represents a missing value.)

You can, of course, get these statistics (and others) by computing them from the columns (or the entire dataframe) directly.

👉🏻 Run the code in the following cell:

In [None]:
print(happy["Ladder score"].mean()), print()

happy.mean()

Pandas also offers a practical way of computing statistics over different groups of data using the **.groupby(...)** function.
For instance, the command in the cell below splits the data into different groups based on the value of **Regional indicator** and, for each group, computes the mean of all of the other numerical collumns.

👉🏻 Run the code in the following cell:

In [None]:
happy.groupby(["Regional indicator"]).mean()

We can also group by more than one criteria at a time.
For instance, in the cell below we use two criteria to construct groups, namely: **"Regional indicator"**, and if the generosity score (**happy["Generosity"]**) is non-negative or not.
For each group - every possible combination of **Regional indicator** and **{True, False}** - we get the mean of each numerical column for the respective group.

👉🏻 Run the code in the following cell:

In [None]:
happy.groupby(["Regional indicator", happy["Generosity"] >= 0]).mean()

The code in the cell below focuses only on the column **"Generosity"** and counts the number of elements in each group.
If we think of **happy["Generosity"] >= 0** indicating a country is "generous", then the table below lists for each regional indicator the number of countries that are "generous" (**True**) and the number of countries that are not "generous" (**False**).

👉🏻 Run the code in the following cell:

In [None]:
happy.groupby(["Regional indicator", happy["Generosity"] >= 0])["Generosity"].count()

### 📝 Solve the following exercises in the cell below:

**(15)**
Load the dataset **happy_sorted.csv** into a new variable called **happy_new**.
For this, compute the following:
1. print out the list of countries in **Southeast Asia** whose **Healthy life expectancy** is at least 68;
2. what is the mean logged GDP for those counties?
3. group the ladder score for the group of countries with **Perceptions of corruption** above and below 0.55, and then print put the ratio between the maximum- and the minimum **Perceptions of corruption** for each of the groups.

## Plotting

Numerical summaries give us a lot of insight into data but plots also help in this regard.
There are several libraries that enable you to make plots in Python.

Some of these are very low level in that they give you a lot of freedom to customise your plots but often imply you having to write a lot of code even to get a simple plot.
Others, allow you to create plots with very little code but may be a bit lacking in terms of customisation.
Here we go over **Matplotlib** which offers a good balance between these two extremes.

👉🏻 Run the code in the following cell to load the Matplotlib library:

In [None]:
# !pip install matplotlib # Only needs to be run once to install the Matplotlib library
import matplotlib.pyplot as plt

### Histograms

Histograms are useful to see the distribution of a sample and can be plotted with **plt.hist(...)**. 
All you need is to give the function the values that you want the histogram for.
In addition you should round things off with **plt.show()** which will omit some unnecessary outputs.

👉🏻 Run the code in the following cell:

In [None]:
plt.hist(happy["Ladder score"])
plt.show()

That looks about right but it also looks very bare.
You can change pretty much every aspect of the plot whith the right commands so in the next cell you can find some commands that allow you to change a few typical plot parameters.
Note that most of these commands apply to other types of plots as well (i.e., not just to histograms).

👉🏻 Run the code in the following cell:

In [None]:
plt.figure(figsize=(12,4))

plt.title("Histogram of the Cantril Ladder Score 2021", fontsize=18, fontweight="bold")

ticks = np.linspace(2,8,7)
plt.xticks(ticks)

bins = np.linspace(2,8,13)
plt.hist(happy["Ladder score"], bins = bins, color = "#FF0000")

plt.xlabel("Ladder score")
plt.ylabel("Frequency")

plt.show()

Lets unpack all of the commands that we used:

- **plt.figure(figsize=...)** allows us the change the size of the plot;

- **plt.title(...)** allows us to set the title of the plot;

- **plt.xticks(...)** allows us to set the ticks that feature in the x-axis;

- the option **bins = ...** in **plt.hist(...)** lets us select the start/end points of the bins (note that these do not necessarily need to be evenly spaced);

- **plt.xlabel(...)** and **plt.ylabel(...)** let us set the labels for the x- and y-axis.

As exemplified above, commands that add/change text can usually take in extra arguments to set font type, size, colour, etc.
Colours in particular can be supplied in a large number of different formats; e.g., **"#FF0000"** is the same as **"red"** is the same as **"r"**.

The plotting function itself can also take extra arguments.
It is often a good idea to check the documentation (e.g., **?plt.hist**) to see exactly what can be changed.

### Box and whiskers plots

These are useful to summarise the range and quartiles of a sample, and to compare different samples.
They can be plotted next to one another by giving the function **plt.boxplot(...)** a list of lists or a list of arrays each containing a sample.

👉🏻 Run the code in the following cell:

In [None]:
life_west = happy.loc[happy["Regional indicator"] == "Western Europe"]["Healthy life expectancy"].to_list()
life_east = happy.loc[happy["Regional indicator"] == "Central and Eastern Europe"]["Healthy life expectancy"].to_list()

plt.figure(figsize=(8,4))
plt.title("Healthy life expectancy comparison", fontsize=18, fontweight="bold")
plt.xlabel("Region", fontweight="bold")
plt.ylabel("Age", fontweight="bold")
plt.boxplot([life_west, life_east])
plt.xticks([1,2], ["Western Europe","Central and Eastern Europe"])
plt.show()

Above we use the **.loc** property of the dataframe **happy** to find all entries relating to two specific regions, and for those we extracted the values corresponding to **Healthy life expectancy**.
(We use **.to_list()** at the end since technically we are getting a dataframe and we want to get rid of the unnecessary column containing the row indices.)
We then compared there two sample by plotting the two box and whiskers plots side-by-side.

You can imagine that it would be tedious doing a similar plot for all regions by hand.
If you compare the two first likes of code, though, you can see they are nearly identical.
This suggests we can automatise this by using a for loop.

The following example will show you a simple trick that is useful to achieve what we want: to create a list where each entry of the list corresponds to a unique region and is a vector containing all the **Healthy life expectancy**s for that region.

👉🏻 Run the code in the following cell:

In [None]:
[[x,x**2] for x in range(1,6)]

The code above creates a list of lists.
Each list contains a number **x** and its square, **x\*\*2**, where x is every element the collection **range(1,6)**.
The result is a list of five lists, each of the five containing an integer and its square.

We now use the same trick to look up in the dataframe the list of all **Healthy life expectancy**s for each of the unique regions.


👉🏻 Run the code in the following cell:

In [None]:
## abbreviations
RI = "Regional indicator"
HLE = "Healthy life expectancy"

regions = happy[RI].unique()
samples = [happy.loc[happy[RI] == region][HLE].to_list() for region in regions]

plt.figure(figsize=(12,4))
plt.title("Healthy life expectancy comparison", fontsize=18, fontweight="bold")
plt.xlabel("Region", fontweight="bold")
plt.ylabel("Age", fontweight="bold")
plt.boxplot(samples)
plt.xticks(range(1,len(regions)+1), regions, rotation=90)
plt.show()

Note that the code is nearly identical to before with some punctual adjustments.

The first line of code in the cell above extracts all unique regions names.
The second line indeed looks complicated but is of the form **[function_of_object for object in collection]** that we saw in the example just above.

The only other new thing is that we use the option **rotation=90** in **plt.xticks(...)** to avoid the region labels overlapping.

### Scatterplots and line graphs

To creare scatterplots and line graphs you use the function **plt.plot(...)**. In the following cell pay attention only to the line involving **plt.plot(...)** and the one preceeding it where we sort the data; the rest of the code just customises the plot.

👉🏻 Run the code in the following cell:

In [None]:
plt.figure(figsize=(12,4))
plt.title("Logged GDP per capita vs Ladder score", fontsize=18, fontweight="bold")
plt.xlabel("Logged GDP per capita", fontweight="bold")
plt.ylabel("Ladder score", fontweight="bold")

##abbreviations
LGDPPC = "Logged GDP per capita"
LS = "Ladder score"

sorted_happy = happy.sort_values(LGDPPC, ascending = True)[[LS,LGDPPC]]
plt.plot(sorted_happy[LGDPPC], sorted_happy[LS], ".")

plt.show()

The argument **"."** in **plt.plot(x,y,".")** tells Python to plot **x** vs **y** as a scatterplot.
The picture tells us that we should expect the log-GDP to positively correlate with the ladder score.

To create a line plot we just need to ommit the **"."** argument.
Several lines can be added to the plot by calling the **plt.plot(...)** several times.
We can use **plt.legend(...)** (after plotting) to add a legend to the plot.

👉🏻 Run the code in the following cell:

In [None]:
poc = "Perceptions of corruption"
x1 = happy.loc[happy[poc]>happy[poc].mean()]["Freedom to make life choices"].sort_values().tolist()
y1 = np.linspace(0,100,len(x1)+2)
x2 = happy.loc[happy[poc]<=happy[poc].mean()]["Freedom to make life choices"].sort_values().tolist()
y2 = np.linspace(0,100,len(x2)+2)

plt.figure(figsize=(12,4))
plt.title("Distribution of Freedom to make life choices", fontsize=18, fontweight="bold")
plt.xlabel("Freedom to make life choices", fontweight="bold")
plt.ylabel("Percentage of countries", fontweight="bold")

plt.plot([0]+x1+[1],y1, "r")
plt.plot([0]+x2+[1],y2, "b")

lab1 = "Countries with above average perceptions of corruption"
lab2 = "Countries with below average perceptions of corruption"
plt.legend([lab1, lab2], loc="upper left")

# plt.savefig("figure.pdf", dpi = 300)
plt.show()

The plot seems to suggest that the distribution of the perceived freedom to make life choices is different in countries with above- and below average perception of corruction.

### Saving plots
Plots can be saved using the command **plt.savefig(...)** which you should use right before the **plt.show()** as exemplified (in a commented way above).
For instance you can do **plt.savefig("figure.pdf")** to save as a PDF, or **plt.savefig("figure.png")**, and you can also add extra options to change the quality of the output such as **plt.savefig(..., dpi = 300)**.

### 📝 Solve the following exercises in the cell below:

**(16)**
The goal of this exercise is to do a bar plot by hand.
Carry out the following steps:
1. use a similar trick to what we did with boxplots to make a list containign the number of coutries per region with "Generosity" above 0.1 and at most 0.1;

2. use ?plt.bar to read how to create simple bar plots;

3. for each region plot two bars, one for each of the two counts making sure that:

    a) the bars of the i-th region (index i-1) are centered around i;
    
    b) no bars overlap;
    
    c) there is no space between the two bars in each pair;
    
    d) there is a gap between pairs of bars;
    
    e) the left bar in each pair is red and the right one is blue;
    
4. add a title and labels to the x- and y-axes;

5. change the x ticks to display the region corresponding to each pair of bars;

6. adjust the figure size to **(12,4)**;

7. add a legend to the **"upper left"** corner of the plot.





# End

# Solutions to exercises

In [None]:
# (1)
s = 10
e1 = s**3 # or pow(10,3)

# (2)
c = 5
r = 0.5
m = 6
e2 = c*(1+r/100)**m

# (3)
n = 10
e3 = c*(1+r/100)**(n*12) - c

# print solutions
print(e1, e2, e3)

In [None]:
# (4)
trolly = []
trolly.append("A")
trolly.extend(["B","B"])
trolly.append("C")
trolly.extend(["D1","D2","D3"])
trolly = trolly[1:]
trolly.insert(0,"E")
trolly.extend(["H","G","F"])
trolly.remove("C")
trolly.reverse()

print(trolly) # this the the order on the shelf


# (5)
A = {"a", "b", "d", "f"}
B = {"b", "d", "g"}
e1 = A.intersection(B)
e2 = A.union(B)
e3 = B.difference(A)
e4 = A.difference(B).union(e3)
e5 = ["a" in e1, "a" in e2, "a" in e3, "a" in e4]

print(e1,e2,e3,e4,e5)

In [None]:
# (6)
x = 3
if x%2: #remember: 1 is interpreted as True, 0 as False; having "if x%2==1:" is equivalent
    print("x is odd")
else:
    print("x is even")
    
# (7)
y = "bob"
if y[::-1] == y:
    print("'" + y + "' is a palindrome")
else:
    print("'" + y + "' is not a palindrome")

# (8)
w = "cat"
if y[-1] == "a":
    print("'" + w + "' ends with an 'a'")
elif y[-1] == "o":
    print("'" + w + "' ends with an 'o'")
else:
    print("'" + w + "' does not end with an 'a' or an 'o'")


In [None]:
# (9)
e1 = np.full((3,3,3), np.nan)
template = np.array(range(1,10)).reshape(3,3).T
e1[0,:,:] = template
e1[1,:,:] = 2*template
e1[2,:,:] = 3*template
print(e1), print()

# (10)
e2 = [e1.sum(axis=(1,2)).tolist(), e1[:,1,:].min(axis = 1).tolist(), e1[:,:,-1].max(axis = 1).tolist()]
print(e2)

In [None]:
# (11)
A = np.array([11, 2, 4, 6, 3, 1, 9, 3, 31, 8, 7, 10, 14, 9])
e1 = []
keep = False
for i in range(0,len(A)):
    if A[i]%2==0:
        if keep:
            e1.append(A[i])
            keep = False
        else:
            keep = True
print(e1)

# (12)
e2 = A[A%2==0][1::2]
print(e2)

In [None]:
# (13)
def sum_pow(n=1, p=1):
    return sum(np.array(range(1,n+1))**p)
print(sum_pow(28,2)-sum_pow(11,2))

# (14)
def my_sort(my_list):
    i = 0; j = 1
    sorted_list = my_list.copy()
    while max(i,j)<len(sorted_list):
        if sorted_list[i]>sorted_list[j]:
            temp = sorted_list[i]
            sorted_list[i]=sorted_list[j]
            sorted_list[j] = temp
        j+=1
        if j>len(sorted_list)-1:
            i+=1
            j=i+1
    return(sorted_list)
    
A = [5, 3, 2, 6, 7, 1, 4, 1, 9, 2]
print(my_sort(A))

In [None]:
# (15)
happy_new = pd.read_csv("happy_sorted.csv")
## some abbreviations
RI = "Regional indicator"
HLE = "Healthy life expectancy"
LGDPPC = "Logged GDP per capita"
CN = "Country name"
POC = "Perceptions of corruption"
LS = "Ladder score"

print(happy_new.loc[(happy_new[RI]=="Southeast Asia")&(happy_new[HLE]>=68)][CN].tolist()), print()
print(happy_new.loc[(happy_new[RI]=="Southeast Asia")&(happy_new[HLE]>=68)][LGDPPC].mean()), print()
temp = happy_new.groupby(happy_new[POC]>0.55)[LS]
print(temp.max()/temp.min())

In [None]:
## abbreviations
RI = "Regional indicator"
HLE = "Healthy life expectancy"
G = "Generosity"
CN = "Country name"

regions = happy[RI].unique()
samples = [happy.loc[happy[RI] == region].groupby(happy[G]>0.1)[CN].count().tolist() for region in regions]

# ?plt.bar

plt.figure(figsize=(12,4))
plt.title("Distribution of 'Generosity' above- vs below 0.1 per region", fontsize=18, fontweight="bold")
plt.xlabel("Regions", fontweight="bold")
plt.ylabel("Counts", fontweight="bold")

wt = 0.33
for i_reg in range(0,len(samples)):
    plt.bar(i_reg+1-wt/2, samples[i_reg][0], width = wt, color = "r")
    plt.bar(i_reg+1+wt/2, samples[i_reg][1], width = wt, color = "b")
plt.xticks(range(1,1+len(regions)), regions, rotation=90)
plt.legend(loc = "upper left", labels = ["'Generosity' > 0.1", "'Generosity' <=0.1"])
plt.show()