# Lab: Review of Python programming language

## Basic commands

```print()``` function outputs a text representation of all of its arguments to the console.

In [None]:
print('fit a model with', 1011, 'variables')


The following command is providing more information about the ```print()``` function.

In [None]:
print?

The three most important types of sequences are lists, tuples, and sets and dictionaries.


Textual data is handled using *strings*. For instance, `"Allez"` and `'l'OM'`are strings. We can concatenate them as follow:

In [None]:
"Hello" + " " + "World"

_Lists_: Lists are one of the most flexible structures in Python. A list is a grouping of values.
The creation of a list is done by writing the values by separating them with a comma
and surrounding them by square brackets ([ and ]).

Ask python to join together the numbers 3,4, and 5, and to save them as a list named `x`. Typing `x` gives us its content.

In [None]:
x = [3,4,5]
x

Copying a list: Be careful, copying a list in Python is achieved by using the list() function not by using the equal sign.



In [None]:
y=x
print(x)
print(y)

y[0] = 0
print(y)

In [None]:
print(x)

In [None]:
x = [3, 4, 5]
y = list(x)
y[0] = 0
print("x : ", x)

In [None]:
print("y : ", y)

In [None]:
a = ["Lea", "Tom", "Zoe"]
print(type(a))
b = list(a)
b[1] = "Bob"
print(a)

_Tuples_: Tuples are sequences of Python objects. To create a tuple, one lists the values, separated by commas. Unlike lists, tuplets are inalterable (i.e. cannot be modified after they have been created).

In [None]:
x = (1, 4, 9, 16, 25)
print(x)
print(type(x))

In [None]:
x[2]

_Sets_: Sets are unordered collections of unique elements. Sets are unalterable. Unlike tuples, sets are not
indexed.

In [None]:
new_set = {"Bordeaux", "Aix-en-Provence", "Nice", "Rennes"}
other_set = {"Nice", "Rennes", "Lyon"}
third_set = new_set | other_set

print(third_set)

_Dictionaries_: Python dictionaries are an implementation of key-value objects, the keys being
indexed. Keys are often text, values can be of different types and structures. To create a dictionary, you can proceed by using braces ({}).

In [None]:
my_dict = { "nom": "Kyrie",
"prenom": "John",
"naissance": 1992,
"equipes": ["Cleveland", "Boston"]}
print(my_dict)

In [None]:
print(my_dict["equipes"])

In [None]:
my_dict["equipes"] = ["Montclair Kimberley Academy",
"Cleveland Cavaliers", "Boston Celtics"]
print(my_dict)

## Numerical Python or `numpy`

Import `numpy` to use it.

In [None]:
import numpy as np

We named the `numpy` package `np` to ease its utilization. 

`Numpy` contains several functions to do numerical calculation in python. On of them is `np.array()` to define `x' and `y', which are vectors.

In [None]:
x = np.array([3, 4, 5])
y = np.array([4, 9, 7])
print(x, y)

In [None]:
x + y 

We can also create matrices (i.e., 2D arrays). Can either use `np.matrix()` or `np.array()`. Let's use `np.array`. Pay attention to the synthax to create it.

In [None]:
x = np.array([[1,2], [3,4]]) 
x

`x` has several _attributes_ that can be access using `x.attribute` after replacing `attribute` by the name of the attribute in which we are interested. For instance, we can check its dimension, data type, shape. For more, just see `np.array`

In [None]:
x.ndim #Its a 2D array

In [None]:
x.dtype #x is composed of 32-bit integers

In [None]:
x.shape  #has two rows and two columns

We can access the different values of the array using `x[i,j]` with _i_ the row number and _j_ the column number. **Indexation starts at 0**, meaning that to access the first row and first column we have to type the following

In [None]:
x[1,1]

We can apply mathematical functions to any arrays. For instance `np.sqrt(x)` returns the square root of x, `x**2` squares it. 

We are going to generate random data during this class. To do it, we will rely mostly on the `np.random.normal()` function to create vector of random normal variables. The function takes 3 arguments `normal(loc=0.0, scale=1.0, size=None)`. By default, it generates random normal variables with mean (`loc`) **0**, standard deviation (`scale`) **1**, and only one variable unless we change `size`.

We now generate 100 independent random variables from a ***N*****(0,1)** distribution.

In [None]:
#Run it few times, what do you observe?
x = np.random.normal(size=100)
x

In order to ensure that our code provides exactly the same results each time it is run, we can set a random seed using the `np.random.default_rng()` function. If we set a random seed before generating random data, then re-running our code will yield the same results. Hence, to generate normal data we use `rng.normal()`.

In [None]:
rng = np.random.default_rng(1303)
print(rng.normal(scale=5, size=2))

rng2 = np.random.default_rng(1303)
print(rng2.normal(scale=5, size=5))

In [None]:
np.random?

## Graphics

Its common to use `matplotlib` for graphics. In `matplotlib`, a plot consists of a figure and one or more axes. The axes contain important information about each plot, such as its axis labels, or title.

We first import the `subplots()` function from `matplotlib`. The function returns a tuple of length two: a figure object as well as the relevant axes object. We will typically pass figsize as a keyword argument. Having created our axes, we attempt our first plot using its plot() method. To learn more about it, type `ax.plot?`.

In [None]:
from matplotlib.pyplot import subplots

In [None]:
#Create two random variables as example
x = rng.standard_normal(100)
y = rng.standard_normal(100)
print(x)
print(y)

In [None]:
output = subplots(figsize=(8, 8))
fig = output[0]
ax = output[1]
print(type(output))
print(output)

Create a scatterplot, add an argument to `ax.plot()`, indicating that circles should be displayed.

In [None]:
fig, ax = subplots(figsize=(8,8))
ax.plot(x,y,'o')

Alternative way is to use `ax.scatter()` function.

In [None]:
fig, ax = subplots(figsize=(8,8))
ax.scatter(x,y, marker='o');

To label our plot, we can use `set_xlabel()`, `set_ylabel()`, and `set_title()` methods of `ax`.

In [None]:
fig, ax = subplots(figsize=(10,10))
ax.scatter(x,y, marker='o')
ax.set_xlabel("this is the x-axis")
ax.set_ylabel("this is the y-axis")
ax.set_title("Plot of X vs Y");

Can create several plots within a figure by adding additional arguments to `subplots()`. Let crate a `3x2` plots grid in a figure size determined by the `figsize`argument. If we want to set a commong _x-axis_, we can add  `sherex=True`.

In [None]:
fig, axes = subplots(nrows=3,
                     ncols=2,
                     figsize=(10, 10))

Fill up scatter plot with `o`in the first row of the first column and a scatter plot with `d` in the last row of the last column.

In [None]:
axes[0,0].plot(x,y, 'o')
axes[2,1].scatter(x,y,marker='d')
fig

To save the figures, we call `savefig()`and set value in `dpi`(dots per inch) to determine the quality.

In [None]:
fig.savefig("Figure.png", dpi=400)


## Indexing data

In [None]:
A = np.array(np.arange(16)).reshape((4, 4))
A

`A[1,2]`shows the element corresponding to the second row and third column.

In [None]:
A[1,2]

### Indexing rows, columns, and submatrices

Selecting multiple rows at a time: `[1,3]` will show the second and fourth rows:

In [None]:
A[[1,3]]

To select the first and third columns:  `[0,2]`- as the second argument in the square brackets. Then supply the first argument `:` which selects all rows.

In [None]:
A[[2],[0,2]]

Now, suppose that we want to select the submatrix made up of the second and fourth rows as well as the first and third columns. An easy way to do it is using `np.ix()`to extract a submatrix using lists. 

In [None]:
idx = np.ix_([2],[0,2])
A[idx]

Another option is subset matrices using slices

In [None]:
A[1:4:2,0:3:2]

### Boolean indexing

A *boolean* array has elements which equal either `True` (=1) or `False` (=0). The next line creates a vector of 0's of length equal to the first dimension of `A`. 

In [None]:
keep_rows = np.zeros(A.shape[0], bool)
keep_rows

We now set the second and last element to true

In [None]:
keep_rows[[1,3]] = True
keep_rows

## Loading data

The `Pandas` library can be used to create and work with data frame. 

### Reading a dataset

In [None]:
import pandas as pd # Import pandas

In [None]:
Auto = pd.read_csv('Data/Auto.csv')
Auto

Can have a further look at a specific variable, `horsepower` here.

In [None]:
Auto['horsepower']

`Auto.shape` tells us the number of observations (rows) and variables (columns). 

In [None]:
Auto.shape

### Selecting rows and columns

In [None]:
Auto.columns

Select the first 4 rows of the dataset

In [None]:
Auto[:4]

Let's keep only the observations for which year is higher than 80 

In [None]:
idx_80 = Auto['year'] > 80
Auto[idx_80]

Have a look at a *subset* of columns. 

In [None]:
Auto[['mpg', 'year']]

In [None]:
Auto.index

As you can see, the first column at the left is the row index, here labeled from 0 to 392. Next, we will rename the index using the content of the name column.

In [None]:
Auto_re = Auto.set_index('name')
Auto_re

We can now access rows of the data by `name`

In [None]:
rows = ['amc rebel sst', 'ford torino']
Auto_re.loc[rows]

We can extract the 4th and 5th rows, as well as the 1st, 3rd and 4th columns, using a single call to iloc[]:

In [None]:
Auto_re.iloc[[3,4],[0,2,3]]


Suppose now that we want to create a data frame consisting of the `weight` and `origin` of the subset of cars with year greater than 80. We first create a Boolean array that indexes the rows. The `loc[]` method allows for Boolean entries as well as strings:

In [None]:
idx_80 = Auto_re['year'] > 80
Auto_re.loc[idx_80, ['weight', 'origin']]

An alternative way is to use an anonymous function called a `lambda` function. This  creates a function that takes a single argument, `df`, and returns `df['year']>80`. As it is created inside `loc[]` for the dataframe `Auto_re`. 

In [None]:
Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']]

The symbol `&` computes an element-wise *and* operation, `| `for an *or* operation. As another example, suppose that we want to retrieve all `Ford` and `Datsun` cars with `displacement` less than 300. We check whether each `name` entry contains either the string `ford` or `datsun` using the `str.contains()` method of the `index` attribute of of the dataframe:

In [None]:
Auto_re.loc[lambda df: (df['displacement'] < 300)
                       & (df.index.str.contains('ford')
                       | df.index.str.contains('datsun')),
            ['weight', 'origin']
           ]

## Flow control

In [None]:
a = range(10)
a

In [None]:
list(a)

### For loop

In [None]:
for k in a:
    print(k)

In [None]:
k = 0
while k <10:
    print(k) #print the value of k for that iteration
    k +=1    # increment the value of k

### If else

In [None]:
k = 0
while k < 10:
    if k%2 == 0 and k%3 == 0: # k%2 means the remainder of this division
        print("%s is divisible by 2 and by 3" % k)
    elif k%2 == 0:
        print("%d is divisible by 2" % k)
    elif k%3 == 0:
        print("{} is divisible by 3".format(k))
    else:
       print(f"{k} is neither divisible by 2 nor by 3")
    k += 1

In [None]:
k = 0
while k < 10:
    if k%2 == 0 and k%3 == 0: # k%2 means the reminder of this division
        print("%s is divisible by 2 and by 3" % k)
    elif k%2 == 0:
        print("%d is divisible by 2" % k) #just the number divisible by 2
    #elif k%3 == 0:
    #    print("{} is divisible by 3".format(k))
    #else:
     #   print(f"{k} is neither divisible by 2 nor by 3")
    k += 1

### Enumerate
Here is how to do _standard_ loops where you refer to an element of a list with an index but using the enumerate() method

In [None]:
a = ['python', 'is', 'so', 'easy', '!']
a2 = ['one', 'two', 'three', 'four', 'five']


1st row: Ask to take the elements in the enumerate of a (first list) in the order of the list ; then we also ask to add the index of each element (the position in the list); also do it for a2
 

In [None]:
for idx, el in enumerate(a): 
    print(idx, a[idx], a2[idx])

## Functions
Functions are useful tools if we plan to do a task many times

In [None]:
def power_y(x, y = 2): #add_two = name of the function ; (x) = the argument
    return x ** y #this is what the function does

In [None]:
power_y(4, y = 3)

In [None]:
def hello():
    print('Hello World')

In [None]:
hello()

In [None]:
n=add_two(10)
n

## Exercise

1. Upload the Vahatra_defor_noNA database from Moodle.
2. Print it.
3. Display the list of all column names.
4. Shorten the column names which end with the suffix "-01-01_treecover_ha".
5. Compute the average deforestation rate over the period 2010-2020 and assign it to a new column.
6. Create a scatter plot of the deforestation rate and the initial forested area of the protected areas.
7. Create a scatter plot of the deforestation rate and the distance to the closest city.

In [None]:
#Type your answers in several cells here