# Introduction to main Python Modules for Data Science

In this notebook, we will very briefly introduce, through some basic examples, the three essential python modules that you will use extensively in this course and, probably, in any data science and machine learning project:
- `numpy` that provides a toolbox for arrays (vector, matrices, ...),
- `pandas` that provides a very complete and powerful framework for dataframes and series,
- `matplotlib` for different types of plots, and
- `scikit-learn` (or `sklearn`) that provides a complete framework and toolbox for the most well-known machine learning methods.

Below, you will also find the official documentation that you can refer to, to learn more about these modules, and to find exactly the method or tool you are looking for during your future exercises.

Again, search engines are also a very useful tool when you will be stuck. The python community is *very* active, you will, in most cases, find a rapid answer to your "*how to ...*" questions!
Discussing with other students is also highly encouraged, to help each other out and learn in groups (as long as you don't share your final answers for graded materal). And of course, don't hesitate to ask your questions to the TAs when you are stuck.

Official documentations (user guides and list of all functions, classes, methods):
- `pandas`:
    - User guide (tutorials): https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
    - API reference (list of all functions and methods): https://pandas.pydata.org/pandas-docs/stable/reference/index.html
- `scikit-learn`:
    - User guide (tutorials): https://scikit-learn.org/stable/user_guide.html
    - API reference (list of all functions and methods): https://scikit-learn.org/stable/modules/classes.html
- `numpy`:
    - API reference: https://numpy.org/devdocs/reference/index.html
- `matplotlib`:
    - Tutorials: https://matplotlib.org/3.3.4/tutorials/index.html

## Arrays with `Numpy`

### Creation, shapes and access

In [1]:
import numpy as np

Numpy allows creating arrays in any dimension and with any shape:

In [2]:
#1D array of shape (4,) (representing, for example the coordinates of a point in a 4-dimentional space)
a1 = np.array([2.,5.6,7.23,7.])
a1

array([2.  , 5.6 , 7.23, 7.  ])

In [3]:
#Shape of a1
a1.shape

(4,)

In [4]:
#access values
a1[2]

7.23

In [5]:
a1[1:3] #carefull: 1:3 returns the values indexed at 1 and 2 but NOT at 3

array([5.6 , 7.23])

In [6]:
#2D array, of shape (2,2) (representing, for example a 2x2 matrix)
a2 = np.array([[2., 5.6],
               [7.23, 7.]])
a2

array([[2.  , 5.6 ],
       [7.23, 7.  ]])

In [7]:
#Shape of a2
a2.shape

(2, 2)

In [8]:
#access values
a2[0:2, 1]

array([5.6, 7. ])

As they contain the same data, `a2` could also have been obtained by reshaping `a1`:

In [9]:
a2_prime = a1.reshape(2,2)
a2_prime

array([[2.  , 5.6 ],
       [7.23, 7.  ]])

Here are some useful functions to create specific arrays:

In [10]:
range_array = np.arange(2,15)
range_array

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [11]:
np.arange(2,15,3) #increment by 3 between 2 and 15

array([ 2,  5,  8, 11, 14])

In [12]:
np.linspace(0, 2, 6) #6 equally spaced values between 0 and 2

array([0. , 0.4, 0.8, 1.2, 1.6, 2. ])

In [None]:
np.ones((3,4)) # fills array with the desired shape with ones

In [None]:
np.zeros((2,3)) # Same with zeros

In [None]:
#Two examples of 3D-arrays of shape (3,4,2) (1/2)
np.ones((3,4,2))

In [None]:
#Two examples of 3D-arrays of shape (3,4,2) (2/2)
np.arange(3*4*2).reshape(3,4,2)

### Basic operations

In [None]:
a = np.array([0, 1, 2, 3])
a

In [None]:
a*2 #multiplication by scalar

In [None]:
a**2 # a to the power of two

In [None]:
a**2 / 2  # (a^2)/ 2

In [None]:
np.sin(a) # numpy predefined sin function (many other functions exist: cos, log, exp, sqrt, ...)

For a list of available universal functions: https://numpy.org/devdocs/reference/ufuncs.html#available-ufuncs

In [None]:
a < 2 # boolean condition "spread" into the vector

In [None]:
# Two matrices A and B
A = np.array([[1, 2],
              [5, 3]])

B = np.array([[1, 3],
              [2, 8]])
A

In [None]:
#scalar product
A * B

In [None]:
#Matrix product
A @ B

In [None]:
#Alternatively: (does the same)
A.dot(B)

In [None]:
#Sum over all elements in A
A.sum()

In [None]:
#Sum in A along specified axis 0 (same for min, max, etc...)
A.sum(axis=0)

In [None]:
#Sum in A along specified axis 1 (same for min, max, etc...)
A.sum(axis=1)

In [None]:
#A transposed
A.transpose()

In [None]:
#inverse of A
np.linalg.inv(A)

### Random numbers generation

The submodule `numpy.random` is used to generate random values in an array of specified shape.

First, it is always a good idea to set a seed for reproducibility, in projects using random elements.

In [None]:
np.random.seed(1)

Then for example:

In [None]:
#Gaussian realisations in an array of shape (2,3) (try running this cell multiple times)
np.random.randn(2,3)

In [None]:
# For the more general N(mu,sigma^2) distribution, one can do:
mu = 1
sigma = 2
mu + np.random.randn(2,3) * sigma

In [None]:
#Uniform(0,1) distribution
np.random.rand(3,4)

## Data Frames with `pandas`

In [None]:
import pandas as pd

In `pandas` you will mainly deal with two types of data structures, pd.`Series` and pd.`DataFrame`. 

DataFrames are for tabular data, that you probably already dealt with, in R for example with data.frames or tibbles. Each column represents a *variable* and each row an *observation*. Each variable has a single type (e.g. int, float, string, categorical (=factor in R), datetime)

Series only contain observations for a single variable. They can be seen as a single column DataFrame, but it's important to distiguish the two, as they are not manipulated in the exact same manner.

### Series

Here is an example of a Series of type int:

In [None]:
series_data = pd.Series([243, 12, 126, 101])
series_data

In [None]:
#one could give a custom index
series_data2 = pd.Series([243, 12, 126, 101],
                         index=['John', 'Rebecca', 'Francis', 'Albert'])
series_data2

one can then access values trough:

In [None]:
series_data2["Rebecca"]

Or convert the values to an array with:

In [None]:
series_data2.values

Perform operations:

In [None]:
series_data3 = pd.Series([3, 4, 3, 4], index=['John', 'Rebecca', 'Francis', 'Albert'])
series_data2 + 2 * series_data3

Also boolean operations:

In [None]:
series_data2 < 120

And select sub-Series based on boolean condition:

In [None]:
series_data2[series_data2 < 120]

### DataFrames

Here is an example of a DataFrame, that gives some information about a group of observed animals:

In [None]:
data = pd.DataFrame({'population':[423, 334, 5686, 554, 438353, 229, 1001, 124],
                     'species':[1, 1, 1, 1, 2, 2, 2, 2],
                     'color':['Green', 'Red', 'Blue',
                              'Yellow', 'Green', 'Red', 'Blue', 'Yellow']})
data

In [None]:
data.shape

One could also import a DataFrame from a `.csv` file, for example, with `data = pd.read_csv(file_path)` of with `pd.read_table(file_path, sep=",")` (that also works for other formats).

One can access a subset of columns via:

In [None]:
data[["color","population"]]

Be aware of the different syntaxes for accessing a single column:

In [None]:
data[["population"]] #returns a DataFrame

In [None]:
data["population"] #Returns a Series

In [None]:
data.population #Alternative syntax, also returns a Series

One can access specific observations with `.loc`:

In [None]:
data.loc[[3,5]]

In [None]:
data.loc[3:5]

It also works with conditions:

In [None]:
data.loc[data.population < 500]

Or with queries (for more complex conditions):

In [None]:
data.query("population < 500 & color == 'Red'")

One can access (and modify) some information, like column names and the index via, for example:

In [None]:
data.columns # = ["new_name1","new_name2","new_name3"]

In [None]:
data.index # = [new index values]

In [None]:
data.index.name="id"
data

A new column can be easily created or modified, for example:

In [None]:
data["year"]=2020
data

In [None]:
data["year"]=[2020,2020,2021,2021,2019,2019,2019,2021]
data

or dropped:

In [None]:
data.drop("year", axis=1, inplace=True)
# or equivalently:
# data = data.drop("year", axis=1)
data

### Data Types

In [None]:
data.dtypes # type of each column (strings are considered as "object")

One can change the column types, for example with:

In [None]:
modified_data = data.astype({"population":"float"})
modified_data

The type "category" is the pandas equivalent to R's factors:

In [None]:
modified_data = data.astype({"color":"category"})
modified_data

In [None]:
modified_data.dtypes

In [None]:
modified_data.color.cat.categories

In [None]:
modified_data.color.cat.codes

### Some useful methods

Here are some useful (self explanatory) methods and attributes that you may often need to use:

#### Data Exploration

In [None]:
data.head(5) #first 5 entries  (tail(5) for the last 5)

In [None]:
data.population.describe() # basic stats for the "population" variable

In [None]:
data.color.value_counts() # counts the occurences of each value in "color"

In [None]:
data.corr() #correlation between numerical columns

In [None]:
data.sort_values(by="population", ascending=True)

#### Indexing

In [None]:
data.index.is_unique #checks if index is unique (different value for each observation) which is always desirable

In [None]:
data.sort_index(ascending=True)

In [None]:
data = data.set_index(["species","color"]) # (Multi)-indexing
data

In [None]:
data.reset_index(inplace=True)
data

#### Missing values

In [None]:
data.loc[3, "population"] = np.nan #suppose there is a missing value
data

In [None]:
data.dropna() # Drop rows with NA values 

In [None]:
data.fillna(0) #fills NAs

### More advanced concepts

#### GroupBy, aggregate and Apply

Quite often in your analyses you will need to evaluate statistics within subgroups in your data.

You can achieve this with the `groupby` method. Suppose, for example, that we want the mean population by species in `data`:

In [None]:
data.groupby("species")[["population"]].mean()

In [None]:
data.groupby("species")[["population"]].mean().add_suffix('_mean') # The result is a bit clearer with a suffix

One could achieve the same with:

In [None]:
data.groupby("species")[["population"]].apply(np.mean).add_suffix('_mean')

where any desired function can be applied to the groups instead of `np.mean`, for example a user defined function.

GroupBy can also take multiple columns for the grouping. Suppose we want the mean population by species and color: (in this particular case it's not very useful, but you get the idea)

In [None]:
data.groupby(["species","color"])[["population"]].apply(np.mean).add_suffix('_mean')

The apply method can also be used without the groupby, to apply any function to a column!

#### Combining DataFrames

If you need additional variables for your analysis, that are imported in a separate DataFrame, you can "merge" the already existing `data` with it:

In [None]:
# The additional data frame:
df2 = pd.DataFrame({'species':[1, 2],
                     'number_legs':[2,4]})
df2.set_index("species", inplace=True)
df2

In [None]:
pd.merge(data, df2, left_on='species', right_index=True, how='outer')

You can also concatenate your `data` with additional observations for the same variables:

In [None]:
# The additional observations:
df3 = pd.DataFrame({'population':[12, 34],
                     'species':[1, 2],
                     'color':['Purple', 'Pink']})
df3

In [None]:
pd.concat([data, df3], axis=0)

If interested, you can find more detailed explanations here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

There are a lot of other very powerful tools available for data wrangling in the pandas library, that will probably allow you to perform or overcome any possible task or challenge for any type of data you will be facing. Do not hesitate to check out the parts of the nice documentation that are relevant for your specific needs at https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html.

## Figures with `matplotlib`

In [None]:
import matplotlib.pyplot as plt
# To ensure showing the plots nicely in the notebook:
%matplotlib inline

The matplotlib module lets you construct a wide range of different types of plots, in a syntax similar to MatLab.
Here are a few simple examples:

In [None]:
x = np.arange(10)
y = [4,2,2,8,7,6,6,9,6,10]

In [None]:
plt.figure(figsize=(9,4))
plt.plot(x,y)
plt.title("A simple line plot")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

In [None]:
plt.figure(figsize=(6,6))
plt.bar(x,y)
plt.title("A simple bar plot")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

In [None]:
plt.figure(figsize=(6,6))
plt.boxplot(y)
plt.title("A simple boxplot",size=16)
plt.xticks(ticks=[1], labels=["x"], size=16)
plt.show()

In [None]:
plt.figure(figsize=(6,5))
plt.hist(y, bins=6, density=True)
plt.title("A simple histogram", size=20)
plt.xlabel("x", size=14)
plt.ylabel("density", size=14)
plt.show()

### `pandas` + `matplotlib`

Pandas has already built in plotting functions, compatible with matplotlib ! :)

Suppose that we want a bar plot of the log population by color, using our `DataFrame` "data" from above:

In [None]:
#First we extract the desired info from the data, with groupby:
pop_by_color = data.groupby("color")[["population"]].sum().apply(np.log).add_prefix("log_")
pop_by_color

In [None]:
pop_by_color.plot.bar(y=["log_population"], figsize=(8,6))
plt.title("A simple bar plot of the log population")
plt.show()

Same with a `Series`:

In [None]:
#Same data, but in the form of a Series instead of a DataFrame:
pop_by_color_series = pop_by_color.log_population
pop_by_color_series

In [None]:
pop_by_color_series.plot.bar(figsize=(8,6))
plt.title("A simple bar plot of the log population")
plt.ylabel("log population")
plt.show()

## Machine Learning with Scikit-Learn

Scikit-learn is currently one of the most used frameworks for machine learning. It not only contains efficient implementations of a wide range of classical machine learning methods, but it also set a new syntax (API) standard for many other complementary machine learning libraries in python, that now share the same user interface.

Apart from the ML methods themselves, sklearn also provides useful tools to perform additional tasks useful for your pipeline, such as preprocess the data (Transformers), compute metrics and losses, perform model selection, and so on...

All these tools are available as classes (see the relevant section in `Introduction_to_python.ipynb`), with consistent method names:
- All supervised learning methods have a `.fit()` method to fit the model and a `.predict()` method for prediction on new data.
- All data transformers have a `.fit()` method to estimate the needed statistics and a `.transform()` method for transforming the data.

Scikit-learn works both with numpy arrays and pandas DataFrames.

For example the `StandardScaler` is a Transformer that computes the mean and standard deviation by variable (column) and transforms the data to be centered and scaled based on these metrics:

In [None]:
X = np.array([[2.2,4.6],
             [3.,8.],
             [7.4,6.2]])
X

In [None]:
from sklearn.preprocessing import StandardScaler

my_scaler = StandardScaler() # create a scaler from the class (blueprint)

my_scaler.fit(X) # computes and stores the mean and SD by column
X_std = my_scaler.transform(X) # Transforms each column: (col_values - mean) / SD

X_std

The scaler can then transform additional data without re-evaluating the mean and SD (you will understand in a few weeks why this is useful)

We will not go into more details here, as you will learn a bit more about the possibilities that `sklearn` offers through the future exercises during the semester

Scikit-learn also has really comprehensive tutorials and documentation on the official website.
If you need to apply machine learning methods that you will learn about in the courses during the semester, you can check out how to use them, with some tips here: https://scikit-learn.org/stable/user_guide.html, and a structured list of all classes and functions here: https://scikit-learn.org/stable/modules/classes.html