# Preprocessing with pandas and scikit-learn

## What you'll learn in this class

In the coming weeks, you will learn how to do machine learning on data. A very important step prior to any use of a machine learning model is preprocessing, which refers to all the processes that will be used to prepare data for use in a machine learning model. Indeed, you may have discovered that you will sometimes have to process a wide variety of data: numbers, texts, dates, times,... that you will have to transform, because a learning machine model is a mathematical model that only knows how to process numbers.  Here's what you'll be able to do after this course:

* How to choose which preprocessings to do according to the type of data you want to use in a machine learning model

## What is preprocessing

From week 3 onwards, you will get to the heart of the subject of machine learning! Without going into details (this is not the purpose of this course), you can for the moment remember that "machine learning" is a whole set of methods that make it possible to carry out tasks, without explicitly coding all the steps necessary for these tasks. The applications are very diverse: image recognition, semantic analysis, cost prediction, customer segmentation,... The principle of the learning machine is that we are going to use mathematical models to which we are going to "show" data, and we are going to let these models "learn" what is the best way to solve the problem from this data. 

You may have discovered that in Data Science, we have to manipulate different types of data: it can be tables of numbers, but also text, or even images. However, since learning machine models are mathematical models, they are only able to process numbers. It is therefore necessary to provide the means to transform the different types of data into numbers that can be interpreted by a model. Preprocessing" is the set of processes that will have to be done to prepare a data set before using it in a learning machine model. 

You will discover it during the rest of the Full Stack program: preprocessing is a crucial step in machine learning, and it is a Science in its own right! Data Scientists often spend a lot of time figuring out how to properly prepare the data to optimize the performance of a machine learning model.

Today's course aims to introduce you to the "basic" steps of preprocessing, those that are the most common and that you will almost always encounter when you do machine learning. You will learn how to use the pandas and scikit-learn libraries to code the different preprocessing steps in python.



## The different types of data

### Structured data

All data that can be represented in tabular form is called "structured data" (in English, this is called a "flat database"). The data you have encountered so far are in fact all structured data: SQL table, CSV file, dictionary returned by an API, etc... You can remember that if you can find a way to read and store your data in a pandas DataFrame, then you are working on structured data.

When working with structured data, the columns of the table represent the variables available: age, salary, gender, country, etc.... These variables can be classified into four broad families. These categories are useful to keep in mind because we will not necessarily go through the same pre-processing steps, depending on the categories of the variables we are working on.


#### The four main families of variables

Quantitative/numeric variables** **Quantitative/numeric variables**
These are all variables that are written as a number. There are two sub-categories:

- continuous quantitative variable: can take all possible values in a given interval. For example: a distance, temperature, salary, a patient's blood sugar level, etc.
- discrete quantitative variable: takes only a limited number of values. For example: age, shoe size, exam score, etc.

Qualitative/categorical variables** **Qualitative/categorical variables **Qualitative/categorical variables **Qualitative/categorical variables
These are all variables that are not (directly) expressed as a number. They are then expressed in the form of short texts, called "modalities". There are two subfamilies:

- ordinal qualitative variable: there is a notion of order or hierarchy between modalities. For example: the answers to a satisfaction questionnaire, or the types of house you created in exercise S1-3B "house market": "small house" / "large house" / "very large house".
- non-ordinary qualitative variables: there is no hierarchy between modalities. For example: nationality, gender, socio-professional category, cities, postcodes... In general, most categorical variables are non-ordinal.


#### Target variable / explanatory variables

A branch of the learning machine aims to train models that will be able to predict the value of one variable as a function of the others. We're talking about a "supervised" learning machine. In that case, we'll call:

- **Y**, the "target variable", "variable to explain" or the "label", the variable you are trying to predict.
- **X**, the "explanatory variables", the other variables that the model will use to make its prediction.


### Unstructured data

Unstructured data is any data that cannot be stored in a simple table. Here are some examples of unstructured data:

- all the articles of Wikipedia (we talk about corpus of texts)
- of the images
- digitized sounds

Each of the types of data listed above requires preparation according to a specific method. You will learn how to process this kind of data in the course on Deep Learning (S6-S7).

For now, we will restrict this course to the preparation of structured data :



## The different types of preprocessing for structured data

### Throw rows and columns

First of all, we clean the dataset to keep only the rows and columns that will be useful for the learning machine.

#### Throw columns

The following columns are excluded from the dataset:

- All columns that are "unique identifiers": dataset index, surname/first name of a person, social security number, transaction number, ...
- Columns with too many missing values. There is no general rule, but you can generally retain that if the rate of missing values exceeds 60/70%, it is better to exclude this column from the dataset.
- Non-ordinal categorical variables that have too many modalities. Again, there is no general rule, but for the small datasets we will be working on during the training, you will have to exclude columns that have more than 20-30 modalities. 
- If two columns are exactly correlated with each other, only one of them will be kept: typically, one will never keep both the age and the year of birth of a person.

#### Throwing lines

The following lines are excluded from the dataset: 

- If we are working on a supervised learning machine problem, we will exclude all the lines for which the target variable **Y** is missing.
- the lines with too many outliers, i.e. those with "strange" or even inconsistent values, or very far from the usual values: negative age, score above 20, very high salary compared to the rest of the population, unknown city name and that you only find once in the dataset, ...



### Imputation of missing values

Imputation is the replacement of a missing value with information. There are a multitude of imputation methods and it is a Science in its own right. Below are the simplest and most common methods. Depending on the type of variable, you will use one or the other.

#### Quantitative variables

##### Imputation by average

Imputation by means of the average consists of replacing the missing values with the average value found in the column. This method is generally used for all continuous quantitative variables.

### Median Imputation

Median imputation involves replacing missing values with the median found in the column. This method is used for discrete quantitative variables, or for some continuous quantitative variables that may have extreme values: for example, the wage of a population.


#### Categorical variables 

For categorical variables, no mean or median cannot be calculated. In this case, the simplest way is to replace the missing values with the most frequently encountered modality in the column. This is called "mode" imputation.

### Standardization (quantitative variables only)

To avoid having to deal with very large values (this is never advisable in computing, for memory management reasons) and to help our models to train, we will transform all the quantitative explanatory variables so that they vary only in intervals of values of typically a few units: this process is called normalization. There are various methods of normalization, but the most widely used is to scale the variables to vary around 0, with a standard deviation of 1.

Note: When using a supervised learning machine, the target variable is never normalized! _

### Encoding (categorical variables only)

Finally, since learning machine models deal with numbers only, all categorical variables must be encoded. Depending on the type of variable, different methods will be used:


#### Target variable

When working in supervised machine learning, it is possible that the target variable **Y** is categorical. In this case, we will simply encode it by making each modality correspond to a number. This transformation can be summarized as follows:

|  Y  | 
|-----|
| no  |
| yes |
| no  |
| no  |
| yes |

become :

| Y | 
|---|
| 0 |
| 1 |
| 0 |
| 0 |
| 1 |


#### Explanatory variables

##### Ordinal variable
For ordinal variables, they can be encoded in the same way as the target variable. However, remember that it is rare to have to deal with ordinal categorical variables.

##### Non-ordinal variable
Most of the time, this is the method you will use to encode categorical explanatory variables.

When the variable is non-ordinal, it cannot be encoded by simply making a correspondence between a modality and a number, as this would introduce a notion of hierarchy between categories that could degrade the performance of the model. We have to find a way to transform information into numbers, without introducing this notion of order, which would be meaningless. In this case, the simplest method is what is called One Hot Encoding (or Dummy Encoding): we will create as many columns as there are possible modalities, and indicate by a "1" the belonging to a category. Let's take an example to make it more concrete:

| Country |
|---------|
| France  |
| France  |
| Spain   |
| Germany |
| Germany |
| Spain   |
| Spain   |

become :

| France | Spain | Germany |
|--------|-------|---------|
|    1    |   0    |    0     |
|    1    |   0    |    0     |
|    0    |   1    |    0     |
|    0    |   0    |    1     |
|    0    |   0    |    1     |
|    0    |   1    |    0     |
|    0    |   1    |    0     |

An additional subtlety: we saw at the beginning of this course that it was necessary to avoid introducing in the machine learning models exactly correlated columns. In the above example, if the values in the first two columns are known, the values in the third column can be determined exactly. For this reason, one of the columns produced by the encoding is usually discarded, resulting in: 

| France | Spain |
|--------|-------|
|    1    |   0    |
|    1    |   0    |
|    0    |   1    |
|    0    |   0    |
|    0    |   0    |
|    0    |   1    |
|    0    |   1    |

In this way, no information was lost, but correlations were removed.
