# Data preprocessing using pandas and scikit-learn

### Feature selection

Data preprocessing is most always the step before training a machine learning model. There are features that are not very useful for predicting a given outcome. For example, including an `id` field which uniquely identifies each sample does not make much sense. 
Thus, such variables can be safely deleted.


In [8]:
import pandas as pd
from matplotlib import pyplot as plt
filename="./2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv"
df = pd.read_csv(filename)
df.columns

Index(['X', 'Y', 'Unique Squirrel ID', 'Hectare', 'Shift', 'Date',
       'Hectare Squirrel Number', 'Age', 'Primary Fur Color',
       'Highlight Fur Color', 'Combination of Primary and Highlight Color',
       'Color notes', 'Location', 'Above Ground Sighter Measurement',
       'Specific Location', 'Running', 'Chasing', 'Climbing', 'Eating',
       'Foraging', 'Other Activities', 'Kuks', 'Quaas', 'Moans', 'Tail flags',
       'Tail twitches', 'Approaches', 'Indifferent', 'Runs from',
       'Other Interactions', 'Lat/Long'],
      dtype='object')

Again, we'll ask you do to a bit of work yourself. This time, we ask you to drop unnecessary columns.

In [11]:
# Drop the `Unique Squirrel ID'` column
del df['Unique Squirrel ID']
df.head()

Unnamed: 0,X,Y,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,Combination of Primary and Highlight Color,...,Kuks,Quaas,Moans,Tail flags,Tail twitches,Approaches,Indifferent,Runs from,Other Interactions,Lat/Long
0,-73.956134,40.794082,37F,PM,10142018,3,,,,+,...,False,False,False,False,False,False,False,False,,POINT (-73.9561344937861 40.7940823884086)
1,-73.968857,40.783783,21B,AM,10192018,4,,,,+,...,False,False,False,False,False,False,False,False,,POINT (-73.9688574691102 40.7837825208444)
2,-73.974281,40.775534,11B,PM,10142018,8,,Gray,,Gray+,...,False,False,False,False,False,False,False,False,,POINT (-73.97428114848522 40.775533619083)
3,-73.959641,40.790313,32E,PM,10172018,14,Adult,Gray,,Gray+,...,False,False,False,False,False,False,False,True,,POINT (-73.9596413903948 40.7903128889029)
4,-73.970268,40.776213,13E,AM,10172018,5,Adult,Gray,Cinnamon,Gray+Cinnamon,...,False,False,False,False,False,False,False,False,,POINT (-73.9702676472613 40.7762126854894)


### Feature slicing
Feature slicing is the act of *slicing* a feature into multiple different features.
For example, we can slice the `Date` into day, month, and year.


In [15]:
df['Date'].head()

0    10142018
1    10062018
2    10102018
3    10182018
4    10182018
Name: Date, dtype: int64

Hint: use the `Series.apply()` method with a lambda function. [[Help]](https://www.analyticsvidhya.com/blog/2020/03/what-are-lambda-functions-in-python/)


In [20]:
# Split the feature Date into day, month and year
data1["timestamp"] = df["timestamp"].apply(lambda x: \
    datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f"))

Unnamed: 0,Date,day,month,year
0,1970-01-01 00:00:00.010142018,1,1,1970
1,1970-01-01 00:00:00.010192018,1,1,1970
2,1970-01-01 00:00:00.010142018,1,1,1970
3,1970-01-01 00:00:00.010172018,1,1,1970
4,1970-01-01 00:00:00.010172018,1,1,1970


### Feature engineering

You can create new features based on the features you have. These might be more useful for your (future) machine learning model than the ones that are already present in the dataset.
In this squirrel dataset, most of the fields encode the action taken by the squirrel when being approached by the human.
We will combine them into a single feature `Reaction` with values `'yes'` and `'no'`.

In [21]:
reaction_columns = ['Kuks', 'Quaas', 'Moans', 'Tail flags',
                   'Tail twitches', 'Approaches', 'Runs from',
                   'Other Interactions']

df['Reaction'] = df[reaction_columns].any(axis=1)
df['Reaction'] = df['Reaction'].apply(lambda x : "yes" if x else "no")
df['Reaction']

0        no
1        no
2        no
3       yes
4        no
       ... 
3018    yes
3019     no
3020     no
3021     no
3022    yes
Name: Reaction, Length: 3023, dtype: object

A important step for a data processing pipeline is making the data understandable for machine learning algorithms. Most of them do not understand strings, like `yes` and `no` in our newly created column.
We need to transform them to a binary format so that the machine learning model can take advantage of that feature.We are going to **One Hot Encode** our feature.


In [None]:
pd.get_dummies(df.Reaction, prefix='Reaction')

Unnamed: 0,Reaction_no,Reaction_yes
0,1,0
1,0,1
2,1,0
3,0,1
4,0,1
...,...,...
3018,0,1
3019,1,0
3020,1,0
3021,1,0


However, we have a redundancy here, as we could just transform `'yes'` to `1` and `'no'` to `0` in our `Reaction` column. This can be done by setting the argument `drop_first` to `True`.

In [None]:
df = pd.get_dummies(df.Reaction, prefix='Reaction', drop_first="True")
df.rename(columns={"Reaction_yes" : "Reaction"})

Unnamed: 0,Reaction
0,0
1,1
2,0
3,1
4,1
...,...
3018,1
3019,0
3020,0
3021,0



Similar things can be done after converting the data frame to an array using the `scikit-learn` library with `LabelBinarizer` or `OneHotEncoder`.

## Feature normalization or standardization
Although they are sometimes used interchangeably, normalization and standardization are two different ways to bring a column of values to a common scale.
In this section, we're going to use the word normalization to refer to this concept.

Why do we normalize data ?

*For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.* [[Source]](https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data)

Some algorithms require that data be normalized before training a model. Other algorithms perform their own data scaling or normalization.

Given a column of values *x*, if we choose to scale them, there are a few options:

- Normalization, also called *min-max scaling*, rescales every value to a range between [0, 1]. The maximum and the minimum are computed for each column separately.

  $$ z = \frac{x - min(x)}{max(x) - min(x)} $$

- Standardization, also called z-score normalization, rescales the value around a 0 mean and a standard deviation of 1. It essentially transforms all values of *x* to a *z-score*. Mean and standard deviation are computed for each column separately.

$$ z = \frac{x - mean(x)}{std(x)} $$


### Be careful!

When you want to normalize your dataset, you have to do so **AFTER** splitting your data into different train-test splits. Indeed, normalizing your data before would use some information from your testing set in the training set, thus biasing the model.
Indeed, in a real world scenario, you would not have access to the testing set, as this would be the data that you are meant to predict.

The procedure is the following:
1. Split your data into train and test
2. For every variable $x$ of your **training set**, compute $max(x_{train})$ and $min(x_{train})$ , or $mean(x_{train})$ and $std(x_{train})$ depending if you do min-max scaling or z-score-normalization.
3. Normalize your training set and your testing set using these values (here I'm only showing the testing set).
$$ z_{test} = \frac{x_{test} - min(x_{train})}{max(x_{train}) - min(x_{train})} $$


$$ z_{test} = \frac{x_{test} - mean(x_{train})}{std(x_{train})} $$


In [None]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
import numpy as np
dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

In [None]:
print(dataset.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [None]:
# Splitting data into train and test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=41, test_size=0.2)
print(X_train.shape)

(455, 30)


Here, we'll want you to use numpy's mean and standard deviation functions to standardize each feature of the training and testing set. These are `np.mean()` and `np.std()`.

In [None]:
# Standardizing each feature using the train mean and standard deviation

for idx, name in enumerate(dataset.feature_names):
    # Get mean and standard deviation from training set (per feature)

    print(f"Feature '{name}' has mean {mean:.2f} and standard deviation {stdev:.2f}")
    # Standardize training and testing set using the mean and standard deviation from the training set


Feature 'mean radius' has mean 14.08 and standard deviation 3.43
Feature 'mean texture' has mean 19.32 and standard deviation 4.30
Feature 'mean perimeter' has mean 91.64 and standard deviation 23.62
Feature 'mean area' has mean 648.64 and standard deviation 335.35
Feature 'mean smoothness' has mean 0.10 and standard deviation 0.01
Feature 'mean compactness' has mean 0.10 and standard deviation 0.05
Feature 'mean concavity' has mean 0.09 and standard deviation 0.08
Feature 'mean concave points' has mean 0.05 and standard deviation 0.04
Feature 'mean symmetry' has mean 0.18 and standard deviation 0.03
Feature 'mean fractal dimension' has mean 0.06 and standard deviation 0.01
Feature 'radius error' has mean 0.39 and standard deviation 0.24
Feature 'texture error' has mean 1.20 and standard deviation 0.53
Feature 'perimeter error' has mean 2.78 and standard deviation 1.73
Feature 'area error' has mean 38.54 and standard deviation 35.48
Feature 'smoothness error' has mean 0.01 and standard

If you run the previous cell twice (without running the others cells again), you'll see that the second time, the mean and standard deviation for each feature will be 0 and 1 respectively, which is exactly what we want when we standardize (z-score normalization).

## Resampling

Sometimes, when you have multiple classes and the number of samples of each class are not equally distributed, i.e. there is an imbalance in the number of samples of each class, you can resort to resampling.
Resampling is using more (or less) of a given class to get a balanced dataset.

**BE CAREFUL**, resample **AFTER** splitting your data set into two parts. You do not want to accidentally have a copy of a testing sample in the training set.
Moreover, **do not resample the testing set**. This would give a false sense of the performance of the model.


In [None]:

import sklearn
import numpy as np
from sklearn import datasets

dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target


In [None]:
# We separate the samples of the different classes
class_one_idx = np.argwhere(y==1)
class_zero_idx = np.argwhere(y==0)

class_one_x = np.squeeze(X[class_one_idx])
class_zero_x = np.squeeze(X[class_zero_idx])

print("Shape of class 0 samples : ", class_zero_x.shape)
print("Shape of class 1 samples : ", class_one_x.shape)


Shape of class 0 samples :  (212, 30)
Shape of class 1 samples :  (357, 30)


You see that we have 212 samples of class 0 and 357 samples of class 1.
We can either upsample, i.e. take more samples of, the minority class (here class 0) or we can downsample, i.e. take fewer samples of, the majority class (here class 1).
To do this, we first have to separate the samples of each class.

In [19]:
from sklearn.utils import resample

# Upsample minority class
class_zero_upsampled = resample(class_zero_x, 
                                 replace=True,     # sample with replacement
                                 n_samples=357,    # to match majority class
                                 random_state=123) # reproducible results

print("New shape of class 0 samples: ",class_zero_upsampled.shape)

# Downsample majority class
class_one_downsampled = resample(class_one_x, 
                                 replace=True,     # sample with replacement
                                 n_samples=212,    # to match minority class
                                 random_state=123) # reproducible results

print("New shape of class 1 samples: ",class_one_downsampled.shape)


New shape of class 0 samples:  (357, 30)
New shape of class 1 samples:  (212, 30)


After having either upsampled our minority class, or downsampled our majority class, we can combine the upsampled with the majority class or the downsampled with the minority class to have a balanced data set.

Which one you use depends on what you want to do, and which one does best.

In [18]:
X_balanced = np.concatenate((class_zero_upsampled, class_one_x), axis=0)
print("(Upsampled) Balanced data set shape : ", X_balanced.shape)

X_balanced = np.concatenate((class_one_downsampled, class_zero_x), axis=0)
print("(Downsampled) Balanced data set shape : ", X_balanced.shape)

(Upsampled) Balanced data set shape :  (714, 30)
(Downsampled) Balanced data set shape :  (424, 30)


## Reading material and additional ressources

[[1] Feature Engineering - Elite Data Science](https://elitedatascience.com/feature-engineering)  
[[2] Feature Engineering - Towards Data Science](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)  
[[3] Feature Engineering Tutorial - Kaggle](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)  
[[4] LabelBinarizer - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)  
[[5] OneHotEncoder - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)  
[[6] Zscore - Simply Psychology](https://www.simplypsychology.org/z-score.html)  
[[7] Normalize data - Microsoft Azure](https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data)  
[[8] How to handle imbalanced classes - Elite Data Science ](https://elitedatascience.com/imbalanced-classes)