<h1><div align='center'><font size='6' color='brick'> Data Preprocessing for Modeling</font></div></h1>
<br>

**This document will cover:**

- Removing and imputing missing values from the dataset 
- Getting categorical data into shape for machine learning algorithms 
- Partitioning a dataset into separate training and test sets 
- Bringing features onto the same scale

At this stage, all we have is raw data with a **number of variables/fields(columns)** and **observations(rows)**. We are yet to do feature engineering at which stage we will refer to the variables as features which will form the inputs to future models. As of now, the columns are referred to as variables.

## Importing packages

In [17]:
import pandas as pd 
from io import StringIO 
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

## Dealing with missing data

Real-world data sets are often missing a large number of values. This may be due to the nature of the data collected, errors in the data collection process or certain fields are not applicable.

In [3]:
# Creating a csv to work with
csv_data =\
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


### Finding missing data

#### isnull( )
The `isnull` method returns a DataFrame with Boolean values that indicate whether a cell contains a numeric value (False) or if data is missing (True).


In [5]:
df.isnull()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,False,True,False
2,False,False,False,True


`.notnull()` is the negation of the above.

In [6]:
df.notnull()

Unnamed: 0,A,B,C,D
0,True,True,True,True
1,True,True,False,True
2,True,True,True,False


#### Missing values per column
Using the `sum()` method, we can then return the number of missing values per column as follows:

In [7]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

### Eliminating observations or variables with missing values

This drops observations with missing values.

In [8]:
 df.dropna(axis=0) # axis=0 is the default

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


To drop columns with missing values:

In [9]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [10]:
# only drop rows where all fields are NaN
# (returns the whole array here since we don't 
# have a row with where all values are NaN 
df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [11]:
 # drop rows that have less than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [12]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


### Imputing missing values

Eliminating observations or variables with missing values can be dangerous while building models. For example, in fraud detection models, the missing value could most likely be a case of fraud which is the reason it's an anomaly in the first place. 

In order to avoid errors and biases that can arise from eliminating data points with missing values, we can replace the missing values with a relevant/representative value imputed from the the other values of that variable.

#### Replacing with column mean

The SimpleImputer simply replaces the missing value with the mean value of the entire feature column. 

In [14]:
imp = SimpleImputer(missing_values = np.nan, strategy = 'mean') 
imp = imp.fit(df.values)
imputed_data = imp.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

In [15]:
imp

SimpleImputer()

**Other options for the strategy parameter are:**
- _median_ 
- _constant_: When strategy == “constant”, there is another parameter named fill_value, which is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
- _most_frequent_: This replaces the missing values with the most frequent values. 
This is useful for imputing **categorical variables**, for example, a feature column that stores an encoding of color names, such as red, green, and blue


**For numerical variables:**
- Right off the bat, you can tell why this is dangerous. Means are notorious for being affected by outliers. Replacing NaNs by the column mean could further bias the data.

- Another option is to replace by the median, but there are still better ways to deal with missing values in numerical variables.

#### KNN Imputer

The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach.

By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors. Each observation’s missing values are imputed using the mean values from n_neighbors nearest neighbors found in the training data sets. Two observations are close if the features that neither is missing are close.

In [19]:
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, np.nan, 5], [8, 8, 7]] 
df = pd.DataFrame(X, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1.0,2.0,
1,3.0,4.0,3.0
2,,,5.0
3,8.0,8.0,7.0


* We set n_neighbors to 2, which means that the number of neighboring observations to use for imputation is 2. **n_neighbors’ default value is 5.**


* We also set weights to “uniform”, which means that all points in each neighborhood are weighted equally. Besides, weights can be also set to “distance”, which means that we weight points by the inverse of their distance. 
In other word, closer neighbors of a query point will have a greater influence than neighbors which are further away. **weights’ default value is “uniform”**

In [20]:
imputer = KNNImputer(n_neighbors=2, weights="uniform") 
imputer.fit_transform(X) # you will get a np array

array([[1. , 2. , 5. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

#### Processing training and testing data

In [21]:
data_train = {'A': [3, 2, np.nan, 4, 3],\
              'B': [3, np.nan, 4, 4, 5],\
              'C':[np.nan, 4.8, 5.1, 4.9, 5.2],\
              'D':[6, 7, 9, np.nan, 10]}
df2_train = pd.DataFrame(data = data_train)
df2_train

Unnamed: 0,A,B,C,D
0,3.0,3.0,,6.0
1,2.0,,4.8,7.0
2,,4.0,5.1,9.0
3,4.0,4.0,4.9,
4,3.0,5.0,5.2,10.0


In [23]:
df2_train.mean()

A    3.0
B    4.0
C    5.0
D    8.0
dtype: float64