# Data Preprocessing

Data pre-processing is a phase in data mining, where you filter out out-of-range values, impossible data combination, missing values, etc. from the gathered data. The most important part we need to do is data cleaning.

## Data Cleaning

When exploring and visualizing we can often ignore missing values cause much of the packages used can deal with them properly (such as the plotting tools we presented earlier). However we must always check for them first and be aware of them. It would not be the first time wrong conclusions were made because one was not aware of all the missing values in their data.

On the other hand most of the models we will use later to make predictions cannot handle missing values. So instead of just ignoring them and being aware of them, we have to deal with them in some way.

In [None]:
import pandas as pd
df = pd.read_csv("./data/housing.csv")
df.head()

In [None]:
# check if there are missing values in any of the columns?
df.isnull().any().any()

In [None]:
# in which column are these missing values?
df.total_bedrooms.isnull().any()

Now that we know where the missing values are situated, we can do one of the following things to cope with these missing values (or features values).

1. Get rid of the corresponding rows
2. Get rid of the entire feature/attribute/variable
3. Set the missing values to some value (e.g. mean or median of that feature)

In [None]:
# option 1:
df_option1 = df.dropna(subset=["total_bedrooms"])
print("original size:" + str(df.shape[0]))
print("new size:" + str(df_option1.shape[0]))

In [None]:
# option 2:
df_option2 = df.drop("total_bedrooms", axis=1)
df_option2.head()

In [None]:
# option 3:
median = df["total_bedrooms"].median()
df_option3 = df["total_bedrooms"].fillna(median, inplace=False)

Although at this point pandas is the most logical and convenient way to deal with missing values, `sklearn` also offers a standard way to deal with missing values. We will touch upon this because we will be making extensive use of `sklearn` later on as the primary package to fit and test models. More specifically  we will be making use of the `imputer` package:

In [None]:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")

In [None]:
# NOTE imputer only works on numerical features!
df_num = df.drop("ocean_proximity", axis=1)

# and lets fit/setup the imputer
imputer.fit(df_num)

In [None]:
# and finally transform/apply this imputer
X = imputer.transform(df_num)
print(X)

Most often this will then be used in a pipeline to run models on. Preprocessing however does not stop here. Depending on what models we use, or on what de data looks like, tranformations are in order.  For example when using a Naive Bayes model it is neccesary  that each feature has the same distribution, i.e. the data is standardized. Below we show how we can do all such transformations using `sklearn` preprocessing in a way similar to how we `imputed` the data.

## MinMaxScaler
To **rescale features to a given range**, the preprocessing packge provides `MinMaxScaler(feature_range=(0,1),copy=True)`. You can give it a **minimum** and a **maximum** with the featured range. The default is (0,1). You can also set copy to False if you want to change the array itself, rather than making a copy of it and rescaling this.

It uses the following algorithm:

    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    X_scaled = X_std * (max - min) + min
where min,max = feature_range.

Assume x an array we want to rescale. First we must **initiate a new scaler**, then **fit it to x** and finally **transform x with the scaler**:

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
    
# lets take one feature from our dataframe:
x = df.median_income.values.reshape(-1, 1)
print(x[0:10])

Let's assume we want to rescale the values of this numpy array to the range from -1 to 1.

In [None]:
scaler = MinMaxScaler((-1,1))
scaler.fit(x)
scaled_x = scaler.transform(x)
print(scaled_x[0:10])

The fitting and transforming can be combined into a single command `fit_transform()`

### Excercise
Plot a histogram of the following three normal distributions. Then rescale them using the MinMaxScaler to an interval of (0,1). Plot these as well and compare the results.

In [None]:
mu1=20
sigma1=10

mu2=5
sigma2=3

mu3=0
sigma3=1

In [None]:
%load 3_Exploratory_Data_Analysis/minmaxscaler.py


## Normalizer
The `normalizer` will rescale features so that each feature has a **unit norm**. It has a similar syntax to the MinMaxScaler.

Unit norm essentially means that if we squared each element in the vector, and summed them, it would equal 1 (note this normalization is also often referred to as, unit norm or a vector of length 1 or a unit vector).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance.

### Excercise
Normalize the following array. Check if the norm of each row is indeed 1. 

Use `np.isclose(a,b)` to check if two elements are equal within a tolerance. Use the `norm()` function from `np.linalg` to calculate the norm.

https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.isclose.html

https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html

In [None]:
%load 3_Exploratory_Data_Analysis/normalizer.py


## Standard Scaler
The standard scaler will **remove the mean** and **scale to unit variance**. This way the data is **centered around 0 and has variance 1**. It has a similar syntax to the other scalers.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

### Excercise
Standardize the given normal distribution. Print the first 10 items of each distribution and check whether the mean and variance indeed approach 0 and 1.

In [None]:
import numpy as np
X = np.random.normal(6,4,(1000000,1))

In [None]:
# %load 5_Machine_Learning/standardscaler
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy import stats as s

X = np.random.normal(6,4,(1000000,1))
scaler = StandardScaler()
standardized_X = scaler.fit_transform(X)
print(X[0:10,:])
print()
print(standardized_X[0:10,:])
print()
print(np.isclose(s.describe(standardized_X).mean,0))
print()
print(np.isclose(s.describe(standardized_X).variance,1))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(12,8))
plt.hist(X)
plt.hist(standardized_X);

## LabelEncoder and OneHotEncoder
LabelEncoder will take an input of labels and **encode these as sequential integers**. It uses the same syntax as the scalers.

In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

values = ['Warm', 'Cold', 'Warm', 'Hot', 'Hot', 'Cold']
labenc = LabelEncoder()
int_encoded = labenc.fit_transform(values)
print(int_encoded)

One hot enconding uses a **series of bits**. Of these bits only one can be "hot" (1), all the others must be "cold" (0). All states with more than one hot bit are illegal. 

This kind of encoding is needed when **feeding categorical data to many scikit-learn estimators**, notably linear models and SVMs with the standard kernels. This way we're **not implying that certain categories have a higher rank than others**, like we are with LabelEncoders. It has a similar syntax to the other scalers we've discussed, but you need to **start from the label-encoded array**, not from the array with categories. 

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Here's an example of how to encode a simple list of categories:

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

OHenc = OneHotEncoder()
int_encoded = int_encoded.reshape(len(int_encoded),1)
print(int_encoded)
print()
OH_encoded = OHenc.fit_transform(int_encoded).toarray()
print(OH_encoded)

### Excercise
Make this table into a One hot encoded array. Compare the output from the label encoder to the one hot encoder.

In [None]:
data  = ['cat', 'dog', 'cat', 'cat','guinea pig','dog','elephant','cat']

In [None]:
%load 3_Exploratory_Data_Analysis/labelandonehotencoding.py


## Dummy feature
A dummy feature will add a feature to your data set that has the same value everywhere. It will be added as the first column of the data set. `sk.preprocessing.add_dummy_feature(X,value=1.0)`

This is useful for fitting an intercept term with implementations which cannot otherwise fit it directly.

### Excercise
Add a dummy feature of 7.5 to the given array

In [None]:
import numpy as np
X = np.array([[1., 4.5],[10.4, 7.2],[2.4, 9.1]])

In [None]:
%load 3_Exploratory_Data_Analysis/dummyfeature.py
