# Data Preprocessing

- Data import
- Styling
- Metadata and data types
- Duplicate detection
- Dataframe manipulation
- Outlier detection
- Variable generation and manipulation
- Preparation of data for modeling

#### Facts
- Pre-modeling vs modeling 80% vs 20% of work

#### Libraries:
- Numpy
- Pandas
- Sci-kit learn
- Matplotlib

[Data Set - Adult](https://archive.ics.uci.edu/ml/datasets/Adult)

[Machine Learning Databases (a lot of them!!!)](http://archive.ics.uci.edu/ml/machine-learning-databases/)

[Main UCI page](http://archive.ics.uci.edu/ml)

Boston Housing dataset - we will use the copy that comes with sklearn

[Boston Housing at UCI page](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/)


### Import libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import * 

### Import the data


In [None]:
df = load_boston()

### Data Exploration -  Exploratory Data Anaysis (EDA)

The best way is to look into data (visually)

In [None]:
df.keys()

In [None]:
print (df.DESCR)

In [None]:
df.data

In [None]:
print(df.data.shape)

In [None]:
print(df.target)

In [None]:
dataF = pd.DataFrame(df.data)

In [None]:
dataF.head()

In [None]:
dataF.columns = df.feature_names

In [None]:
dataF['Price'] = df.target

In [None]:
dataF.head()

In [None]:
dataF.info()

In [None]:
dataF.describe()

In [None]:
dataF.count()

In [None]:
dataF.max()

In [None]:
#Return a subset of the dataframe columns based on the column data types
dataF.select_dtypes(include='float64')

In [None]:
dataF.select_dtypes(include='bool')

In [None]:
#find unique values

list(set(dataF['ZN']))

In [None]:
#and that way?
dataF['ZN'].value_counts()

In [None]:
#and that way - v3?
dataF['ZN'].unique()

In [None]:
#more visual approach
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.pairplot(dataF, height=1.5);
plt.show()

| Code   | Description   |
|:---|:---|
|**ZN**  | proportion of residential land zoned for lots over 25,000 sq.ft. | 
|**INDUS**  | proportion of non-retail business acres per town | 
|**NOX**  | nitric oxides concentration (parts per 10 million) | 
|**RM**  | average number of rooms per dwelling | 


In [None]:
col_study = ['ZN', 'INDUS', 'NOX', 'RM']

sns.pairplot(dataF[col_study], height=2.5);
plt.show()

| Code   | Description   |
|:---|:---|
|**PTRATIO**  | pupil-teacher ratio by town | 
|**B**  | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | 
|**LSTAT**  | % lower status of the population | 
|**Price**  | Median value of owner-occupied homes in \$1000's | 

In [None]:
col_study = ['PTRATIO', 'B', 'LSTAT', 'Price']

sns.pairplot(dataF[col_study], height=2.5);
plt.show()

### Data preprocessing


### data transformation

##### lets make a function that will create some categories

NOX - nitric oxides concentration (parts per 10 million)

In [None]:
dataF['NOX'].min(),  dataF['NOX'].mean(),  dataF['NOX'].max()

In [None]:
def NOX_category(nox):

    if(nox >= 0.55):
        NOX_cat = 'nox above mean'
    else:
        NOX_cat = 'nox below mean'    
        
    return NOX_cat

In [None]:
dataF['NOX_category'] = dataF['NOX'].apply(NOX_category)

In [None]:
dataF['NOX_category'].head()

In [None]:
dataF['NOX_category'].value_counts()

In [None]:
#maybe we can do it simpler? 
#Take a look into the TAX column  (187,408,711)

#TAX      full-value property-tax rate per $10,000

dataF['Tax_category'] = ['tax above mean' if tax >= dataF['TAX'].mean() else 'tax above median' for tax in dataF['TAX']]


In [None]:
dataF['Tax_category'].head(),  dataF['Tax_category'].value_counts()

##### analyze only subset of data - remove columns


In [None]:
dataF['CHAS'].value_counts()

In [None]:
del dataF['CHAS']

In [None]:
dataF.info()

##### proper data types

- There are three main data types:
    - Numeric, e.g. income, age
    - Categorical, e.g. gender, nationality 
    - Ordinal, e.g. low/medium/high
    
    
- Models most of the time can only handle numeric features


- Must convert categorical and ordinal features into numeric features
    - Create dummy features
    - Transform a categorical feature into a set of dummy features, each representing a unique category
    - In the set of dummy features, 1 indicates that the observation belongs to that category

In [None]:
# Decide which categorical variables you want to use in model
for col_name in dataF.columns:
    if dataF[col_name].dtypes == 'object':
        unique_cat = len(dataF[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} unique categories".format(col_name=col_name, unique_cat=unique_cat))

#### handling missing values

In [None]:
np.where(dataF.isnull())

In [None]:
dataF.isnull().sum().sort_values(ascending=False).head()

an example - just to show how we can deal with them


In [None]:
dt1 = pd.Series([1,0,np.NAN,np.NAN,5,9,np.NaN])

dt2 = pd.DataFrame ([ [1,      0,     np.NAN],
                      [np.NAN, 0,     np.NAN],
                      [6,      8,     2] ])

dt3 = pd.DataFrame ([ [1,      0,     np.NAN],
                      [np.NAN, 0,     np.NAN],
                      [6,      8,     2] ])

dt4 = pd.DataFrame ([ ['?',0,2],
                      ['?',0,4],
                      [6  ,8,3] ])


dt5 = pd.DataFrame ([ [1,      0,     np.NAN],
                      [np.NAN, 0,     np.NAN],
                      [2,      8,     2]     ,
                      [np.NAN, 0,     np.NAN],
                      [3,      8,     2]     ,
                      [np.NAN, 0,     np.NAN],
                      [4,      8,     2]     ,
                      [4,      8,     2]])

In [None]:
# row level drop
dt1

In [None]:
dt1.dropna()

In [None]:
# column level drop
dt2

In [None]:
dt2.dropna()

In [None]:
# filter cut-off
dt3

In [None]:
dt3.dropna(thresh=1)

In [None]:
dt3.dropna(thresh=2)

In [None]:
# change vale to NaN 

In [None]:
dt4

In [None]:
dt4[0].describe()

In [None]:
dt4[0].value_counts()

In [None]:
dt4[dt4[0] == "?"]

In [None]:
dt4[0].replace("?", np.NAN, inplace=True)
dt4


# dt4 = dt4.dropna()

In [None]:
# Impute missing values using Imputer in sklearn.preprocessing
from sklearn.preprocessing import Imputer

In [None]:
dt5

In [None]:
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
imp.fit(dt5)

In [None]:
dt5 = pd.DataFrame(data=imp.transform(dt5) , columns=dt5.columns)

In [None]:
dt5

we could use other approach as well:

dt5.fillna(dt5.mean())  -- same as above, mean value

dt5.fillna(0) -- constant value

dt5.fillna(method='ffill')  --interpolation

##### removing duplicates

- delete a row
- delete by analyzing a specific column


In [None]:
dups = pd.DataFrame({'col1':[1,2,2,3,3,3,4,4,4,4],
                     'col2':[1,1,1,1,1,2,2,2,2,2]})

In [None]:
dups

In [None]:
# delete a row -> entire data frame is analyzed

dups.drop_duplicates()

In [None]:
# delete by analyzing a specified column
dups

In [None]:
dups.drop_duplicates(['col1'])

### Outlier detection

- An outlier is an observation that deviates drastically from other observations in a dataset


- Occurrence:
    - Natural, Bill Gates's income
    - Error,  impossible value in the data set - Heart Rate 1681 (should be 168.1)


- How bad it is
    - Naturally occuring:
        - Not necessarily problematic
        - But can skew your model by affecting the slope  
    - Error 
        - Indicative of data quality issues
        - Treat in the same way as a missing value, i.e. use imputation
   
   
- Many approaches for detecting outliers:
    - Tukey IQR
    - Kernel density estimatation

### Outlier detection - Tukey IQR
- Identifies extreme values in data

- Outliers are defined as:
    - Values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)
 
 
- Standard deviation from the mean is another common method to detect extreme values
    - But it can be problematic:
        - Assumes normality 
        - Sensitive to very extreme values

In [None]:
def find_outliers_tukey(x):
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)
    iqr = q3-q1 
    floor = q1 - 1.5*iqr
    ceiling = q3 + 1.5*iqr
    outlier_indices = list(x.index[(x < floor)|(x > ceiling)])
    outlier_values = list(x[outlier_indices])

    return outlier_indices, outlier_values

In [None]:
dataF['CRIM'].describe()

In [None]:
tukey_indices, tukey_values = find_outliers_tukey(dataF['CRIM'])
print(np.sort(tukey_values))

### Outlier detection - Kernel Density Estimation
- Non-parametric way to estimate the probability density function of a given feature


- Can be advantageous compared to extreme value detection (e.g. Tukey IQR)
    

In [None]:
from sklearn.preprocessing import scale
from statsmodels.nonparametric.kde import KDEUnivariate

def find_outliers_kde(x):
    x_scaled = scale(list(map(float, x)))
    kde = KDEUnivariate(x_scaled)
    kde.fit(bw="scott", fft=True)
    pred = kde.evaluate(x_scaled)
    
    n = sum(pred < 0.05)
    outlier_ind = np.asarray(pred).argsort()[:n]
    outlier_value = np.asarray(x)[outlier_ind]

    return outlier_ind, outlier_value

In [None]:
kde_indices, kde_values = find_outliers_kde(dataF['CRIM'])
print(np.sort(kde_values))

### Correlation Analysis and Feature Selection

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
dataF.corr()

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(dataF.corr(), annot=True)
plt.show()

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(dataF[['CRIM', 'ZN', 'INDUS',  'Price']].corr(), annot=True)
plt.show()

### Label encoding

In [None]:
labelF = pd.DataFrame ([ ['XS'],['S'],['M'],['L'],['XL'],['XXL'] ])

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
labelF['encoded'] = encoder.fit_transform(labelF[0]) 

In [None]:
labelF

### One Hot Encoder or Dummy Features


In [None]:
dict =np.array (['A','B','C','D','E'])

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

In [None]:
dict = dict.reshape(len(dict),1)

In [None]:
dfEncoded = encoder.fit_transform(dict)

In [None]:
dfEncoded


### Scaling, normalization

##### MinMax

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
#Compute the minimum and maximum to be used for later scaling
scaler.fit(dataF)

In [None]:
scaler.data_max_

In [None]:
scaler.transform(dataF)

In case of error - we have 2 columns that stores categorical data
Scaling does not allow that
drop the column or create dummy variables

In [None]:
del dataF['NOX_category']

del dataF['Tax_category']


##### Standard

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
#Compute the mean and std to be used for later scaling
scaler.fit(dataF)

In [None]:
scaler.transform(dataF)

##### Normalizer

In [None]:
from sklearn.preprocessing import Normalizer

In [None]:
transformer = Normalizer()

In [None]:
#Do nothing and return the estimator unchanged
transformer.fit(dataF)

In [None]:
transformer.transform(dataF)

How it is calulated?
x0 = take the cell value 
norm0 =calculate ABS(sum all values in the row)

x0 \ norm0

##### Binarizer

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
#Do nothing and return the estimator unchanged
transformer = Binarizer().fit(dataF)

In [None]:
transformer

In [None]:
transformer.transform(dataF)

In [None]:
transformer.set_params(threshold=7.0)

In [None]:
transformer.transform(dataF)

## Dimensionality Reduction

### Principal Component Analysis (PCA)

Statistical procedure that utilise orthogonal transformation technology

Convert possible correlated features (predictors) into linearly uncorrelated features (predictors) called principal components of principal components <= number of features (predictors)

First principal component explains the largest possible variance

Each subsequent component has the highest variance subject to the restriction that it must be orthogonal to the preceding components.

A collection of the components are called vectors

Sensitive to scaling



In [None]:
from sklearn.decomposition import PCA

In [None]:
X = dataF.iloc[:, 0:12]
y = dataF['Price']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=100)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train_sc)

In [None]:
pca.explained_variance_ratio_

In [None]:
pd.DataFrame(np.round(pca.components_, 3), columns=X.columns).T

In [None]:
pca = PCA(n_components=None)
pca.fit(X_train_sc)

In [None]:
pca.transform(X_train_sc)

In [None]:
np.cumsum(pca.explained_variance_ratio_)

In [None]:
res = pca.transform(X_train_sc)
index_name = ['PCA_'+str(k) for k in range(0, len(res))]

In [None]:
df1 = pd.DataFrame(res, columns=dataF.columns[0:12],
                   index=index_name)[0:4]
df1.T.sort_values(by='PCA_0')