# Introduction to Data Pre-Processing

Data Source : https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/P14-Data-Preprocessing.zip. <br>
first import the required packages 

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk

Now load the data from the "Data.csv" file into a dataframe 

In [14]:
ds = pd.read_csv('ds/Data.csv')
ds.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


you can also see the dimension of the dataset, whcih contains 10 rows and 4 colums 

In [15]:
ds.shape

(10, 4)

## Augmenting Independent and Dependent variables

Now in Machine learning, the dataset is given as a labeled format. The Data matrix is one that contains samples to train the model mostly denoted by $X_{n\times m}$, where $n$ is the number of samples or observations and $m$ is the number of features (__Indipendednt Variables__). The label vector holds the labels (__Dependent Varuable__)corresponsind to each sample in $X$. The label vector is denoted as $y_{n\times 1}$. <br>
The fists task is seperate the independent variable matrix fromt the dependent variable vector. In this case the first 3 colums are independent and the last column is dependent. 

In [186]:
X = ds.iloc[ : , : -1 ].values   # iloc [all rows, all col but the last on]
y = ds.iloc[ : , -1 ].values     # iloc [all rows, last col]

In [187]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [188]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

`X` and `y` is now converted into __ndarray__, verify their dimension

In [189]:
print(f'X matrix is of type {type(X)} with Dimension {X.shape}')
print(f'y vector is of type {type(y)} with Dimension {y.shape}')

X matrix is of type <class 'numpy.ndarray'> with Dimension (10, 3)
y vector is of type <class 'numpy.ndarray'> with Dimension (10,)


## Handling Missing Data
Missing data are entries into your dataset which are either emply or filled as __NaN__ or Not a Number. A dataset must not have any missing data when given from learning. Thus, all of them must be removed during the pre-processing phase. <br>
In order to mitigate this problem, there can be two approches 
1. remove the sample that has a missing data, but this can be dangerus as it effects the consistency of the actual dataset.
2. the second approach is to replace the te missing data by any of the summary statistics of the entire dataset (mean, median) etc. 

In our datast there are some missing data... 

In [190]:
ds[ds.isna().any(axis=1)]   # this command filters out rows having NaN

Unnamed: 0,Country,Age,Salary,Purchased
4,Germany,40.0,,Yes
6,Spain,,52000.0,No


In order to do replace NaN using mean value, we'll use a data pre-processing class called __sklearn.impute.SimpleImputer__. 
1. An object of the class will be made providing the form of the missing value and the streategy to use. 
2. Uthe the __fit()__ function to apply the replacement by supplying the data matrix with specific columns that contains missing data

In [191]:
imputer = sk.impute.SimpleImputer(missing_values=np.nan , 
                                  strategy='mean')   # set basic params 
imputer = imputer.fit(X[ :, [1,2] ])                 # col 1,2 has NaN
X[: , [1,2] ] = imputer.transform( X[: , [1,2]] )    # transform the actual matrix 

In [192]:
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.8
5,France,35.0,58000.0
6,Spain,38.7778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


## Encoding Categorical Data
A categorical variable is one that has non-numeric samples (e.g. 'Country' and 'Purchesed' attribute, the members of those variables must be encoded into numbers as the ML models uses mathematical equation which expects numeric inputs. Once encoded all the distinct members of an independent variable colums gets an natural number. <br>
A problem however it creates as the categories are replaced by numbers, in the equation is implies bias (i.e. a category with higher numeric value pretends superior in the underlaying mathematical equations, than a category wit lower numeric representation). Eleminate this, __Dummy variables__ are introduced, which will be added to the original dataset, This encoding sceme is called __One-Hot Encoding__ and it works as follows,
* A categorical column is replaced by a number of columns, each representing a catergory.
* the added columns are binary in nature, and contains 1 at positions where the given catergory of the column appears. 
* it is recomanded to encode the categorical coulums one-by-one in case the data set cotains multiple of such type

Thus, the complete encoding scheme of categorical variables is as follows 
1. __`sklearn.preprocessing.LabelEncoder`__ class transforms categorical column into numeric column
2. __`sklearn.compose.make_column_transformer`__ class transforms numeric column into one-hot-endoded columns 

In [193]:
label_encoder_X = sk.preprocessing.LabelEncoder()        # create an LE object for X 
X[: , 0] = label_encoder_X.fit_transform( X[ : , 0 ])    # col 0 is categorical

In [194]:
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,0,44.0,72000.0
1,2,27.0,48000.0
2,1,30.0,54000.0
3,2,38.0,61000.0
4,1,40.0,63777.8
5,0,35.0,58000.0
6,2,38.7778,52000.0
7,0,48.0,79000.0
8,1,50.0,83000.0
9,0,37.0,67000.0


In [195]:
from sklearn.compose import make_column_transformer
col_trans = make_column_transformer( (OneHotEncoder(), [0]) ,     #apply OHE on col=0
                                      remainder = 'passthrough' ) #keep rest identical 
X=col_trans.fit_transform(X)

In [196]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4
0,1,0,0,44.0,72000.0
1,0,0,1,27.0,48000.0
2,0,1,0,30.0,54000.0
3,0,0,1,38.0,61000.0
4,0,1,0,40.0,63777.8
5,1,0,0,35.0,58000.0
6,0,0,1,38.7778,52000.0
7,1,0,0,48.0,79000.0
8,0,1,0,50.0,83000.0
9,1,0,0,37.0,67000.0


Now label encode the dependent variable $y$

In [231]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [232]:
label_enc_y = sk.preprocessing.LabelEncoder()
y = label_enc_y.fit_transform(y)

In [233]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Train-Test Split
While training the model, data set is splitted into traing and testing set. The train part is used to train the model and test part validates the score. 
* Feature scalling is not recomanded on Summy variables.
* Feature scalling is not required for a classification problem but required for regression problems.

In [282]:
X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X, y, 
                                                                       test_size = 0.3, 
                                                                       random_state = 1)

In [283]:
print(f'X dim       = {X.shape}')
print(f'X_train dim = {X_train.shape}')
print(f'X_test dim  = {X_test.shape}')
print(f'y_train dim = {y_train.shape}')
print(f'y_test dim  = {y_test.shape}')

X dim       = (10, 5)
X_train dim = (7, 5)
X_test dim  = (3, 5)
y_train dim = (7,)
y_test dim  = (3,)


## Feature scalling 
The data is not scalled, i.e. the range of various coulms are different than eachother. this makes a problem for the ML models to get trained, as most of them uses Euclidean Distance to measure the correlation. Now, if the scale is different, the dominant scale would dominate the other attribute. Thus, ML model recomands to have a scalled data, where all the attributes are without bias. <br>
There are mainly two ways to perform feature scalling,
1. Standardisation : $x_{sand}=\frac{x - mean(X)}{\sigma(X)} \forall x \in X$
2. Normalizarion :   $x_{norm}=\frac{x - min(X)}{max(X) - min(X) } \forall x \in X$

use the __`sklearn.preprocessing.StandardScaler`__ class to perform it

In [284]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,0,1,0,40,63777.8
1,1,0,0,44,72000.0
2,0,0,1,38,61000.0
3,0,0,1,27,48000.0
4,1,0,0,48,79000.0
5,0,1,0,50,83000.0
6,1,0,0,35,58000.0


In [285]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,0,1,0,30.0,54000
1,1,0,0,37.0,67000
2,0,0,1,38.7778,52000


In [286]:
scaller_X = sk.preprocessing.StandardScaler()

#scalling is not recomanded on dummy variables 
X_train[ :, [3,4]] = scaller_X.fit_transform(X_train[: , [3,4]]) 
X_test[: , [3,4]]  = scaller_X.transform(X_test[:, [3,4]]) 

In [287]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,0,1,0,-0.0389102,-0.2296
1,1,0,0,0.505833,0.491205
2,0,0,1,-0.311282,-0.473116
3,0,0,1,-1.80932,-1.61277
4,1,0,0,1.05058,1.10486
5,0,1,0,1.32295,1.45553
6,1,0,0,-0.719839,-0.736112


In [288]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,0,1,0,-1.40077,-1.08677
1,1,0,0,-0.447467,0.0528776
2,0,0,1,-0.205359,-1.26211


# Data Pre-processing Template 

In [327]:
import numpy as np
import pandas as pd
import sklearn as sk

def pre_processing(data_source, nan_cols, cat_cols, is_cat_y, test_size, is_scale_y):
    '''
    Performs data pre-processing for ML/DL programs 
    data_source: a STRING to locate data source CSV file
    nan_cols   : a LIST of colums that has NaN entries (use 'ds[ds.isna().any(axis=1)]' to locate)  
    cat_cols   : a LIST of colums which are categrical
    is_cat_y   : a BOOLEAN value, True if dependent variable is categorical
    test_size  : a NUMBER in [0,1], proportion of the Test size for splitting 
    is_scale_y : a BOOLEAN value, True if dependent variable is to scale (recomanded for Regression)
    '''
    
    
    # loading data source
    ds = pd.read_csv(data_source)   
    
    # splitting dependent and independent variables
    X = ds.iloc[ : , :-1].values
    y = ds.iloc[ : , -1].values
    
    # removing NaN values
    imputer = sk.impute.SimpleImputer(missing_values=np.nan , strategy='mean')
    X[: , nan_cols ] = imputer.fit_transform(X[ : , nan_cols])
    
    # Encoding categorical Variables
    le_X = sk.preprocessing.LabelEncoder()
    for col in cat_cols:
        X[ : , col ] = le_X.fit_transform(X[ : , col])
    
    if is_cat_y:
        le_y = sk.preprocessing.LabelEncoder()
        y = le_y.fit_transform(y)
        
    col_trans = sk.compose.make_column_transformer((OneHotEncoder(), cat_cols),
                                                   remainder='passthrough'
                                                  )
    X=col_trans.fit_transform(X)
    
    # Train-test split
    X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X, y, 
                                                                           test_size = test_size, 
                                                                           random_state = 0)
    # Feature Scalling
    scaller_X = sk.preprocessing.StandardScaler()
    X_train = scaller_X.fit_transform(X_train)
    X_test = scaller_X.transform(X_test)
    
    if is_scale_y:
        scaller_y = sk.preprocessing.StandardScaler()
        y_train = scaller_y.fit_transform(y_train)
        y_test = scaller_y.transform(y_test)
    
    print(pd.DataFrame(X_train))
    print(pd.DataFrame(y_train))

In [324]:
pre_processing(data_source = 'ds/Data.csv', 
               nan_cols=[1,2], 
               cat_cols=[0], 
               is_cat_y=True, 
               test_size=0.3, 
               is_scale_y=False)

          0    1         2         3         4
0  0.866025  0.0 -0.866025 -0.202981  0.448971
1 -1.154701  0.0  1.154701 -1.821689 -1.417064
2 -1.154701  0.0  1.154701  0.084789 -1.024215
3  0.866025  0.0 -0.866025  1.577598  1.627519
4 -1.154701  0.0  1.154701 -0.041110 -0.140303
5  0.866025  0.0 -0.866025  0.930115  0.940033
6  0.866025  0.0 -0.866025 -0.526723 -0.434940
   0
0  1
1  1
2  0
3  1
4  0
5  0
6  1
