<img src="img/make ai logo.png" width="180" height="360" align="left"/>

<div class="alert alert-block alert-success" style="margin-top: 5px">
<h2> In this lesson, we will learn about data preprocessing.

Data preprocessing is used to get better data so that machine learning will understand our data.

Let's learn. </h2>
</div>

In [None]:
#core package for data science
import numpy as np
import pandas as pd

#import package for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#hide warnings so you can't find any warning
import warnings 
warnings.simplefilter('ignore')

print("Done !!!")

For preprocess data, we should divide into numerical and categorical, since that from different data, different ways to tackle that problem.

As, always for example, we are going to preprocess Titanic Data

In [None]:
data = pd.read_csv("data/train.csv")
data.head()

In [None]:
data = data.fillna({'Age': data['Age'].median()})
data = data.fillna({'Embarked': 'S'})

In [None]:
data = data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'] , axis = 1)

In [None]:
data.tail()

Divide your data to numerical and categorical !
   - Numerical   : **Age , SibSp , Parch, Fare**
   - Categorical : **Pclass, Sex, Embarked**
   - We don't have no preprocess **Survived** since that they are Target

In [None]:
data_num = data[['Age','SibSp','Parch','Fare']]
data_cat = data[['Pclass','Sex','Embarked']]

In [None]:
data_num.tail()

In [None]:
data_cat.tail()

## Numerical Data Normalization

In general, there are two ways to normalization from numerical data. There are **Min-Max Scaler** and **Standard Scaler**.

**Min-Max Scaler** takes number of value and then normalize so we get output number between 0 and 1.

Example : 

     data_raw  = [0, 1000, 2000, 3000, 4000]

     data_norm = [0, 0.25, 0.5 , 0.75,  1  ]
     
     
     
**Standard Scaler** takes number of value and then normalize with normal distribution, refer to **Day 3: Basic Statistics and Math, with Normal Distribution and Z-Scores**, so that we get output number approx. -2 until 2.

Example : 

     data_raw  = [  0  , 1000 , 2000, 3000, 4000]

     data_norm = [-1.27, -0.63, 0 , 0.63,  1.27 ]

<img src="img/about_standardization_normalization_44_0.png" width="560" height="560" align="center"/>

### How to Scaling

#### 1. Import package

I am using another package called **sklearn** or Scikit Learn, it is very powerful to create machine learning and some kind of that. More information you can refer to https://scikit-learn.org/stable/modules/preprocessing.html for preprocessing. 

In [None]:
# To get Min-Max or Standard, use this package
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

#### 2. Define function

In [None]:
# For Min-Max
scaler1 = MinMaxScaler()

# For Standard
scaler2 = StandardScaler()

#### 3. Copy your data (optional)

In [None]:
data_1 = data_num.copy()

#### 4. Fitting and Transform your data

In [None]:
data_m = scaler1.fit_transform(data_1)
data_m

In [None]:
data_s = scaler2.fit_transform(data_1)
data_s

#### 5. Convert to DataFrame !

In [None]:
data_m = pd.DataFrame(data_m,columns = ['Age','SibSp','Parch','Fare'])
data_m.head()

In [None]:
data_s = pd.DataFrame(data_s,columns = ['Age','SibSp','Parch','Fare'])
data_s.head()

### So, which one do you choose?

## Categorical Data Normalization

In general, there are two ways to normalization from categorical data. There are **Label Encoder** and **One Hot Encoder**.

**Label Encoder** takes categorical value and then convert to label like 0,1,2 etc, sometimes how they get 0,1,2 is random. 

Example : 

     data_raw  = [France, Belgium, Germany, France, Germany]

     data_norm = [0, 1, 2 ,0, 1]
     
     
     
**One Hot Encoder** takes categorical value and convert to number 0 and 1 only, but it create more data.

Example : 

     data_raw  = [France, Belgium, Germany, France, Germany]

     data_norm = [[1,0,0],[0,1,0],[0,0,1],[1,0,0],[0,0,1]]

<img src="img/0 T5jaa2othYfXZX9W.jpg" width="560" height="560" align="center"/>

### How to Encode

#### 1. Import package

In [None]:
# To get Min-Max or Standard, use this package
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
print("Done !!!")

#### 2. Define function

In [None]:
# For Min-Max
encoder1 = LabelEncoder()

# For Standard
encoder2 = OneHotEncoder()

#### 3. Copy your data (optional)

In [None]:
data_2 = data_cat.copy()

In [None]:
data_2.head()

#### 4. Fitting and Transform your Data

### Label Encoder

The steps is almost same, but disadvantages from this is we *can't* encode all of them, we should take one by one column, because...

In [None]:
encoder1.fit(data_2)

So, we should encode one by one to get this done.

In [None]:
encoder1.fit(data_2.iloc[:,2:3])

We want to know for which classes they are ordered.

In [None]:
encoder1.classes_

In [None]:
a = encoder1.transform(data_2.iloc[:,2:3])

In [None]:
a = pd.DataFrame(a)

In [None]:
data_2.join(a).head()

Don't worry, you can still get encode in another way too, with looping !

In [None]:
length = data_2.shape[1]
col = data_2.columns
for i in range(length):
    print(i)
    a = encoder1.fit_transform(data_2.iloc[:,i:i+1])
    a = pd.DataFrame(a, columns=[col[i]+'new'])
    data_2 = data_2.join(a)

In [None]:
data_2.head()

### One Hot Encoder

There are some disadvantages too when you are using One Hot Encoder, it is it can't convert any string value.

From now, we are using data that we are using with Label Encoder.

In [None]:
#create new data
data_ohe = data_2[["Pclassnew","Sexnew","Embarkednew"]]
data_ohe.head()

In [None]:
data_ohe = encoder2.fit_transform(data_ohe)

In [None]:
data_ohe

So we should convert to array to know our data since that just fit and transform is not enough.

In [None]:
data_ohe = data_2[["Pclassnew","Sexnew","Embarkednew"]]
data_ohe.head(10)

In [None]:
data.head(10)

In [None]:
data_ohe = encoder2.fit_transform(data_ohe).toarray()
data_ohe = pd.DataFrame(data_ohe, columns=['Pclass_1','Pclass_2','Pclass_3','Sex_female','Sex_male','Embarked_C','Embarked_Q','Embarked_S'])
data_ohe.head()

In [None]:
data_2 = data_2.join(data_ohe)
data_2.tail()

So, which one do you used? **Label Encoder** or **One Hot Encoder**?

## Join All of your Data

Get all of your data, numerical and categorical bring in together !!!

In [None]:
data_num_fix = data_m
data_cat_fix = data_2[["Pclassnew","Sexnew","Embarkednew"]]
data_fix     = data_num_fix.join(data_cat_fix)
data_fix     = data_fix.join(data[["Survived"]])

data_fix.tail()

## Split Training and Testing Data

<img src="img/1 -8_kogvwmL1H6ooN1A1tsQ.png" width="560" height="560" align="center"/>

For using machine learning, they should have training and testing data since that they need to learn it first, right?

So Training for learning data first, after they learning and get good score, which that is a must, they use Testing data to get exam, is it okay for machine learning or not?

In [None]:
# package that needed for split training and testing
from sklearn.model_selection import train_test_split

# get split between x(feature) and y(target)
X = data_fix.drop(columns='Survived')
y = data_fix['Survived']

#Split dataset using sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

# Get shape of how many train 
print ("Train size : ",X_train.shape)
print ("Test size : ",X_test.shape)

### Bonus : How to save our data to csv

Just use **to_csv** to saving your data to csv, so you can import again.

In [None]:
# Concatenate your data !
train_data = pd.concat([X_train,y_train],axis=1)
test_data = pd.concat([X_test,y_test],axis=1)

# Finally, save your data !!
train_data.to_csv('train_data.csv',index=False)
test_data.to_csv('test_data.csv',index=False)

<div class="alert alert-block alert-success" style="margin-top: 5px">
<h2> Question </h2>
</div>

For now, we are going through diamonds dataset, so we can analyze diamonds by their cut, color, clarity, price, and other attributes.

Import the data !

In [None]:
data = pd.read_csv("data/diamonds.csv")
data.tail()

With the description of every columns is like this :

**Unnamed:0** : Index counter

**carat** : Carat weight of the diamond

**cut** :
Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal

**color** :
Color of the diamond, with D being the best and J the worst

**clarity** :
How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3

**depth** :
depth % :The height of a diamond, measured from the culet to the table, divided by its average girdle diameter ( total depth percentage = z / mean(x, y) = 2 * z / (x + y) )

**table** :
table%: The width of the diamond's table expressed as a percentage of its average diameter

**price** :
the price of the diamond

**x** :
length mm

**y** :
width mm

**z** :
depth mm

<div class="alert alert-block alert-success" style="margin-top: 5px">
<h2> Preprocessing this data into something that can divide as Training and Testing Data !!! </h2>
</div>