# Data Preprocessing Template

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd

## Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')
# iloc: locate index = this function to get all the index of rows and columns, respectively
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Check how many rows in each columns is nan(not a number)
missing_data = dataset.isnull().sum()

In [3]:
print(X)
print(missing_data)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [6]:
from sklearn.impute import SimpleImputer
# Create a new object for SimpleImputer
# In the constructor, there are two arguments that were passed into
# The first argument tell SimpleImputer to recognize the missing cell (nan = not a number)
# The second argument tell this object to autofill average values of non-null cells and autofill the value into empty value
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# This lines tells SimpleImputer learn how to fill average value into missing-value cells
imputer.fit(X[:, 1:3])
# This line applies what SimpleImputer learned to fill average value
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [7]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### How to identiy categorical features

In [8]:
categorical_features = dataset.select_dtypes(include=['object']).columns.tolist()
print(categorical_features)

['Country', 'Purchased']


### Encoding the Independent variable

Below step is cruical, because **machine can just work with numerical data**, and if your dataset contains non-numerical data, you will have to transform it into numerical data


---


**`ColumnTransformer`**: Applying transformations to specific columns


> It is a powerful tool for managing data preprocessing when you have a mix of different data types (e.g., numerical, categorical, text).  It allows you to apply different preprocessing steps to different columns in a clean and organized way.


---

**`One Hot Encoding`**: Converting Categories to Numbers

> It is a crucial preprocessing technique for handling categorical data, which are variables that represent qualities or characteristics (e.g., colors, types of fruit, city names).

**How it works**

Let's say we have a "Color" column:
```
Color
-----
Red
Blue
Green
Red
```

After one-hot encoding, it becomes:
```
Red  Blue  Green
---  ----  -----
1    0     0
0    1     0
0    0     1
1    0     0
```

> **Prevents Misinterpretation**: If we simply assigned numbers to categories (e.g., Red=1, Blue=2, Green=3), the algorithm might incorrectly interpret these numbers as having an ordinal relationship (i.e., that Green is "greater" than Blue, which is not true). One-hot encoding avoids this


In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# ct = column transformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder="passthrough")
X = np.array(ct.fit_transform(X))

In [10]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


#### **Explain about the above code**

The first argument of **`ColumnTransformer`**
is `transformers`

* It's a list of tuples, where each tuple defines a transformation to be applied to a specific set of columns.
* Inside the tuple (like fixed-array in other programming languge)

  1.   `name`:  A string that gives a name to this specific transformation (encoder means change some value to new value)
  2.   `transformer`: Use **`OneHotEncoder`**
  3.   `columns`: Specify which columns will be applied



---

The second argument is `remainder`


*   Specifies what to do with the columns that are not explicitly specified in the transformers argument
 1.   **`drop`**: The default. Columns not specified in transformers are dropped from the output.

 2.   **`passthrough`**: The unspecified columns are passed through without transformation. This is very common when you have a mix of numerical and categorical features

 3. **`any transformer`**: You can provide another transformer object. This transformer will be applied to the remaining columns.







### Encoding the Dependent variable

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [12]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


#### A detailed explanation about ```fit()``` and ```transform()``` do in Pre-processing data step in **Machine Learning**


---
**```fit()```**: is a step that learning parameters from your data.

- When you call ```fit()``` on a preprocessing object (like a **```StandardScaler```**, **```MinMaxScaler```**, **```OneHotEncoder```**, etc.) and pass your data (typically your training data), the method analyzes the data to calculate statistics or parameters needed for the transformation or to use later.


---
**```transform()```**: is about applying the learned parameters from **```fit()```** to actually transform your data.



### Spliting the dataset into the Training set and Test set

*   **Training set**: Build your machine learning model
*   **Test set**: Test or evaluate your model

---

#### Why we have to apply **spliting the dataset** before **feature scalling**

##### Try to read this example:
- Let imagine you are practicing for the next exam (using training set). You use previous exam to practice (previous exam = training set). The next exam (next exam = test set) will be completely new that you don't see anytime in the practicing time
    - **Feature scalling before spliting (wrong)**: Like you use the answer of the real test to help you practice for that test. You will take the test well, but it is not a fair assessment of your true knowledge.
    - **Standardization after division (right)**: Like you practice with practice and then take a whole new test you have never seen. This shows you the right level and preparation of you.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

##### Explanation about the above code:

- **```train_test_split()```**: This function will split dataset into 2 parts:
    - A pair of training set and test set for feature variable
    - A rest pair about dependent variable
- **```test_size = 0.2```**: This means the size of test set is 20% of dataset
- **```random_state=1```**: This means you shuffle the training set and test set before spiliting to make sure get the best result for your future machine learning model

In [14]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [15]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [16]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [17]:
print(y_test)

[0 1]


### Feature scalling
Main usage: **Feature scaling** is like converting everything to a common "size unit" so we can fairly compare them
  - If you don't understand, try to look at the below example:
      - Imagine you're comparing apples and oranges, but you're trying to compare them based on their "size."
          - **Apples** might be measured in centimeters (diameter). Let's say apple sizes range from 6cm to 9cm
          - **Oranges** might be measured in grams (weight). Let's say orange weights range from 150g to 250g.
          
          => So to hard to compare betweem them, so **feature scalling** comes and solves this prolem

In [18]:
# We will use Standardlization instead of Normalization techinique
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train[:, 3:] = scaler.fit_transform(X_train[:, 3:])
X_test[:, 3:] = scaler.transform(X_test[:, 3:])

In [19]:
print(X_train)
print(X_test)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


#### Explanation about the above code

Why **```X_train```** runs both **```fit```** and **```transform```** functions, but **```X_train```** only runs **```transform```** function?

  * Because the objective of **fit** function is make the model learn how to apply transformation before transform it, so both the train and test set only need learn once before transforming



---

Why ```[:, 3:]```

  * Because we just apply feature scalling for the columns that don't belong to ```[-3;3]```. If the columns has been belong to this range, we don't need to apply **feature scalling**