# Lesson 1.8: Encoding the model

### Lesson Duration: 3 hours

> Purpose: The purpose of this lesson is to learn how to process categorical data before fitting the model, fit the model, make predictions on the test data, and check the accuracy of the model. We will also talk about other models used for supervised and unsupervised learning.

---

### Setup

To start this lesson, students should have:

- Completed lesson 1.7
- All previous Setup

### Learning Objectives

After this lesson, students will be able to:

- Encode categorical data
- Fit the model on the training data
- Make predictions on the test data
- Check the accuracy of the model using different statistical measures

---

### Lesson 1 key concepts

> :clock10: 20 min

- Categorical data - nominal, ordinal
- Encoding categorical variables
  - Label encoding
  - One Hot encoding

:exclamation: Note: You can continue using the same Jupyter file from the last lesson. If you do not have that, use the following code to quickly set up:

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('./files_for_lesson_and_activities/regression_data.csv')
Y = data['TARGET_D']
data = data.drop(['TARGET_D'], axis=1)
X_num = data.select_dtypes(include = np.number)
from sklearn.preprocessing import Normalizer
transformer = Normalizer().fit(X_num)
x_normalized = transformer.transform(X_num)
print(x_normalized.shape)
# pd.DataFrame(x_normalized)

(4670, 4)


<details>
<summary> Click for Code Sample </summary>

Links to docs:

- [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
- [.fit(x)](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.fit)
- [.transform(x)](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.transform)

In [3]:
X_cat = data.select_dtypes(include = np.object)
# pd.get_dummies(X_cat, drop_first=True)

In [4]:
from sklearn.preprocessing import OneHotEncoder

In [5]:
encoder = OneHotEncoder(handle_unknown='error', drop='first').fit(X_cat)
# encoder.categories_

In [6]:
encoded = encoder.transform(X_cat).toarray()

In [7]:
encoded

array([[1., 0.],
       [1., 0.],
       [0., 0.],
       ...,
       [0., 0.],
       [1., 0.],
       [1., 0.]])

In [8]:
#le = preprocessing.LabelEncoder().fit(X_cat).transform(X_cat) # ordered wrt value counts

</details>

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

# 1.08 Activity 1

- Do you think it is important to reduce the number of categories in a column if you can? How might it impact your model?
- Discussion on reducing the number of categories in a column.

Keeping One Hot encoding as the reference, having multiple categories in a single column can add a large number of additional columns in the data set. Given that there might be many such categorical columns, this might make the data set sparse, by adding a large number of columns with binary values.

### Lesson 2 key concepts

> :clock10: 20 min

- Fitting the model with processed data
- Understanding the documentation
- Making predictions
  - Predictions on the test data
  - Predictions on new data

<details>
<summary> Click for Code Sample </summary>

Links to docs:

- [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#)
- [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [9]:
X = np.concatenate((x_normalized, encoded), axis=1)

In [10]:
# Y = data['TARGET_D'] #This columns was already droped from 'data'

In [11]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model 
from sklearn.metrics import r2_score

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=100)

In [13]:
lm = linear_model.LinearRegression()

In [14]:
model = lm.fit(X_train,y_train)

In [15]:
predictions  = lm.predict(X_test)

In [16]:
r2_score(y_test, predictions)
# to make predictions on the new data, we have to process the data (X features) in the same way.

0.4367709325610222

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

<details>
  <summary> Click for Instructions: Activity 2 </summary>

# 1.08 Activity 2

Refer to the file `files_for_activities/regression_data___.csv` for this exercise.

1. Import the data from `regression_data___.csv`.
2. Select categorical columns.
3. Difference between One Hot Encoding, Label Encoding, and Ordinal Encoding.

# Solutions

In [17]:
# 1
import pandas as pd

df=pd.read_csv('./files_for_activities/csv_files/regression_data___.csv') # this file is inside files_for_lesson_and_activities folder
df.head()

Unnamed: 0,AVGGIFT,HV1_log,IC1_transformed,IC5_transformed,gender,TARGET_D
0,15.5,7.760467,17.343389,4.181353,Male,21.0
1,3.08,6.20859,16.230984,4.150313,Male,3.0
2,7.5,7.113956,18.047227,4.205057,Female,20.0
3,6.7,5.783825,11.73711,4.055333,Male,5.0
4,8.785714,6.64379,12.494862,4.088969,Female,10.0


In [18]:
# 2
import numpy as np

cat_data=df.select_dtypes(include=np.object)

In [19]:
# 3

OneHotEncoder can be used for transforming your independent variables according to how one-hot-encoding works. It is not really intended to be used on your dependent variables.

The OrdinalEncoder can be used if you can order / rank your independent variables, e.g., small, medium, large, very large. This is also not intended to be used on your dependent variables.

The third one one, LabelEncoder, is used when you want to transform your dependent variables into classes, e.g., :
[1, 1, 2, 6] -> [0, 0, 1, 2]. This is only intended to be used with your LABELS, i.e., your dependent variables, and not your independent variables.

In [20]:
# One hot Encoding
from sklearn.preprocessing import OneHotEncoder

encoder=OneHotEncoder(handle_unknown='error', drop='first').fit(cat_data)
encoded=encoder.transform(cat_data).toarray()

In [21]:
# with pandas
data_one_hot=pd.get_dummies(cat_data, drop_first=True)
data_one_hot.head()

Unnamed: 0,gender_Male,gender_U
0,1,0
1,1,0
2,0,0
3,1,0
4,0,0


In [22]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder

data_labeled=pd.DataFrame()

for c in data.columns:
    encoder=LabelEncoder().fit(df[c])
    encoded=encoder.transform(df[c])
    data_labeled[c]=encoded

data_labeled.head()

Unnamed: 0,AVGGIFT,HV1_log,IC1_transformed,IC5_transformed,gender
0,1296,1472,339,2043,1
1,28,269,269,1194,1
2,581,940,388,2714,0
3,492,103,66,58,1
4,756,537,92,180,0


In [23]:
# Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

encoder=OrdinalEncoder().fit(df)
encoded=encoder.transform(df)

pd.DataFrame(encoded, columns=df.columns)

Unnamed: 0,AVGGIFT,HV1_log,IC1_transformed,IC5_transformed,gender,TARGET_D
0,1296.0,1472.0,339.0,2043.0,1.0,30.0
1,28.0,269.0,269.0,1194.0,1.0,3.0
2,581.0,940.0,388.0,2714.0,0.0,29.0
3,492.0,103.0,66.0,58.0,1.0,6.0
4,756.0,537.0,92.0,180.0,0.0,13.0
...,...,...,...,...,...,...
4665,1296.0,473.0,215.0,2223.0,1.0,34.0
4666,1075.0,1682.0,202.0,3574.0,1.0,28.0
4667,1037.0,472.0,240.0,1991.0,0.0,13.0
4668,1157.0,373.0,124.0,589.0,1.0,34.0


### Lesson 3 key concepts

> :clock10: 20 min

- Checking the accuracy of the model
  - RMSE
  - MSE
  - R square

In [72]:
from sklearn.metrics import mean_squared_error
import math

In [73]:
mse = mean_squared_error(y_test, predictions)
print(mse)

81.59292426207563


In [74]:
rmse = math.sqrt(mse)
print(rmse)

9.032880175341397


In [75]:
r2 = r2_score(y_test, predictions)
print(r2)

0.4367709325610222


Adjusted R2 is a corrected goodness-of-fit (model accuracy) measure for linear models. It identifies the percentage of variance in the target field that is explained by the input or inputs. R2 tends to optimistically estimate the fit of the linear regression.

In [76]:
n = len(X_test)
p = X_test.shape[1]
adj_r2 = 1-((1-r2)*(n-1)/(n-p-1))
print(adj_r2)

0.43495504088738757


#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

<details>
  <summary> Click for Instructions: Activity 3 </summary>

# 1.08 Activity 3

- Check the difference between MSE and RMSE. Which one is bigger? Does it matter?
- There is another measure of accuracy called "adjusted R-square". How is this different from R-square?
- Read this [article](https://blog.minitab.com/blog/adventures-in-statistics-2/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables) and let's discuss it in class.

#1
MSE is bigger than RMSE, but they have the same information about the error. RMSE has the same magnitude order as the data.

#2 
Adjusted R2 is a corrected goodness-of-fit (model accuracy) measure for linear models. It identifies the percentage of variance in the target field that is explained by the input or inputs. R2 tends to optimistically estimate the fit of the linear regression.

### Lesson 4 key concepts

> :clock10: 20 min

- Recap the complete process
- Discussion

  - Regression and classification models that will be covered - a quick intro
    - KNN, SVM, decision trees, random forests, neural networks

- Introduce unsupervised machine learning models
  - Clustering algorithms that will be covered - a quick intro
    - K-means clustering, DBSCAN

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

# 1.08 Activity 4

1. List down some examples/problems that can be solved using regression models
2. List down some examples/problems that can be solved using classification models
3. List down some examples/problems that can be solved using clustering models

- There is a file `files_for_activities/sites_with_free_data_sets.pdf`. You can use that to find some examples on different models.

#### 1/2 Supervised ML: Examples
- Predictive analytics (house prices, stock exchange prices, etc.)
- Text recognition
- Spam detection
- Customer sentiment analysis
- Object detection (e.g. face detection)

#### 3 Unsupervised
-  Audience segmentation
-  Customer persona investigation
-  Anomaly detection (for example, to detect bot activity)
-  Pattern recognition (grouping images, transcribing audio)
-  Inventory management (by conversion activity or by availability)

### :pencil2: Practice on key concepts - Lab

> :clock10: 30 min

<details>
  <summary> Click for Instructions: Lab </summary>

# Lab | Customer Analysis Round 6

For this lab, we still keep using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab` folder.

### Get the data

We are using the `marketing_customer_analysis.csv` file.

### Dealing with the data

Already done in the round 2.

### Explore the data

Done in the round 3.

### Processing Data

(_Further processing..._)

- X-y split. (_done_)
- Normalize (numerical). (_done_)
- One Hot/Label Encoding (categorical).
- Concat DataFrames

### Linear Regression

- Train-test split.
- Apply linear regression.

### Model Validation

- Description:
  - R2.
  - MSE.
  - RMSE.
  - MAE.