## Machine Learning Tutorial 6: Dummy Variables & One Hot Encoding

Handling numerical data is straightforward, but real-world datasets often include text. To work with categorical text data, we need techniques like **one hot encoding**. While **label encoding** assigns integers to categories, it can be misleading for nominal variables (categories with no inherent order). Instead, **one hot encoding** is a better approach. In this tutorial, we'll use `pandas.get_dummies` to create dummy variables and perform one hot encoding on a dataset.

(Alternatively, you can use `sklearn.preprocessing.OneHotEncoder` for the same task)

#### Topics covered:
* Categorical Variables 
* Dummy Variables
* One Hot Encoding

### Categorical Variables

Categorical variables can be classified into two types:

- **Nominal:** 
  - Examples: 
    - Gender: Male, Female
    - Colors: Red, Green, Blue
    - Locations: Monroe Township, Robbinsville, West Windsor

- **Ordinal:** 
  - Examples:
    - Levels: Low, Medium, High
    - Education: Graduate, Master's, PhD
    - Ratings: Bad, Average, Good

### One-Hot Encoding

Each category is transformed into a **binary vector**, where only one element is '1' (hot) and all others are '0'. This process is especially useful for nominal variables, where there is **no intrinsic order among categories**. This converts categoricaal variables into numerical format that machine learning models can understand.

- **Example for Nominal Variables:**
  - Gender: 
    - Male: `[1, 0]`
    - Female: `[0, 1]`
  - Colors:
    - Red: `[1, 0, 0]`
    - Green: `[0, 1, 0]`
    - Blue: `[0, 0, 1]`

## Task

Build a predictor function to predict price of a home.

1) With **3400 sqft** area in **west windsor**
2) **2800 sqft** home in **robbinsville**

In [68]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [2]:
df = pd.read_csv("C:\\Users\\Vaishob\\Downloads\\homeprices (1).csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


## Using Pandas to create dummy variables

In [76]:
# Convert 'town' to one-hot encoded binary columns
pd.get_dummies(df.town)

Unnamed: 0,0,1,2
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False
5,False,False,True
6,False,False,True
7,False,False,True
8,False,False,True
9,False,True,False


In [8]:
dummies = pd.get_dummies(df.town)

In [10]:
# Merge the original dataframe with the one-hot encoded columns
merged = pd.concat([df,dummies], axis='columns')
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,True,False,False
1,monroe township,3000,565000,True,False,False
2,monroe township,3200,610000,True,False,False
3,monroe township,3600,680000,True,False,False
4,monroe township,4000,725000,True,False,False
5,west windsor,2600,585000,False,False,True
6,west windsor,2800,615000,False,False,True
7,west windsor,3300,650000,False,False,True
8,west windsor,3600,710000,False,False,True
9,robinsville,2600,575000,False,True,False


In [41]:
# Convert the Boolean columns to integer 1/0 values
merged[['monroe township', 'robinsville', 'west windsor']] = merged[['monroe township', 'robinsville', 'west windsor']].astype(int)

# Display the updated DataFrame
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [42]:
# Drop the 'town' column as it is redundant
# Also drop one of the dummy variables
final = merged.drop(['town','west windsor'],axis='columns')
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


### Why Drop a Dummy Variable?

Dropping one dummy variable prevents **multicollinearity** by ensuring that the remaining variables are independent, allowing the model to function correctly.

### What is the "Dummy Variable Trap"?

The **dummy variable trap** occurs when one-hot encoding creates redundant variables that cause **multicollinearity**, similar to **linear dependency** in linear algebra.

### What is Multicollinearity?

**Multicollinearity** occurs when two or more independent variables in a regression model are highly correlated, making it difficult for the model to distinguish their individual effects.

#### **Linear Dependency Example:**
- **Category 1:** [1, 0, 0]
- **Category 2:** [0, 1, 0]
- **Category 3:** [0, 0, 1]

These vectors sum to [1, 1, 1], showing they are not independent.

In [43]:
model = LinearRegression()

In [44]:
X = final.drop('price',axis='columns')
X

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [46]:
y = final.price
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [47]:
# Training your machine learning model
model.fit(X, y)

In [48]:
# Make predictions
model.predict([[2800,0,1]])



array([590775.63964739])

In [49]:
# Make predictions
model.predict([[3400,0,0]])



array([681241.66845839])

In [50]:
# Check accuracy score
model.score(X, y)

0.9573929037221873

In [51]:
df

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


## Using sklearn OneHotEncoder for the same

In [62]:
# Initialize LabelEncoder class object
le = LabelEncoder()

In [63]:
# Carry out Label Encoding using Label Column as input
dfle = df
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [64]:
X = dfle[['town','area']].values # We want 2D array instead of dataframe
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [66]:
y = dfle.price.values
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

In [70]:
# Create the dummy variables for each town value 
ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

In [71]:
X = ct.fit_transform(X)
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [72]:
# Drop first column to avoid dummy variable trap
X = X[:,1:]
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [73]:
model.fit(X, y)

In [74]:
model.predict([[1,0,2800]])

array([590775.63964739])

In [75]:
model.predict([[0,1,3400]])

array([681241.6684584])

## Exercise

Given the car sell price data for different models within the file **carprices.csv**, first plot data points on a scatter plot to see if linear regression model can be applied. If so, build a model that can answer the following questions:

* **i) Predict price of a Mercedes Benz that is 4 years old with mileage 45000**
* **ii) Predict price of a BMW X5 that is 7 years old with mileage 86000**
* **iii) Determine the accuracy score of your model (Hint: can use LinearRegression().score())**



In [77]:
df_cars = pd.read_csv("C:\\Users\\Vaishob\\Downloads\\carprices.csv")
df_cars

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [79]:
dummies = pd.get_dummies(df_cars['Car Model'])
dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,False,True,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
5,True,False,False
6,True,False,False
7,True,False,False
8,True,False,False
9,False,False,True


In [80]:
merged = pd.concat([df_cars, dummies],axis='columns')
merged

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,False,True,False
1,BMW X5,35000,34000,3,False,True,False
2,BMW X5,57000,26100,5,False,True,False
3,BMW X5,22500,40000,2,False,True,False
4,BMW X5,46000,31500,4,False,True,False
5,Audi A5,59000,29400,5,True,False,False
6,Audi A5,52000,32000,5,True,False,False
7,Audi A5,72000,19300,6,True,False,False
8,Audi A5,91000,12000,8,True,False,False
9,Mercedez Benz C class,67000,22000,6,False,False,True


In [82]:
final = merged.drop(["Car Model","Mercedez Benz C class"],axis='columns')
final

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,False,True
1,35000,34000,3,False,True
2,57000,26100,5,False,True
3,22500,40000,2,False,True
4,46000,31500,4,False,True
5,59000,29400,5,True,False
6,52000,32000,5,True,False
7,72000,19300,6,True,False
8,91000,12000,8,True,False
9,67000,22000,6,False,False


In [83]:
X = final.drop('Sell Price($)',axis='columns')
X

Unnamed: 0,Mileage,Age(yrs),Audi A5,BMW X5
0,69000,6,False,True
1,35000,3,False,True
2,57000,5,False,True
3,22500,2,False,True
4,46000,4,False,True
5,59000,5,True,False
6,52000,5,True,False
7,72000,6,True,False
8,91000,8,True,False
9,67000,6,False,False


In [84]:
y = final['Sell Price($)']
y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: Sell Price($), dtype: int64

In [85]:
model = LinearRegression()

In [86]:
model.fit(X,y)

In [87]:
model.score(X, y)

0.9417050937281083

#### i) Predict price of Mercedes Benz that is 4 years old with mileage of 45000

In [88]:
model.predict([[45000,4,0,0]])



array([36991.31721061])

#### ii) Predict price of BMW X5 that is 7 years old with mileage of 86000

In [89]:
model.predict([[86000,7,0,1]])



array([11080.74313219])

**iii) Determine the accuracy score of your model (Hint: can use LinearRegression().score())**

In [90]:
model.score(X, y)

0.9417050937281083