# Introduction to Dummy Variables & One Hot Encoding
Link to the Youtube video tutorial: https://www.youtube.com/watch?v=9yl6-HEY7_s&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=6

Categorical variable explanation: An indenpendent variable (feature) of this training data, called town, is a categorical variable. 
![title](hidden\categorical_variable.png)


## Load training data

In [1]:
import pandas as pd

df = pd.read_csv('homeprices.csv') # Load the training data from CSV file into pandas data frame called df. The category variable (object label) of this training dataset is the town
print(df)

               town  area   price
0   monroe township  2600  550000
1   monroe township  3000  565000
2   monroe township  3200  610000
3   monroe township  3600  680000
4   monroe township  4000  725000
5      west windsor  2600  585000
6      west windsor  2800  615000
7      west windsor  3300  650000
8      west windsor  3600  710000
9       robinsville  2600  575000
10      robinsville  2900  600000
11      robinsville  3100  620000
12      robinsville  3600  695000


### Training data preprocessing option 1)   Create dummy variables using pandas get_dummies method

#### Create dummy variables for categorical variable using pandas get_dummies method

Town is the categorical variable in this training dataset

In [2]:
# get_dummies() get the dummy variables for each entry of the category variable (object label) under the category column specified as the input. dtype refers to the type of content represents the dummy variables. int for boolean (1/0), not mentioned for words (True/False) 
dummies = pd.get_dummies(df.town, dtype=int) # get the dummy variables for each entry of the category variable (object label) in the format of data frame. Then store this data frame into a variable called dummies
print(dummies)


    monroe township  robinsville  west windsor
0                 1            0             0
1                 1            0             0
2                 1            0             0
3                 1            0             0
4                 1            0             0
5                 0            0             1
6                 0            0             1
7                 0            0             1
8                 0            0             1
9                 0            1             0
10                0            1             0
11                0            1             0
12                0            1             0


In [3]:
# concat() join two data frames. axis refers to the method to join the two data frames.
merged = pd.concat([df,dummies],axis='columns') # join the df and dummies data frames, by placing the 1st column of dummies data frame after the last column of df data frame
print(merged)

               town  area   price  monroe township  robinsville  west windsor
0   monroe township  2600  550000                1            0             0
1   monroe township  3000  565000                1            0             0
2   monroe township  3200  610000                1            0             0
3   monroe township  3600  680000                1            0             0
4   monroe township  4000  725000                1            0             0
5      west windsor  2600  585000                0            0             1
6      west windsor  2800  615000                0            0             1
7      west windsor  3300  650000                0            0             1
8      west windsor  3600  710000                0            0             1
9       robinsville  2600  575000                0            1             0
10      robinsville  2900  600000                0            1             0
11      robinsville  3100  620000                0            1 

In [4]:
final = merged.drop(['town','west windsor'], axis = 'columns') # drop/remove the columns called town and west windsor from merged data frame, then save this data frame to variable called final. We drop town column because it contains word as content and will not be used by the machine learning model. We drop a dummy variable column to prevent dummy variable trap (can google this for more information)
print(final)



    area   price  monroe township  robinsville
0   2600  550000                1            0
1   3000  565000                1            0
2   3200  610000                1            0
3   3600  680000                1            0
4   4000  725000                1            0
5   2600  585000                0            0
6   2800  615000                0            0
7   3300  650000                0            0
8   3600  710000                0            0
9   2600  575000                0            1
10  2900  600000                0            1
11  3100  620000                0            1
12  3600  695000                0            1


#### Develop a linear regression model

Predict the dependent variable (price) based on the independent variables (area and hometown name)

In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() # created a linear regression model

In [6]:
# Prepare independent variable / feature of the training data
X = final.drop('price', axis='columns') # drop the price column from the data frame called final, then assign the remaining columns to a variable called X. When drop only 1 column, the column doesn't need enclosed by [], except when drop at least 2 columns.
print(X) # [[town encoder value 1,town encoder value 2]] -> [[1,0]] for monroe township; [[0,0]] for west windsor; [[0,1]] for robinsville


    area  monroe township  robinsville
0   2600                1            0
1   3000                1            0
2   3200                1            0
3   3600                1            0
4   4000                1            0
5   2600                0            0
6   2800                0            0
7   3300                0            0
8   3600                0            0
9   2600                0            1
10  2900                0            1
11  3100                0            1
12  3600                0            1


In [7]:
# Prepare dependent variable / ground truth of the training data
Y = final.price # assign only price column to a variable called Y
print(Y)


0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64


In [8]:
model.fit(X,Y) # train the machine learning model (linear regression model)

print(model.score(X,Y)) # calculate the score/performance of the trained model. It firstly calculates the predicted values for all rows (instances) in X, then compare the predicted values with the actual/ground truth values in Y. Then, use some formula to calculate the score.

0.9573929037221872


#### Apply the trained machine learning model (linear regression model)

In [9]:
print(model.predict([[2800,0,1]])) # predict the price of a home with area of 2800 sqr ft in hometown of robinsville). According to the independent variable format used in training, [[area,town encoder value 1,town encoder value 2]]

[590775.63964739]




In [10]:
print(model.predict([[3400,0,0]])) # predict the price of a home with area of 3400 sqr ft in hometown of west windsor). According to the independent variable format used in training, [[area,town encoder value 1,town encoder value 2]]


[681241.66845839]




###  Training data preprocessing option 2)    Create dummy variables using one hot encoder method

#### Encode the data of categorical variable into a unique numerical representation using label encoder

Town is the categorical variable in this training dataset

In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() # create label encoder

dfle = df # create a new data frame which has the exactly same content as df data frame
# le.fit_transform() # takes the label column (category column) as an input, then return unique numerical value as representation for each label under the label column
dfle.town = le.fit_transform(dfle.town) # takes the label column (EG: town) as an input, then return unique numerical value as representation for each label (EG: monroe township, robinsville, west windsor) under the label column. Then, assign back to the data frame.
print(dfle) # shows that after one hot label encoding technique, 0, 1, and 2 represent monroe township, west windsor, and robinsville respectively



    town  area   price
0      0  2600  550000
1      0  3000  565000
2      0  3200  610000
3      0  3600  680000
4      0  4000  725000
5      2  2600  585000
6      2  2800  615000
7      2  3300  650000
8      2  3600  710000
9      1  2600  575000
10     1  2900  600000
11     1  3100  620000
12     1  3600  695000


In [12]:
# Prepare independent variable / feature of the training data
X_ohe = dfle[['town','area']].values # .values convert the data frame into 2D array format. Load the independent variables to variable X_ohe. When accessing at least 2 columns of a data frame, need [['columnname1','columnname2',...]]. When accessing only 1 column of a data frame, just use dataframe_name.column_name
print(X_ohe)

[[   0 2600]
 [   0 3000]
 [   0 3200]
 [   0 3600]
 [   0 4000]
 [   2 2600]
 [   2 2800]
 [   2 3300]
 [   2 3600]
 [   1 2600]
 [   1 2900]
 [   1 3100]
 [   1 3600]]


In [13]:
# Prepare dependent variable / ground truth of the training data
Y_ohe = dfle.price # Load the dependent variables to variable Y_ohe.
print(Y_ohe)

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64


#### Create dummy variables for categorical variable using one hot encoder method

Town is the categorical variable in this training dataset

More information regarding how to handle categorical data with ColumnTransform and OneHotEncoding:
1) https://www.youtube.com/watch?v=ZS0hzcA5w9I
2) https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html


In [14]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Apply ColumnTransformer() to use the OneHotEncoder method/operation
ct = ColumnTransformer([("encode",OneHotEncoder(),[0])],remainder='passthrough') # for [(1st parameter,2nd parameter,3rd parameter,),remainder=4th parameter], the 1st parameter is any name you want to put for the operation, 2nd parameter is the type of operation you want to perform, 3rd parameter is the index of column of a given data frame which will be applied with the operation, 4th parameter is 'passthrough' to remain all the unspecified columns

X_ohe = ct.fit_transform(X_ohe) # apply the ColumnTransformer on X_ohe variable based on the specified rule, to get dummy variables for each entry of the category variable
print(X_ohe) # the first 3 columns are the dummy variables for the town, the last column is area


[[1.0e+00 0.0e+00 0.0e+00 2.6e+03]
 [1.0e+00 0.0e+00 0.0e+00 3.0e+03]
 [1.0e+00 0.0e+00 0.0e+00 3.2e+03]
 [1.0e+00 0.0e+00 0.0e+00 3.6e+03]
 [1.0e+00 0.0e+00 0.0e+00 4.0e+03]
 [0.0e+00 0.0e+00 1.0e+00 2.6e+03]
 [0.0e+00 0.0e+00 1.0e+00 2.8e+03]
 [0.0e+00 0.0e+00 1.0e+00 3.3e+03]
 [0.0e+00 0.0e+00 1.0e+00 3.6e+03]
 [0.0e+00 1.0e+00 0.0e+00 2.6e+03]
 [0.0e+00 1.0e+00 0.0e+00 2.9e+03]
 [0.0e+00 1.0e+00 0.0e+00 3.1e+03]
 [0.0e+00 1.0e+00 0.0e+00 3.6e+03]]


In [15]:
# drop a column of dummy variable to avoid dummy variable trap
X_ohe = X_ohe[:,1:] # Take all the rows from the Column 1 to the last column (ignore the 1st column called Column 0, represents monroe township), then assign back to the same variable X_ohe
print(X_ohe) # [[town encoder value 1,town encoder value 2]] -> [[0,0]] for monroe township; [[0,1]] for west windsor; [[1,0]] for robinsville


[[0.0e+00 0.0e+00 2.6e+03]
 [0.0e+00 0.0e+00 3.0e+03]
 [0.0e+00 0.0e+00 3.2e+03]
 [0.0e+00 0.0e+00 3.6e+03]
 [0.0e+00 0.0e+00 4.0e+03]
 [0.0e+00 1.0e+00 2.6e+03]
 [0.0e+00 1.0e+00 2.8e+03]
 [0.0e+00 1.0e+00 3.3e+03]
 [0.0e+00 1.0e+00 3.6e+03]
 [1.0e+00 0.0e+00 2.6e+03]
 [1.0e+00 0.0e+00 2.9e+03]
 [1.0e+00 0.0e+00 3.1e+03]
 [1.0e+00 0.0e+00 3.6e+03]]


#### Develop a linear regression model

Predict the dependent variable (price) based on the independent variables (area and hometown name)

In [16]:
model_ohe = LinearRegression() # created a linear regression model
model_ohe.fit(X_ohe,Y_ohe) # train the machine learning model (linear regression model)

print(model_ohe.score(X_ohe,Y_ohe)) # calculate the score/performance of the trained model. It firstly calculates the predicted values for all rows (instances) in X, then compare the predicted values with the actual/ground truth values in Y. Then, use some formula to calculate the score.

0.9573929037221873


#### Apply the trained machine learning model (linear regression model)

In [17]:
print(model_ohe.predict([[1,0,2800]])) # predict the price of a home with area of 2800 sqr ft in hometown of robinsville). According to the independent variable format used in training, [[town encoder value 1,town encoder value 2,area]]


[590775.63964739]


In [18]:
print(model_ohe.predict([[0,1,3400,]])) # predict the price of a home with area of 3400 sqr ft in hometown of west windsor). According to the independent variable format used in training, [[town encoder value 1,town encoder value 2,area]]


[681241.6684584]
