## Categorical Variables
They are of two type:

    1. Nominal
    2. Ordinal

Nominal: These type of variables don't have any correlation among them.
        
         EX: Town names, gender, name of colors



Ordinal: On the other hand, these ordinal variables have correlation among them   

         EX: Satisfaction level, qualification level, expertise level         

## Handling Categorical Variables


When you have a text column : Categorical variables in the data, so to convert those into variables, you can use label encoding.


But here the problem is that: after assigning the variable value to the categorical variables, model will built the intuition that : higher the variable value assigned to it, the more preference must be given to it

## One Hot Encoding

using this, there won't be any preference given to any variables

In this, each subtype is alloted 0 or 1

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn import linear_model

In [2]:
df = pd.read_csv("../Datasets/Homeprices_OneHotEncoding.csv")
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [62]:
## Lets get the dummy Variables

dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [63]:
## lets merge the dummies with original dataset

df_merged = pd.concat([df,dummies],axis='columns')
df_merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


In [64]:
df_final = df_merged.drop('town',axis=1)
df_final.head()

Unnamed: 0,area,price,monroe township,robinsville,west windsor
0,2600,550000,1,0,0
1,3000,565000,1,0,0
2,3200,610000,1,0,0
3,3600,680000,1,0,0
4,4000,725000,1,0,0


## Dummy Variable Trap

So here, after creating the dummies, you can see, if any two are zero, other is one.

So this is the clear case of Multi-collinearity. So to avoid this, best practice is to remove the extra column.

Sklearn, no doubt looks into this dummy variable trap, but it is best to delete the extra dummy column

In [65]:
df_final = df_final.drop('west windsor',axis=1)

df_final.columns = df_final.columns.str.replace(" ","_")
df_final.head()



Unnamed: 0,area,price,monroe_township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0


In [66]:
X = df_final.drop('price',axis=1)
y = df_final.price

In [67]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X,y)
model.score(X,y)  ## 96% accuracy


0.9573929037221873

In [68]:
## lets predict the prices
# 3400 sqr ft home in west windsor

model.predict([[3400,0,0]])

array([681241.66845839])

## Must Remember: fit_transform expects 2-D array as input:

## Lets use OneHotEncoder

To use this, we need to convert the categorical variables into numbers

In [69]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df_le = df.copy()
df_le.town = le.fit_transform(df_le.town)
df_le.head()

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000


In [70]:
X = df_le[['town','area']].values
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [71]:
y = df_le['price'].values
y

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,
       710000, 575000, 600000, 620000, 695000], dtype=int64)

In [72]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('town',OneHotEncoder(),[0])],remainder='passthrough')

'''
By default, only the specified columns in transformers are transformed 
and combined in the output, and the non-specified columns are dropped.
 (default of 'drop'). By specifying remainder='passthrough', 
 all remaining columns that were not specified in transformers will be 
 automatically passed through.

'''

"\nBy default, only the specified columns in transformers are transformed \nand combined in the output, and the non-specified columns are dropped.\n (default of 'drop'). By specifying remainder='passthrough', \n all remaining columns that were not specified in transformers will be \n automatically passed through.\n\n"

In [73]:
X = ct.fit_transform(X)  ## will just transform the town column
X


### you can also use this function
'''
ohe = OneHotEncoder(catgorical_features=[0]) ## just for first column: town
X = ohe.fit_transform(X).toarray()
X = X[:,1:]
model.fit(X,y)

'''

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [74]:
## lets drop one dummy from the 3 dummies ceated

X = X[:,1:]  ## all rows and columns from index 1
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [75]:
model.fit(X,y)
model.score(X,y)

0.9573929037221873

## When to use Label and One hot encding

Label Encoding:

    1. There are (relation among the data in the) ordinal categorical variables
    2. Number of categories is too large, applying one hot will consume more  memory


One Hot Encoding:

    1. No ordinality among the variables
    2. Number of Categorical features are less