<a href="https://colab.research.google.com/github/rahulrajpr/AdvancedFeatureEngineering/blob/OneHotEncoding/OneHotEncording.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Practicing OneHot Encoding Using 

1. Panadas
2. Sklearn
3. Feature Engine

In [1]:
import pandas as pd
import numpy as np

In [2]:
path = '/content/drive/MyDrive/Udemy Courses - 2.0/Feature Engineering/Dataset/titanic.csv'
data = pd.read_csv(path)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [3]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
use_cols=['Sex', 'Embarked', 'Cabin', 'Survived']

In [5]:
data = data[use_cols]

In [6]:
data['Cabin'] = data['Cabin'].str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Cabin'] = data['Cabin'].str[0]


In [7]:
data.isnull().sum()

Sex           0
Embarked      2
Cabin       687
Survived      0
dtype: int64

In [8]:
X = data.drop('Survived', axis = 1)
y = data['Survived']

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7, random_state= 100, stratify = y)

### Pandas OneHot Encording

Advantages

    quick
    returns pandas dataframe
    returns feature names for the dummy variables


```
# This is formatted as code
```


Limitations of pandas:

    it does not preserve information from train data to propagate to test data


In [11]:
# missing categories are eliminiated
pandas_onehot = pd.get_dummies(data = X_train, drop_first = True)
pandas_onehot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 623 entries, 410 to 528
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Sex_male    623 non-null    uint8
 1   Embarked_Q  623 non-null    uint8
 2   Embarked_S  623 non-null    uint8
 3   Cabin_B     623 non-null    uint8
 4   Cabin_C     623 non-null    uint8
 5   Cabin_D     623 non-null    uint8
 6   Cabin_E     623 non-null    uint8
 7   Cabin_F     623 non-null    uint8
 8   Cabin_G     623 non-null    uint8
 9   Cabin_T     623 non-null    uint8
dtypes: uint8(10)
memory usage: 11.0 KB


In [12]:
# missing categories are added as dummy variable
# similiar to adding a missing value indicator in dummy encoding
pandas_onehot = pd.get_dummies(data = X_train, drop_first = True, dummy_na = True)
pandas_onehot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 623 entries, 410 to 528
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Sex_male      623 non-null    uint8
 1   Sex_nan       623 non-null    uint8
 2   Embarked_Q    623 non-null    uint8
 3   Embarked_S    623 non-null    uint8
 4   Embarked_nan  623 non-null    uint8
 5   Cabin_B       623 non-null    uint8
 6   Cabin_C       623 non-null    uint8
 7   Cabin_D       623 non-null    uint8
 8   Cabin_E       623 non-null    uint8
 9   Cabin_F       623 non-null    uint8
 10  Cabin_G       623 non-null    uint8
 11  Cabin_T       623 non-null    uint8
 12  Cabin_nan     623 non-null    uint8
dtypes: uint8(13)
memory usage: 12.8 KB


### OneHot Encoding uisng Sklearn

Advantages

    quick
    Creates the same number of features in train and test set

Limitations

    it returns a numpy array instead of a pandas dataframe
    it does not return the variable names, therefore inconvenient for variable exploration


In [13]:
from sklearn.preprocessing import OneHotEncoder

In [14]:
sklearn_onehot = OneHotEncoder(categories = 'auto',
                               drop = 'first',
                               handle_unknown= 'error',
                               sparse = False)

In [15]:
X_train.isnull().sum()

Sex           0
Embarked      1
Cabin       485
dtype: int64

In [16]:
X_train['Cabin'] = X_train['Cabin'].fillna(X_train['Cabin'].mode()[0])

In [17]:
sklearn_onehot.fit(X_train)

OneHotEncoder(drop='first', sparse=False)

In [18]:
sklearn_onehot.get_feature_names().tolist()



['x0_male',
 'x1_Q',
 'x1_S',
 'x1_nan',
 'x2_B',
 'x2_C',
 'x2_D',
 'x2_E',
 'x2_F',
 'x2_G',
 'x2_T']

In [19]:
sklearn_onehot.transform(X_train)

array([[1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.]])

In [20]:
X_train = pd.DataFrame(data = sklearn_onehot.transform(X_train), columns = sklearn_onehot.get_feature_names())



In [21]:
X_train

Unnamed: 0,x0_male,x1_Q,x1_S,x1_nan,x2_B,x2_C,x2_D,x2_E,x2_F,x2_G,x2_T
0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
618,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
619,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
620,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
621,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


One Hot Encoiding using feature Engine

Advantages

    quick
    returns dataframe
    returns feature names
    allows to select features to encode

Limitations

    Not sure yet.


In [22]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7, random_state= 100, stratify = y)

In [23]:
!pip install feature_engine

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feature_engine
  Downloading feature_engine-1.5.2-py2.py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 5.5 MB/s 
Installing collected packages: feature-engine
Successfully installed feature-engine-1.5.2


In [24]:
from feature_engine.encoding import OneHotEncoder

In [25]:
X_train.isnull().sum()

Sex           0
Embarked      1
Cabin       485
dtype: int64

In [26]:
for i in X_train.columns:
  X_train[i].fillna(X_train[i].mode()[0], inplace = True)
  X_test[i].fillna(X_train[i].mode()[0], inplace = True)

In [27]:
featureengine_onehot = OneHotEncoder(drop_last = True,
                                     variables = ['Sex','Embarked','Cabin'])

In [28]:
featureengine_onehot.fit(X_train)

OneHotEncoder(drop_last=True, variables=['Sex', 'Embarked', 'Cabin'])

In [29]:
X_train = featureengine_onehot.transform(X_train)
X_test = featureengine_onehot.transform(X_test)

In [30]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 623 entries, 410 to 528
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Sex_male    623 non-null    int64
 1   Embarked_S  623 non-null    int64
 2   Embarked_C  623 non-null    int64
 3   Cabin_C     623 non-null    int64
 4   Cabin_F     623 non-null    int64
 5   Cabin_B     623 non-null    int64
 6   Cabin_A     623 non-null    int64
 7   Cabin_D     623 non-null    int64
 8   Cabin_E     623 non-null    int64
 9   Cabin_T     623 non-null    int64
dtypes: int64(10)
memory usage: 53.5 KB


In [31]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 338 to 200
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Sex_male    268 non-null    int64
 1   Embarked_S  268 non-null    int64
 2   Embarked_C  268 non-null    int64
 3   Cabin_C     268 non-null    int64
 4   Cabin_F     268 non-null    int64
 5   Cabin_B     268 non-null    int64
 6   Cabin_A     268 non-null    int64
 7   Cabin_D     268 non-null    int64
 8   Cabin_E     268 non-null    int64
 9   Cabin_T     268 non-null    int64
dtypes: int64(10)
memory usage: 23.0 KB


### TopN Categories OneHotEncoding

In [33]:
path = '/content/drive/MyDrive/Udemy Courses - 2.0/Feature Engineering/Dataset/houseprice/houseprice.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [34]:
sel_cols = ['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']

In [36]:
df = data[sel_cols]

In [37]:
X = df.drop('SalePrice', axis = 1)
y = df['SalePrice']

In [39]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7, random_state= 100)

In [42]:
X_train.nunique()

Neighborhood    25
Exterior1st     15
Exterior2nd     15
dtype: int64

In [44]:
onehot_topcat = OneHotEncoder(top_categories = 10, drop_last = False)

In [45]:
onehot_topcat.fit(X_train)

OneHotEncoder(top_categories=10)

In [47]:
onehot_topcat.encoder_dict_

{'Neighborhood': ['NAmes',
  'CollgCr',
  'OldTown',
  'Edwards',
  'Somerst',
  'NridgHt',
  'Gilbert',
  'Sawyer',
  'NWAmes',
  'SawyerW'],
 'Exterior1st': ['VinylSd',
  'MetalSd',
  'HdBoard',
  'Wd Sdng',
  'Plywood',
  'CemntBd',
  'BrkFace',
  'Stucco',
  'WdShing',
  'AsbShng'],
 'Exterior2nd': ['VinylSd',
  'MetalSd',
  'HdBoard',
  'Wd Sdng',
  'Plywood',
  'CmentBd',
  'Wd Shng',
  'Stucco',
  'BrkFace',
  'AsbShng']}

In [48]:
onehot_topcat.variables_

['Neighborhood', 'Exterior1st', 'Exterior2nd']

In [51]:
X_train = onehot_topcat.transform(X_train)
X_test = onehot_topcat.transform(X_test)

In [53]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1021 entries, 318 to 792
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Neighborhood_NAmes    1021 non-null   int64
 1   Neighborhood_CollgCr  1021 non-null   int64
 2   Neighborhood_OldTown  1021 non-null   int64
 3   Neighborhood_Edwards  1021 non-null   int64
 4   Neighborhood_Somerst  1021 non-null   int64
 5   Neighborhood_NridgHt  1021 non-null   int64
 6   Neighborhood_Gilbert  1021 non-null   int64
 7   Neighborhood_Sawyer   1021 non-null   int64
 8   Neighborhood_NWAmes   1021 non-null   int64
 9   Neighborhood_SawyerW  1021 non-null   int64
 10  Exterior1st_VinylSd   1021 non-null   int64
 11  Exterior1st_MetalSd   1021 non-null   int64
 12  Exterior1st_HdBoard   1021 non-null   int64
 13  Exterior1st_Wd Sdng   1021 non-null   int64
 14  Exterior1st_Plywood   1021 non-null   int64
 15  Exterior1st_CemntBd   1021 non-null   int64
 16  Exter