# Why NEVER use pandas' get dummies for creating dummy variables

We'll go over the concept of pandas' offering of creating dummy variables for encoding categorical columns.

More specifically, we'll look at why pd.get_dummies() is not at all a good option for creating dummy variables when it comes to integrating the work into a machine learning workflow.

pd.get_dummies() is, in a way, static in its behavior. It cannot "learn" characteristics fro the training data and hence is unable to propagate its findings onto the testing data (which sklearn actually masters at btw).

The only advantages of pd.get_dummies() are its easy interpretability, and the fact that it returns a pandas dataframe with clean column names as well.

In [1]:
import pandas as pd

In [2]:
flowers = pd.DataFrame({
    'color' : ['red', 'green', 'red', 'green', 'red', 'green', 'red', 'green', 'blue', 'blue'],
    'height': [4,9,4,8,4,7,4,7.5,20,19],
    'petals': [3,9,1,8,1,10,2,8,50,47],
    'days'  : [6,16,7,15,8,17,5,12,40,45]
})
flowers

Unnamed: 0,color,height,petals,days
0,red,4.0,3,6
1,green,9.0,9,16
2,red,4.0,1,7
3,green,8.0,8,15
4,red,4.0,1,8
5,green,7.0,10,17
6,red,4.0,2,5
7,green,7.5,8,12
8,blue,20.0,50,40
9,blue,19.0,47,45


In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(flowers.drop('days', axis=1), flowers['days'],
                                                    test_size=0.2, random_state=40
)

In [4]:
X_train

Unnamed: 0,color,height,petals
8,blue,20.0,50
1,green,9.0,9
2,red,4.0,1
9,blue,19.0,47
0,red,4.0,3
5,green,7.0,10
7,green,7.5,8
6,red,4.0,2


In [5]:

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 8 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   color   8 non-null      object 
 1   height  8 non-null      float64
 2   petals  8 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 256.0+ bytes


In [6]:

X_train['color'].unique()

array(['blue', 'green', 'red'], dtype=object)

In [7]:
# Creating Dummy Variables For Train Set
pd.get_dummies(X_train).join(X_train['color'])

Unnamed: 0,height,petals,color_blue,color_green,color_red,color
8,20.0,50,1,0,0,blue
1,9.0,9,0,1,0,green
2,4.0,1,0,0,1,red
9,19.0,47,1,0,0,blue
0,4.0,3,0,0,1,red
5,7.0,10,0,1,0,green
7,7.5,8,0,1,0,green
6,4.0,2,0,0,1,red


## Note
pd.get_dummies() is, in a way, static in its behavior. It cannot "learn" characteristics fro the training data and hence is unable to propagate its findings onto the testing data (which sklearn actually masters at btw).

In [8]:
X_test

Unnamed: 0,color,height,petals
4,red,4.0,1
3,green,8.0,8


In [9]:
pd.get_dummies(X_test)

Unnamed: 0,height,petals,color_green,color_red
4,4.0,1,0,1
3,8.0,8,1,0


As you can see the no of columns in Training Set is not equal to Testing set 
this will create data mismatch that is no good for Machine Learning Model.

It is better to use OneHot Encoding from sckitlearn.preprocessing

### I hope you enjoyed it and see you in the next one! Please Upvote and Share