## Pandas get_dummies() method

 - In Pandas, the get_dummies() function converts categorical variables into dummy/indicator variables.

 - This is known as one-hot encoding.

 - The function returns a DataFrame where each unique category in original data is converted into a seperare column, and the values are represented as True (for presence) or False (for absence)

## Encoding a Pandas DataFrame

- DataFrame is a 2-dimensional labeled table.

In [5]:
import pandas as pd

data = {
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}

df = pd.DataFrame(data)
print(f"Original DataFrame\n\n{df}\n\n")


#Perform one-hot encoding 
df_encoded = pd.get_dummies(df)
print(f"DataFrame after performing one-hot encoding\n\n{df_encoded}\n")

Original DataFrame

   Color    Size
0    Red   Small
1   Blue   Large
2  Green  Medium
3   Blue   Small
4    Red   Large


DataFrame after performing one-hot encoding

   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
0       False        False       True       False        False        True
1        True        False      False        True        False       False
2       False         True      False       False         True       False
3        True        False      False       False        False        True
4       False        False       True        True        False       False



In [7]:
#To get the output as 0 and 1
df_encoded = pd.get_dummies(df, dtype = int)
print(f"DataFrame after performing one-hot encoing\n\n{df_encoded}\n")

DataFrame after performing one-hot encoing

   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
0           0            0          1           0            0           1
1           1            0          0           1            0           0
2           0            1          0           0            1           0
3           1            0          0           0            0           1
4           0            0          1           1            0           0



## Encoding a Pandas Series
- Series is a 1-dimensional labeled array.

In [10]:
days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday'])

print(f"Original Series\n\n{days}\n\n")

#Performing one-hot encoding 
series_encoded = pd.get_dummies(days, dtype ='int')
print(f"Series after performing one-hot encoding\n\n{series_encoded}\n\n")

Original Series

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5       Monday
dtype: object


Series after performing one-hot encoding

   Friday  Monday  Thursday  Tuesday  Wednesday
0       0       1         0        0          0
1       0       0         0        1          0
2       0       0         0        0          1
3       0       0         1        0          0
4       1       0         0        0          0
5       0       1         0        0          0




## Converting NaN Values into Dummy Variable

- "dummy_na = True" option can be used when dealing with missing values.

- It created a seperate a column indicating whether the value is missing or not.

  

In [13]:
import numpy as np

#List with colour categories and NaN
colours = ['Red', 'Blue', 'Green', np.nan, 'Red', 'Blue']
print(f"Original Series\n\n{colours}\n\n")

#performing one-hot encoding
colour_encoded = pd.get_dummies(colours, dummy_na = True, dtype= 'int')
print(f"After performing one-hot encoding\n\n{colour_encoded}\n\n")



Original Series

['Red', 'Blue', 'Green', nan, 'Red', 'Blue']


After performing one-hot encoding

   Blue  Green  Red  NaN
0     0      0    1    0
1     1      0    0    0
2     0      1    0    0
3     0      0    0    1
4     0      0    1    0
5     1      0    0    0




### One Hot Encoding Using Pandas

In [17]:
#More Examples 

data = {
    'Employee id': [10, 20, 15, 25, 30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

df = pd.DataFrame(data)
print(f"Original DataFrame\n\n{df}\n\n")

#Performing one-hot encoding 
# drop_first = True drops one redundant column
# eg: keeps only gender_f to avoid multicollinearity 
df_encoded = pd.get_dummies(df, columns = ['Gender', 'Remarks'], drop_first = True)
print(f"After one-hot encoding \n\n{df_encoded}\n\n")

Original DataFrame

   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice


After one-hot encoding 

   Employee id  Gender_M  Remarks_Great  Remarks_Nice
0           10      True          False         False
1           20     False          False          True
2           15     False          False         False
3           25      True           True         False
4           30     False          False          True




### One Hot Encoding Using Scikit Learn Library

In [63]:
from sklearn.preprocessing import OneHotEncoder

data = {'Employee id': [10, 20, 15, 25, 30],
        'Gender': ['M', 'F', 'F', 'M', 'F'],
        'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
        }
df = pd.DataFrame(data)
print(f"Original DataFrame \n\n{df}\n\n")

#Choosing Categorical data from the DataFrame and put into a list
categorical_columns = df.select_dtypes(include =['object']).columns.tolist()
print(f"\n\n {categorical_columns}\n\n")


encoder = OneHotEncoder(sparse_output =False)

#One hot encoding in a Matrix form
one_hot_encoded = encoder.fit_transform(df[categorical_columns])
print(f"One hot encoding Matrix \n\n{one_hot_encoded}\n\n")

#One hot encoding in a DataFrame
one_hot_df = pd.DataFrame(one_hot_encoded, columns = encoder.get_feature_names_out(categorical_columns))
print(f"After one-hot encoding \n\n{one_hot_df}\n\n")


#Original DataFrame + One-hot encoded DataFrame
df_encoded = pd.concat([df, one_hot_df], axis =1)
print(f"Both Original DataFrame and the DataFrame after one-hot encoding \n\n{df_encoded} \n\n")


df_encoded = df_encoded.drop(categorical_columns, axis = 1)
print(f"Encoded Employee data : \n\n{df_encoded}\n\n")

                                                                                   

Original DataFrame 

   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice




 ['Gender', 'Remarks']


One hot encoding Matrix 

[[0. 1. 1. 0. 0.]
 [1. 0. 0. 0. 1.]
 [1. 0. 1. 0. 0.]
 [0. 1. 0. 1. 0.]
 [1. 0. 0. 0. 1.]]


After one-hot encoding 

   Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0       0.0       1.0           1.0            0.0           0.0
1       1.0       0.0           0.0            0.0           1.0
2       1.0       0.0           1.0            0.0           0.0
3       0.0       1.0           0.0            1.0           0.0
4       1.0       0.0           0.0            0.0           1.0


Both Original DataFrame and the DataFrame after one-hot encoding 

   Employee id Gender Remarks  Gender_F  Gender_M  Remarks_Good  \
0           10      M    Good       0.0       1.0           1.0   
1           20      F    Nice    