## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [6]:
## Create a simple dataframe 
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [7]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [8]:
##create an instance of Onehotencoder
encoder=OneHotEncoder()

In [9]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [10]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [11]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [12]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [13]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [14]:
import seaborn as sns
data = sns.load_dataset('tips')

In [18]:

encoded_data = encoder.fit_transform(data[['sex']]).toarray()
print(encoded_data)

[[1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 

In [20]:
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['sex']))
print(encoded_df)

     sex_Female  sex_Male
0           1.0       0.0
1           0.0       1.0
2           0.0       1.0
3           0.0       1.0
4           1.0       0.0
..          ...       ...
239         0.0       1.0
240         1.0       0.0
241         0.0       1.0
242         0.0       1.0
243         1.0       0.0

[244 rows x 2 columns]


In [21]:
final_df = pd.concat([data, encoded_df], axis=1)
print(final_df.head())

   total_bill   tip     sex smoker  day    time  size  sex_Female  sex_Male
0       16.99  1.01  Female     No  Sun  Dinner     2         1.0       0.0
1       10.34  1.66    Male     No  Sun  Dinner     3         0.0       1.0
2       21.01  3.50    Male     No  Sun  Dinner     3         0.0       1.0
3       23.68  3.31    Male     No  Sun  Dinner     2         0.0       1.0
4       24.59  3.61  Female     No  Sun  Dinner     4         1.0       0.0
