## Data Encoding
    - nominal encoding/ohe(one hot encoding)**
    - label and ordinal encoding**
    - Target Guided Ordinal Encoding**



Data encoding is a crucial preprocessing step in data science and machine learning. It involves converting data from one format to another, most commonly, transforming categorical data into a numerical format. This conversion is essential because most machine learning algorithms are built on mathematical principles and require numerical inputs to function effectively.


**Types of Categorical Data**
* Nominal Data: Categories that do not have any inherent order or ranking. Examples include colors ('Red', 'Blue', 'Green') or city names.
* Ordinal Data: Categories that have a meaningful, intrinsic order or ranking. Examples include education levels ('High School', 'Bachelor's', 'Master's') or customer satisfaction ratings ('Poor', 'Average', 'Good').

1. One-Hot Encoding
* What it does: Creates a new binary column for each unique category. For each observation, it places a '1' in the column corresponding to its category and '0's in all other new columns.
Example: For the "Color" feature, it would create three new columns: 'Color_Red', 'Color_Green', and 'Color_Blue'. A 'Red' observation would be represented as  
 1. red:    [1 0 0]. 
 2. green:  [0 1 0]. 
 3. blue :  [0 0 1]. 
* When to use: This is the most common and safest choice for nominal data, especially with linear models, logistic regression, and neural networks.
* Pros:
Avoids introducing false ordinal relationships between categories.
* Cons:
Can lead to high dimensionality (the "curse of dimensionality") if the feature has a large number of unique categories, which can increase memory usage and model complexity.


sklearn.preprocessing.OneHotEncoder¶

In [1]:
import pandas as pd
from  sklearn.preprocessing  import OneHotEncoder

In [2]:
#create a dataframe 
df= pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red'],
})
df

Unnamed: 0,color
0,red
1,blue
2,green
3,blue
4,red


In [3]:
df['color'].value_counts()

color
red      2
blue     2
green    1
Name: count, dtype: int64

In [4]:
#create a instace pf onehotencoder
encoder= OneHotEncoder()


In [5]:
# Fit and transform the data
encoded_data= encoder.fit_transform(df[['color']])
encoded_data=encoded_data.toarray()

In [6]:
encoded_data

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [7]:
encoded_df= pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['color']))

In [8]:
encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0


In [9]:
# for new data
encoder.transform([['red']]).toarray()



array([[0., 0., 1.]])

In [10]:
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [11]:
tips['smoker'].value_counts()

smoker
No     151
Yes     93
Name: count, dtype: int64

In [12]:
# transform the sex column
# smoker_encoded= encoder.fit_transform(tips[['smoker']]).toarray()
sex_encoded= encoder.fit_transform(tips[['sex']]).toarray()

In [13]:
sex_encoded_df=pd.DataFrame(sex_encoded, columns=encoder.get_feature_names_out())
sex_encoded_df

Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0
...,...,...
239,0.0,1.0
240,1.0,0.0
241,0.0,1.0
242,0.0,1.0


In [14]:
tips_encoded=pd.concat([tips, sex_encoded_df], axis=1)

In [15]:
tips_encoded

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,1.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0


In [16]:
day_encoder=OneHotEncoder(categories=[['Sun','Mon','Tue','Wed','Thur','Fri','Sat']])
day_encoded= day_encoder.fit_transform(tips[['day']]).toarray()
day_encoded_df= pd.DataFrame(day_encoded, columns=day_encoder.get_feature_names_out())
day_encoded_df


Unnamed: 0,day_Sun,day_Mon,day_Tue,day_Wed,day_Thur,day_Fri,day_Sat
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
239,0.0,0.0,0.0,0.0,0.0,0.0,1.0
240,0.0,0.0,0.0,0.0,0.0,0.0,1.0
241,0.0,0.0,0.0,0.0,0.0,0.0,1.0
242,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [17]:
tips_encoded=pd.concat([tips_encoded, day_encoded_df], axis=1)

In [18]:
tips_encoded

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male,day_Sun,day_Mon,day_Tue,day_Wed,day_Thur,day_Fri,day_Sat
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
