## Data Encoding

1. **Nominal / One-Hot Encoding (OHE)**  
   Used for categorical data with no order.  
   Creates a separate binary column for each category.

2. **Label and Ordinal Encoding**  
   - Label Encoding: assigns a unique number to each category.  
   - Ordinal Encoding: assigns numbers based on the natural order of categories.

3. **Target Guided Ordinal Encoding**  
   Categories are encoded based on the target variable (for example, mean of the target for each category).


### Nominal / One-Hot Encoding (OHE)
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category.
 
Each category becomes a separate binary column.

Example (color = red, green, blue):

- Red   → [1, 0, 0]  
- Green → [0, 1, 0]  
- Blue  → [0, 0, 1]

##### Disadvantages of One-Hot Encoding (OHE)
- As the number of categories increases, the number of columns increases (high dimensionality).
- It creates a sparse matrix (most values are 0), which increases memory usage and computation cost.


In [1]:
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

In [2]:
#  create a simple dataframe
df = pd.DataFrame({
    'color' : ['red', 'blue', 'green', 'green', 'red', 'blue']
})

df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [10]:
# create an instance of OneHotEncoder
encoder = OneHotEncoder()

# perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()
# alphabetical sorting occurs 

encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [8]:
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

encoder_df.head()

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [12]:
#  for new data

encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [13]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [23]:
import seaborn as sns

tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [22]:
X = tips.drop("tip", axis=1)
y = tips['tip']

In [24]:
cat_cols = ["sex", "smoker", "day", "time"]
num_cols = ["total_bill", "size"]

In [25]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)

encoded = encoder.fit_transform(tips[cat_cols])

In [32]:
import pandas as pd 

cat_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(cat_cols))

cat_df.head()

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [33]:
X_final = pd.concat([tips[num_cols], cat_df], axis=1)
X_final.head()

Unnamed: 0,total_bill,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


### Label Encoding

Label encoding converts each category into a unique number.

It is mainly used for **nominal data** (no natural order).

Example (color):

- Red   → 1  
- Green → 2  
- Blue  → 3


In [34]:
import pandas as pd

df = pd.DataFrame({
    'color' : ['red', 'blue', 'green', 'green', 'red', 'blue']
})

df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [37]:
from sklearn.preprocessing import LabelEncoder

lbl_encoder = LabelEncoder()

lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [42]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [43]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [44]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

### Problems in Label Encoding

- It introduces a false order between categories (the model may assume 3 > 2 > 1 has meaning).
- Not suitable for nominal data (e.g., color, city, gender).
- Can mislead distance-based models (like KNN, K-Means) because numeric distances become meaningless.
- The assigned numbers do not represent real relationships between categories.


### Ordinal Encoding

Ordinal encoding is used for categorical data that **has a natural order**.

Each category is assigned a number based on its rank.

Example (education level):

- High school   → 1  
- College       → 2  
- Graduate      → 3  
- Post-graduate → 4


In [45]:
#  a sample dataframe with an oridnal variable
df = pd.DataFrame({
    'size' : ['small', 'medium', 'large', 'medium', 'small', 'large']
})

df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [46]:
# ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [47]:
encoder.transform([['small']])



array([[0.]])