# Feature Encoding: 
 Feature encoding is the process of converting non-numerical (or) categorical data into a numerical fromat that machine learning algorithms can understand and process.

### We choose the encoding techniques based on : 
  - The type of categorical data( nominal or ordinal )
  - The number of unique values in a categorical column
  - and the model we plan to train.

# Types of categorical data: 
   ### 1.Nominal data:
    no order 
   ### examples :
   - color(red , Blue , Green)
   - City( Delhi, Mumbai)
   ### 2.ordial data:
   - order exists
   - size (Small<medium<large)
   - Rating (Low<medium<high)


### Why features encoding is important:
 Text can not be used in directly in model training.
 - proper encoding
 - improves model accuracy
 - prevents wrong assumptions

In [6]:
import pandas as pd


In [12]:
df = pd.DataFrame({
    "Color" : ["Red", "Green" ,"Blue" ,"Green" ,"Red"],
    "Size"  : ["Small" , "Medium" , "Large" , "Medium" , "Small"],
    "City"  : ["Hyderabad" , "Delhi" , "Mumbai" , "Delhi" , "Chennai"],
    "Target": [10,20,15,25,12]
               })
df

Unnamed: 0,Color,Size,City,Target
0,Red,Small,Hyderabad,10
1,Green,Medium,Delhi,20
2,Blue,Large,Mumbai,15
3,Green,Medium,Delhi,25
4,Red,Small,Chennai,12


# 1. Label Encoding:
- each category is assigned a unique integer
- (e.g , "Red":0; "Green":1)



In [15]:
from sklearn.preprocessing import LabelEncoder

In [19]:
df["Color_Label"] = LabelEncoder().fit_transform(df["Color"])

df[["Color","Color_Label"]]

Unnamed: 0,Color,Color_Label
0,Red,2
1,Green,1
2,Blue,0
3,Green,1
4,Red,2


# 2. One-Hot Encoding:
- Creates a new binary column for each category , with a 1 indicating the presence of that category and 0 otherwise.
- Ideal for nominal data (no inherent order) with how to medium cardinality, for linear models or neural networks.

In [27]:
one_hot = pd.get_dummies(df['Color'], prefix='Color')
one_hot

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,False,False,True
1,False,True,False
2,True,False,False
3,False,True,False
4,False,False,True


In [29]:
ohe = pd.get_dummies(df, columns=['Color'])
print(ohe)

     Size       City  Target  Color_Label  Color_Blue  Color_Green  Color_Red
0   Small  Hyderabad      10            2       False        False       True
1  Medium      Delhi      20            1       False         True      False
2   Large     Mumbai      15            0        True        False      False
3  Medium      Delhi      25            1       False         True      False
4   Small    Chennai      12            2       False        False       True


In [31]:
from sklearn.preprocessing import OneHotEncoder

In [41]:
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[["City"]])
encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out(["City"]))

In [43]:
encoded_df

Unnamed: 0,City_Chennai,City_Delhi,City_Hyderabad,City_Mumbai
0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0


# 3. Ordinal Encoding :
- Similar to label encoding , but the integer assignment is based on the specific,predefined order or rank of the categories ( e.g., "Low":0,"Medium":1,"High":2)
- use specifically when the categorical variable has a natural, meaningful order.

In [55]:
from sklearn.preprocessing import OrdinalEncoder

In [60]:
order = [["Small", "Large" , "Medium"]]
df["Size_Ordinal"] = OrdinalEncoder(categories=order).fit_transform(df[["Size"]])

df[["Size", "Size_Ordinal"]]

Unnamed: 0,Size,Size_Ordinal
0,Small,0.0
1,Medium,2.0
2,Large,1.0
3,Medium,2.0
4,Small,0.0


# 4. Target Encoding(mean or average) :
- Each category is replaced with the "mean" of the target variables for that category
- Effective for high-cardinality features and when there's a strong relationship between the category and the target variable, but requires careful validation to avoid data leakage

In [72]:
te = df.groupby("City")["Target"].mean()

df["City_Target_Enc"] = df["City"].map(te)
df[["City","City_Target_Enc"]]

Unnamed: 0,City,City_Target_Enc
0,Hyderabad,10.0
1,Delhi,22.5
2,Mumbai,15.0
3,Delhi,22.5
4,Chennai,12.0


# 5. Frequency Encoding:
- Replaces each category with its frequency (count) in the dataset
- Useful  when the frequency of a category is predictive of the target variable


In [None]:
freq = df["City"].Value_counts()
