## One hot encoding

*it is same like as dummy encoding*

One-Hot Encoding converts categorical values into separate binary (0/1) columns.

In [27]:
import pandas as pd 

 *import the crop_recommendation.csv dataset*

In [26]:
crop_data=pd.read_csv("dataset/Crop_recommendation.csv")
crop_data.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


In [11]:
for col in crop_data.columns:
    print(f"{col} : {len(crop_data[col].unique())}")

N : 137
P : 117
K : 73
temperature : 2200
humidity : 2200
ph : 2200
rainfall : 2200
label : 22


In [13]:
crop_data.label.sort_index().unique()

array(['rice', 'maize', 'chickpea', 'kidneybeans', 'pigeonpeas',
       'mothbeans', 'mungbean', 'blackgram', 'lentil', 'pomegranate',
       'banana', 'mango', 'grapes', 'watermelon', 'muskmelon', 'apple',
       'orange', 'papaya', 'coconut', 'cotton', 'jute', 'coffee'],
      dtype=object)

In [28]:
crop_data = pd.get_dummies(crop_data,columns=['label'])
crop_data


Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label_apple,label_banana,label_blackgram,...,label_mango,label_mothbeans,label_mungbean,label_muskmelon,label_orange,label_papaya,label_pigeonpeas,label_pomegranate,label_rice,label_watermelon
0,90,42,43,20.879744,82.002744,6.502985,202.935536,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,85,58,41,21.770462,80.319644,7.038096,226.655537,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,60,55,44,23.004459,82.320763,7.840207,263.964248,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,74,35,40,26.491096,80.158363,6.980401,242.864034,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,78,42,42,20.130175,81.604873,7.628473,262.717340,False,False,False,...,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2195,107,34,32,26.774637,66.413269,6.780064,177.774507,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2196,99,15,27,27.417112,56.636362,6.086922,127.924610,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2197,118,33,30,24.131797,67.225123,6.362608,173.322839,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2198,117,32,34,26.272418,52.127394,6.758793,127.175293,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [29]:
crop_data.shape

(2200, 29)

**Pros**

- No false ordering (good for nominal data)

- Easy to understand

- Works well with Linear & ML models

- Does not mislead the model

**Cons**

- Creates many columns (high memory usage)

- Slower when categories are many

- Not good for high cardinality (e.g., 1000 cities)

# When to Use

*Use when:*

- Categories are small (less unique values)

- Data has no order (Nominal)

- Interpretability is important

*Avoid when:*

- Too many unique values

- Memory/computation is limited

# the difference in one hot and dummy is drop first col
- one hot - didnt drop first col
- dummy - drop first column

# Whwn to use
| Situation         | Better Option                        |
| ----------------- | ------------------------------------ |
| Tree-based models | One-Hot                              |
| Linear models     | Dummy                                |
| Small dataset     | Either works                         |
| Large categories  | Neither (use Binary/Target Encoding) |
