# **Feature Encoding**
Feature encoding is the process of transforming `categorical features` into `numeric features`. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label encoding
- Ordinal encoding
- One-hot encoding
- Binary encoding

In [2]:
# import libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 
warnings.filterwarnings('ignore')

In [3]:
# Load data 
df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [4]:
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

In [5]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [6]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [7]:
df['smoker'].value_counts()

smoker
No     151
Yes     93
Name: count, dtype: int64

In [8]:
df['sex'].value_counts()

sex
Male      157
Female     87
Name: count, dtype: int64

In [9]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


# 1. Label Encoder

In [30]:
# Let's encode the time column
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_0,day_1,day_2,encoded_time
201,12.74,2.01,Female,Yes,Thur,Lunch,2,0,1,1,1
197,43.11,5.0,Female,Yes,Thur,Lunch,4,0,1,1,1
166,20.76,2.24,Male,No,Sun,Dinner,2,0,0,1,0
165,24.52,3.48,Male,No,Sun,Dinner,3,0,0,1,0
174,16.82,4.0,Male,Yes,Sun,Dinner,2,0,0,1,0


In [11]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

In [12]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

# 2. Odinal Encoder

In [13]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [14]:
# let's encode the day column 
oe = OrdinalEncoder()
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,2.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,2.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,2.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,2.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,2.0


In [15]:
df['encoded_day'].value_counts()

encoded_day
1.0    87
2.0    76
3.0    62
0.0    19
Name: count, dtype: int64

In [16]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [17]:
# let's encode the day column with manual order 
oe = OrdinalEncoder(categories = [['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [18]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

In [19]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

# 3. One-Hot Encoding

In [20]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [32]:
# Let's encode the smoker column  
ohe = OneHotEncoder()
ohe.fit_transform(df[['smoker']]).toarray()

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.

In [22]:
# example of one hot encoding 
titanic = sns.load_dataset('titanic')

ohe = OneHotEncoder(sparse_output = False)
embared_ohe = ohe.fit_transform(titanic[['embarked']])
embarked_df = pd.DataFrame(embared_ohe, columns = ohe.get_feature_names_out())
embarked_df

Unnamed: 0,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
886,0.0,0.0,1.0,0.0
887,0.0,0.0,1.0,0.0
888,0.0,0.0,1.0,0.0
889,1.0,0.0,0.0,0.0


In [23]:
pd.concat([titanic.reset_index(drop = True), embarked_df.reset_index(drop = True)], axis = 1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,0.0,0.0,1.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1.0,0.0,0.0,0.0
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,0.0,0.0,1.0,0.0
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,0.0,0.0,1.0,0.0
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,0.0,0.0,1.0,0.0
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,0.0,0.0,1.0,0.0
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1.0,0.0,0.0,0.0



---

üîπ `sparse_output` in `OneHotEncoder` decides **how the encoded data is stored**:

* `sparse_output = True` (default):

  * Stores the result in a **sparse matrix** (only saves positions of 1s, not all 0s).
  * Saves **memory** when there are **lots of zeros**.

* `sparse_output = False`:

  * Stores the result in a **normal dense array** (shows all 0s and 1s).
  * Easier to **see** and work with, but uses **more memory**.

---

üìå Example (Color: Red, Blue, Green):
Encoded row (Red): `[1, 0, 0]`

* Sparse: only saves **index of 1**.
* Dense: saves the **whole array** `[1, 0, 0]`.

---

üí° **In short:**
`sparse_output=False` ‚Üí Easier to read but heavier.
`sparse_output=True` ‚Üí Memory-efficient for large data.

---




# 4. Binary Enocoding 

In [24]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.9.0-py3-none-any.whl.metadata (7.9 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Downloading patsy-1.0.2-py2.py3-none-any.whl.metadata (3.6 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Downloading statsmodels-0.14.5-cp312-cp312-win_amd64.whl.metadata (9.8 kB)
Downloading category_encoders-2.9.0-py3-none-any.whl (85 kB)
Downloading patsy-1.0.2-py2.py3-none-any.whl (233 kB)
Downloading statsmodels-0.14.5-cp312-cp312-win_amd64.whl (9.6 MB)
   ---------------------------------------- 0.0/9.6 MB ? eta -:--:--
   ---- ----------------------------------- 1.0/9.6 MB 5.0 MB/s eta 0:00:02
   ------ --------------------------------- 1.6/9.6 MB 3.6 MB/s eta 0:00:03
   ---------- ----------------------------- 2.6/9.6 MB 4.3 MB/s eta 0:00:02
   ------------- -------------------------- 3.1/9.6 MB 4.2 MB/s eta 0:00:02
   ------------- -------------------------- 3.1/9.6 MB 4.2 MB/s eta 0:00:02
   --------------- ---

In [25]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [26]:
from category_encoders import BinaryEncoder 

binary_encoder = BinaryEncoder()
# gives an array/dataFrame of categories
df_binary = binary_encoder.fit_transform(df['day'])
df_binary

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


##### NOTE : It also reduces the dimension if data is not binary and here in the `day` column we have 4 days categories but it returns only 3 different category of `day` column (dimensionality reduction) or prevent multicolinearity.

In [27]:
encoded_df = pd.DataFrame(df_binary, columns = binary_encoder.get_feature_names_out())
encoded_df

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


In [28]:
df = pd.concat([df.reset_index(drop = True), encoded_df.reset_index(drop = True)], axis = 1)

In [29]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_0,day_1,day_2
0,16.99,1.01,Female,No,Sun,Dinner,2,0,0,1
1,10.34,1.66,Male,No,Sun,Dinner,3,0,0,1
2,21.01,3.5,Male,No,Sun,Dinner,3,0,0,1
3,23.68,3.31,Male,No,Sun,Dinner,2,0,0,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0,0,1


##### Let's Understand reset_index(drop = True)
---

#### üî• **reset_index(drop=True) ‚Äî Simple Explanation**

##### üß† **What is an index?**

Every row in a DataFrame has a number called the **index**.

Example:

| index | Name  |
| ----- | ----- |
| 0     | Sahil |
| 1     | Raj   |
| 2     | Aman  |

Sometimes after filtering, splitting, or encoding, the index becomes messy:

| index | Name  |
| ----- | ----- |
| 5     | Sahil |
| 7     | Raj   |
| 10    | Aman  |

Now the numbers (5,7,10) look weird.

---

#### ‚úî **reset_index()**

This function **resets the index back to 0,1,2,3‚Ä¶**

##### Without `drop=True`:

It keeps the old index as a column.

Example:

```python
df.reset_index()
```

Output:

| old_index | Name  |
| --------- | ----- |
| 5         | Sahil |
| 7         | Raj   |
| 10        | Aman  |

---

#### ‚úî **reset_index(drop=True)**

This means:

‚ùó ‚ÄúDo NOT keep the old index, just remove it.‚Äù

It gives you clean index numbers:

| index | Name  |
| ----- | ----- |
| 0     | Sahil |
| 1     | Raj   |
| 2     | Aman  |

---

#### üî• **Why do we use reset_index(drop=True) in your code?**

You wrote:

```python
pd.concat([df.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)
```

When we combine (concatenate) two DataFrames:

* both must have **same index**
* otherwise rows will not match properly

So before merging:

* we clean the index of `df`
* we clean the index of `encoded_df`

This ensures **row 0 joins with row 0**, **row 1 with row 1**, etc.

---





The `.reset_index()` function in pandas is used to **reset the index** of a DataFrame back to the default numbering (0, 1, 2, ‚Ä¶).

---

### üí° In simple words:

> It converts the current index into a normal column and creates a **new default index**.

---

### üîç Example

```python
import pandas as pd

df = pd.DataFrame({
    'Name': ['A', 'B', 'C'],
    'Age': [22, 25, 30]
})

# set Name as index
df = df.set_index('Name')
print(df)
```

**Output:**

```
      Age
Name     
A      22
B      25
C      30
```

Now, if you reset the index:

```python
df_reset = df.reset_index()
print(df_reset)
```

**Output:**

```
  Name  Age
0    A   22
1    B   25
2    C   30
```

---

### ‚öôÔ∏è Common options:

1. **`drop=True`** ‚Üí removes the old index completely (doesn‚Äôt add it as a column)

   ```python
   df.reset_index(drop=True)
   ```

2. **`inplace=True`** ‚Üí makes the change directly in the same DataFrame (no need to assign)

   ```python
   df.reset_index(inplace=True)
   ```

---

‚úÖ **In short:**

* `reset_index()` ‚Üí moves index to a column and makes a new default index.
* `reset_index(drop=True)` ‚Üí removes index and starts numbering fresh (0,1,2,‚Ä¶).
`