# Feature Engineering

## Encoding Categorical Data Techniques

Encoding categorical data into numerical format is essential for feeding it into machine learning models since most models can only work with numerical inputs. Here are the key encoding techniques:  
  
### 1. Label Encoding:  
    Used for: Target (output) variable or non-ordinal categorical data.  
    How it works: Assigns each category a unique integer value.  
    Example: Color as Target Variable: ['Red', 'Green', 'Blue']
    Label Encoding: [0, 1, 2]
  
### 2. Ordinal Encoding:  
    Used for: Ordered categorical data (e.g., education levels, customer satisfaction ratings).  
    How it works: Assigns increasing integer values based on order.  
    Example: Student Grades: ['A+', 'A', 'B+', 'B-', 'C']
    Ordinal Encoding: [4, 3, 2, 1, 0]
      
### 3. One Hot Encoding:
    Used for: Nominal categorical data (categories without order).
    How it works: Creates separate binary columns for each category.
    It increases feature space / dimensions
    Example: Colors: ['Red', 'Green', 'Blue']  
  
| # | Red | Green | Blue |
|---|-----|-------|------|
| 0 |  1  |   0   |  0   |
| 1 |  0  |   1   |  0   |
| 2 |  0  |   0   |  1   |

### 4. Frequency Encoding:  
    Used for: High-Cardinality categorical data.  
    How it works: Replaces categories with their occurence frequencies.  
    Example: City ['NY', 'LA', 'NY', 'SF', 'SF', 'SF']
    Frequency Encoding:  
    NY -> 2/6 = 0.33  
    LA -> 1/6 = 0.17  
    SF -> 3/6 = 0.5  

In [46]:
import numpy as np
import pandas as pd

In [47]:
df = pd.read_csv('./data/customer.csv')
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [48]:
df = df.loc[:, ['review', 'education', 'purchased']]
df.head(), df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50 non-null     object
 1   education  50 non-null     object
 2   purchased  50 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


(    review education purchased
 0  Average    School        No
 1     Poor        UG        No
 2     Good        PG        No
 3     Good        PG        No
 4  Average        UG        No,
 None)

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('purchased', axis=1), df['purchased'], random_state=10, test_size=0.2)

X_train.shape, X_test.shape

((40, 2), (10, 2))

## Ordinal Encoding

In [50]:
# Import Ordinal and Label Encoder
from sklearn.preprocessing import OrdinalEncoder

# We pass the order for each column to the Encoder Object as Categories
ordencoder = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], # Review Column Ordered Categories ASC order.
                                      ['School', 'UG', 'PG'] # Education Column Ordered Categories ASC order.
])

ordencoder = ordencoder.fit(X_train)

In [51]:
if type(X_train) == 'object':    
    X_train = ordencoder.transform(X_train)
if type(X_test) == 'object':
    X_test = ordencoder.transform(X_test)

ordencoder.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

## Label Encoding

In [53]:
from sklearn.preprocessing import LabelEncoder

lblencoder = LabelEncoder()

lblencoder = lblencoder.fit(y_train)
# Print the identified classes
lblencoder.classes_

array(['No', 'Yes'], dtype=object)

In [54]:
y_train = lblencoder.transform(y_train)
y_test = lblencoder.transform(y_test)

y_train, y_test

(array([0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]),
 array([1, 0, 0, 1, 1, 1, 0, 0, 1, 0]))

## One Hot Encoding