# Feature Engineering

## Encoding Categorical Data Techniques

Encoding categorical data into numerical format is essential for feeding it into machine learning models since most models can only work with numerical inputs. Here are the key encoding techniques:  
  
### 1. Label Encoding:  
    Used for: Target (output) variable or non-ordinal categorical data.  
    How it works: Assigns each category a unique integer value.  
    Example: Color as Target Variable: ['Red', 'Green', 'Blue']
    Label Encoding: [0, 1, 2]
  
### 2. Ordinal Encoding:  
    Used for: Ordered categorical data (e.g., education levels, customer satisfaction ratings).  
    How it works: Assigns increasing integer values based on order.      
    Example: Student Grades: ['A+', 'A', 'B+', 'B-', 'C']
    Ordinal Encoding: [4, 3, 2, 1, 0]
      
### 3. One Hot Encoding:
    Used for: Nominal categorical data (categories without order).  
    How it works: Creates separate binary columns for each category.  
    Why Separate Binary Columns: If you encode it as 1, 2, 3 then it means a value might have higher weightage over others for example 3 might get more importance than 1. Hence encode them all as a same as 1 or 0 / on or off / yes or no.  
    After encoding you remove 1 of the binary column and use n-1 columns to avoid dummy variable trap.  
    It increases feature space / dimensions  
    Example: Colors: ['Red', 'Green', 'Blue']  
  
| # | Red | Green | Blue |
|---|-----|-------|------|
| 0 |  1  |   0   |  0   |
| 1 |  0  |   1   |  0   |
| 2 |  0  |   0   |  1   |

### 4. Frequency Encoding:  
    Used for: High-Cardinality categorical data.  
    How it works: Replaces categories with their occurence frequencies.  
    Example: City ['NY', 'LA', 'NY', 'SF', 'SF', 'SF']
    Frequency Encoding:  
    NY -> 2/6 = 0.33  
    LA -> 1/6 = 0.17  
    SF -> 3/6 = 0.5  

In [51]:
import numpy as np
import pandas as pd

In [52]:
df = pd.read_csv('./data/customer.csv')
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [53]:
df = df.loc[:, ['review', 'education', 'purchased']]
df.head(), df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50 non-null     object
 1   education  50 non-null     object
 2   purchased  50 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


(    review education purchased
 0  Average    School        No
 1     Poor        UG        No
 2     Good        PG        No
 3     Good        PG        No
 4  Average        UG        No,
 None)

In [54]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('purchased', axis=1), df['purchased'], random_state=10, test_size=0.2)

X_train.shape, X_test.shape

((40, 2), (10, 2))

## Ordinal Encoding

In [55]:
# Import Ordinal and Label Encoder
from sklearn.preprocessing import OrdinalEncoder

# We pass the order for each column to the Encoder Object as Categories
ordencoder = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], # Review Column Ordered Categories ASC order.
                                      ['School', 'UG', 'PG'] # Education Column Ordered Categories ASC order.
])

ordencoder = ordencoder.fit(X_train)

In [56]:
if type(X_train) == 'object':    
    X_train = ordencoder.transform(X_train)
if type(X_test) == 'object':
    X_test = ordencoder.transform(X_test)

ordencoder.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

## Label Encoding

In [57]:
from sklearn.preprocessing import LabelEncoder

lblencoder = LabelEncoder()

lblencoder = lblencoder.fit(y_train)
# Print the identified classes
lblencoder.classes_

array(['No', 'Yes'], dtype=object)

In [58]:
y_train = lblencoder.transform(y_train)
y_test = lblencoder.transform(y_test)

y_train, y_test

(array([0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]),
 array([1, 0, 0, 1, 1, 1, 0, 0, 1, 0]))

## One Hot Encoding

In [59]:
df_cars = pd.read_csv('./data/cars.csv')
df_cars.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [60]:
# We can see there are multiple columns with Nominal Categories.
# We need to Convert them to Binary Encoding before passing the
# Data to ML model. Why not Ordinal, because data is not of 
# ordered categories in nature. Adding Ordinal Encoding will
# Introduce new values like 1, 2, 3, which adds different
# Biased weightages to our encoded data which can be infered
# wrongly by ML models.
print(f"\nFuel: {df_cars.fuel.value_counts()}")
print(f"\nowner: {df_cars.owner.value_counts()}")
print(f"\nbrand: {df_cars.brand.nunique()} categories")


Fuel: fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

owner: owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

brand: 32 categories


### One Hot Encoding using Pandas

In [61]:
pd.get_dummies(df_cars[['fuel', 'owner']]).head()

Unnamed: 0,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,False,True,False,False,True,False,False,False,False
1,False,True,False,False,False,False,True,False,False
2,False,False,False,True,False,False,False,False,True
3,False,True,False,False,True,False,False,False,False
4,False,False,False,True,True,False,False,False,False


In [62]:
# Implement K-1 HotEncoding (remove 1 column for each binary set) 
# to avoid multicollinearity issues in the data.
pd.get_dummies(df_cars[['fuel', 'owner']], drop_first=True).head()

#fuel_CNG and owner_first Owner were removed (K-1) when doing drop_first=True

Unnamed: 0,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,True,False,False,False,False,False,False
1,True,False,False,False,True,False,False
2,False,False,True,False,False,False,True
3,True,False,False,False,False,False,False
4,False,False,True,False,False,False,False


### One Hot Encoding using OneHotEncoder

In [63]:
from sklearn.preprocessing import OneHotEncoder

# We selected all columns as input set
# except selling_price which is our output set 
X_train, X_test, y_train, y_test = train_test_split(
    df_cars.drop('selling_price', axis=1), 
    df_cars.loc[:, ['selling_price']],
    test_size=0.2, random_state=10
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6502, 4), (1626, 4), (6502, 1), (1626, 1))

In [64]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
3365,Toyota,90000,Diesel,First Owner
6948,Hyundai,60000,Petrol,Second Owner
7977,Ford,105000,Diesel,First Owner
6708,Mahindra,110000,Diesel,Second Owner
4652,Mahindra,120000,Diesel,Second Owner


In [65]:
y_train.head()

Unnamed: 0,selling_price
3365,415000
6948,190000
7977,550000
6708,575000
4652,980000


In [78]:
# Now we will apply OneHotEncoder to fuel and owner columns

# drop = 'first' means it'll drop the first column, drop = 'if_binary' means it'll drop a column
# if it only contains 2 categories.
# dtype = 'int32' means the Binaray encoded columns will be created as integers, default is float
# sparse_output = False mean's it'll not return an Sparse Matrix and instead an array is returned.
ohencoder = OneHotEncoder(drop='first', dtype='int32', sparse_output=False)
ohencoder = ohencoder.fit(X_train[['fuel', 'owner']])

# Transform train set
X_train_encoded = ohencoder.transform(X_train[['fuel', 'owner']])
X_train_encoded

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0]], dtype=int32)

In [75]:
# Transform test set
X_test_encoded = ohencoder.transform(X_test[['fuel', 'owner']])
X_test_encoded

array([[1, 0, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=int32)

Now we have to join these transformed data in array format with our training and test data. We will see 2 different ways to do this. One is using numpy horizontal stack (np.hstack), another one is using Column Transformers.

In [76]:
X_train_new = np.hstack((X_train[['brand','km_driven']].values, X_train_encoded))
X_test_new = np.hstack((X_test[['brand', 'km_driven']].values, X_test_encoded))

X_train_new.shape, X_test_new.shape

((6502, 9), (1626, 9))

In [77]:
# Our train and test data are not combined in one numpy array
X_train_new[:5], X_test_new[:5]

(array([['Toyota', 90000, 1, 0, 0, 0, 0, 0, 0],
        ['Hyundai', 60000, 0, 0, 1, 0, 1, 0, 0],
        ['Ford', 105000, 1, 0, 0, 0, 0, 0, 0],
        ['Mahindra', 110000, 1, 0, 0, 0, 1, 0, 0],
        ['Mahindra', 120000, 1, 0, 0, 0, 1, 0, 0]], dtype=object),
 array([['Tata', 60000, 1, 0, 0, 0, 0, 0, 1],
        ['Maruti', 30000, 0, 0, 1, 0, 0, 0, 0],
        ['Maruti', 80000, 1, 0, 0, 0, 0, 0, 0],
        ['Mahindra', 30000, 1, 0, 0, 0, 0, 0, 0],
        ['Hyundai', 35000, 0, 0, 1, 0, 0, 0, 0]], dtype=object))

### Now we will encode brand column which has many categories. We will club and encode those categories as uncommon where the value count of that category is below 100

In [102]:
print(df_cars['brand'].nunique())

counts = df_cars['brand'].value_counts()
low_counts = counts[counts <= 100]

low_counts.index

32


Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [101]:
# Encode the brand column using pandas.get_dummies
df_cars_brand_encoded = pd.get_dummies(df_cars['brand'].replace(low_counts.index, 'uncommon'))
df_cars_brand_encoded.head()

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
0,False,False,False,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False
