### Feature encoding

---


#### 1. Ordinal catagorical data (Ordered data)


#### **Ordinal encoding**


If X has ordinal data, use ordinal encoder
but if y has ordinal data, use label encoder

<img src="../assets/ordinal_encoding.png" />


In [40]:
import os
import pandas as pd

In [41]:
path = os.path.join("..", "data", "customer.csv")

df = pd.read_csv(
    path,
    dtype={
        "gender": "category",
        "review": "category",
        "education": "category",
        "purchased": "category",
    },
)
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
44,77,Female,Average,UG,No
30,73,Male,Average,UG,No
27,69,Female,Poor,PG,No
6,18,Male,Good,School,No
34,86,Male,Average,School,No


In [42]:
print(df["review"].cat.categories)
print(df["education"].cat.categories)
print(df["purchased"].cat.categories)

Index(['Average', 'Good', 'Poor'], dtype='object')
Index(['PG', 'School', 'UG'], dtype='object')
Index(['No', 'Yes'], dtype='object')


In [43]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("purchased", axis=1), df["purchased"]
)

In [44]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

ct = ColumnTransformer(
    [
        (
            "ordinal",
            OrdinalEncoder(
                categories=[["Poor", "Good", "Average"], ["School", "UG", "PG"]],
            ),
            ["review", "education"],
        ),
    ],
    remainder="passthrough",
)
ct.set_output(transform="pandas")

In [45]:
X_train_trans = ct.fit_transform(X_train)
X_test_trans = ct.transform(X_test)

In [46]:
X_train.sample(5), X_train_trans.sample(5)

(    age  gender review education
 15   75    Male   Poor        UG
 35   74    Male   Poor    School
 36   34  Female   Good        UG
 18   19    Male   Good    School
 9    74    Male   Good        UG,
     ordinal__review  ordinal__education  remainder__age remainder__gender
 19              0.0                 2.0              97              Male
 36              1.0                 1.0              34            Female
 7               0.0                 0.0              60            Female
 16              0.0                 1.0              59              Male
 45              0.0                 2.0              61              Male)

#### **Label encoding**


In [47]:
le = LabelEncoder()
y_train_trans = le.fit_transform(y_train)
y_test_trans = le.transform(y_test)

---


#### 2. Nominal catagorical data (Unordered data)


#### **One hot encoding**


N categorical data is converted in n seperate column

<img src="../assets/onehot_encoding.png" />


Since the sum of all the column of each row is 1 now, this leads to **dummy variable trap** and introduction of **multicolinearity**.


**Multicollinearilty** is extremely bad for linear models as it changes the correlation of data.
Also, in ML, features should be independent from each other to yield maximum results.


To combat this problem, one of the columns is removed from the generated column.
Hence for **n** categories, only **n-1** columns are now generated.


In [52]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

In [49]:
path = os.path.join("..", "data", "cars.csv")

df = pd.read_csv(path)
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
4506,Maruti,81000,Diesel,First Owner,438999
6366,Maruti,30000,Diesel,First Owner,370000
6600,Mahindra,70000,Diesel,Third Owner,900000
2226,Maruti,70000,Petrol,First Owner,580000
5734,Maruti,100000,Petrol,Second Owner,100000


In [51]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop("selling_price", axis=1), df["selling_price"]
)
X_train.sample(5), y_train.sample(5)

(        brand  km_driven    fuel                 owner
 6015    Honda      49185  Petrol           First Owner
 3222  Hyundai      19000  Petrol           First Owner
 4234   Jaguar       9000  Diesel           First Owner
 2928  Hyundai      60000  Petrol  Fourth & Above Owner
 7102   Toyota     376412  Diesel          Second Owner,
 119     300000
 3252    535000
 3916    350000
 3993    750000
 3200    400000
 Name: selling_price, dtype: int64)

In [61]:
ct = ColumnTransformer(
    [
        ("ohe", OneHotEncoder(drop="first", sparse_output=False), ["brand", "fuel"]),
        ("ordinal", OrdinalEncoder(), ["owner"]),
    ],
    remainder="passthrough",
)
ct.set_output(transform="pandas")

In [62]:
X_train_trans = ct.fit_transform(X_train)
X_test_trans = ct.transform(X_test)

In [64]:
X_train_trans.sample(5)

Unnamed: 0,ohe__brand_Ashok,ohe__brand_Audi,ohe__brand_BMW,ohe__brand_Chevrolet,ohe__brand_Daewoo,ohe__brand_Datsun,ohe__brand_Fiat,ohe__brand_Force,ohe__brand_Ford,ohe__brand_Honda,...,ohe__brand_Skoda,ohe__brand_Tata,ohe__brand_Toyota,ohe__brand_Volkswagen,ohe__brand_Volvo,ohe__fuel_Diesel,ohe__fuel_LPG,ohe__fuel_Petrol,ordinal__owner,remainder__km_driven
3156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,40000
3087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,79328
5301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,50000
5284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4.0,100000
270,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,120000


---
