# Encoding Categorical Value

- Encoding is the process of converting text-based categories (like "Red", "Green", "Blue") into numbers so that machine learning algorithms can perform mathematical operations on them.

- There are two main types based on whether the category has a specific order or not.

## Label(Ordinal) Encoding
- Use this when there is a natural order or rank (e.g., "Small" < "Medium" < "Large").
- Theory: Assigns each category a unique integer (0, 1, 2...).
- Formula: $x \rightarrow \{0, 1, 2, \dots, n-1\}$
- Pros: Simple and memory-efficient.
- Cons: If used on non-ordered data (like "City"), the model might mistakenly think "London" (2) is "greater than" "Paris" (1).


## One-Hot Encoding(Nominal)
- Use this when there is no order (e.g., "Color" or "Country").
- Theory: Creates a new binary column (0 or 1) for every unique category in the original column.
- Formula: For $N$ categories, create $N$ columns where $x_i = 1$ if the category matches, else $0$.
- Pros: No fake "ranking" is created; the model treats all categories equally.
- Cons: Can create "The Curse of Dimensionality" if you have hundreds of unique categories (too many columns).

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('customer.csv')

In [3]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
37,94,Male,Average,PG,Yes
11,74,Male,Good,UG,Yes
8,65,Female,Average,UG,No
32,92,Male,Average,UG,Yes
7,60,Female,Poor,School,Yes


In [4]:
df = df.iloc[:,2:]

In [5]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [7]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1],test_size = 0.2)

In [9]:
from sklearn.preprocessing import OrdinalEncoder


In [10]:
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [11]:
oe.fit(x_train)

0,1,2
,categories,"[['Poor', 'Average', ...], ['School', 'UG', ...]]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,


In [12]:
oe.fit(x_train)

0,1,2
,categories,"[['Poor', 'Average', ...], ['School', 'UG', ...]]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,


In [14]:

x_train

Unnamed: 0,review,education
49,Good,UG
26,Poor,PG
40,Good,School
12,Poor,School
28,Poor,School
46,Poor,PG
37,Average,PG
7,Poor,School
31,Poor,School
20,Average,School


In [15]:
oe.categories_


[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [16]:
x_train

Unnamed: 0,review,education
49,Good,UG
26,Poor,PG
40,Good,School
12,Poor,School
28,Poor,School
46,Poor,PG
37,Average,PG
7,Poor,School
31,Poor,School
20,Average,School


In [17]:
from sklearn.preprocessing import LabelEncoder

In [18]:

le = LabelEncoder()


In [19]:
le.fit(y_train)

In [20]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [21]:

y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [22]:
y_train

array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1])