<a href="https://colab.research.google.com/github/samiha-mahin/Data-Analysis/blob/main/Feature_Engineering_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Ordinal Encoding**


### ✅ What is Ordinal Encoding?

**Ordinal Encoding** is used to **convert categorical data** (that has an order) into **numbers**.

---

### 🔤 Example:

Suppose you have a feature called **"Size"** with values:

```
['Small', 'Medium', 'Large']
```

These have a **meaningful order**, right?

With **Ordinal Encoding**, we convert them to:

```
Small  →  0  
Medium →  1  
Large  →  2
```

Now the model can understand the **order**:
`Small < Medium < Large`

---

### 🟡 When to use?

Use **Ordinal Encoding** when:

* The categories have a **natural order**
* But not necessarily equal spacing (e.g., size, level, rating)




# **Label Encoding**


### ✅ What is Label Encoding?

**Label Encoding** is used to **convert categorical (text) values** into **numeric labels**.

It **assigns a unique number** to each category, **without implying any order**.

---

### 🔤 Example:

If you have a feature called **"Color"**:

```
['Red', 'Blue', 'Green']
```

Label Encoding will convert it to:

```
Red   →  0  
Blue  →  1  
Green →  2
```

⚠️ These numbers **do not mean any ranking** — they're just identifiers.

---

### 🟡 Where is Label Encoding used?

* ✅ Mostly used for the **output/target column** (e.g., converting `"Yes"`/`"No"` or class labels like `"Dog"`, `"Cat"` to numbers).
* Can be used on input features **only if** the categories are **unordered** and the model can handle numerical categories (like tree-based models).




In [11]:
import numpy as np
import pandas as pd

In [12]:
df = pd.read_csv('/content/customer.csv')

In [33]:
df.head(5)

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [15]:
df = df.iloc[:,2:]

In [16]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [17]:
from sklearn.preprocessing import OrdinalEncoder

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('purchased', axis=1),
                                                    df['purchased'],
                                                    test_size=0.3,
                                                    random_state=0)

X_train.shape, X_test.shape


((35, 2), (15, 2))

In [20]:
X_train

Unnamed: 0,review,education
7,Poor,School
14,Poor,PG
45,Poor,PG
48,Good,UG
29,Average,UG
15,Poor,UG
30,Average,UG
32,Average,UG
16,Poor,UG
42,Good,PG


In [21]:
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [22]:
oe.fit(X_train)

In [23]:
X_train = oe.transform(X_train)

In [24]:
X_train

array([[0., 0.],
       [0., 2.],
       [0., 2.],
       [2., 1.],
       [1., 1.],
       [0., 1.],
       [1., 1.],
       [1., 1.],
       [0., 1.],
       [2., 2.],
       [1., 0.],
       [0., 2.],
       [1., 1.],
       [1., 0.],
       [2., 0.],
       [1., 0.],
       [0., 1.],
       [2., 0.],
       [2., 1.],
       [0., 1.],
       [0., 0.],
       [1., 2.],
       [1., 2.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [1., 2.],
       [0., 2.],
       [2., 1.],
       [0., 2.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [2., 2.],
       [1., 1.]])

In [25]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [26]:
from sklearn.preprocessing import LabelEncoder

In [27]:
le = LabelEncoder()

In [28]:
le.fit(y_train)

In [29]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [30]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [31]:
y_train

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0])