<a href="https://colab.research.google.com/github/samiha-mahin/Data-Analysis/blob/main/Feature_Engineering_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Ordinal Encoding**


### ✅ What is Ordinal Encoding?

**Ordinal Encoding** is used to **convert categorical data** (that has an order) into **numbers**.

---

### 🔤 Example:

Suppose you have a feature called **"Size"** with values:

```
['Small', 'Medium', 'Large']
```

These have a **meaningful order**, right?

With **Ordinal Encoding**, we convert them to:

```
Small  →  0  
Medium →  1  
Large  →  2
```

Now the model can understand the **order**:
`Small < Medium < Large`

---

### 🟡 When to use?

Use **Ordinal Encoding** when:

* The categories have a **natural order**
* But not necessarily equal spacing (e.g., size, level, rating)




# **Label Encoding**


### ✅ What is Label Encoding?

**Label Encoding** is used to **convert categorical (text) values** into **numeric labels**.

It **assigns a unique number** to each category, **without implying any order**.

---

### 🔤 Example:

If you have a feature called **"Color"**:

```
['Red', 'Blue', 'Green']
```

Label Encoding will convert it to:

```
Red   →  0  
Blue  →  1  
Green →  2
```

⚠️ These numbers **do not mean any ranking** — they're just identifiers.

---

### 🟡 Where is Label Encoding used?

* ✅ Mostly used for the **output/target column** (e.g., converting `"Yes"`/`"No"` or class labels like `"Dog"`, `"Cat"` to numbers).
* Can be used on input features **only if** the categories are **unordered** and the model can handle numerical categories (like tree-based models).




In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('/content/customer.csv')

In [None]:
df.head(5)

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [None]:
df = df.iloc[:,2:]

In [None]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('purchased', axis=1),
                                                    df['purchased'],
                                                    test_size=0.3,
                                                    random_state=0)

X_train.shape, X_test.shape


((35, 2), (15, 2))

In [None]:
X_train

Unnamed: 0,review,education
7,Poor,School
14,Poor,PG
45,Poor,PG
48,Good,UG
29,Average,UG
15,Poor,UG
30,Average,UG
32,Average,UG
16,Poor,UG
42,Good,PG


In [None]:
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [None]:
oe.fit(X_train)

In [None]:
X_train = oe.transform(X_train)

In [None]:
X_train

array([[0., 0.],
       [0., 2.],
       [0., 2.],
       [2., 1.],
       [1., 1.],
       [0., 1.],
       [1., 1.],
       [1., 1.],
       [0., 1.],
       [2., 2.],
       [1., 0.],
       [0., 2.],
       [1., 1.],
       [1., 0.],
       [2., 0.],
       [1., 0.],
       [0., 1.],
       [2., 0.],
       [2., 1.],
       [0., 1.],
       [0., 0.],
       [1., 2.],
       [1., 2.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [1., 2.],
       [0., 2.],
       [2., 1.],
       [0., 2.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [2., 2.],
       [1., 1.]])

In [None]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
le.fit(y_train)

In [None]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [None]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [None]:
y_train

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0])

# **One Hot Encoding**


### ✅ What is One Hot Encoding?

**One Hot Encoding** is a technique to convert **categorical values into binary (0/1) columns** — one column for **each category**.

This is especially useful when **categories have no order** (i.e., nominal data).

---

### 🔤 Example:

Suppose you have a `Color` column:

| Color |
| ----- |
| Red   |
| Blue  |
| Green |

If you use **One Hot Encoding**, it becomes:

| Color\_Blue | Color\_Green | Color\_Red |
| ----------- | ------------ | ---------- |
| 0           | 0            | 1          |
| 1           | 0            | 0          |
| 0           | 1            | 0          |

* Each category becomes a **new column**.
* The value is **1** in the column that matches the category, and **0** in others.

---

### 🛠 How to do it in Python (with pandas)?

```python
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})

encoded_df = pd.get_dummies(df)
print(encoded_df)
```

✅ Output:

```
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
```

---

### 💡 Why One Hot Encoding?

* Prevents the model from thinking that **one category is greater than another** .
* Especially useful for algorithms like **Logistic Regression**, **Linear Regression**, etc., which assume numerical relationships.


# **Column Transformer**



### 🧰 What is `ColumnTransformer`?

`ColumnTransformer` lets you **apply different preprocessing steps to different columns** of your dataset **at once**.

👉 It’s super helpful when your dataset has **both numerical and categorical columns**, and you want to treat them differently.

---

### 💡 Why use it?

In real-world datasets:

* You might want to **scale** numerical features (like Age, Salary)
* And **encode** categorical features (like Gender, City)

**Instead of doing them separately**, `ColumnTransformer` does them **in one clean pipeline**.

---

### 🧪 Example:

Let’s say your dataset looks like this:

```python
import pandas as pd

df = pd.DataFrame({
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000],
    'City': ['Paris', 'London', 'Paris', 'Berlin']
})
```

You want to:

* **Scale** `Age` and `Salary`
* **One-hot encode** `City`

---

### ⚙️ Using `ColumnTransformer`:

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define which columns need what transformation
transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['Age', 'Salary']),        # scale numeric columns
        ('cat', OneHotEncoder(), ['City'])                   # one-hot encode categorical column
    ]
)

transformed = transformer.fit_transform(df)
```

---

### 📊 Result:

The output is a **NumPy array** where:

* First 2 columns are scaled versions of Age & Salary
* Next columns are one-hot encoded City (like `[0 1 0]` for 'London')

---

### ✅ Summary:

| Feature Type            | Transformation   | Tool Used             |
| ----------------------- | ---------------- | --------------------- |
| Numerical (Age, Salary) | Standardization  | `StandardScaler()`    |
| Categorical (City)      | One-hot encoding | `OneHotEncoder()`     |
| Combined with           |                  | `ColumnTransformer()` |



In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [2]:
df = pd.read_csv('/content/covid_toy.csv')

In [14]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [16]:
df.iloc[:,:]

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No
...,...,...,...,...,...,...
95,12,Female,104.0,Mild,Bangalore,No
96,51,Female,101.0,Strong,Kolkata,Yes
97,20,Female,101.0,Mild,Bangalore,No
98,5,Female,98.0,Strong,Mumbai,No


In [4]:
df.isnull().sum()

Unnamed: 0,0
age,0
gender,0
fever,10
cough,0
city,0
has_covid,0


To fill the missing values we will use SimpleImputer.


## **🧩 What is SimpleImputer?**

SimpleImputer is a tool from sklearn used to fill in missing values (NaN) in your dataset using a basic strategy like:

Mean

Median

Most frequent value (mode)

Constant value

## **💡 Why do we need it?**
Many machine learning models can’t handle missing values. So we use SimpleImputer to fill them in automatically before training.



In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_covid']),df['has_covid'],
                                                test_size=0.2)

In [6]:
X_train

Unnamed: 0,age,gender,fever,cough,city
49,44,Male,104.0,Mild,Mumbai
37,55,Male,100.0,Mild,Kolkata
19,42,Female,,Strong,Bangalore
75,5,Male,102.0,Mild,Kolkata
43,22,Female,99.0,Mild,Bangalore
...,...,...,...,...,...
3,31,Female,98.0,Mild,Kolkata
25,23,Male,,Mild,Mumbai
39,50,Female,103.0,Mild,Kolkata
80,14,Female,99.0,Mild,Mumbai


In [7]:
from sklearn.compose import ColumnTransformer

In [9]:
transformer = ColumnTransformer(transformers=[('tnf1',SimpleImputer(),['fever']),
                                              ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
                                              # Removed sparse=False as it's deprecated/removed in newer sklearn versions
                                              ('tnf3',OneHotEncoder(drop='first'),['gender','city'])
                                              ],remainder='passthrough')

## **✅ OneHotEncoder with drop='first':**
When you set drop='first', it drops the first category alphabetically for each column.

**For gender:**
Categories: ['Female', 'Male']

Drop: 'Female'

Encode:

'Female' → 0

'Male' → 1

**For city:**
Categories: ['Bangalore', 'Kolkata', 'Mumbai']

Drop: 'Bangalore'

Encode:

'Bangalore' → [0, 0]

'Kolkata' → [1, 0]

'Mumbai' → [0, 1]

## **🧾 Transformed Data:**

| gender\_Male | city\_Kolkata | city\_Mumbai |
| ------------ | ------------- | ------------ |
| 1            | 0             | 1            |
| 1            | 1             | 0            |
| 0            | 0             | 0            |
| 1            | 1             | 0            |
| 0            | 0             | 0            |


In [10]:
transformer.fit_transform(X_train).shape

(80, 7)

In [13]:
transformer.transform(X_test).shape

(20, 7)