## Handling Nominal categorical data:

`Nominal categorical data` is a type of qualitative data that uses labels and tags to classify data into groups without any numerical values.

<img src="images/nominal.png" alt="Image Description" width="550" height="250">

## One Hot Encoding:

`One-hot encoding` is a technique used to convert categorical data into a binary format where each category is represented by a separate column with a 1 indicating its presence and 0s for all other categories.

<img src="images/ohe.png" alt="Image Description" width="550" height="230">

In [2]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Datasets/data_for_ohe.csv')

In [3]:
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,math score,reading score
0,female,group B,standard,72,72
1,female,group C,standard,69,90
2,female,group B,standard,90,95
3,male,group A,free/reduced,47,57
4,male,group C,standard,76,78


In [8]:
print(df['gender'].unique())
print(df['race/ethnicity'].unique())
print(df['lunch'].unique())

['female' 'male']
['group B' 'group C' 'group A' 'group D' 'group E']
['standard' 'free/reduced']


So, get_dummies function will create 9 new columns and deletes the existing columns.

## 1. One Hot Encoding Using Pandas:

- Using `get_dummies` function:

In [9]:
temp = pd.get_dummies(df,columns=['gender','race/ethnicity','lunch'],dtype=np.int32)

In [10]:
temp.sample(4)

Unnamed: 0,math score,reading score,gender_female,gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,lunch_free/reduced,lunch_standard
924,74,70,0,1,0,0,0,1,0,1,0
109,70,64,1,0,0,1,0,0,0,0,1
333,90,78,0,1,0,1,0,0,0,0,1
133,75,81,1,0,0,0,1,0,0,0,1


In [12]:
temp.shape

(1000, 11)

### `drop_first = True`

- This parameter drops first column of each dummy column i.e `three`.    
- It is used to eliminate multicollinearity from the data.. -> Each column used be independent to each other.

- `Multicollinearity` in machine learning (ML) is a situation where two or more independent variables in a regression model are highly correlated. This can negatively impact the model's performance and the accuracy of its predictions. 

In [13]:
new_temp = pd.get_dummies(df,columns=['gender','race/ethnicity','lunch'],dtype=np.int32,drop_first=True)

In [14]:
new_temp.sample(4)

Unnamed: 0,math score,reading score,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,lunch_standard
748,50,60,0,0,1,0,0,0
226,72,72,0,0,1,0,0,1
679,63,61,1,0,0,1,0,0
527,36,53,0,0,1,0,0,0


In [15]:
new_temp.shape

(1000, 8)

## One Hot Encoding Using Sklearn:

1. train test split:

In [16]:
df.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,math score,reading score
0,female,group B,standard,72,72
1,female,group C,standard,69,90


In [17]:
from sklearn.model_selection import train_test_split

x = df.iloc[:,0:4]
y = df.iloc[:,-1]
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=2)

In [19]:
X_train.sample(4)

Unnamed: 0,gender,race/ethnicity,lunch,math score
458,female,group E,standard,100
768,female,group D,standard,68
517,female,group E,standard,66
466,female,group D,free/reduced,26


In [20]:
from sklearn.preprocessing import OneHotEncoder

In [21]:
ohe = OneHotEncoder()

In [24]:
ohe.fit(X_train[['gender','race/ethnicity','lunch']])

In [29]:
new_X_train = ohe.transform(X_train[['gender','race/ethnicity','lunch']])
new_X_test = ohe.transform(X_test[['gender','race/ethnicity','lunch']])

In [27]:
new_X_train # It will return spare matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2400 stored elements and shape (800, 9)>

In [None]:
new_X_test.toarray() # toarray() can be should to convert the data to a numpy array

array([[1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 1., 0., 1.],
       [0., 1., 0., ..., 0., 0., 1.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 1., 0., 1.],
       [0., 1., 0., ..., 0., 0., 1.]])

- `sparse_output=False` and `dtype=np.int32`

Instead to using toarray() we can use the parameter sparse_output = False..    
Using dtype parameter we can change the type from float to int..

In [33]:
ohe = OneHotEncoder(sparse_output=False,dtype=np.int32)

ohe.fit(X_train[['gender','race/ethnicity','lunch']])
new_X_train = ohe.transform(X_train[['gender','race/ethnicity','lunch']])
new_X_test = ohe.transform(X_test[['gender','race/ethnicity','lunch']])

new_X_train

array([[1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 0, 1],
       ...,
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 0, 1]], dtype=int32)

In [34]:
new_X_train.shape

(800, 9)

### drop = 'first':

 - To eliminate multicollinearity from the data.

In [35]:
ohe = OneHotEncoder(sparse_output=False,dtype=np.int32,drop='first')

ohe.fit(X_train[['gender','race/ethnicity','lunch']])
new_X_train = ohe.transform(X_train[['gender','race/ethnicity','lunch']])
new_X_test = ohe.transform(X_test[['gender','race/ethnicity','lunch']])

new_X_train

array([[0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1],
       ...,
       [0, 0, 1, 0, 0, 1],
       [0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 1]], dtype=int32)

In [36]:
new_X_train.shape

(800, 6)

- Now Merging this encoded array to original remaining column by converting that column into array.

In [65]:
X_train_math = X_train.drop(columns=['gender','race/ethnicity','lunch']).values

In [68]:
new_X_train.shape

(800, 6)

In [69]:
X_train_math.shape

(800, 1)

In [75]:
new_array = np.hstack((new_X_train,X_train_math))

In [76]:
new_df = pd.DataFrame(new_array,columns=['gender_male','race/ethnicity_group B','race/ethnicity_group C','race/ethnicity_group D','race/ethnicity_group E','lunch_standard','math score'])

In [77]:
new_df.sample(4)

Unnamed: 0,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,lunch_standard,math score
500,1,0,0,1,0,1,65
322,1,0,0,1,0,1,75
256,0,0,0,0,1,1,100
644,0,0,0,0,1,0,64


This is on training data .. 
Same as on testing data ...

------

`Problem`: If the dataset has multiple columns and want multiple encoding ..     
So, In this case if you will apply encodings on all the columns one by one then it will be very time consuming and also it will be very difficult to understand.

`Solution`: Column Transformer

## Column Transformer:

It allows you to selectively apply data preparation transforms to different columns in your dataset. This is particularly useful when you have a mix of numerical and categorical data that require different preprocessing steps.

<img src="images/transformer.png" alt="Image Description" width="850" height="280">

In [3]:
df1 = pd.read_csv('Datasets/col transformer data.csv')

In [4]:
df1.sample(4)

Unnamed: 0,Years_of_Experience,Department,Job_Role,Performance_Rating,Seniority_Level,Promoted
97,9,IT,System Analyst,Excellent,Senior,Yes
160,10,IT,System Analyst,Excellent,Senior,Yes
39,4,IT,Data Scientist,Good,Mid,No
152,19,Marketing,Marketing Analyst,Excellent,Senior,Yes


## Summary table of the data:

<img src="images/data.png" alt="Image Description" width="650" height="470">

In [8]:
from sklearn.model_selection import train_test_split

x = df1.drop(columns=['Promoted'])
y = df1.iloc[:,-1]
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [12]:
X_train.columns

Index(['Years_of_Experience', 'Department', 'Job_Role', 'Performance_Rating',
       'Seniority_Level'],
      dtype='object')

In [15]:
print(X_train['Performance_Rating'].unique())
print(X_train['Seniority_Level'].unique())

['Good' 'Average' 'Excellent' 'Poor']
['Mid' 'Senior' 'Junior']


### The `ColumnTransformer` takes two main parameters:

1. `transformers` (List of Transformations Applied)  

-> Each tuple inside the transformers list consists of three elements:

- Name (String Identifier) → A unique name for the transformation (used for reference).   
- Transformer (Transformation Method) → The transformation applied, such as OneHotEncoder, OrdinalEncoder, StandardScaler, etc.    
- Column List (Columns to Transform) → Specifies which columns the transformer should be applied to.  

Example:  

- **('ct1', OneHotEncoder(...), ['Department', 'Job_Role'])** →   
        Applies One-Hot Encoding to categorical columns (Department, Job_Role), converting them into binary (0/1) columns.   

- **('ct2', OrdinalEncoder(...), ['Performance_Rating', 'Seniority_Level'])** →    
         Applies Ordinal Encoding, mapping Performance_Rating (Poor, Good, Average, Excellent) and Seniority_Level (Junior, Mid, Senior) into numerical ranks.     

2. `remainder` (What Happens to Other Columns) 

- 'drop' → Drops any columns not listed in transformers.   
- 'passthrough' → Keeps unchanged columns (like Years_of_Experience and Promoted) in the output.

In [24]:
ct = ColumnTransformer(transformers=[
    ('ct1',OneHotEncoder(sparse_output=False,dtype=np.int32),['Department', 'Job_Role']),
    ('ct2',OrdinalEncoder(categories=[['Poor','Good', 'Average', 'Excellent'],['Junior','Mid','Senior']]),['Performance_Rating',
       'Seniority_Level']),
],remainder='passthrough')

In [25]:
ct.fit(X_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [26]:
new_X_train = ct.transform(X_train)
new_X_test = ct.transform(X_test)

In [28]:
new_X_train.shape

(140, 19)

In [29]:
new_X_test.shape

(60, 19)

Using column transformer you can easily encode the multiple types feature simultenously.

----