<div id="header">
    <p style="color:black; text-align:center; font-weight:bold; font-family:Tahoma, sans-serif; font-size:24px;">
        Data Preprocessing with Ordinal and Label Encoding
    </p>
</div>

<div style="background-color:#bfbfbf; padding:8px; border:2px dotted black; border-radius:8px; font-family:sans-serif; line-height: 1.7em">

**Encoding categorical data** is a crucial preprocessing step in machine learning that transforms categorical variables into a numerical format. This transformation allows algorithms to interpret the data, as most machine learning models require numerical input.

**Ordinal encoding** is used for categorical features that have a clear order or ranking. This technique assigns a unique integer to each category, reflecting its rank. For example, in a feature representing customer satisfaction levels (e.g., "low," "medium," "high"), ordinal encoding might map these categories to integers as follows:
Low: 0
Medium: 1
High: 2

**Label encoding**, also known as nominal encoding, is used for categorical features without any inherent order. This method assigns a unique integer to each category, but unlike ordinal encoding, the integers do not imply any ranking. For instance, if you have a feature representing colors (e.g., "red," "blue," "green"), label encoding might map these categories as follows:
Red: 0
Blue: 1
Green: 2

</div>


In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('employee_promotion.csv')

In [None]:
df.head(10)

Unnamed: 0,department,education,gender,recruitment_channel,age,performance,is_promoted
0,Sales & Marketing,Master,f,sourcing,35,Average,No
1,Operations,Bachelor,m,other,30,Average,No
2,Sales & Marketing,Bachelor,m,sourcing,34,Average,No
3,Sales & Marketing,Bachelor,m,other,39,Average,No
4,Technology,Bachelor,m,other,45,Average,No
5,Analytics,Bachelor,m,sourcing,31,Average,No
6,Operations,Bachelor,f,other,31,Average,No
7,Operations,Master,m,sourcing,33,Average,No
8,Analytics,Bachelor,m,other,28,Average,No
9,Sales & Marketing,Master,m,sourcing,32,Average,No


In [None]:
df = df.iloc[:, [1, 5, 6]]

In [None]:
df.head()

Unnamed: 0,education,performance,is_promoted
0,Master,Average,No
1,Bachelor,Average,No
2,Bachelor,Average,No
3,Bachelor,Average,No
4,Bachelor,Average,No


In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[['education', 'performance']]
y = df['is_promoted']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Shapes of the splits
print(X_train.shape, X_test.shape)

(38365, 2) (16443, 2)


In [None]:
X_train

Unnamed: 0,education,performance
50994,Bachelor,Good
48799,Bachelor,Good
32986,Master,Good
22251,Bachelor,Good
48703,Master,Good
...,...,...
45891,PHD,Good
52416,Master,Good
42613,Bachelor,Good
43567,Bachelor,Good


In [None]:
oe = OrdinalEncoder(categories=[['Bachelor', 'Master', 'PHD'], ['Average', 'Good', 'Outstanding']])

In [None]:
# Fitting and transforming the training target variable
X_train[['education', 'performance']] = oe.fit_transform(X_train[['education', 'performance']])

In [None]:
# Transforming test data
X_test[['education', 'performance']] = oe.transform(X_test[['education', 'performance']])

In [None]:
X_train

Unnamed: 0,education,performance
50994,0.0,1.0
48799,0.0,1.0
32986,1.0,1.0
22251,0.0,1.0
48703,1.0,1.0
...,...,...
45891,2.0,1.0
52416,1.0,1.0
42613,0.0,1.0
43567,0.0,1.0


In [None]:
oe.categories_

[array(['Bachelor', 'Master', 'PHD'], dtype=object),
 array(['Average', 'Good', 'Outstanding'], dtype=object)]

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
# Fitting and transforming the training target variable
y_train_encoded = le.fit_transform(y_train)

In [None]:
# Transforming the test target variable
y_test_encoded = le.transform(y_test)

In [None]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [None]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [None]:
y_train

array([0, 0, 0, ..., 0, 0, 0])