# What is Encoding in Machine Learning

Encoding in ML is the process of converting categorical data into numerical format so that algorithms can process it.


## OneHotEcoder
OneHotEncoder in sklearn is used to convert categorical data into a binary (0/1) matrix — one column for each category.

 Used on: Nominal (non-ordinal) categorical features (e.g., color: red, green, blue).
 Purpose: Makes data usable for ML models that can't handle text labels directly


![image-7.png](attachment:image-7.png)

In [59]:
import pandas as pd 


df = pd.read_csv("binary_classification_sample.csv")
df

Unnamed: 0,Age,Salary,Experience,Gender,Department,Education,LocationScore,Purchased
0,56,51905.183591,27,Female,HR,Bachelors,67.964728,0
1,69,31258.344158,16,Female,Engineering,High School,21.825389,0
2,46,79176.734217,4,Male,HR,PhD,94.996118,0
3,32,47699.953137,4,Male,Engineering,High School,78.634501,1
4,60,36395.191619,5,Male,Marketing,High School,8.941100,1
...,...,...,...,...,...,...,...,...
195,69,69228.805705,10,Female,Engineering,Bachelors,77.985099,1
196,30,49573.678136,14,Female,Marketing,Bachelors,3.961883,1
197,58,24253.633311,27,Female,HR,High School,48.050695,0
198,20,,12,Female,Sales,High School,,0


# How to Apply Oridinal Encoding?

In [60]:
df.dropna(inplace=True)

In [61]:
df['Department'].value_counts()

Department
Sales          47
HR             41
Engineering    41
Marketing      40
Name: count, dtype: int64

In [27]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first', sparse_output=False)
df['Department'] = encoder.fit_transform(df[['Department']]).astype(int)
df

Unnamed: 0,Age,Salary,Experience,Gender,Department,Education,LocationScore,Purchased
0,56,51905.183591,27,Female,1,Bachelors,67.964728,0
1,69,31258.344158,16,Female,0,High School,21.825389,0
2,46,79176.734217,4,Male,1,PhD,94.996118,0
3,32,47699.953137,4,Male,0,High School,78.634501,1
4,60,36395.191619,5,Male,0,High School,8.941100,1
...,...,...,...,...,...,...,...,...
194,44,33125.723933,13,Male,0,PhD,86.012240,1
195,69,69228.805705,10,Female,0,Bachelors,77.985099,1
196,30,49573.678136,14,Female,0,Bachelors,3.961883,1
197,58,24253.633311,27,Female,1,High School,48.050695,0


In [62]:
# One-hot encoding
cols = ['Department']
encoder = OneHotEncoder(drop='first', sparse_output=False)

encoded_col = encoder.fit_transform(df[cols]).astype(int)
encoded_df = pd.DataFrame(encoded_col, columns=encoder.get_feature_names_out(cols))

# Align indexes
encoded_df.index = df.index

# Concatenate
final_df = pd.concat([df, encoded_df], axis=1)


In [63]:
final_df

Unnamed: 0,Age,Salary,Experience,Gender,Department,Education,LocationScore,Purchased,Department_HR,Department_Marketing,Department_Sales
0,56,51905.183591,27,Female,HR,Bachelors,67.964728,0,1,0,0
1,69,31258.344158,16,Female,Engineering,High School,21.825389,0,0,0,0
2,46,79176.734217,4,Male,HR,PhD,94.996118,0,1,0,0
3,32,47699.953137,4,Male,Engineering,High School,78.634501,1,0,0,0
4,60,36395.191619,5,Male,Marketing,High School,8.941100,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
194,44,33125.723933,13,Male,Sales,PhD,86.012240,1,0,0,1
195,69,69228.805705,10,Female,Engineering,Bachelors,77.985099,1,0,0,0
196,30,49573.678136,14,Female,Marketing,Bachelors,3.961883,1,0,1,0
197,58,24253.633311,27,Female,HR,High School,48.050695,0,1,0,0
