##  Overview — Encoding Categorical Variables using LabelEncoder & OrdinalEncoder

###  Objective
This section demonstrates how to **encode categorical variables** in a dataset using **scikit-learn’s preprocessing encoders** — `LabelEncoder` for the dependent (target) variable and `OrdinalEncoder` for the independent (feature) variables.

---

###  Dataset Information
- The dataset (`Encoding.csv`) contains categorical columns such as:
  - `fuel`
  - `seller_type`
  - `transmission`
  - `owner`
- The `name` column (car name/model) is dropped as it is non-numeric and not useful for encoding.

---


In [121]:
# NumPy: Used for numerical operations (arrays, mathematical calculations)
import numpy as np

# Pandas: Used for data loading, cleaning, and manipulation
import pandas as pd

# Scikit-learn preprocessing: Used for converting categorical data into numeric form
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

In [126]:
# Display the first 5 rows of the dataset
df.head()
#Loading the dataset
df = pd.read_csv("Encoding.csv")

In [127]:
# Drop the 'name' column as it is not needed for encoding and It is not an important feature for model building
df.drop('name', axis=1, inplace=True)

In [128]:
df

Unnamed: 0,fuel,seller_type,transmission,owner
0,Petrol,Individual,Manual,First Owner
1,Petrol,Individual,Manual,First Owner
2,Diesel,Individual,Manual,First Owner
3,Petrol,Individual,Manual,First Owner
4,Diesel,Individual,Manual,Second Owner
...,...,...,...,...
4335,Diesel,Individual,Manual,Second Owner
4336,Diesel,Individual,Manual,Second Owner
4337,Petrol,Individual,Manual,Second Owner
4338,Diesel,Individual,Manual,First Owner


In [38]:
df['transmission'].value_counts()

transmission
Manual       3892
Automatic     448
Name: count, dtype: int64

In [39]:
#df['Transmission_new'] = df['transmission'].map({'Manual':1,'Automatic':0})

In [40]:
df

Unnamed: 0,fuel,seller_type,transmission,owner
0,Petrol,Individual,Manual,First Owner
1,Petrol,Individual,Manual,First Owner
2,Diesel,Individual,Manual,First Owner
3,Petrol,Individual,Manual,First Owner
4,Diesel,Individual,Manual,Second Owner
...,...,...,...,...
4335,Diesel,Individual,Manual,Second Owner
4336,Diesel,Individual,Manual,Second Owner
4337,Petrol,Individual,Manual,Second Owner
4338,Diesel,Individual,Manual,First Owner


In [41]:
df['seller_type'].value_counts()

seller_type
Individual          3244
Dealer               994
Trustmark Dealer     102
Name: count, dtype: int64

In [42]:
# df['seller_type_new'] = df['seller_type'].map({'Individual':0,'Dealer':1,'Trustmark Dealer':2})

In [43]:
df.head()

Unnamed: 0,fuel,seller_type,transmission,owner
0,Petrol,Individual,Manual,First Owner
1,Petrol,Individual,Manual,First Owner
2,Diesel,Individual,Manual,First Owner
3,Petrol,Individual,Manual,First Owner
4,Diesel,Individual,Manual,Second Owner


In [44]:
#random sample
df.sample(5)

Unnamed: 0,fuel,seller_type,transmission,owner
2525,Petrol,Individual,Manual,First Owner
3515,Diesel,Individual,Manual,Third Owner
3002,Petrol,Individual,Manual,First Owner
2553,Petrol,Dealer,Manual,First Owner
779,Diesel,Individual,Manual,Second Owner


In [46]:
#another method:lambda takes every row at once
#df['transmission'].apply(lambda x:1 if x=='Manual' else 0)

In [130]:
# Split the dataset into features (X) and target (Y)
# 'fuel' is the target column, so we drop it from X
from sklearn.model_selection import train_test_split
X = df.drop('fuel', axis=1)
Y = df['fuel']

In [131]:
X.head()

Unnamed: 0,seller_type,transmission,owner
0,Individual,Manual,First Owner
1,Individual,Manual,First Owner
2,Individual,Manual,First Owner
3,Individual,Manual,First Owner
4,Individual,Manual,Second Owner


In [135]:
# Split the data into training and testing sets
# 80% of data for training and 20% for testing
# random_state=42 ensures reproducibility of results
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

** Sklearn **

In [137]:
# Create encoder objects for categorical data
# LabelEncoder: used for encoding the target variable (Y)
# OrdinalEncoder: used for encoding categorical feature columns (X)
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
label = LabelEncoder()
ordinal = OrdinalEncoder()

In [138]:
# Fit and transform the target variable (y_train) into numeric values
y_train_scaled = label.fit_transform(y_train)

# Convert the encoded array into a DataFrame for better readability
y_train_scaled = pd.DataFrame(y_train_scaled)

# Display the count of each encoded category in y_train
y_train_scaled.value_counts()


0
1    1714
4    1704
0      35
3      18
2       1
Name: count, dtype: int64

In [139]:
# Display the count of each category in the original target variable (y_train)
y_train.value_counts()

fuel
Diesel      1714
Petrol      1704
CNG           35
LPG           18
Electric       1
Name: count, dtype: int64

In [140]:
# Transform the test target variable (y_test) using the same fitted LabelEncoder
y_test_scaled = label.transform(y_test)

# Convert the encoded test data into a DataFrame for better readability
y_test_scaled = pd.DataFrame(y_test_scaled)

# Display the count of each encoded category in y_test
y_test_scaled.value_counts()


0
1    439
4    419
0      5
3      5
Name: count, dtype: int64

In [90]:
y_test.value_counts()

fuel
Diesel    439
Petrol    419
LPG         5
CNG         5
Name: count, dtype: int64

In [141]:
## Display the count of each category in the original independent variable (X_train)
X_train.value_counts()

seller_type       transmission  owner               
Individual        Manual        First Owner             1408
                                Second Owner             731
Dealer            Manual        First Owner              509
Individual        Manual        Third Owner              220
Dealer            Automatic     First Owner              157
Individual        Automatic     First Owner              113
Dealer            Manual        Second Owner              86
Individual        Manual        Fourth & Above Owner      68
Trustmark Dealer  Manual        First Owner               64
Individual        Automatic     Second Owner              39
Dealer            Automatic     Second Owner              18
Individual        Automatic     Third Owner               17
Dealer            Manual        Test Drive Car            14
Trustmark Dealer  Automatic     First Owner               11
Dealer            Manual        Third Owner                7
Trustmark Dealer  Automatic     

In [142]:
# Check the shape of the training feature data (rows, columns)
X_train.shape

# Fit and transform the independent feature columns using OrdinalEncoder
X_train_scaled = ordinal.fit_transform(X_train)

# Display the encoded numeric values of the training features
X_train_scaled


array([[1., 1., 0.],
       [1., 1., 0.],
       [1., 1., 2.],
       ...,
       [0., 1., 0.],
       [1., 1., 0.],
       [1., 1., 0.]])

In [143]:
# Convert the encoded training feature array into a DataFrame with original column names
X_train_df = pd.DataFrame(data=X_train_scaled, columns=X_train.columns)

# Display the first 5 rows of the encoded training DataFrame
X_train_df.head()


Unnamed: 0,seller_type,transmission,owner
0,1.0,1.0,0.0
1,1.0,1.0,0.0
2,1.0,1.0,2.0
3,1.0,1.0,0.0
4,0.0,1.0,0.0


In [145]:
# Check the shape of the testing feature data (rows, columns)
X_test.shape

(868, 3)

In [146]:
# Fit and transform the testing feature data using OrdinalEncoder
# (Note: In practice, we should use transform() here instead of fit_transform() to avoid data leakage)
X_test_scaled = ordinal.fit_transform(X_test)

# Display the encoded numeric values of the testing features
X_test_scaled


array([[1., 1., 0.],
       [1., 1., 0.],
       [1., 1., 2.],
       ...,
       [1., 1., 2.],
       [2., 1., 0.],
       [0., 1., 0.]])

In [147]:
# Convert the encoded testing feature array into a DataFrame with original column names
X_test_df = pd.DataFrame(data=X_test_scaled, columns=X_test.columns)

# Display the encoded testing DataFrame
X_test_df


Unnamed: 0,seller_type,transmission,owner
0,1.0,1.0,0.0
1,1.0,1.0,0.0
2,1.0,1.0,2.0
3,1.0,1.0,2.0
4,1.0,0.0,2.0
...,...,...,...
863,1.0,1.0,4.0
864,1.0,1.0,0.0
865,1.0,1.0,2.0
866,2.0,1.0,0.0


##  Conclusion

All categorical columns in our dataset have been successfully converted into numeric form.

- **LabelEncoder** was used to convert the target column (`fuel`) into numerical values.  
- **OrdinalEncoder** was used to convert all independent feature columns into numbers.  

Now, both **training** and **testing** data are fully numeric and ready to be used in **machine learning models**.
