# Encoding Multiple Categorical Features in Classification

This notebook demonstrates **how to handle multiple string (categorical) features** in a **classification problem**.

We will:
- Simulate a realistic dataset
- Identify nominal vs ordinal features
- Apply correct encoding techniques
- Train a classification model
- Explain *why* each step is needed


## 1. Problem Statement

Predict whether a customer will **purchase a product** based on demographic and categorical features.

**Target variable:** `Purchased` (Yes / No)


In [1]:
import numpy as np
import pandas as pd


# 2.Reading the Dataset

In [3]:
dataset = pd.read_csv("customer_purchase_data.csv")

In [4]:
dataset.head()

Unnamed: 0,Gender,City,Product,Education,Age,Purchased
0,Male,Mumbai,Laptop,Bachelor,34,No
1,Female,Chennai,Tablet,High School,31,No
2,Male,Chennai,Phone,Bachelor,38,No
3,Male,Bangalore,Phone,Master,32,Yes
4,Male,Mumbai,Phone,High School,43,No


## 3. Separate Features and Target

- `X` → Input features
- `y` → Target variable


In [None]:
X = data.drop('Purchased', axis=1)
y = data['Purchased']


## 4. Encode Target Variable (y)

Machine learning models **cannot work with string labels**, so we encode the target.


In [None]:
from sklearn.preprocessing import LabelEncoder

y_encoder = LabelEncoder()
y = y_encoder.fit_transform(y)


## 5. Encode Ordinal Feature (Education)

Education has a **natural order**:

`High School < Bachelor < Master < PhD`


In [None]:
edu_encoder = LabelEncoder()
X['Education'] = edu_encoder.fit_transform(X['Education'])


## 6. One-Hot Encode Nominal Features

Nominal features have **no order**, so we apply One-Hot Encoding:

- Gender
- City
- Product


In [None]:
nominal_features = ['Gender', 'City', 'Product']
X = pd.get_dummies(X, columns=nominal_features, drop_first=True)


## 7. Train-Test Split


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


## 8. Train Classification Model

We use **Logistic Regression**, which requires fully numeric input.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


## 9. Key Takeaways

- Each categorical feature must be encoded **independently**
- Ordinal features → Label / Ordinal Encoding
- Nominal features → One-Hot Encoding
- Target variable must always be numeric
- Logistic Regression **cannot accept strings at all**

This approach is **interview-safe**, **production-ready**, and **scalable**.
