## Encoding

In [None]:
Excellent ‚Äî encoding is the next essential step after cleaning data!

Encoding means transforming categorical or text data into numeric form so that machine learning models can understand it.

Here‚Äôs a complete guide üëá

## üß≠ Step 1: Identify What Needs Encoding

Categorical features usually come in two types:

In [None]:
| Type                    | Example                                                | Encoding Method          |
| ----------------------- | ------------------------------------------------------ | ------------------------ |
| **Nominal (no order)**  | Gender = {Male, Female}, City = {Paris, Tokyo, London} | One-Hot Encoding         |
| **Ordinal (has order)** | Size = {Small, Medium, Large}                          | Label / Ordinal Encoding |


## üß© Step 2: Choose an Encoding Technique
üîπ 1. Label Encoding (for ordinal data)

Each category gets a unique integer label.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Size'] = le.fit_transform(df['Size'])


In [None]:
| Size   | Encoded |
| ------ | ------- |
| Small  | 0       |
| Medium | 1       |
| Large  | 2       |


In [None]:
‚úÖ Good for: Ordinal variables (ordered categories).
‚ö†Ô∏è Avoid for nominal variables ‚Äî can mislead the model (it may think 2 > 1).

In [None]:
üîπ 2. One-Hot Encoding (for nominal data)

Creates a binary column for each category.

In [None]:
# Using pandas
df = pd.get_dummies(df, columns=['City'], drop_first=True)


In [None]:
| City_Paris | City_Tokyo |
| ---------- | ---------- |
| 1          | 0          |
| 0          | 1          |
| 0          | 0          |


In [None]:
‚úÖ Good for: Nominal variables (no natural order).
‚ö†Ô∏è Can increase dimensionality if there are many categories.

In [None]:
üîπ 3. Ordinal Encoding (custom order)

If categories have a logical order, define it manually.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df[['Size']] = encoder.fit_transform(df[['Size']])


In [None]:
‚úÖ Keeps order information.
‚ö†Ô∏è Must ensure the order is meaningful.

In [None]:
üîπ 4. Binary Encoding (for high-cardinality features)

Reduces dimensionality by converting categories to binary digits.

In [None]:
!pip install category_encoders
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['Product_ID'])
df = encoder.fit_transform(df)


In [None]:
‚úÖ Good for: Columns with many unique categories (e.g., zip codes).
‚ö†Ô∏è Slightly less interpretable.

In [None]:
üîπ 5. Target Encoding (Mean Encoding)

Replace each category with the average target value (for supervised learning).

In [None]:
df['City_encoded'] = df.groupby('City')['Sales'].transform('mean')


In [None]:
üîπ 6. Frequency / Count Encoding

Replace each category with how often it appears.


In [None]:
df['City_encoded'] = df['City'].map(df['City'].value_counts())


In [None]:
‚úÖ Simple and effective for tree-based models.
‚ö†Ô∏è Doesn‚Äôt capture relationship with the target variable.

## üß† Step 3: Verify the Encoding

After encoding:

In [None]:
df.info()
df.head()

In [None]:
Check:

* All categorical columns are now numeric.

* No unintended high dimensionality.

* Encodings make sense (ordered vs unordered features handled properly).

## ‚öôÔ∏è Step 4: Automate with Scikit-learn Pipelines

To avoid data leakage:

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

categorical_cols = ['City', 'Gender']
numerical_cols = ['Age', 'Income']

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('num', 'passthrough', numerical_cols)
])

model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('classifier', RandomForestClassifier())
])
