This notebook is based on and adapted from the Kaggle Learn “Intermediate Machine Learning” course.
I used the exercise structure and dataset provided by Kaggle as a foundation for exploring missing data and categorical encoding techniques.

In [1]:
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex3 import *
print("Setup Complete")

Setup Complete


In this exercise, you will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course). 

![Ames Housing dataset image](https://storage.googleapis.com/kaggle-media/learn/images/lTJVG4e.png)

Load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`.  The test set is loaded in `X_test`.

- The following code is from the exercise, no modifications were needed from me

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [3]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,RL,11694,Pave,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,...,108,0,0,260,0,0,7,2007,New,Partial
871,20,RL,6600,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,...,0,0,0,0,0,0,8,2009,WD,Normal
93,30,RL,13360,Pave,IR1,HLS,AllPub,Inside,Gtl,Crawfor,...,0,44,0,0,0,0,8,2009,WD,Normal
818,20,RL,13265,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Mitchel,...,59,0,0,0,0,0,7,2008,WD,Normal
303,20,RL,13704,Pave,IR1,Lvl,AllPub,Corner,Gtl,CollgCr,...,81,0,0,0,0,0,1,2006,WD,Normal


Notice that the dataset contains both numerical and categorical variables.  Encode the categorical data before training a model.

To compare different models, you'll use the same `evaluate_model_with_mae()` function from the previous exercise,handling missing values.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Trains a Random Forest model and evaluates it using Mean Absolute Error
def evaluate_model_with_mae(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Adapted from Kaggle learn: Categorical Variables_

# Step 1: Drop columns with categorical data

Preprocess the data in `X_train` and `X_valid` to remove columns with categorical data.  Set the preprocessed DataFrames to `drop_X_train` and `drop_X_valid`, respectively.  

In [5]:
# Fill in the lines below: drop columns in training and validation data - cat variables
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])


Print the MAE for this approach

In [6]:
print("MAE from Approach 1 (Drop categorical variables):")
print(evaluate_model_with_mae(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
17837.82570776256


Before jumping into ordinal encoding, we'll investigate the dataset.  Specifically, we'll look at the `'Condition2'` column.  The code cell below prints the unique entries in both the training and validation sets.

In [7]:
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']

Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']


- Above, Adapted from Kaggle learn_

## Step 2: Ordinal Encoding

### Handling Issues with Unseen Categories

When fitting an `OrdinalEncoder` on the training data and transforming the validation data, an error can occur if the validation set contains categories not present in the training set. 

This is because the `OrdinalEncoder` by default does not know how to handle unseen categories, and will raise a `ValueError`. For instance, if a categorical feature includes a new category like `"Purple"` in the validation set that wasn't present during fitting, the encoder won't know how to process it.

A solution is to use the `handle_unknown='use_encoded_value'` parameter along with `unknown_value=-1` to encode such categories explicitly.


## Identifying Categorical Columns for Encoding

In real-world datasets, it's common for categorical variables in the validation or test set to include categories not seen during training. This causes issues with encoders like `OrdinalEncoder`, which by default cannot handle unknown categories.

There are multiple ways to address this:
- Implementing a custom encoder that gracefully handles unseen values
- Using encoding methods that can ignore unknowns (e.g., one-hot encoding with `handle_unknown='ignore'`)
- Dropping problematic columns altogether (a simple and effective approach)

In this step, we identify:
- `bad_label_cols`: categorical columns that contain values in the validation set not present in the training set (unsafe for ordinal encoding)
- `good_label_cols`:categorical columns that can be safely ordinal encoded


In [8]:
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if 
                   set(X_valid[col]).issubset(set(X_train[col]))]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

Categorical columns that will be ordinal encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Condition2', 'Functional', 'RoofMatl']


## Ordinal Encoding of Safe Categorical Columns

In this step, we preprocess the data by applying ordinal encoding to only those categorical columns that are safe to encode — that is, columns where all categories present in the validation set also exist in the training set (`good_label_cols`).

The remaining categorical columns (`bad_label_cols`) are excluded from the dataset to prevent errors caused by unseen categories.

The resulting preprocessed datasets are stored in:
- `label_X_train`: the training set with ordinal encoding applied to safe categorical features
- `label_X_valid`: the validation set encoded using the same mapping


In [9]:
from sklearn.preprocessing import OrdinalEncoder

# Drop problematic columns
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply ordinal encoding to safe categorical columns
ordinal_encoder = OrdinalEncoder()
label_X_train[good_label_cols] = ordinal_encoder.fit_transform(label_X_train[good_label_cols])
label_X_valid[good_label_cols] = ordinal_encoder.transform(label_X_valid[good_label_cols])

Get MAE for the 2nd approach:

In [10]:
print("MAE from Approach 2 (Ordinal Encoding):") 
print(evaluate_model_with_mae(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Ordinal Encoding):
17098.01649543379


## Comparing Categorical Data Handling Strategies

At this point, two strategies have been evaluated for handling categorical variables:

1. **Dropping categorical columns** with missing or problematic values
2. **Ordinal encoding** for safe categorical features

The results demonstrate that **encoding categorical variables generally produces better model performance** than simply removing them.

Next, we'll explore a more flexible and commonly used method: **one-hot encoding**. Before diving into that, however, there’s one more important concept to address.


- Unmodified cell from Kaggle

In [11]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

# Step 3: Investigating cardinality

### Part A

The output above shows, for each column with categorical data, the number of unique values in the column.  For instance, the `'Street'` column in the training data has two unique values: `'Grvl'` and `'Pave'`, corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the **cardinality** of that categorical variable.  For instance, the `'Street'` variable has cardinality 2.


In [12]:
# How many categorical variables in the training data
# have cardinality greater than 10?
high_cardinality_numcols = 3

# How many columns are needed to one-hot encode the 
# 'Neighborhood' variable in the training data?
num_cols_neighborhood = 25

## Part B: Impact of Encoding on Dataset Size

One-hot encoding can significantly increase the dimensionality of a dataset, especially when applied to categorical columns with high cardinality (many unique values). This expansion can affect computational efficiency and model performance.

Therefore, it is common practice to:
- One-hot encode only categorical features with **low cardinality**, to avoid a large increase in features.
- For **high cardinality** columns, either drop them or use ordinal encoding as a more compact alternative.

**Example:**  
Consider a dataset with 10,000 rows and a categorical column that contains 100 unique categories.

- When one-hot encoding this column, the dataset will gain **100 new feature columns**, each representing one category.
- When using ordinal encoding, the dataset will gain only **1 new feature column**, since each category is mapped to a unique integer.

Below, compute the total number of entries added to the dataset in each encoding scenario.


In [13]:
#How many entries are added to the dataset by 
# replacing the column with a one-hot encoding?
OH_entries_added = 990000

# How many entries are added to the dataset by
# replacing the column with an ordinal encoding?
label_entries_added = 0

## Selecting Columns for One-Hot Encoding

One-hot encoding can lead to a large number of new features when applied to high-cardinality categorical columns. To manage this, we focus on encoding only those categorical columns with **low cardinality** (fewer than 10 unique categories). 

- `low_cardinality_cols` will store the list of columns selected for one-hot encoding.
- `high_cardinality_cols` will contain categorical columns with high cardinality that will be dropped to avoid dimensionality explosion.

The code below identifies these columns accordingly.


In [14]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Exterior1st', 'Exterior2nd', 'Neighborhood']


## Step 4: One-Hot Encoding

In this step, we apply one-hot encoding to the categorical columns with low cardinality (stored in `low_cardinality_cols`). 

- Only the columns in `low_cardinality_cols` are one-hot encoded.
- The remaining categorical columns, including those in `high_cardinality_cols`, are dropped to avoid increasing dataset dimensionality unnecessarily.
- The full set of categorical columns is stored in `object_cols`.

The processed datasets are saved as:
- `OH_X_train` for the training features
- `OH_X_valid` for the validation features


In [15]:
from sklearn.preprocessing import OneHotEncoder

# Create encoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Apply encoder and assign meaningful column names
OH_cols_train = pd.DataFrame(
    OH_encoder.fit_transform(X_train[low_cardinality_cols]),
    columns=OH_encoder.get_feature_names_out(low_cardinality_cols),
    index=X_train.index
)

OH_cols_valid = pd.DataFrame(
    OH_encoder.transform(X_valid[low_cardinality_cols]),
    columns=OH_encoder.get_feature_names_out(low_cardinality_cols),
    index=X_valid.index
)

# Drop original categorical columns
num_X_train = X_train.drop(columns=object_cols)
num_X_valid = X_valid.drop(columns=object_cols)

# Concatenate numerical and one-hot encoded columns
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)


In [16]:
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)


The MAE for this approach:

In [17]:
print("MAE from Approach 3 (One-Hot Encoding):") 
print(evaluate_model_with_mae(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from Approach 3 (One-Hot Encoding):
17525.345719178084


## Conclusin

In this exercise adapted from the Kaggle “Handling Missing Values” competition, I applied multiple approaches to deal with missing data and categorical variables, then compared their effects on model performance using a Random Forest regressor.

- Dropping categorical columns resulted in the highest mean absolute error, indicating loss of important information.

- Ordinal encoding of categorical features produced the best results, suggesting the model benefited from the preserved order in categories.

- One-hot encoding improved performance compared to dropping columns but did not outperform ordinal encoding in this specific case, likely due to increased feature dimensionality and potential noise.

These experiments demonstrate how different preprocessing choices impact predictive accuracy. Selecting the right encoding strategy requires understanding the dataset and the trade-offs between model complexity and information retention.

For future work, I plan to explore alternative encoding methods, such as target encoding, and experiment with hyperparameter tuning to enhance model performance further.