### Working with Categorical Data

- Notebook based on Kaggle Tutorial:
-  https://www.kaggle.com/code/alexisbcook/categorical-variables
-  https://www.kaggle.com/learn/intermediate-machine-learning



In [None]:
# TODO 
# 1. Split features and target. And generate train/test split for training data.
# 2. Impute missing values. 
# 3. Determine categorical columns and numerical columns.
# 4. Encode categorical columns which are ordinal and have low cardinality.
#     Ensure consistent column values between train/validation.
# 5. Reduce cardinality by introducing "Other" category for low frequency categories.
# 6. One-hot encode the remainder of the categorical variables, after reducing their cardinality.
# 7. Combine numerical and encoded categorical variables.
# 8. Create a random forest model. Fit to train data and check model accuracy on validation data.
# 9. categorical-data-competition-prediction.ipynb:  
# Preprocess final test data, use all of training data for model fit, generate predictions and submit to competition. 

In [None]:
# 1. Split features and target. And generate train/test split for training data.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',200)
from sklearn.model_selection import train_test_split

X_full = pd.read_csv("./home-data-for-ml-course/train.csv",index_col="Id")
X_test_full = pd.read_csv("./home-data-for-ml-course/test.csv",index_col="Id")

# remove rows with no y values
X_full.dropna(axis=0,subset=['SalePrice'],inplace=True)

y = X_full['SalePrice']

# drop SalePrice column from predictors
X_full.drop(['SalePrice'],axis=1,inplace=True)

X_train, X_valid, y_train, y_valid = train_test_split(X_full,y,train_size=0.8,test_size=0.2,random_state=1)

In [None]:
# 2. Impute missing values. 
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')

imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer.transform(X_valid))

# Fill in the lines below: imputation removed column names, put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# Test: preserve index of X_train, X_valid dataset ...
imputed_X_train.index = X_train.index
imputed_X_valid.index = X_valid.index

#preserve data types
imputed_X_train = imputed_X_train.astype(X_train.dtypes.to_dict())
imputed_X_valid = imputed_X_valid.astype(X_train.dtypes.to_dict())


print(imputed_X_train.dtypes.unique())
print(len(imputed_X_train.columns))
print(len([col for col in imputed_X_train.columns if imputed_X_train[col].dtype == "object"]))
print(len([col for col in imputed_X_train.columns if imputed_X_train[col].dtype == "int64"]))
print(len([col for col in imputed_X_train.columns if imputed_X_train[col].dtype == "float64"]))


In [None]:
# 3. Determine categorical columns and numerical columns.
object_cols = [col for col in imputed_X_train.columns if imputed_X_train[col].dtype == "object"]
numerical_cols = [col for col in imputed_X_train.columns 
                    if imputed_X_train[col].dtype == "int64" or \
                         imputed_X_train[col].dtype == "float64" ]

numerical_X_train_df = imputed_X_train[numerical_cols]
numerical_X_valid_df = imputed_X_valid[numerical_cols]

print(len(object_cols))
print(len(numerical_cols))
print(len(imputed_X_train.columns))

In [None]:
# 4. Encode categorical columns which are ordinal. Ensure consistent column values between train/validation.

# Note: determine ordinal columns by looking at data_description.txt
ordinal_cols = ["ExterQual", "ExterCond", "BsmtQual", "BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2",\
"HeatingQC","KitchenQual","Functional","FireplaceQu","GarageQual","GarageCond"]

In [None]:
# Note: columns which have consistent values between training and validation set thus can be safely encoded

good_label_ordinal_cols = [col for col in ordinal_cols
                   if set(imputed_X_valid[col]).issubset(set(imputed_X_train[col]))]
bad_label_ordinal_cols = list(set(ordinal_cols) - set(good_label_ordinal_cols))

print('Categorical columns that will be ordinal encoded:', good_label_ordinal_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_ordinal_cols)

In [None]:
remaining_object_cols = list(set(object_cols) - (set(good_label_ordinal_cols+bad_label_ordinal_cols))) 
print(len(object_cols))
print(len(good_label_ordinal_cols))
print(len(remaining_object_cols)) #Note: #bad_label_cols dropped ... fix column values or handle by one-hot encoder?

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_X_train = imputed_X_train.copy()
ordinal_X_valid = imputed_X_valid.copy()

ordinal_X_train = ordinal_X_train[good_label_ordinal_cols]
ordinal_X_valid = ordinal_X_valid[good_label_ordinal_cols]

# Apply ordinal encoder to categorical cols with good labels
ordinal_encoder = OrdinalEncoder()
ordinal_encoded_X_train = pd.DataFrame(ordinal_encoder.fit_transform(ordinal_X_train))
ordinal_encoded_X_valid = pd.DataFrame(ordinal_encoder.transform(ordinal_X_valid))

ordinal_encoded_X_train.index = imputed_X_train.index
ordinal_encoded_X_valid.index = imputed_X_valid.index

ordinal_encoded_X_train.columns = ordinal_X_train.columns
ordinal_encoded_X_valid.columns = ordinal_X_valid.columns

In [None]:
print(type(ordinal_encoded_X_train))
print(ordinal_encoded_X_train.shape)
print(ordinal_encoded_X_valid.shape)

print(ordinal_encoded_X_train.index)
print(ordinal_encoded_X_train.columns)

ordinal_encoded_X_train.head()

In [None]:
ordinal_encoded_X_valid.head()

In [None]:
#Note: One-hot encode the remaining object columns
print(remaining_object_cols)

In [None]:
# 5. Reduce cardinality by introducing "Other" category for low frequency categories.
low_cardinality_cols = []
high_cardinality_cols = []
for col in remaining_object_cols:
    if imputed_X_train[col].nunique() < 8 and imputed_X_train[col].dtype == "object":
        low_cardinality_cols.append(col)
    else:
        high_cardinality_cols.append(col)



In [None]:
# Note: Can I reduce the cardinality of high cardinality columns by using an "Other" label for low frequency values
# Based on: https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b

from collections import Counter
def reduce_cardinality(column, threshold = 0.75, return_categories = True):
    # print("Input cardinality: ",column.nunique())
    threshold_value = threshold*len(column)
    frequency_sum = 0

    new_category_list = []
    counts=Counter(column)
    most_common = dict(counts.most_common())

    for category,count in most_common.items():
        frequency_sum = frequency_sum + count
        new_category_list.append(category)
        if frequency_sum >= threshold_value:
            break
    # Test
    new_category_list.append("Other")
    new_column = column.apply(lambda x: x if x in new_category_list else "Other")
    # print("Output cardinality: ",new_column.nunique())
    # print("new index: \n",new_column.index)
    # print("new column name: \n",new_column.name)
    
    if return_categories:
        return new_column,new_category_list
    else:
        return new_column


transformed_col, new_categories = reduce_cardinality(imputed_X_train["Neighborhood"])
print(new_categories)
print(type(transformed_col))
print(transformed_col)

In [None]:
OH_X_train = imputed_X_train[low_cardinality_cols]
OH_X_valid = imputed_X_valid[low_cardinality_cols]

for col in high_cardinality_cols:
    transformed_X_train_col = reduce_cardinality(imputed_X_train[col],threshold=0.70, return_categories=False)
    OH_X_train = pd.concat([OH_X_train,transformed_X_train_col.to_frame()],axis=1)
    # OH_X_train[col] = transformed_X_train_col
    
    transformed_X_valid_col = reduce_cardinality(imputed_X_valid[col], threshold=0.70, return_categories=False)
    OH_X_valid = pd.concat([OH_X_valid,transformed_X_valid_col.to_frame()],axis=1)
    # OH_X_valid[col] = transformed_X_valid_col



In [None]:
print(type(OH_X_train))
print(OH_X_train.shape)
OH_X_train.describe()

In [None]:
print(type(OH_X_valid))
print(OH_X_valid.shape)
OH_X_valid.describe()

In [None]:
# 6. One-hot encode the remainder of the categorical variables, after reducing their cardinality.

from sklearn.preprocessing import OneHotEncoder

# Note: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# def custom_combiner(feature, category):
#     return str(feature) + "_" + type(category).__name__ + "_" + str(category)

# Note: handle_unknown='ignore': avoid errors when the validation data contains classes that aren't represented in the training data
#parse=False:  ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).
hot1_encoder = OneHotEncoder(handle_unknown="ignore",sparse_output=False)

OH_encoded_X_train = pd.DataFrame(hot1_encoder.fit_transform(OH_X_train))
OH_encoded_X_valid = pd.DataFrame(hot1_encoder.transform(OH_X_valid))

OH_encoded_X_train.index = OH_X_train.index
OH_encoded_X_valid.index = OH_X_valid.index

OH_encoded_X_train.columns = hot1_encoder.get_feature_names_out( OH_X_train.columns.values)
OH_encoded_X_valid.columns = hot1_encoder.get_feature_names_out(OH_X_valid.columns.values)

print(OH_X_train.shape)
print(OH_encoded_X_train.shape)
print("*"*10)

print(OH_encoded_X_train.index)
print(OH_encoded_X_train.columns)
print("*"*10)
print(OH_encoded_X_valid.index)
print(OH_encoded_X_valid.columns)


In [None]:
OH_encoded_X_train.head()

In [None]:
OH_encoded_X_valid.head()

In [None]:
# 7. Combine numerical and encoded categorical dataframes.
train_frames = [numerical_X_train_df, ordinal_encoded_X_train,  OH_encoded_X_train]
valid_frames = [numerical_X_valid_df,ordinal_encoded_X_valid, OH_encoded_X_valid ]

for df in train_frames:
    print(len(df.columns.values))


# Note: should not be overlapping columns, but indexes should match
# for f in range(len(train_frames)-1):
#     intersect = set(train_frames[f].columns.values).intersection(set(train_frames[f+1].columns.values))
#     if intersect:
#         print(intersect)


# for f in range(len(valid_frames)-1):
#     intersect = set(valid_frames[f].columns.values).intersection(set(valid_frames[f+1].columns.values))
#     if intersect:
#         print(intersect)

final_X_train = pd.concat(train_frames,axis=1)
final_X_valid = pd.concat(valid_frames,axis=1)

# print(final_X_train.columns)
# print(final_X_valid.columns)

print(final_X_train.shape, y_train.shape, final_X_valid.shape,y_valid.shape)


In [None]:
# 8. Create a random forest model. Fit to train data and check model accuracy on validation data.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def check_model_accuracy(X_train,X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100,random_state=1)
    model.fit(X_train,y_train)
    predictions = model.predict(X_valid)
    return model, mean_absolute_error(y_valid,predictions)


final_model,error = check_model_accuracy(final_X_train, final_X_valid,y_train,y_valid)
print(final_model)
print(error)



In [None]:
# Preprocess test data
test_imputer = SimpleImputer(strategy='most_frequent')
test_ordinal_encoder = OrdinalEncoder()
test_1hot_encoder = OneHotEncoder(handle_unknown="ignore",sparse_output=False)

#1. Get test data
X_test_final = X_test_full.copy()
# 2. Impute missing values. 
X_test_final_imputed = pd.DataFrame(test_imputer.fit_transform(X_test_final))
X_test_final_imputed.index = X_test_final.index
X_test_final_imputed.columns = X_test_final.columns
X_test_final_imputed = X_test_final_imputed.astype(X_test_final.dtypes.to_dict())

print(X_test_final_imputed.dtypes.unique())
print(X_test_final_imputed.shape)
X_test_final_imputed.head()

In [None]:
# 3. Determine categorical and numerical columns.
test_object_cols = [col for col in X_test_final_imputed.columns.values
                        if X_test_final_imputed[col].dtype == "object"]
test_numerical_cols = [ col for col in X_test_final_imputed.columns.values
                       if X_test_final_imputed[col].dtype == "float64" or \
                         X_test_final_imputed[col].dtype == "int64"]

test_ordinal_cols = ["ExterQual", "ExterCond", "BsmtQual", "BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2",\
"HeatingQC","KitchenQual","Functional","FireplaceQu","GarageQual","GarageCond"]
X_train_full = pd.concat([final_X_train, final_X_valid],axis=0)

# Note: features used in training data should match those in test data
test_data_cols = test_object_cols + test_numerical_cols + test_ordinal_cols

for col in X_train_full.columns.values:
    if col not in test_data_cols:
        print(col)

In [None]:

# 4. Encode categorical columns which are ordinal and have low cardinality.
#     Ensure consistent column values between 
# training data used to fit the model and test data on which generate final predictions for competition.

# good_label_ordinal_cols = [col for col in ordinal_cols
#                    if set(imputed_X_valid[col]).issubset(set(imputed_X_train[col]))]
# bad_label_ordinal_cols = list(set(ordinal_cols) - set(good_label_ordinal_cols))
# remaining_object_cols = list(set(object_cols) - (set(good_label_ordinal_cols+bad_label_ordinal_cols))) 

columns_with_consistent_values = []
not_consistent_cols = []

X_test_final_imputed.drop()

for col in test_ordinal_cols:
    # Note: features in test data should match features in full training data set
    # Also categories within a feature used to train/fit the model should be consistent with those in test data
    if set(X_test_final_imputed[col]).issubset(set(X_train_full[col])):
        # print(set(X_final_imputed[col]))
        columns_with_consistent_values.append(col)
    else:
        not_consistent_cols.append(col)


# Note which columns can safely one-hot encode?
print("columns_with_consistent_values: \n",  columns_with_consistent_values)  
print("not_consistent_cols: \n ",not_consistent_cols)

ordinal_X_test = X_test_final_imputed[columns_with_consistent_values]
ordinal_encoded_X_test =  pd.DataFrame(test_ordinal_encoder.fit_transform(ordinal_X_test))

ordinal_encoded_X_test.index = X_test_final_imputed.index
ordinal_encoded_X_test.columns = columns_with_consistent_values

print(ordinal_encoded_X_test.shape)
ordinal_encoded_X_test.head()
                 


In [None]:
# 5. Reduce cardinality by introducing "Other" category for low frequency categories.
remaining_categorical_cols = list(set(test_object_cols) - set(ordinal_X_test.columns.values))     
test_low_card = []
test_high_card = []
for col in remaining_categorical_cols:
    if X_test_final_imputed[col].nunique() < 8 and X_test_final_imputed[col].dtype == "object":
        test_low_card.append(col)
    else:
        test_high_card.append(col)

OH_X_test = X_test_final_imputed[test_low_card]
for col in test_high_card:
    transformed_X_test_col = reduce_cardinality(X_test_final_imputed[col], threshold=0.7, return_categories=False)
    OH_X_test = pd.concat([OH_X_test,transformed_X_test_col.to_frame()],axis=1 )


print(OH_X_test.shape)
OH_X_test.describe()


In [None]:
# 6. One-hot encode the remainder of the categorical variables, after reducing their cardinality.
OH_encoded_X_final = pd.DataFrame(test_1hot_encoder.fit_transform(OH_X_test))

OH_encoded_X_final.index = OH_X_test.index
OH_encoded_X_final.columns = test_1hot_encoder.get_feature_names_out(OH_X_test.columns.values)
print(OH_encoded_X_final.shape)
OH_encoded_X_final.head()

In [None]:
# 7. Combine numerical and encoded categorical variables.
test_frames_to_combine = [X_test_final_imputed[test_numerical_cols],ordinal_encoded_X_test,OH_encoded_X_final]
for df in test_frames_to_combine:
    print(len(df.columns.values))
X_test_final_df = pd.concat(test_frames_to_combine, axis=1)
print(X_test_final_df.shape)
X_test_final_df.head()

In [None]:
# Note: use all of training data now to train model
print(set(final_X_train.columns.values) == set(final_X_valid.columns.values))

In [None]:
# 8. Create a random forest model. Fit to entire training data and check model accuracy on validation data.
final_preprocessed_training_data_X = pd.concat([final_X_train,final_X_valid],axis=0)
final_preprocessed_training_data_y = pd.concat([y_train,y_valid],axis=0)

# Note: number of features used to fit model should equal number of features used to test
print("number features training: ",len(final_preprocessed_training_data_X.columns))
print("number features test: ",len( X_test_final_df.columns ))
print(set(final_preprocessed_training_data_X.columns) == set(X_test_final_df.columns))


competition_model = RandomForestRegressor(n_estimators=100, random_state=1)
competition_model.fit(final_preprocessed_training_data_X,final_preprocessed_training_data_y )
competition_predictions = competition_model.predict(X_test_final_df)
final_predictions = final_model.predict(X_test_final_df)
final_predictions
# 9. Preprocess final test data, generate predictions and submit to competition. 

### Notes
- For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables.
- categorical variables without ordering --> nominal variables --> one-hot encoding
- one-hot encoding NOT used for a variables which takes more than 15 values
- "Cardinality" means the number of unique values in a column
- One-hot encode low cardinality columns, if can ordinal encode the remainder of categorical columns, drop remainder
  
```python
# low cardinality
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

```

#### Ordinal Encoder

```python
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
```


#### One-hot encoder

- set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data
- sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).
- 
```python
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
```