# Categorical Variables with Melbourne Housing dataset
This is Intermediate Machine Learning course in Kaggle. The data source: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
# More detailed explanation is following. 
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
                       X_train_full[cname].dtype == "object"]

# Select numerical columns 
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()


### About Cardinality and low_cardinality_cols code

I asked chatGPT for detailed explanation about this code for my own reference(or maybe for you if anyone is reading this): 

The code you provided is typically used in machine learning for feature selection or preprocessing. It's specifically used to identify and select categorical columns with relatively low cardinality. Let's break down what this code does and why it's necessary:

- Cardinality: In the context of machine learning, "cardinality" refers to the number of unique values in a categorical column. For example, in a "color" column, if there are only three unique values ("red," "green," and "blue"), the cardinality of that column is 3.

- Categorical Columns: Categorical columns are those that contain discrete, non-numeric values. These values represent categories or labels, such as "color," "gender," or "country."

- Low Cardinality: The code aims to identify categorical columns with relatively low cardinality. This is often done for several reasons: 

    High cardinality categorical columns can lead to a large number of one-hot encoded features, which can significantly increase the dimensionality of your data and make the model more complex. It can also lead to sparse data, which may affect the model's performance.
    
    Low cardinality columns are typically easier to work with because they have fewer unique categories, making them more manageable for feature engineering and modeling.
    
    
- Selection: The code creates a list of column names (low_cardinality_cols) by iterating through the columns of the dataset X_train_full. It checks if the column's data type is "object" (indicating it's a categorical column) and if the number of unique values in that column is less than 10. If both conditions are met, the column name is added to the low_cardinality_cols list.

The resulting low_cardinality_cols list contains the names of categorical columns with low cardinality, which can then be used for various purposes in the machine learning workflow. For example, you might use these columns for one-hot encoding, label encoding, or as features for your machine learning model.

In summary, this code helps you identify and select categorical columns with low cardinality, which is a common practice in machine learning to manage data complexity and improve model performance when dealing with categorical features.

Take a peek at the training data with the 'head()' method

As you can see, I am going to deal with only numerical data and low cardinality columns by selecting them into my_cols. 

In [3]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


Next, I obtain a list of all of the categorical variables in the training data. 

I do this by checking the data type (or dtaype) of each column. The "object" dtype indicates a column has text. For this dataset, the columns with text indicate categorical variables. 

In [4]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['Type', 'Method', 'Regionname']


### Define Function to Measure quality of Each Approach
We define a function 'score_dataset()' to compare the three different approaches to dealing with categorical variables. This function reports the mean absolute error (MAE) from a random forest model. In general, we want the MAE to be as low as possible!

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## Score from Approach 1 (Drop Categorical Variables)
I drop the 'object' columns with the 'select_dtypes()' method.

In [6]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
175703.48185157913


## Score from Approach 2 (Ordinal Encoding)

scikit-lean has a 'OrdinalEncoder' class that can be used to get ordinal encodings. We loop over the categorical variables and apply the ordinal encoder separately to each column.

In [7]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Ordinal Encoding):
165936.40548390493


## Score from Approach 3 (One-Hot Encoding)
We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.

- We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
- setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)

In [14]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) #Explained above
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
# Concatenate means combining two or more DataFrames into one specified axis
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
# Explained in the box below
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))



MAE from Approach 3 (One-Hot Encoding):
166089.4893009678


#### About astype(str)

By applying .astype(str) to the column names, you are explicitly converting any non-string column names into string data types. This can be helpful to ensure consistency in the data and prevent potential issues that might arise if you were to work with column names as non-string data types.

In practice, having all column names as strings is a common convention, and it makes it easier to work with DataFrames and perform various operations, especially when you need to reference columns, filter data, or manipulate the DataFrame using column names as keys. The code you provided ensures that all columns are treated as strings, regardless of their original data type, which can help avoid unexpected errors or inconsistencies in your code.

## Conclusion

Approach 2 performed best with the lowest MAE. However, Approach 3 score is so close to Approach 2, there doesn't appear to be any meaningful benefit to one over the other. 