<a href="https://colab.research.google.com/github/secoxx/IE423/blob/main/task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initialize

In [76]:
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Load Data

In [77]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [78]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/black_friday/train.csv')
df.head()
# Target variable is the Purchase column as explained in dataset description
df = df.dropna(subset=["Purchase"])
y = df.loc[:,['Purchase']].values.ravel()
X = df.drop(['Purchase'],axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8, test_size=0.2,random_state=1)

## Build Random Forest Model

In [79]:
from sklearn.ensemble import RandomForestRegressor

# Function for building and scoring Random Forest models
def get_random_forest_mae(X_trn, X_tst, y_trn, y_tst):
    mdlRfs = RandomForestRegressor(random_state=1)
    mdlRfs.fit(X_trn, y_trn)
    y_tst_prd = mdlRfs.predict(X_tst)
    mae = mean_absolute_error(y_tst, y_tst_prd)
    return (mae)


Let's try to build a model with all the features...

In [80]:
# Try to build a model using all features
get_random_forest_mae(X_train, X_test, y_train, y_test)

ValueError: could not convert string to float: 'P00304042'

We have non-numeric values. We remove them first.

In [94]:
# Select numeric features
cols_num = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
Xnum = X[cols_num]

# Split numeric features into training and test sets
Xnum_train, Xnum_test, y_train, y_test = train_test_split(Xnum,y,train_size=0.8, test_size=0.2,random_state=1)

In [82]:
# Try to build a model using all numeric features
get_random_forest_mae(Xnum_train, Xnum_test, y_train, y_test)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Some of the numeric columns have **Missing Values**. Therefore we'll look for the best way to remove them.

In [83]:
# Count number of missing values in each column of the training data
Xnum_train.isna().sum()

User_ID                    0
Occupation                 0
Marital_Status             0
Product_Category_1         0
Product_Category_2    138892
Product_Category_3    306504
dtype: int64

Missing values are in two columns.

#### Approach 1. Drop columns with missing values
The simplest option is to **drop columns** with missing values.

In [85]:
# Identify columns with missing values and then drop such columns
cols_num_null = [col for col in Xnum_train.columns
    if Xnum_train[col].isnull().any()]
Xnum_train_drpnull = Xnum_train.drop(cols_num_null, axis=1)
Xnum_test_drpnull = Xnum_test.drop(cols_num_null, axis=1)

In [86]:
print('MAE from Approach 1 (Drop features with missing values):')
print(get_random_forest_mae(Xnum_train_drpnull, Xnum_test_drpnull, y_train, y_test))

MAE from Approach 1 (Drop features with missing values):
2091.2402741391948


We dropped any column that has missing values and obtained a MAE of 2091. This approach may be getting rid of vital information by dropping an entire column even if it has one missing value (even though that's not the case for this dataset, only two columns had missing values and the missing values were not a small minority of the column). Therefore, we'll explore other options.

#### Approach 2. Fill missing values by Imputation
**Imputation** fills in the missing values with some number. For instance, we can fill in the mean value along each column.

The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

In [88]:
# Replace with specific value (0, bfill, ffill)
Xnum_train_repnull = Xnum_train.fillna(method = 'ffill')
Xnum_test_repnull = Xnum_test.fillna(method = 'ffill')

print("Missing values in Xnum_train_repnull:", Xnum_train_repnull.isnull().sum().sum())
print("Missing values in Xnum_test_repnull:", Xnum_test_repnull.isnull().sum().sum())

print('MAE from Approach 2 (Replace missing values with forward fill):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

Missing values in Xnum_train_repnull: 2
Missing values in Xnum_test_repnull: 1
MAE from Approach 2 (Replace missing values with forward fill):


ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

There seems to be missing values in the beginning of the columns, which is why even after forward fill, we still observe NaN values.

In [99]:
Xnum_train_repnull = Xnum_train.fillna(method = 'ffill')
Xnum_test_repnull = Xnum_test.fillna(method = 'ffill')

Xnum_train_repnull = Xnum_train_repnull.tail(-1)
Xnum_test_repnull = Xnum_test_repnull.tail(-1)
y_train_repnull = y_train[1:]
y_test_repnull = y_test[1:]
print("Missing values in Xnum_train_repnull:", Xnum_train_repnull.isnull().sum().sum())
print("Missing values in Xnum_test_repnull:", Xnum_test_repnull.isnull().sum().sum())

print('MAE from Approach 2 (Replace missing values with forward fill):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train_repnull, y_test_repnull))

Missing values in Xnum_train_repnull: 0
Missing values in Xnum_test_repnull: 0
MAE from Approach 2 (Replace missing values with forward fill):
2271.429521996585


After replacing all NaN values with the forward fill method, we found a higher error. This means that dropping the columns Product_Category_2 and Product_Category_3 didn't cost us vital information in the first place.

In [100]:
# Replace with mean value
Xnum_train_repnull = Xnum_train.fillna(Xnum_train.mean())
Xnum_test_repnull = Xnum_test.fillna(Xnum_train.mean())

print('MAE from Approach 2 (Replace missing values with mean):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

MAE from Approach 2 (Replace missing values with mean):
2193.456214200903


Again, we get a higher error after replacing the missing values with the average of the columns that have missing values. This further concludes that Product_Category_2 and Product_Category_3 are not conclusive columns that possess vital information regarding the target label.

In [101]:
# Going forward, let us remove columns consisting of missing numeric values
X_train = X_train.drop(columns=['Product_Category_2','Product_Category_3'])
X_test = X_test.drop(columns=['Product_Category_2','Product_Category_3'])

Next, we'll add non numeric features.

## Non-numerical Features

In [102]:
# Select non-numeric features
cols_obj = [col for col in X.columns if X[col].dtype == 'object']
cols_obj

['Product_ID', 'Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [103]:
# Label encoding on all non-numeric features

from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with non-numeric data
label_encoder = LabelEncoder()
for col in cols_obj:
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    Xle_test[col] = label_encoder.transform(X_test[col])

ValueError: y contains previously unseen labels: 'P00206242'

Apparently label encoder can't handle new values that it hasn't seen in the training set. Since ID is different for every row, we are more prone to run into this issue in the Product_ID column. Therefore, we'll filter for columns that have a limited number of unique values.

In [104]:
# Select categorical features
cols_cat = [col for col in X.columns if X[col].dtype == 'object' and X[col].nunique()<10]
cols_cat

['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [105]:
# Label encoding on only categorical features

from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in cols_cat:
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    Xle_test[col] = label_encoder.transform(X_test[col])

cols_num still has Product_Category_2 and Product_Category_3 so we'll remove those columns.

In [107]:
cols_num = cols_num[:-2]

In [108]:
cols_num

['User_ID', 'Occupation', 'Marital_Status', 'Product_Category_1']

In [109]:
cols_cat

['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [110]:
# Encode and Build/Score using all Categorical columns

mae = get_random_forest_mae(Xle_train[cols_num + cols_cat], Xle_test[cols_num + cols_cat], y_train, y_test)
print("MAE from Label Encoding all Categorical columns:")
print(mae)

MAE from Label Encoding all Categorical columns:
2086.9820597803287


So, by including the Categorical features, the MAE is reduced to 2086 from 2091. This means that even categorical variables either didn't have that much impact on the target variable or the label encoding method is not suitable for capturing their essence.

## Build Gradient Boosted Tree Model

Let's begin by training a simple Gradient Boosting model...

In [111]:
from xgboost import XGBRegressor

#Build and score default Gradient Boosting Model
mdlXgb = XGBRegressor()
mdlXgb.fit(Xle_train[cols_num + cols_cat], y_train)
y_test_pred = mdlXgb.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("MAE from default XGBoost model:")
print(mae)

MAE from default XGBoost model:
2138.092662320839


HSince the result is worse than Random Forest model, we'll tune the parameters according to the parameter tuning we did in class.

`n_estimators`: maximum number of decision trees that will be ensembled

`max_depth`: maximum depth of each tree (typically 3-10)

`learning_rate`: weight applied to each tree (typically 0.01-0.2)

In [112]:
#Build and score a tuned Gradient Boosting Model
mdlXgb = XGBRegressor(n_estimators=5000, learning_rate=0.2, max_depth=5)
mdlXgb.fit(Xle_train[cols_num + cols_cat], y_train)
y_test_pred = mdlXgb.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("MAE from tuned XGBoost model:")
print(mae)

MAE from tuned XGBoost model:
2081.0612316920938


By doing hyperparameter tuning on the gradient boosting model, we're able to further decrease the MAE to 2081. Although not significant, this result shows that through correct hyperparameters, a different result may be obtained.

## Conclusion

* At first, a random forest model was fitted. A function evaluating the MAE of random forest models was created. This function will be useful in the preprocessing stage where we'll be tweaking the data to get the least error.
* However, there were categorical features. The dataframe was adjusted so that it doesn't contain categorical features.
* Between dropping the columns with missing values and filling the missing values by imputation, dropping two columns that had missing values proved to provide less error. After dropping those two columns, the next step was to analyze categorical features.
* Categorical features were label encoded and the resulting MAE was lower.
* Next, the effectiveness of the random forest model was questioned. It turned out that the gradient boost was more effective (resulted in less MAE) when hyperparameters were tuned.