<a href="https://colab.research.google.com/github/yamac0/IE423/blob/main/task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initialize

In [None]:
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Load Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dfSls = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/sales/sales.csv')

# Select target as a series and features as dataframe
y = dfSls.loc[:,['Purchase']].values.ravel()
X = dfSls.drop(['Purchase'],axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8, test_size=0.2,random_state=1)

## Build Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Function for building and scoring Random Forest models
def get_random_forest_mae(X_trn, X_tst, y_trn, y_tst):
    mdlRfsSls = RandomForestRegressor(random_state=1)
    mdlRfsSls.fit(X_trn, y_trn)
    y_tst_prd = mdlRfsSls.predict(X_tst)
    mae = mean_absolute_error(y_tst, y_tst_prd)
    return (mae)

In [None]:
# Try to build a model using all features
get_random_forest_mae(X_train, X_test, y_train, y_test)

ValueError: could not convert string to float: 'P00304042'

Some of the columns are non-numeric.  we should handle these data and adapt them.

## Numerical Features
Columns with quantitative data - either Discrete or Continuous are called Numerical Features. We tried to get only numeric data for training process

In [None]:
# Select numeric features
cols_num = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
Xnum = X[cols_num]

# Split numeric features into training and test sets
Xnum_train, Xnum_test, y_train, y_test = train_test_split(Xnum,y,train_size=0.8, test_size=0.2,random_state=1)

See the data that we deal with

In [31]:
Xnum.head()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000001,10,0,3,,
1,1000001,10,0,1,6.0,14.0
2,1000001,10,0,12,,
3,1000001,10,0,12,14.0,
4,1000002,16,0,8,,


In [33]:
Xnum.describe(include = 'all')

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
count,550068.0,550068.0,550068.0,550068.0,376430.0,166821.0
mean,1003029.0,8.076707,0.409653,5.40427,9.842329,12.668243
std,1727.592,6.52266,0.49177,3.936211,5.08659,4.125338
min,1000001.0,0.0,0.0,1.0,2.0,3.0
25%,1001516.0,2.0,0.0,1.0,5.0,9.0
50%,1003077.0,7.0,0.0,5.0,9.0,14.0
75%,1004478.0,14.0,1.0,8.0,15.0,16.0
max,1006040.0,20.0,1.0,20.0,18.0,18.0


In [34]:
Xnum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   User_ID             550068 non-null  int64  
 1   Occupation          550068 non-null  int64  
 2   Marital_Status      550068 non-null  int64  
 3   Product_Category_1  550068 non-null  int64  
 4   Product_Category_2  376430 non-null  float64
 5   Product_Category_3  166821 non-null  float64
dtypes: float64(2), int64(4)
memory usage: 25.2 MB


In [None]:
# Try to build a model using all numeric features
get_random_forest_mae(Xnum_train, Xnum_test, y_train, y_test)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Seems like some of the numeric columns have **Missing Values**. Some numeric data are missing

In [None]:
# Count number of missing values in each column of the training data
Xnum_train.isna().sum()

User_ID                    0
Occupation                 0
Marital_Status             0
Product_Category_1         0
Product_Category_2    138892
Product_Category_3    306504
dtype: int64

As expected, there are some missing values.

#### Approach 1. Drop columns with missing values
The simplest option is to **drop columns** with missing values.

In [None]:
# Identify columns with missing values and then drop such columns
cols_num_null = [col for col in Xnum_train.columns
    if Xnum_train[col].isnull().any()]
Xnum_train_drpnull = Xnum_train.drop(cols_num_null, axis=1)
Xnum_test_drpnull = Xnum_test.drop(cols_num_null, axis=1)

So we dropeed 138892 values for column Product_Category_2 and 306504 from Product_Category_3

In [None]:
print('MAE from Approach 1 (Drop features with missing values):')
print(get_random_forest_mae(Xnum_train_drpnull, Xnum_test_drpnull, y_train, y_test))

MAE from Approach 1 (Drop features with missing values):
2091.2402741391948


#### Approach 2. Fill missing values by Imputation
**Imputation** fills in the missing values with some number. For instance, we can fill in the mean value along each column.

The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

In [None]:
# Replace with specific value (0, bfill, ffill)
Xnum_train_repnull = Xnum_train.fillna(method = 'ffill')
Xnum_test_repnull = Xnum_test.fillna(method = 'ffill')

# If 'ffill' couldn't fill all NaNs, use 'bfill' to fill NaNs from the back
Xnum_train_repnull = Xnum_train_repnull.fillna(method='bfill')
Xnum_test_repnull = Xnum_test_repnull.fillna(method='bfill')

# Check if there are still any missing values after filling
print("Missing values in Xnum_train_repnull:", Xnum_train_repnull.isnull().sum().sum())
print("Missing values in Xnum_test_repnull:", Xnum_test_repnull.isnull().sum().sum())

print('MAE from Approach 2 (Replace missing values with forward fill, if ffill does not work bfill technique is used):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

Missing values in Xnum_train_repnull: 0
Missing values in Xnum_test_repnull: 0
MAE from Approach 2 (Replace missing values with forward fill):
2272.049645096092


In [None]:
# Replace with mean value
Xnum_train_repnull = Xnum_train.fillna(Xnum_train.mean())
Xnum_test_repnull = Xnum_test.fillna(Xnum_train.mean())

print('MAE from Approach 2 (Replace missing values with mean):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

MAE from Approach 2 (Replace missing values with mean):
2193.456214200903


Turns out that replacing the missing numerical values with the column mean does not gives us better MAE compared to the dropping the missing values

In [None]:
# Going forward, let us replace all missing numeric values with the column mean
X_train[cols_num]=Xnum_train_repnull[cols_num]
X_test[cols_num]=Xnum_test_repnull[cols_num]

Next, let's try to improve the model by including some non-numeric features...

utexas_ds_orie_divider_gray.png

## Non-numerical Features

We have already seen the error generated by non-numeric features, but let's try to convert them to numeric values so that they can be used in the model.

In [None]:
# Select non-numeric features
cols_obj = [col for col in X.columns if X[col].dtype == 'object']
cols_obj

['Product_ID', 'Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [None]:
# Label encoding on all non-numeric features

from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with non-numeric data
label_encoder = LabelEncoder()
for col in cols_obj:
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    Xle_test[col] = label_encoder.transform(X_test[col])

ValueError: y contains previously unseen labels: 'P00206242'

There are too many values to be label encoded. So, we would need to take only those non-numeric features with low cardinality aka Categorical features.



In [None]:
# Select categorical features
cols_cat = [col for col in X.columns if X[col].dtype == 'object' and X[col].nunique()<10]
cols_cat

['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [None]:
# Label encoding on only categorical features

from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in cols_cat:
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    Xle_test[col] = label_encoder.transform(X_test[col])

print("Number of NaNs in Xle_train:", Xle_train.isna().sum().sum())
print("Number of NaNs in Xle_test:", Xle_test.isna().sum().sum())

Number of NaNs in Xle_train: 0
Number of NaNs in Xle_test: 0


In [38]:
# Encode and Build/Score using all Categorical columns

mae = get_random_forest_mae(Xle_train[cols_num + cols_cat], Xle_test[cols_num + cols_cat], y_train, y_test)
print("MAE from Label Encoding all Categorical columns:")
print(mae)

MAE from Label Encoding all Categorical columns:
2218.810253436235


## Build Gradient Boosted Tree Model

Begining by training a simple Gradient Boosting model.

In [None]:
from xgboost import XGBRegressor

#Build and score default Gradient Boosting Model
mdlXgbMlb = XGBRegressor()
mdlXgbMlb.fit(Xle_train[cols_num + cols_cat], y_train)
y_test_pred = mdlXgbMlb.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("MAE from default XGBoost model:")
print(mae)

MAE from default XGBoost model:
2108.359831063556


Let's try to improve this by **tuning the parameters** that drive the Gradient Boosting model.  Below are some popular parameters...

`n_estimators`: maximum number of decision trees that will be ensembled

`max_depth`: maximum depth of each tree (typically 3-10)

`learning_rate`: weight applied to each tree (typically 0.01-0.2)

In [43]:
depths = [3, 5, 7, 9]
results = {}

for depth in depths:
    mdlXgbMlb = XGBRegressor(n_estimators=5000, learning_rate=0.01, max_depth=depth)
    mdlXgbMlb.fit(Xle_train[cols_num + cols_cat], y_train)
    y_test_pred = mdlXgbMlb.predict(Xle_test[cols_num + cols_cat])
    mae = mean_absolute_error(y_test_pred, y_test)
    results[depth] = mae
    print(f"MAE from XGBoost model with max_depth={depth}: {mae}")


MAE from XGBoost model with max_depth=3: 2205.7670529624943
MAE from XGBoost model with max_depth=5: 2116.815223390688
MAE from XGBoost model with max_depth=7: 2064.416389897939
MAE from XGBoost model with max_depth=9: 2053.144651265808


As the depth increases, the model has more learning capacity, so it can capture more complex relationships and patterns. However, increasing the depth of the model does not always produce better results, because the model may over-learn, which can negatively affect generalization performance.

Using a gradient boosted model, we're able to shave off another \$16K off the best MAE so far. Therefore, intelligently selecting trees to add to the ensemble using gradient descent combined with proper parameter tuning helps to significantly improve the result.

# (Extra) My aim is to train and test the same model with replacing the missing values with the most frequent datas used in that column and using k-nearest neighbours imputation

In [35]:
# Select target as a series and features as dataframe
y = dfSls.loc[:, ['Purchase']].values.ravel()
X = dfSls.drop(['Purchase'], axis=1)

# Select numeric features
cols_num = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
Xnum = X[cols_num]

# Split numeric features into training and test sets
Xnum_train, Xnum_test, y_train, y_test = train_test_split(Xnum, y, train_size=0.8, test_size=0.2, random_state=1)

# Fill missing values with the most frequent values in each column
Xnum_train_repnull = Xnum_train.apply(lambda x: x.fillna(x.mode()[0]) if x.isnull().sum() > 0 else x)
Xnum_test_repnull = Xnum_test.apply(lambda x: x.fillna(x.mode()[0]) if x.isnull().sum() > 0 else x)

# Function to build and score Random Forest model
def get_random_forest_mae(X_trn, X_tst, y_trn, y_tst):
    mdlRfsSls = RandomForestRegressor(random_state=1)
    mdlRfsSls.fit(X_trn, y_trn)
    y_tst_prd = mdlRfsSls.predict(X_tst)
    mae = mean_absolute_error(y_tst, y_tst_prd)
    return mae

# Calculate MAE
print('MAE from Approach (Replace missing values with the most frequent values):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

MAE from Approach (Replace missing values with the most frequent values):
2186.22104201576


K-Nearest Neighbors Imputations (KNNImputer)

In [37]:
from sklearn.impute import KNNImputer
# Select target as a series and features as dataframe
y = dfSls.loc[:, ['Purchase']].values.ravel()
X = dfSls.drop(['Purchase'], axis=1)

# Select numeric features
cols_num = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
Xnum = X[cols_num]

# Split numeric features into training and test sets
Xnum_train, Xnum_test, y_train, y_test = train_test_split(Xnum, y, train_size=0.8, test_size=0.2, random_state=1)

# Apply KNN Imputer
imputer = KNNImputer(n_neighbors=1)
Xnum_train_imp = imputer.fit_transform(Xnum_train)
Xnum_test_imp = imputer.transform(Xnum_test)

# Function to build and score Random Forest model
def get_random_forest_mae(X_trn, X_tst, y_trn, y_tst):
    mdlRfsSls = RandomForestRegressor(random_state=1)
    mdlRfsSls.fit(X_trn, y_trn)
    y_tst_prd = mdlRfsSls.predict(X_tst)
    mae = mean_absolute_error(y_tst, y_tst_prd)
    return mae

# Calculate MAE
print('MAE from Approach (KNN Imputer):')
print(get_random_forest_mae(Xnum_train_imp, Xnum_test_imp, y_train, y_test))

KeyboardInterrupt: 

## Takeaways

* Expanded the model to include other Numerical features, and replaced missing values by *Imputation*
* Included Categorical features, and converted them to usable information by *Label Encoding*
* Ensembled many decision trees more intelligently using the *Gradient Boosting* model and *Parameter Tuning* for better results
* Most satisfying error is found with the technique of dropping the missing values and in tuned Gradient Boosting Model. It may vary depending on the dataset