## Introduction

The following notebook includes a XGBoost regression to predict total sales for every item and store in the following month (as proposed by the competition: https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/overview/description).

We begin with an **EDA**, followed by **data engineering** focused on pivoting and preparing the data as a timeseries able to be fed into our regression algorithm with both a training and testing period, and lastly we run our **XGBoost regression**, and apply our **prediction** to the testing data.

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from numpy import absolute
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
import xgboost as xgb
from xgboost import XGBRegressor
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from sklearn.model_selection import RandomizedSearchCV
import seaborn as sns

In [None]:
item_categories = pd.read_csv ('../input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv ('../input/competitive-data-science-predict-future-sales/items.csv')
train = pd.read_csv ('../input/competitive-data-science-predict-future-sales/sales_train.csv')
shops = pd.read_csv ('../input/competitive-data-science-predict-future-sales/shops.csv')
test = pd.read_csv ('../input/competitive-data-science-predict-future-sales/test.csv')
sample_submission = pd.read_csv ('../input/competitive-data-science-predict-future-sales/sample_submission.csv')

## EDA

In [None]:
#Applying Basic EDA function
def EDA(df):
    return 'First rows', df.head(3),\
    'Info', df.info,\
    'Describe', df.describe(),\
    'Missing Values', df.isnull().sum()

In [None]:
datasets = [item_categories, items, train, shops, test, sample_submission]
#we run the EDA function to all the provided datasets
for i in datasets:
    print(EDA(i))

Looking at the "Describe" table of Train, the minimum values of "item_price" and "item_cnt_day" are negative which doesn't seem to make much logic when we are dealing with Sales. Regarding the maximum values of said features, they are much larger than the mean so we might have outliers here.

In [None]:
# plot item_price
sns.violinplot(y = train['item_price']).set(title='item_price has a long tail in the upper values')

In [None]:
print(train.sort_values(by=['item_price']).head(5),train.sort_values(by=['item_price']).tail(5))

We clearly have outliers: The last Rows shown above clearly have unusually high "item_price". Row 484683 contains a negative item_price which is not realistic.

In [None]:
# plot item_cnt_day
sns.violinplot(y = train['item_cnt_day']).set(title='item_cnt_day has a long tail in the upper values')

In [None]:
print(train.sort_values(by=['item_cnt_day']).head(5),train.sort_values(by=['item_cnt_day']).tail(5))

We notice outliers with very high "item_cnt_day".

## Data Engineering

In [None]:
#In order to deal with outliers we will apply a Z-SCORE for item_cnt_day and item_price, and remove scores outside 3 Z-Scores
#we'll limit the Z score to |3| (will cover ~99.77% of area)

train['Zscore_item_cnt_day'] = (train.item_cnt_day - train.item_cnt_day.mean())/train.item_cnt_day.std(ddof=0)
train['Zscore_item_price'] = (train.item_price - train.item_price.mean())/train.item_price.std(ddof=0)

In [None]:
#Based on our analysis above we remove Outliers and "sketchy" Rows

#First we remove the 1 row with negative item_price
train = train[train['item_price'] > 0]

#removing outliers based on Z-score
train = train[(train['Zscore_item_cnt_day']<3)&(train['Zscore_item_cnt_day']>-3)]
train = train[(train['Zscore_item_price']<3)&(train['Zscore_item_price']>-3)]

#removing Zscores now that the operation is finished
train.drop(columns=['Zscore_item_cnt_day','Zscore_item_price'], axis=1)

In [None]:
#our estimation is restricted to november 2015, so it's convenient we organise data by months. 
#We are working with a timeseries so we'll use datetime dtype to make things easier 

train['date'] = pd.to_datetime(train['date'], format = "%d.%m.%Y" )

In [None]:
train.head(5) #updated date

In summary, the purpose of this competition is to figure out the expected values of how much items are sold for a given shop within the period of november 2015. Hence, an easier way to frame the problem is by making the data into a **pivot table** in which given a shop, we have the count values of items that were sold over a month and organise those monthly counts by frequency. Therefore we group train data by "shop_id" and "item_id"

In [None]:
pt = pd.pivot_table(train, index = ['shop_id', 'item_id'], values = 'item_cnt_day', columns = ['date_block_num'], aggfunc = np.sum, fill_value = 0)
pt

In [None]:
#we currently have MultiIndex from the pivot table we built
#It's easier if we convert pt to a plain DataFrame by resetting the index with reset_index which removes the MultiIndex
pt.reset_index(inplace = True)

In [None]:
pt.tail(5)

In [None]:
#now that we have item data, we merge the pivot table with the test data, giving priority to the latter
#as a result we keep the ordered date_block_num
df = pd.merge(test, pt, on=['shop_id', 'item_id'], how = 'left')
df.head(5)

In [None]:
#as seen above we are missing values
df.fillna(0, inplace=True)
df.head(5)

In [None]:
#Since we completed all data engineering we finally split the data between test and train

#training data
X_train = df.drop(columns=['shop_id','item_id', 'ID', 33], axis=1) #firstly we don't need ids, not the shop & item ones, and drop the last month
y_train = df[33]

#for test we keep all the columns except the first one so we maintain the same time window as in training
X_test = df.drop(columns=['shop_id','item_id', 'ID', 0], axis=1)

In [None]:
#observing our split datasets
print('X TRAIN \n', X_train.head(3))
print('Y TRAIN \n', y_train.head(3))
print('X TEST \n', X_test.head(3))

Naturally X_test has different column names than X_train since they are a month apart, but in order to fit the model, we'll change X_test to have the same column names as X_train (as if sliding an imaginary time window)

In [None]:
X_test.columns = X_train.columns

In [None]:
X_test.head(5) #artificial column names addded

## XGBoost Regression

In [None]:
# create an xgboost regression model
model = XGBRegressor() #first we try the model with default parameters, our relevant evaluation metric is rmse (default)
# define model evaluation method
cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=1)
# evaluate model
scores = cross_val_score(model, X_train, y_train, scoring='neg_root_mean_squared_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean RMSE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

### Hyperparameter tunning

Using RandomizedSearchCV, we run a random search with a grid. Although relatively computer intensive to run, we aim to obtain better parameters aiming to improve the RMSE we've got with the default parameters.

In [None]:
"""
CODE BELOW TAKES UP TO 2 HOURS TO RUN, SKIP THIS CELL FOR RESULTS

regressor = model

hyperparameter_grid = {
    'n_estimators': [100, 400, 800],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.05, 0.1, 0.20],
    'min_child_weight': [1, 10, 100]
    }

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50,
            scoring = 'neg_root_mean_squared_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,
            random_state=42)

random_cv.fit(X_train,y_train)

random_cv.best_estimator_
"""

In [None]:
# IMPROVED xgboost regression model
regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=3,
             min_child_weight=1, monotone_constraints='()',
             n_estimators=400, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

# define model evaluation method
cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=1)
# evaluate model
scores = cross_val_score(regressor, X_train, y_train, scoring='neg_root_mean_squared_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean RMSE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

Our new Hyperparameters slightly improve the model's Mean RMSE - this means the standard deviation of the residuals (prediction errors) decreased. In simpler terms, we have a smaller average distance between the observed data values and the values predicted by our model.

## Prediction

In [None]:
#Best Regressor
Best_Regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=3,
             min_child_weight=1, monotone_constraints='()',
             n_estimators=400, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

#Given the improved model, we fit our training data
Best_Regressor.fit(X_train, y_train)

# make a prediction
yhat = Best_Regressor.predict(X_test)

# summarize prediction for total number of sales
print(yhat)

In [None]:
submission = pd.DataFrame({
            "ID": np.arange(len(yhat)),
            "item_cnt_month": yhat
    })

In [None]:
submission.head(5)

In [None]:
submission.to_csv('../submission.csv', index=False)

## Brief Conclusion

In brief, we conducted an EDA to then clean and prepare the data for an XGBoost regressor. After doing so we ran a Randomized Search Cross Validation to obtain better hyperparameters for our final regressor model.

In this version of the notebook we remove outliers based on the Zscore: the result was barely any difference in the competition score, but with the in-sample mean RMSE improving from 4.028 (before treating outliers) to 0.96 (after) using the default XGBoost regressor... The fact the out-of-sample score remained almost the same while the in-sample score improved significantly might be a cause of overfiting. In case I ever dive into this notebook again, it might be a worthy pursuit to explore ways to improve the out-of-sample score.