*Note:* This still WIP (work in progress) and it is my first attempt to kaggle competitons from scratch. I'm trying to write this in a manner suitable for those who are just starting with kaggle competitions like me. I will start from basic classical ML models and planning to use LSTM at the end. If this notebook helped you, please give us an upvote!

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns

# **Preprocessing and Exploratory Data Analysis**

## Preprocessing

Instead of preprocessing all csv files together, I will be chhecking all csv files one-by-one and see what preprocessing I need to do, I will also see what other features I can extract.


In [None]:
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
sales_train.head()

This file has item_id, we can add another feature item_category by merging information from the items.csv

In [None]:
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
items.head()

Let's have a look at item_categories.csv

In [None]:
item_categories = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
item_categories.head()

Categories like PS2, PS3 can be futher merged into same category - PS. Some notebooks have used that approach, however I'm instead extracting features more generally from the category name. I translated this Russian file to English, after which it seems that if we split category name into two different categories, viz. main category and sub cateogry, it might be more helpful. We would try doing so. But let's check remaining shops.csv before preprocessing everything.

In [None]:
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
shops.head()

After talking with some people who know Russian (internet is very useful thing), I figured out that most of shop names in fact are in format : CityName TP/TPK/ETC ShopOrMallName. TP/TPK/TU means Shopping complex, Shopping and Entertainment complex and Shopping center respectively. So we can add another columns, city name, shop category (TP/TPK/TU/Others) and vanila shop name. 

Now, we have seen all files and can start prepreocessing, I will start from shops.csv and then see what we can do for item_categories.csv, then we will merge data from all 4 files into a single dataframe.

In [None]:
import string
shops["shop_name"] = shops["shop_name"].str.replace('[^\w\s]','')  # Remove punctuations.

## Uncomment following line to see how we can create city column
## shops['city'] = shops['shop_name'].apply(lambda shop : shop.split()[0])
## Since we are creating more than one columns, we will create all of them together.

shops[["shop_city", "shop_category", "shop_name"]] = shops["shop_name"].str.split(n=2, expand=True)
shops.drop('shop_name', axis=1, inplace=True)
shops.head()

Let's explore this data more.

In [None]:
print('--Cities--')
print(shops['shop_city'].value_counts())
print('--Categories--')
print(shops['shop_category'].value_counts())


Upon translating above into English, I found that some city names and some categories are incorrect, however such entries are less, so I kept things as they are.

In [None]:
item_categories = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv', index_col='item_category_id')
item_categories["item_category_name"] = item_categories["item_category_name"].str.replace('[^\w\s-]','')  # Remove punctuations

item_categories[["item_main_category_name", "item_sub_category_name"]] = item_categories["item_category_name"].str.split(pat='-', n=1, expand=True)
item_categories.drop('item_category_name', axis=1, inplace=True)
# We fill missing values with category named 'undefined'.
item_categories.fillna(value = {'item_sub_category_name' : 'Undefined'}, inplace=True)

item_categories.head()

In [None]:
item_categories['item_sub_category_name'].value_counts()

For now, I will not extract furthur information from item categories. Now I will prepare training data, for which we will use sales_train.csv and then join appropriate columns, so that we will have all required features in a single dataframe.

In [None]:
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
sales_train.head()
sales_train['date'] = pd.to_datetime(sales_train['date'],format='%d.%m.%Y')
sales_train['day'], sales_train['month'], sales_train['year'], sales_train['weekday'] = sales_train['date'].dt.day, sales_train['date'].dt.month, sales_train['date'].dt.year, sales_train['date'].dt.weekday
# Monday is 0
sales_train = sales_train.merge(items.drop('item_name', axis=1), on='item_id')
sales_train = sales_train.merge(item_categories, on='item_category_id')
sales_train = sales_train.merge(shops, on='shop_id')
sales_train.head(25)

We do have some outliers, but detecting them would be easy once we've merged things, so I'm not removing them right now.

## EDA

I had done EDA earlier, but I've removed major portions of it right now, I'll be addding them back again.

In [None]:
# TODO: Change this plot with something that clearly shows outliers
sales_train['item_cnt_day'].plot() 

Our outlier:

In [None]:
sales_train[sales_train['item_cnt_day'] == sales_train['item_cnt_day'].max()]

In [None]:
sales_train.nlargest(20, 'item_cnt_day', keep='first')

In [None]:
pd.options.display.float_format = '{:.4f}'.format  # Globally suppress sci format
# Total number of items sold across all shops per day can be found as follows,
# we are trying to explore on which days sells were maximum.
print((sales_train.groupby('date').item_cnt_day).sum().nlargest(50, keep='first'))
(sales_train.groupby('date').item_cnt_day).sum().plot()

#TODO: More EDA later. Like sales vs weekday etc.

Let's remove outliers and preparing final training data.

In [None]:
sales_train = sales_train[(sales_train['item_cnt_day'] < 1001) & (sales_train['item_price'] < 100000) & (sales_train['item_price'] > 0)]

## Preparing training data

Currently we have following training data.

In [None]:
sales_train.head(20)
sales_train.shape

As you can see, each entry in above dataframe represents non-zero sales of particular item, in a particular shop, on a particular date. If you see our test data, we have been asked to predict *monthly* sales for next one month for given items in a given shop & it's 214k rows. We are not given any information apart from item_id and shop_id, especially we are not given item_price for test set. Other features like item_category, shop_city can be fetched from the training data using item_id and shop_id, however item_price understandabaly varies across different dates even for the same shop & item pair. Getting item_price for our test data, is going to be another challenge we have to deal with.

Clearly our current sales_train dataframe and test data are not consistent because our sales_train currently has data for non-zero daily sales only. Following two solutions to this came to my mind, however there are other approaches discussed in other notebooks.

Approach 1. We can make predictions for each day in upcoming month and add total sales for each shop & item combination to get monthly sales prediction. If we follow this approach, we have to make 214k * 30 predictions & then again reprocess them to find sum to get monthly count. Also, on many days sales would be zero and our current training data only has non-zero entries, so we will (probably) need to teach our model somehow - by modifying training data - that most of the days will have zero sales. If to do so we need to add zero rows, our training set, which already has <> rows, will become very large along with large test set. So I haven't used this approach.

Approach 2. Instead, we can restructure our training set to look more like test set. To do so, we create a new dataframe, where each record shows monthly sales of particular item_id and shop_id with average price for the same. To find average price, we first group rows by month, item_id and shop_id and find average price of that group, monthly sales count is found by taking sum of daily sales count of this group.

In [None]:
monthly_group = sales_train.groupby(['date_block_num', 'shop_id', 'item_id'])
monthly_count = monthly_group['item_cnt_day'].sum()  # Total number of items sold in month.
monthly_avg_price = monthly_group['item_price'].mean()  # Average price of all items sold in month.

In [None]:
monthly_count.describe()

I was very optimistic while I made weekday feature. Because sales heavily depend on which day of week it is in real life. However, we will need to remove daily features from our training set now that we are creating monthly training set. So basically, I created a feature, didn't use it and now deleting it. But I'm leaving all those steps here in this notebook, as a reminder to myself & to emphasise the fact that machine learning is an iterative process.

In [None]:
X_train = sales_train.drop(['date', 'item_cnt_day', 'day', 'weekday'], axis=1).set_index(['date_block_num', 'shop_id', 'item_id']).drop_duplicates()
X_train['item_price'] = monthly_avg_price
X_train['item_cnt_monthly'] = monthly_count
X_train.reset_index(inplace=True)
X_train.head()

In [None]:
y_train = X_train['item_cnt_monthly']
X_train.drop(['item_cnt_monthly'], axis = 1, inplace = True)

In [None]:
X_train.head()

In [None]:
X_train.columns

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score, train_test_split, TimeSeriesSplit
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor

categorical_cols = ['item_main_category_name', 'item_sub_category_name', 'shop_id', 'shop_city','month', 'year']
numerical_cols = ['item_price']
drop_cols = ['shop_category', 'date_block_num']
preprocessor = ColumnTransformer(
transformers=[
    ('categorical', OneHotEncoder(handle_unknown = 'ignore'), categorical_cols),
    ('numerical', StandardScaler(), numerical_cols)    
])

linear_model = SGDRegressor(early_stopping = True, learning_rate='adaptive')
forest_model = RandomForestRegressor(n_estimators = 4)
linear_pipe = Pipeline(steps=[('preprocessor', preprocessor), ('model', linear_model)])
forest_pipe = Pipeline(steps=[('preprocessor', preprocessor), ('model', forest_model)])

In [None]:
from sklearn.metrics import mean_squared_error
X_train.drop(drop_cols, axis=1, inplace=True)
X_train1, X_valid1, y_train1, y_valid1 = X_train[:250000], X_train[250000:], y_train[:250000], y_train[250000:]
y_valid1_np = np.clip(y_valid1.to_numpy(), 0, 20)

In [None]:
linear_pipe.fit(X_train1, y_train1)
ans1 = linear_pipe.predict(X_valid1)
ans1 = np.clip(ans1, 0, 20)
print("RMSE:",(mean_squared_error(ans1, y_valid1_np)**0.5))

In [None]:
#forest_pipe.fit(X_train1, y_train1)
#ans2 = forest_pipe.predict(X_valid1)
#ans2 = np.clip(ans2, 0, 20)
#print("RMSE:",(mean_squared_error(ans2, y_valid1_np)**0.5))

In [None]:
# XGBoost
from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators = 150, max_depth = 15, learning_rate = 0.21, random_state=0)
xgb_pipe = Pipeline(steps=[('preprocessor', preprocessor), ('model', xgb_model)])
xgb_pipe.fit(X_train1, y_train1)
ans3 = xgb_pipe.predict(X_valid1)
ans3 = np.clip(ans3, 0, 20)
print("RMSE:",(mean_squared_error(ans3, y_valid1_np)**0.5))

In [None]:
ans3_tr = xgb_pipe.predict(X_train1)
ans3_tr = np.clip(ans3_tr, 0, 20)
y_train1_np = np.clip(y_train1.to_numpy(), 0, 20)
print("RMSE:",(mean_squared_error(ans3_tr, y_train1_np)**0.5))

In [None]:
pd.Series(abs(ans3-y_valid1_np)).nlargest(n=20)

In [None]:
y_valid1_np[6917]  # Actual sales of (one of the) highest error entry.

In [None]:
pd.Series(ans3).iloc[6917]  # Predicted sales of same entry.

In [None]:
X_valid1.iloc[7041]  # Let's see what is that entry.

In [None]:
y_train1[X_train1['item_id'] == 11457] # Have we seen that item_id earlier, what was its sales?

In [None]:
# This item was being sold in less quantities, what caused our model to predict higher sales?
X_train1[X_train1['item_id'] == 11457]

In [None]:
X_train1[X_train1['item_id'] == 12397]

In [None]:
print(X_valid1.iloc[7041])
print(X_valid1.iloc[6917])

In [None]:
'''Index(['date_block_num', 'shop_id', 'item_id', 'item_price', 'month', 'year',
       'item_category_id', 'item_main_category_name', 'item_sub_category_name',
       'shop_city', 'shop_category'],
      dtype='object')'''
#todo : item price function
test_raw = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv", index_col='ID')
test_raw['date_block_num'] = 34
test_raw['month'] = 11
test_raw['year'] = 2015

#test_raw['item_category_id'] = items[items['item_id'] == test_raw['item_id']].item_category_id

In [None]:
#test_raw = pd.merge(test_raw, items, how="left", on="item_id")
#test_raw = pd.merge(test_raw, item_categories, how="left", on="item_category_id")
#test_raw = pd.merge(test_raw, shops, how="left", on="shop_id")

test_raw = test_raw.merge(items, on='item_id', suffixes=[None, '_drop'])
test_raw = test_raw.merge(item_categories, on='item_category_id', suffixes=[None, '_drop'])
test_raw = test_raw.merge(shops, on='shop_id', suffixes=[None, '_drop'])
test_raw.drop(test_raw.filter(regex='_drop$').columns.tolist(), axis=1, inplace=True)
#test_raw.drop(drop_cols, axis=1, inplace=True)
test_raw['item_price'] = 0
test_raw = test_raw[X_train.columns]

In [None]:
X_train.head()

In [None]:
test_raw.head()

In [None]:
len(X_train[(X_train['shop_id'] == 48) & (X_train['item_id'] == 944)]) == 0