## Author: Dwi Hadyan Harsono
* [Github source](https://github.com/dwihdyn/ds-exploration/blob/main/p2/retail-simple.ipynb) 

## Most beginner-friendly notebook & straight-to-the-point (no fuss & no lengthy charts)
- Obtain data
- Scrub data : to make it all numerical & model-friendly
- Explore data : correlation check with weekly_sales
- Model : RandomForestClassifier
- Interpret : predicting the future sales

> Beginner friendly, as this helps you on getting the concept of data science full-cycle (once you understand, youre able to treat this as stepping-stone to improve the model accuracy)

# Obtain

In [None]:
# load necessary packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore") # ignoring annoying warnings

# load data
df_features = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/features.csv.zip', parse_dates=['Date']) # parse_date to ensure Date in 'datetime64' format
df_sales = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/train.csv.zip', parse_dates=['Date'])
df_stores = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/stores.csv')
df_sales_answer = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/test.csv.zip', parse_dates=['Date'])
sample_submission = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip')


# Scrub
- combine all datasets into one
- then separate into train inclusive(2010-02-05 until 2012-10-26) and test (2012-11-02 until 2013-07-26)
- convert all columns to numerical

In [None]:
# combine all 4 training dataset into one. from largest to smallest

# merge two sales into one for now
sales_answer = pd.merge(df_sales ,df_sales_answer, how='outer', on=['Store', 'Dept', 'Date', 'IsHoliday'])

# merge features & stores on 'store' key
sales_feat = pd.merge(sales_answer ,df_features, how='outer', on=['Store', 'Date', 'IsHoliday'])

# merge sales_feat & sales on 'Store' and 'Date' key
df_all = pd.merge(sales_feat, df_stores, how='outer', on='Store')

In [None]:
def multipledummies(df, non_numerical_columns):
    ''' Input the whole dataframe & name of non-numerical columns, output is clean dataframe that all is in numerical format'''

    for i in non_numerical_columns:

        # convert to numerical using get_dummies
        one_hot = pd.get_dummies(df[i], prefix=i)

        # append new numerical column to main df
        df = df.join(one_hot)

        # drop that non-numerical column
        df.drop(i, axis = 1, inplace=True)

    return df

In [None]:
# convert Date to 'Day, Week, Month' to make it numerical
df_all['Day'] = df_all.Date.dt.day
df_all['Week'] = df_all.Date.dt.week 
df_all['Year'] = df_all.Date.dt.year


In [None]:
# convert Type columns to numerical using multipledummies
df_all = multipledummies(df_all, ['Type'])
df_all.sample(3)


In [None]:
# separate df into data inclusive(2010-02-05 until 2012-10-26) and answer (2012-11-02 until 2013-07-26)

data_range = (df_all['Date'] >= '2010-02-05') & (df_all['Date'] <= '2012-10-26')
answer_range = (df_all['Date'] >= '2012-11-02') & (df_all['Date'] <= '2013-07-26')


df = df_all.loc[data_range]
df_answer = df_all.loc[answer_range]


# drop date column now since its been segregated properly already
df.drop(['Date'], axis=1, inplace=True)
df_answer.drop(['Date'], axis=1, inplace=True)

In [None]:
# ensure all columns in df is in integer format before done with "scrub" section
df.info()
df_answer.info()

# for multiple null, we check in heatmap first, if weak correlation (between 0.1 & -0.1) with weekly_sales, we drop those column. else we use IterativeImputer() package

# IsHoliday stays boolean object for now. converted in 'Model' section WMAE function

# Explore

- heatmap. drop if corr with targetVariable is between 0.1 & -0.1 

In [None]:
sns.set(style="white")

corr = df.corr()

mask = np.triu(np.ones_like(corr, dtype=np.bool))

f, ax = plt.subplots(figsize=(20, 15))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

plt.title('Correlation Matrix', fontsize=18)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f')

plt.show()

In [None]:
# drop CPI, Unemployment & all markdowns1-5, as it is : weak correlation to weekly_sales AND too much Null data
df.drop(['CPI', 'Unemployment', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'], axis=1, inplace=True)
df_answer.drop(['CPI', 'Unemployment', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'], axis=1, inplace=True)

# Model
- Since competition error measurement is WMAE (given in competition, and not the usual RMSE from GridSearchCV), we cant use GridSearchCV or RandomSearchCV to fine-tune our model
- for the sake of siimplicity, we'll just jump right to model building.
- once you grasp the concept, you may tweak & google around to improve this

In [None]:
# WMAE function as error measurement (lower the better)

def WMAE(dataset, real, predicted):
    ''' Input df, real value , predicted value. Output the error value. lower the value, more accurate our model is '''

    # weight allocation on IsHoliday
    weights = dataset.IsHoliday.apply(lambda x : 5 if x else 1)

    # WMSE formula
    return np.round(np.sum(weights * abs(real - predicted)) / (np.sum(weights)), 2)

In [None]:
# prep data

X_train = df.drop(['Weekly_Sales'], axis = 1)
Y_train = df['Weekly_Sales']


In [None]:
# model building & training
from sklearn.ensemble import RandomForestRegressor
# from sklearn.model_selection import train_test_split

RF = RandomForestRegressor(n_estimators=58, max_depth=27, max_features=12, min_samples_split=4, min_samples_leaf=1)
RF.fit(X_train, Y_train)

In [None]:
# get prediction answer

X_test = df_answer.drop(['Weekly_Sales'], axis = 1)
predict = RF.predict(X_test)



# iNterpret

- convert output to csv that to be submitted

In [None]:
sample_submission['Weekly_Sales'] = predict
sample_submission

In [None]:
# export file to csv
sample_submission = sample_submission.set_index('Id')
sample_submission.to_csv('walmart_v1.csv', sep=',')