### **1. DATA LOADING**

In this challenge, the following files were available:

*   stores.csv
*   features.csv
*   train.csv
*   test.csv
*   sampleSubmission.csv

The "store" and "features" files have been combined and joined to the "train" and "test" datasets. The sampleSubmission file is the template to be used for submission to Kaggle and will be used at the end. The submission dataset is equivalent to the test dataset but only with the triple (store, department and date) to be predicted.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df_features = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/features.csv.zip', sep=',')
df_stores = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/stores.csv', sep=',')

df_features_stores = df_features.merge(df_stores, how='inner', on='Store')
df_features_stores.head()

In [None]:
df_train = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/train.csv.zip', sep=',')
train = df_train.merge(df_features_stores, how='inner', on=['Store','Date','IsHoliday'])
train.head()

In [None]:
df_test = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/test.csv.zip', sep=',')
test = df_test.merge(df_features_stores, how='inner', on=['Store','Date','IsHoliday'])
test.head()

Checking the first and last records of the training and test datasets. The train dataset contains 421570 weekly sales records detailed by stores and departments from 02-05-2010 to 10-26-2012. The test base starts one week later and runs until 07-26-2013.

In [None]:
print("Primeiro registro treino: ", train['Date'].min())
print("Último registro treino:", train['Date'].max())

print("Primeiro registro teste: ", test['Date'].min())
print("Último registro teste:", test['Date'].max())

Since the records are weekly, the "date" variable was converted to week of the year and year, as two new variables.

In [None]:
train['Date'] = pd.to_datetime(train['Date'])
test['Date'] = pd.to_datetime(test['Date'])

train['Week'] = train['Date'].dt.isocalendar().week
test['Week'] = test['Date'].dt.isocalendar().week

train['Year'] = train['Date'].dt.isocalendar().year
test['Year'] = test['Date'].dt.isocalendar().year

By analyzing a summary of the training and test dataset, there is a certain similarity in the patterns of both datasets, without much discrepancy in the values.

In [None]:
train.describe()

In [None]:
test.describe()

### **2. FEATURES TYPES**

#### **2.1 TRANSFORMATION**


Analyzing the type of variables, it is observed that all of them are in numeric format, except for the Date, Type and IsHoliday variables. The "Date" variable will not be used for training the model, using only Week and Year. "IsHoliday" was transformed to numeric binary and "Type" to ordinal numeric format. These transformation was applied to train and test datasets. 

In [None]:
train.dtypes

In [None]:
train['Type'].unique()

In [None]:
train['Date'] = pd.to_datetime(train['Date'])
train['Type'] = train['Type'] .apply(lambda x: 3 if x == 'A' else(2 if x == 'B' else 1))
train['IsHoliday'] = train['IsHoliday'].apply(lambda x: 1 if x == True else 0)

cols = train.columns.drop(['Date'])
train[cols] = train[cols].apply(pd.to_numeric, errors='coerce')

In [None]:
test['Date'] = pd.to_datetime(test['Date'])
test['Type'] = test['Type'].apply(lambda x: 3 if x == 'A' else(2 if x == 'B' else 1))
test['IsHoliday'] = test['IsHoliday'].apply(lambda x: 1 if x == True else 0)

cols = test.columns.drop(['Date'])
test[cols] = test[cols].apply(pd.to_numeric, errors='coerce')

#### **2.2. HOLIDAYS**

According to the challenge instructions, the holiday dates are expected to have a greater weight in the model training, since in general they represent a greater volume of sales.

The code below shows all dates that represent holidays, both in the train and test dataset. It is observed that the holidays are in the same weeks (6, 36, 47 and 52) for the years 2010, 2011, 2012 and 2013. From the data provided by the challenge, it is possible to identify what these holidays are.

*   Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 --> WEEK 6
*   Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 --> WEEK 36
*   Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 --> WEEK 47
*   Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13 --> WEEK 52

It is noticed that there are no sales records on the laborday holiday at the test dataset, since the holiday is in September and the test runs until July.



In [None]:
holiday_train = train[['Date','Week','Year','IsHoliday']]
holiday_train = holiday_train.loc[holiday_train['IsHoliday']==True].drop_duplicates()

holiday_test = test[['Date','Week','Year','IsHoliday']]
holiday_test = holiday_test.loc[holiday_test['IsHoliday']==True].drop_duplicates()

holidays = pd.concat([holiday_train, holiday_test])
holidays

In order to identify not only if it is holiday, but also which holiday it is, and try to improve the sales volume prediction for these dates, the IsHoliday binary variable was transformed to:

*   0 - if it is not a holiday
*   1 - if the holiday is SuperBowl
*   2 - if the holiday is LaborDay
*   3 - if the holiday is Thanksgiving
*   4 - if the holiday is Christmas

In [None]:
def holiday_type(x):
    if   (x['IsHoliday']== 1) & (x['Week']==6):
       return 1 #SuperBowl
    elif (x['IsHoliday']== 1) & (x['Week']==36):
       return 2 #LaborDay
    elif (x['IsHoliday']== 1) & (x['Week']==47):
       return 3 #Thanksgiving
    elif (x['IsHoliday']== 1) & (x['Week']==52):
       return 4 #Christmas
    else:
       return 0

In [None]:
train['IsHoliday'] = train.apply(holiday_type, axis=1)
train['IsHoliday'].unique()

In [None]:
test['IsHoliday'] = test.apply(holiday_type, axis=1)
test['IsHoliday'].unique()


### **3. EXPLORATORY ANALYSIS**

#### **NULLS AND CORRELATIONS**


The training dataset has a percentage of 64% to 74% of null records for the MarkDown variables. However, before removing them, we will analyze their correlation with the other variables in order to check their impact on weekly sales. 

In [None]:
train = train.replace('None', np.nan)
train = train.replace('NaN', np.nan)
train = train.replace('NaT', np.nan)
train = train.replace('', np.nan)
train_nulls = (train.isnull().sum(axis = 0)/len(train))*100
train_nulls

Analyzing the correlation matrix of the training database, it can be noticed a weak correlation between tWeekly_Sales and the other variables. The variables "size", followed by the "type" and "dept" variables, appear to have the greatest impact on Weekly Sales.

In [None]:
plt.figure(figsize=(15, 10))

heatmap = sns.heatmap(train.corr(), vmin=-1, vmax=1, annot=True,cmap="Blues",annot_kws={"fontsize":10})
heatmap.set_title('Correlation Matrix - Train', fontdict={'fontsize':12}, pad=12);

Unlike the train dataset, the test has a much lower percentage of null in the MarkDown variables, and about 33% of null records in the CPI and Unemployment variables. 

The correlation matrix shows that the "Markdown" variables have a certain correlation with the IsHoliday variable, which can help in predicting. Therefore, these variables will not be eliminated at first.

Null records have been replaced by zero in both datasets.

In [None]:
test = test.replace('None', np.nan)
test = test.replace('NaN', np.nan)
test = test.replace('NaT', np.nan)
test = test.replace('', np.nan)
test_nulls = (test.isnull().sum(axis = 0)/len(test))*100
test_nulls

In [None]:
plt.figure(figsize=(15, 10))

heatmap = sns.heatmap(test.corr(), vmin=-1, vmax=1, annot=True,cmap="Blues",annot_kws={"fontsize":10})
heatmap.set_title('Correlation Matrix - Test', fontdict={'fontsize':12}, pad=12);

In [None]:
train = train.fillna(0)
test = test.fillna(0)

train.isnull().sum()

#### **AVG OF SALES X WEEK X YEAR**

Weekly sales data were grouped by week and year in order to identify the average and median sales per week over the years.

In general, the average values are well above the median, which indicates a high dispersion and variation in sales by stores and departments in a week.

Despite this, there is a certain pattern over the years, with high seasonality at the end of the year.

In [None]:
weekly_sales = train.groupby(['Year','Week']).agg({'Weekly_Sales': ['mean', 'median']})
weekly_sales2010 = train.loc[train['Year']==2010].groupby(['Week']).agg({'Weekly_Sales': ['mean', 'median']})
weekly_sales2011 = train.loc[train['Year']==2011].groupby(['Week']).agg({'Weekly_Sales': ['mean', 'median']})
weekly_sales2012 = train.loc[train['Year']==2012].groupby(['Week']).agg({'Weekly_Sales': ['mean', 'median']})

In [None]:
weekly_sales.plot(figsize=(20,5))

The data was also grouped by week but separately for each year, in order to identify patterns between the weeks of different years. As a result, a similar pattern can be seen over the years, with a significant increase in sales in weeks 51 and 47 (Christmas and Thanksgiving). The Superbowl (week 6) and LaborDay holidays (week 36) have little impact on increased sales volume.

In [None]:
plt.figure(figsize=(20, 7))

sns.lineplot(weekly_sales2010['Weekly_Sales']['mean'].index, weekly_sales2010['Weekly_Sales']['mean'].values)
sns.lineplot(weekly_sales2011['Weekly_Sales']['mean'].index, weekly_sales2011['Weekly_Sales']['mean'].values)
sns.lineplot(weekly_sales2012['Weekly_Sales']['mean'].index, weekly_sales2012['Weekly_Sales']['mean'].values)

plt.grid()
plt.xticks(np.arange(1, 53, step=1))
plt.legend(['2010', '2011', '2012'])
plt.show()

#### **STORES X WEEK SALES**


Analyzing the average weekly sales per store, there is a strong variation in sales volume between stores, ranging from 5000 up to 30000.

In [None]:
stores = train.groupby(['Store']).agg({'Weekly_Sales': ['mean']})

plt.figure(figsize=(20, 7))
plt.bar(stores.index,stores['Weekly_Sales']['mean'])
plt.xticks(np.arange(1, 46, step=1))
plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()

Despite this discrepancy in weekly sales by store, this behavior seems to remain stable over the years. Some stores showed a decrease in sales over the years, such as stores 14, 27, 35 and 36.

In [None]:
stores_sales2010 = train.loc[train['Year']==2010].groupby(['Store']).agg({'Weekly_Sales': ['mean', 'median']})
stores_sales2011 = train.loc[train['Year']==2011].groupby(['Store']).agg({'Weekly_Sales': ['mean', 'median']})
stores_sales2012 = train.loc[train['Year']==2012].groupby(['Store']).agg({'Weekly_Sales': ['mean', 'median']})

plt.figure(figsize=(20, 7))
sns.lineplot(stores_sales2010['Weekly_Sales']['mean'].index, stores_sales2010['Weekly_Sales']['mean'].values)
sns.lineplot(stores_sales2011['Weekly_Sales']['mean'].index, stores_sales2011['Weekly_Sales']['mean'].values)
sns.lineplot(stores_sales2012['Weekly_Sales']['mean'].index, stores_sales2012['Weekly_Sales']['mean'].values)

plt.xticks(np.arange(1, 46, step=1))
plt.legend(['2010', '2011', '2012'])
plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()



#### **DEPARTAMENT x WEEK SALES**

Weekly sales by department are even more irregular, with departments with average sales ranging from 0 to more than 70000.

In [None]:
departament = train.groupby(['Dept']).agg({'Weekly_Sales': ['mean', 'median']})

plt.figure(figsize=(20, 7))
plt.bar(departament.index,departament['Weekly_Sales']['mean'])
plt.xticks(np.arange(1, 100, step=2))
plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Departament', fontsize=16)
plt.show()



Despite this discrepancy in weekly sales by departament, this behavior seems to remain stable over the years. Some departaments showed a decrease in sales over the years, such as departaments 18, 65 and 73.

In [None]:
departament_sales2010 = train.loc[train['Year']==2010].groupby(['Dept']).agg({'Weekly_Sales': ['mean', 'median']})
departament_sales2011 = train.loc[train['Year']==2011].groupby(['Dept']).agg({'Weekly_Sales': ['mean', 'median']})
departament_sales2012 = train.loc[train['Year']==2012].groupby(['Dept']).agg({'Weekly_Sales': ['mean', 'median']})

plt.figure(figsize=(20, 7))
sns.lineplot(departament_sales2010['Weekly_Sales']['mean'].index, departament_sales2010['Weekly_Sales']['mean'].values)
sns.lineplot(departament_sales2011['Weekly_Sales']['mean'].index, departament_sales2011['Weekly_Sales']['mean'].values)
sns.lineplot(departament_sales2012['Weekly_Sales']['mean'].index, departament_sales2012['Weekly_Sales']['mean'].values)

plt.xticks(np.arange(1, 100, step=2))
plt.legend(['2010', '2011', '2012'])

plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Departament', fontsize=16)
plt.show()


#### **SIZE x WEEK SALES**

Grouping weekly sales by store size, the chart below seems to indicate a certain trend towards higher sales for larger stores.

However, this relationship is far from being proportionally linear, with several cases contradicting this trend.

In [None]:
size = train.groupby(['Size']).agg({'Weekly_Sales': ['mean']})

plt.figure(figsize=(20, 7))
plt.plot(size)
#plt.xticks(np.arange(1, 100, step=2))
#plt.show()

plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Size', fontsize=16)

The pattern of weekly sales by store size seems stable over the years, despite some cases of increase or decrease in sales of stores of the same size from 2010 to 2012.

In [None]:
size_sales2010 = train.loc[train['Year']==2010].groupby(['Size']).agg({'Weekly_Sales': ['mean', 'median']})
size_sales2011 = train.loc[train['Year']==2011].groupby(['Size']).agg({'Weekly_Sales': ['mean', 'median']})
size_sales2012 = train.loc[train['Year']==2012].groupby(['Size']).agg({'Weekly_Sales': ['mean', 'median']})

plt.figure(figsize=(20, 7))
sns.lineplot(size_sales2010['Weekly_Sales']['mean'].index, size_sales2010['Weekly_Sales']['mean'].values)
sns.lineplot(size_sales2011['Weekly_Sales']['mean'].index, size_sales2011['Weekly_Sales']['mean'].values)
sns.lineplot(size_sales2012['Weekly_Sales']['mean'].index, size_sales2012['Weekly_Sales']['mean'].values)

plt.legend(['2010', '2011', '2012'])
plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Size', fontsize=16)
plt.show()



#### **TYPE x WEEK SALES**

The 'type' feature also seems to have a certain relationship with Weekly Sales. Type 'A' (transformed to '3') has a higher sales median than types 'B' and 'C', in addition to a greater dispersion of sales values around this median. Type 'C' (transformed to '1') tends to have lower weekly sales.

In [None]:
plt.figure(figsize=(10, 7))
sns.boxplot(x='Type', y='Weekly_Sales', data=train,showfliers = False)

Despite this differentiation around median, the three types have many outlier records. 

In [None]:
plt.figure(figsize=(15, 7))
sns.boxplot(x='Type', y='Weekly_Sales', data=train,showfliers = True)

### **4. EVALUATION FUNCTION**

The challenge evaluation is based on Weighted Mean Absolute Error (WMAE), with a weight of 5 for Holiday Weeks and 1 otherwise. A function was created to evaluate the model considering these criteria. 

In [None]:
sample_weight = train['IsHoliday'].apply(lambda x: 1 if x==0 else 5)
sample_weight_frame = pd.DataFrame(sample_weight, index=train.index)

In [None]:
from sklearn.metrics import make_scorer

def WMAE(y_test, y_pred):
        y_pred_df = pd.DataFrame(y_pred,index=y_test.index)
        
        weights_5 = sample_weight_frame.loc[(y_test.index)].loc[sample_weight_frame.IsHoliday==5].index
        weights_1 = sample_weight_frame.loc[(y_test.index)].loc[sample_weight_frame.IsHoliday==1].index
        
        sum_5 = np.sum(5*(abs(y_test.loc[weights_5].values-y_pred_df.loc[weights_5].values)))
        sum_1 = np.sum(abs(y_test.loc[weights_1].values-y_pred_df.loc[weights_1].values))           
        
        return np.round((sum_5+sum_1)/(5*len(weights_5)+len(weights_1)),2)
 
my_score = make_scorer(WMAE,greater_is_better=False)

### **5. MODEL TRAINING**



#### **5.1. TRAINING WITH ALL FEATURES**

The training models were initially fitted with all the features of the train dataset. In order to select the best regression algorithm for this model, Random Search was applied to some of the main regression algorithms.The  RandomForestRegressor algorithm obtained the best result.



In [None]:
train_all = train.drop(['Date'],axis=1)
train_all

In [None]:
y_train_all = train_all.loc[:, ['Weekly_Sales']]
x_train_all = train_all.drop(['Weekly_Sales'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_train_all, y_train_all, test_size=0.2, random_state=0)

print(x_train.shape)
print(x_test.shape)

In [None]:
#RandomForest, ExtraTrees, XGB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

clf = RandomForestRegressor(random_state=0)
pca = PCA()

pipe = Pipeline(steps=[('clf', clf)])

param_grid = [ {
                'clf':[RandomForestRegressor()],
                'clf__n_estimators': [50,100,150],
                'clf__max_depth': [10,20,30]
                },
               
                {
                'clf': [ExtraTreesRegressor()],
                'clf__n_estimators': [50,100,150],
                'clf__max_depth': [10,20,30]
                },
               
                {
                'clf': [XGBRegressor()],  
                'clf__learning_rate':[0.1,0.05],
                'clf__min_samples_split':[5,7,9],
                'clf__max_depth':[10,20,30]
                }
              ]

rscv_all_tree = RandomizedSearchCV(pipe, param_grid, cv = 3, scoring = my_score, n_jobs=-1)
model_all_tree = rscv_all_tree.fit(x_train, y_train)

In [None]:
rscv_all_tree.best_estimator_

In [None]:
y_pred = rscv_all_tree.best_estimator_.predict(x_test)
print('WMAE:', WMAE(y_test, y_pred))

#### **5.2. TRAINING WITH MAIN FEATURES**

In an attempt to obtain even better results in the prediction, models were also trained only with the features of greatest impact in the "Weekly Sales", based on the correlation matrix.

Therefore, the features with the highest correlation ("Size", "Type" and "Dept") were used to train these models, in addition to "IsHoliday", needed to calculate the evaluation metric and the features "Store", "Week" and "Year", essential for identifying the record and future prediction.

The results show that the models using only the most relevant features performs better than the models using all variables.

In [None]:
plt.figure(figsize=(15, 10))

heatmap = sns.heatmap(train.corr(), vmin=-1, vmax=1, annot=True,cmap="Blues",annot_kws={"fontsize":10})
heatmap.set_title('Matriz de Correlação', fontdict={'fontsize':12}, pad=12);

In [None]:
train_relevant = train.drop(['Date','Temperature','Fuel_Price','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5','CPI','Unemployment'],axis=1)
train_relevant

In [None]:
y_relevant = train_relevant.loc[:, ['Weekly_Sales']]
x_relevant = train_relevant.drop(['Weekly_Sales'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

x_train_relevant, x_test_relevant, y_train_relevant, y_test_relevant = train_test_split(x_relevant, y_relevant, test_size=0.2, random_state=0)

print(x_train_relevant.shape)
print(x_test_relevant.shape)

In [None]:
#RandomForest, ExtraTrees, XGB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
import numpy as np

clf = RandomForestRegressor(random_state=0)
pca = PCA()

pipe = Pipeline(steps=[('clf', clf)])

param_grid = [ {
                'clf':[RandomForestRegressor()],
                'clf__n_estimators': [50,100,150],
                'clf__max_depth': [10,20,30]
                },
               
                {
                'clf': [ExtraTreesRegressor()],
                'clf__n_estimators': [50,100,150],
                'clf__max_depth': [10,20,30]
                },
               
                {
                'clf': [XGBRegressor()],  
                'clf__learning_rate':[0.1,0.05],
                'clf__min_samples_split':[5,7,9],
                'clf__max_depth':[10,20,30]
                }
              ]

rscv_relevant_tree = RandomizedSearchCV(pipe, param_grid, cv = 3, scoring = my_score, n_jobs=-1)
model_relevant_tree = rscv_relevant_tree.fit(x_train_relevant, y_train_relevant)

In [None]:
rscv_relevant_tree.best_estimator_

In [None]:
y_pred= rscv_relevant_tree.best_estimator_.predict(x_test_relevant)
print('WMAE:', WMAE(y_test_relevant, y_pred))

#### **5.3. HYPERPARAMETERS TUNING**




Since the model that obtained the best result was the Random Forest algorithm trained with the most relevant features, we will tune some hyperparameters to try to obtain even better results.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(random_state=0)

pipe = Pipeline(steps=[('clf', clf)])

param_grid_rf = [ {
                'clf':[RandomForestRegressor()],
                'clf__n_estimators': [140,150,160],
                'clf__max_depth': [25,30,35],
                'clf__max_features': ['auto',5,6]
                }
              ]

gscv_rf1 = GridSearchCV(pipe, param_grid_rf, cv = 3, scoring = my_score, n_jobs=-1)
model_rf1 = gscv_rf1.fit(x_train_relevant, y_train_relevant)

In [None]:
gscv_rf1.best_estimator_

In [None]:
y_pred_rf = gscv_rf1.best_estimator_.predict(x_test_relevant)
print('WMAE:', WMAE(y_test_relevant, y_pred_rf))

The "Dept" and "Size" seem to be the most important features of the model training. Although "IsHoliday" is used to weight the evaluation metric, a large majority of records are holiday-free, and therefore the small proportion of holiday sales records was not as deterministic for forecasting weekly sales.

In [None]:
plt.rcParams["figure.figsize"] = (5,3)

importances = gscv_rf1.best_estimator_._final_estimator.feature_importances_

attributes = list(x_train_relevant.columns)
indices = np.argsort(importances)
attributes_rank = []
for i in indices:
    attributes_rank.append(attributes[i])
plt.title('Feature Importances')
plt.tight_layout()
plt.barh(range(len(indices)), importances[indices], color='gray', align='center')
plt.yticks(range(len(indices)), attributes_rank, fontsize=5)
plt.xlabel('Relative Importance',fontsize=5)
plt.xticks(color='k', size=15)
plt.yticks(color='k', size=15)
plt.xlim([0.0, 0.25])
plt.show()

### **6. SUBMISSION PREDICTIONS**

Finally, we use the training model with the lowest WMAE score to predict the test dataset values. 

In [None]:
date = test['Date']
test = test.drop(['Date'], axis=1)

In [None]:
test_relevant = test.drop(['Temperature','Fuel_Price','MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5','CPI', 'Unemployment'],axis=1)
test_relevant = test_relevant.sort_values(['Store', 'Dept'], ascending=[True, True])
y_pred_rf = gscv_rf1.best_estimator_.predict(test_relevant)

In [None]:
test_relevant['Date'] = date
test_relevant = test_relevant.sort_values(['Store', 'Dept'], ascending=[True, True])
test_relevant['Weekly_Sales'] = y_pred_rf
test_relevant

By plotting the weekly sales average of the training base and the predictions for the test dataset, it is possible to conclude that the forecasts appear to be consistent, able to get data pattern and sazonal component.

In [None]:
test = test_relevant

weekly_sales_train = train.groupby(['Year','Week']).agg({'Weekly_Sales': ['mean']}).reset_index()
weekly_sales_test = test.groupby(['Year','Week']).agg({'Weekly_Sales': ['mean']}).reset_index()

indices = weekly_sales_train.shape[0] + weekly_sales_test['Weekly_Sales'].index 
plt.figure(figsize=(20, 7))
sns.lineplot(weekly_sales_train['Weekly_Sales'].index,weekly_sales_train['Weekly_Sales']['mean'], color='gray')
sns.lineplot(indices,weekly_sales_test['Weekly_Sales']['mean'],color = 'red')


In [None]:
plt.figure(figsize=(20, 7))

weekly_sales2010 = train.loc[train['Year']==2010].groupby(['Week']).agg({'Weekly_Sales': ['mean']})
weekly_sales2011 = train.loc[train['Year']==2011].groupby(['Week']).agg({'Weekly_Sales': ['mean']})
weekly_sales2012 = train.loc[train['Year']==2012].groupby(['Week']).agg({'Weekly_Sales': ['mean']})
weekly_sales2012_test = test.loc[test['Year']==2012].groupby(['Week']).agg({'Weekly_Sales': ['mean']})
weekly_sales2013_test = test.loc[test['Year']==2013].groupby(['Week']).agg({'Weekly_Sales': ['mean']})

sns.lineplot(weekly_sales2010['Weekly_Sales']['mean'].index, weekly_sales2010['Weekly_Sales']['mean'].values, color='gray')
sns.lineplot(weekly_sales2011['Weekly_Sales']['mean'].index, weekly_sales2011['Weekly_Sales']['mean'].values, color='gray')
sns.lineplot(weekly_sales2012['Weekly_Sales']['mean'].index, weekly_sales2012['Weekly_Sales']['mean'].values, color='gray')
sns.lineplot(weekly_sales2012_test['Weekly_Sales']['mean'].index, weekly_sales2012_test['Weekly_Sales']['mean'].values, color='red')
sns.lineplot(weekly_sales2013_test['Weekly_Sales']['mean'].index, weekly_sales2013_test['Weekly_Sales']['mean'].values, color='red')

plt.grid()
plt.xticks(np.arange(1, 53, step=1))
plt.legend(['2010', '2011', '2012','2012 test', '2013 test'])
plt.show()

### **7. SUBMISSION**

After predction, we prepare the file with the results to submit it to Kaggle evaluation and check the final score.

In [None]:
sampleSubmission = pd.read_csv('../input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip', sep=',')

In [None]:
sampleSubmission['Weekly_Sales'] = y_pred_rf
sampleSubmission.to_csv('submission.csv',index=False)
sampleSubmission