**<center><font size = 6 >Rossmann Future Sales</font></center>**

**<font size = 5> Context </font><br>
<br>
<font size = 4 color = '#4740B9'> 1. Data Cleaning </font><br>
<font size = 4 color = '#4740B9'> 2. Random Forest </font><br>
<font size = 4 color = '#4740B9'> 3. Neural Network </font><br>
<font size = 4 color = '#4740B9'> 4. Graphic </font><br>
<font size = 4 color = '#4740B9'> 5. Time Series Analysis </font><br>
<font size = 4 color = '#4740B9'> 6. Result </font>**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input/rossmann-store-sales'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('/kaggle/input/rossmann-store-sales/train.csv')
test = pd.read_csv('/kaggle/input/rossmann-store-sales/test.csv')
store = pd.read_csv('/kaggle/input/rossmann-store-sales/store.csv')
sample = pd.read_csv('/kaggle/input/rossmann-store-sales/sample_submission.csv')

***

# 1. Data Cleaning

==> I'm going to use **store data** to have more features to predict **Sales** better.

In [None]:
train.head()

In [None]:
store.head()

It makes sense just to take a look at data when a store is open

In [None]:
train_open = train[train.Open == 1]

I'm going to use the following features from store: <br>
**"StoreType, Assortment, CompetitionDistance, Promo2"**

In [None]:
features = ["Store", "StoreType", "Assortment", "CompetitionDistance", "Promo2"]
store_selected = store[features]

We need to check whether there are **Null Values** in this dataset (both: train and store)

In [None]:
train_open.info() # As you can see there is no null value in the train set

In [None]:
store_selected.info() 
#There are Null Values in "CompetitionDistance"

### Merging Train and Store

In [None]:
together = train_open.merge(store_selected)
together = together.drop(columns = ['Date']) # we don't need the Date column

In [None]:
together.info()
#There are Null Values in "CompetitionDistance"

In [None]:
# the number of null values
together.CompetitionDistance.isnull().sum()

Here, we have to decide how to deal with these **Null Values** or with these **features**

==> **Possible Choices**<br>
<br>
<font color = 'red'>1. mean value </font><br>
<font color = 'blue'>2. median value </font><br>
<font color = 'blue'>3. removing data with null values from data set for predicting (drop na) </font><br>
<font color = 'blue'>4. removing features that have null values </font><br>
<font color = 'blue'>5. imputing missing values</font>

For this dataset, I will take the first one because just a few data have null values.<br> 
And If we drop null values, you also have to drop all null values in test set. In this case, we have a problem to predict future sales of some stores that don't have a feature **CompetitionDistance**.<br>
For convenience' sake, I took mean value.

In [None]:
mean_CD = round(together.CompetitionDistance.mean(),0)
mean_CD

In [None]:
together.CompetitionDistance[together.CompetitionDistance.isnull() == True] = mean_CD

In [None]:
together.info()

### Spliting: X and y

In [None]:
together.columns

In [None]:
X = together[['DayOfWeek', 'Open', 'Promo', 'Customers',
       'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment', ## there is no feaure 'Customers' in Test set
       'CompetitionDistance', 'Promo2']]
y = together[['Sales']]

Now we have X and y. <br>
But we have to see whether all data are ready to be used to predict

Reset Index

In [None]:
X = X.reset_index()
del X['index']
y = y.reset_index()
del y['index']

In [None]:
X.StateHoliday.unique() 
# here, we have 0 and '0'. And it should be the same so we have to change the type of 0 to object

In [None]:
X.StateHoliday[X.StateHoliday == 0] = '0'

### OneHot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

I want to use Onehot Encoding **for the features: DayOfWeek, StateHoliday, Storetype and Assortment** <br>
why?: Categorical data are variables that contain label values rather than numeric values.

In [None]:
X.columns

In [None]:
X_OH =  X[['DayOfWeek','StateHoliday','StoreType','Assortment']] # for OneHot Encoding
X_rest = X[['Customers', 'Promo','SchoolHoliday','CompetitionDistance', 'Promo2']] # we don't need open because it is always 1

In [None]:
OHencoder = OneHotEncoder(handle_unknown='ignore')
OH_result = pd.DataFrame(OHencoder.fit_transform(X_OH).toarray())

In [None]:
OH_result.columns = OHencoder.get_feature_names(['DayOfWeek','StateHoliday','StoreType','Assortment'])

In [None]:
OH_result.head()

In [None]:
X_final = pd.concat([X_rest,OH_result],axis = 1)

### We just got X(all features) to be used to predict!

In [None]:
X_final

In [None]:
X_final.info()

## The Last Step: Train and Valid

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_final, y, test_size=0.33, random_state=42)

***

# 2. Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

Step: Training

In [None]:
rf = RandomForestRegressor(n_estimators=50, max_depth=8, random_state=0)
rf.fit(X_train, y_train)

In [None]:
rf.score(X_valid, y_valid)

In [None]:
X

In [None]:
predicted_RF = pd.DataFrame(rf.predict(X_valid))
X_ML = together.iloc[X_valid.index,:]
X_ML = X_ML.reset_index()
del X_ML['index']

In [None]:
All_RF = pd.concat([X_ML,predicted_RF], axis = 1)
All_RF = All_RF.rename(columns={0: "Predicted"})
All_RF

In [None]:
All_RF[['Sales','Predicted']] ## to compare

### Calculating MASE

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mean_absolute_error(All_RF.Sales, All_RF.Predicted)

In [None]:
# MASE
MASE_RF = mean_absolute_error(All_RF.Sales, All_RF.Predicted)/len(All_RF.Sales)
MASE_RF

***

# 3. Neural Network

In [None]:
from sklearn.neural_network import MLPRegressor

In [None]:
NN = MLPRegressor(hidden_layer_sizes=(30,30,30),max_iter=30)
NN.fit(X_train, y_train)

In [None]:
NN.score(X_valid, y_valid)

In [None]:
predicted_NN = pd.DataFrame(NN.predict(X_valid))

In [None]:
All_NN = pd.concat([X_ML,predicted_NN], axis = 1)
All_NN = All_NN.rename(columns={0: "Predicted"})
All_NN

In [None]:
All_NN[['Sales','Predicted']] ## to compare

### Calculating MASE


In [None]:
mean_absolute_error(All_NN.Sales, All_NN.Predicted)

In [None]:
# MASE
MASE_NN = mean_absolute_error(All_NN.Sales, All_NN.Predicted)/len(All_NN.Sales)
MASE_NN

***

# 4. Graphically: Random Forest and Neural Network Weekly

<font size = 4>==> **RandomForest** </font>

In [None]:
week_mean_RF = All_RF.groupby('DayOfWeek').agg({'Sales':'mean','Predicted':'mean'})
week_mean_RF['Difference'] = week_mean_RF['Sales'] - week_mean_RF['Predicted']
week_mean_RF

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = week_mean_RF.index, y = week_mean_RF.Sales, color = 'red', alpha = 0.3, label = 'Sales')
sns.barplot(x = week_mean_RF.index, y = week_mean_RF.Predicted, color = 'blue', alpha = 0.1, label = 'Predicted')
plt.legend()
plt.title('Random Forest weekly Comparison')

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = week_mean_RF.index, y = week_mean_RF.Difference, color = 'gray')
plt.title('Difference Random Forest')

<font size = 4>==> **Neural Network** </font>

In [None]:
week_mean_NN = All_NN.groupby('DayOfWeek').agg({'Sales':'mean','Predicted':'mean'})
week_mean_NN['Difference'] = week_mean_NN['Sales'] - week_mean_NN['Predicted']
week_mean_NN

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = week_mean_NN.index, y = week_mean_NN.Sales, color = 'red', alpha = 0.3, label = 'Sales')
sns.barplot(x = week_mean_NN.index, y = week_mean_NN.Predicted, color = 'blue', alpha = 0.1, label = 'Predicted')
plt.legend()
plt.title('Neural Network weekly Comparison')

In [None]:
plt.figure(figsize = (12,8))
sns.barplot(x = week_mean_NN.index, y = week_mean_NN.Difference, color = 'gray')
plt.title('Difference Neural Network')

In [None]:
print('MASE calculated by Random Forest: ', MASE_RF)
print('MASE calculated by Neural Network: ', MASE_NN)

***

# 5. Time Series Analysis for Consumers in Test Set

As you can see in test set, we have no data for the feature Consumers

In [None]:
test.head()

The feature **Consumer** is the most important feature for predicting. So I would like to predict the feature **Consumers** by using **ARIMA Model(Time Series Analysis)**

In [None]:
train_open.head()

In [None]:
train_open.Date = pd.to_datetime(train_open.Date)

We need to split according to Store ID to predict the number Customers, because the number of Customers is different by Store

In [None]:
train_for_ts = train_open[['Store','Date','Customers']]

In [None]:
# Splitting based on Store Number
ts_stores = {}
for i, g in train_for_ts.groupby('Store'):
    ts_stores.update({i : g.reset_index(drop=True)})

## ARIMA Model

In [None]:
ts_stores[1]

In [None]:
from statsmodels.tsa.arima_model import ARIMA

### ARIMA Model (1,0,6)

In [None]:
ARIMA_model = ARIMA(ts_stores[1].Customers, order=(1,0,2))
ARIMA_model_fit = ARIMA_model.fit()

In [None]:
ARIMA_predicted = ARIMA_model_fit.predict()

In [None]:
gr = pd.concat([ts_stores[1],ARIMA_predicted],axis = 1)
gr = gr.rename(columns={0: "Predicted"})

In [None]:
gr

In [None]:
# To check whether ARIMA Model works well
plt.figure(figsize = (16,8))

sns.lineplot(x = gr.Date[0:360], y = gr.Customers[0:360], label = 'Customers')
sns.lineplot(x = gr.Date[0:360], y = gr.Predicted[0:360], label = 'Predicted')

In [None]:
result_ARIMA = {}
result_ARIMA = pd.DataFrame(result_ARIMA)

for index,value in ts_stores.items():
    
    ARIMA_model = ARIMA(value.Customers, order=(1,0,2))
    ARIMA_model_fit = ARIMA_model.fit()
    ARIMA_predicted = ARIMA_model_fit.predict()
    ARIMA_forecast = ARIMA_model_fit.forecast(41)
    
    
    tmp = pd.concat([value,ARIMA_predicted],axis = 1)
    tmp = tmp.rename(columns={0: "Predicted"})
    
    
    result_ARIMA = result_ARIMA.append(tmp, ignore_index=True)
    
    

In [None]:
result_ARIMA.info()

In [None]:
result_ARIMA

In [None]:
len(test[ (test.Store == 1) & (test.Open == 1)])
## we have to forecast 41 days!

In [None]:
ARIMA_model = ARIMA(ts_stores[1].Customers, order=(1,0,2))
ARIMA_model_fit = ARIMA_model.fit()
ARIMA_predicted = ARIMA_model_fit.predict()
ARIMA_forecast = ARIMA_model_fit.forecast(41)[1]

In [None]:
tmp = pd.concat([ts_stores[1],ARIMA_predicted],axis = 1)
tmp = tmp.rename(columns={0: "Predicted"})

In [None]:
tmp

In [None]:
from datetime import datetime

In [None]:
times = pd.date_range(start="2015-08-01",end="2015-09-17")
weekday = times.weekday
times = pd.DataFrame(times)
times = times.rename(columns={0: "Date"})
weekday = pd.DataFrame(weekday)
weekday = weekday.rename(columns={0: "Open"})

In [None]:
x = pd.concat([times,weekday],axis = 1)
x['Open'][x['Open'] != 6] = 1
x['Open'][x['Open'] == 6] = 0

In [None]:
ARIMA_forecast = pd.DataFrame(ARIMA_forecast)
ARIMA_forecast = ARIMA_forecast.rename(columns={0: "Forecast"})

In [None]:
base = x[x['Open'] == 1]
base = base.reset_index()
del base['index']

In [None]:
f = []
for i in range(1,1116):
    
    for ii in range(1,42):
        
        f.append([i,ii])
              
f = pd.DataFrame(f)

In [None]:
f

In [None]:
base2 = base.copy()

In [None]:
for i in range(115):
    
    base2 = pd.concat([base2,base2],axis = 0)

In [None]:
base2.head(45)

In [None]:
g = pd.concat([ base,ARIMA_forecast ],axis = 1)
g = pd.concat([g, ])