<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/3816/logos/front_page.png">
## Walmart Recruiting - Store Sales Forecasting
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

### Challange
One challenge of modeling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line. 

This challange uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants **must project the sales for each department in each store**. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact.


### Promotional Markdowns

These are discounts that derive from any type of promotional sale such as a temporary price reduction, circular promotion, coupons, endcap promotions and more.

### Consumer Price Index – [CPI](https://www.investopedia.com/terms/c/consumerpriceindex.asp)

#### What Is the Consumer Price Index – CPI?
The Consumer Price Index (CPI) is a measure that examines the weighted average of prices of a basket of consumer goods and services, such as transportation, food, and medical care. It is calculated by taking price changes for each item in the predetermined basket of goods and averaging them. Changes in the CPI are used to assess price changes associated with the cost of living; the CPI is one of the most frequently used statistics for identifying periods of inflation or deflation.

#### How the CPI Is Used
CPI is widely used as an economic indicator. It is the most widely used measure of inflation and, by proxy, of the effectiveness of the government’s economic policy. The CPI gives the government, businesses, and citizens an idea about prices changes in the economy, and can act as a guide in order to make informed decisions about the economy.


### Dataset

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

- features.csv
- sampleSubmission.csv
- stores.csv
- test.csv
- train.csv

#### **stores.csv**

This file contains anonymized information about the 45 stores, indicating the type and size of store.

#### **train.csv**

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales -  sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week
- test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

##### **features.csv**

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

- Store - the store number
- Date - the week
- Temperature - average temperature in the region
- Fuel_Price - cost of fuel in the region
- MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday - whether the week is a special holiday week

#### **Holidays**

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

- Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
- Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
- Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
- Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

#### Support
- https://www.kaggle.com/abefukasawa/walmart-recruiting-draft


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### <span style="color: blue">Importing Datasets</span>

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as mse
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [None]:
stores = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/stores.csv', dtype={"Type": "category"})
features = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/features.csv.zip')
train = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/train.csv.zip')

In [None]:
stores.head()

In [None]:
stores.info()

In [None]:
features.head()

In [None]:
features.info()

In [None]:
train.head()

In [None]:
train.info()

### <span style="color: blue">Auxiliar Functions</span>

In [None]:
# Set Date as Datetime
def set_datetime_column(df):
    df["Datetime"] = pd.to_datetime(df.Date, format='%Y-%m-%d')
    df.drop("Date", axis=1, inplace=True)
    df.rename(columns={"Datetime" : "Date"}, inplace=True)
    return df

# Split Date to Year, Month, Day
def split_datetime_info(df):
    df["Year"] = df.Date.dt.year
    df["Month"] = df.Date.dt.month
    df["Day"] = df.Date.dt.day
    return df

# Create Numerical Ordinal Type column
def create_num_ordinary_type_column(df):
    di = {"A": 3, "B": 2, "C": 1}
    df["TypeInt"] = df.Type.map(di)
    return df

# Join Dataframes informations
def join_dataframe_columns(base_df, how="inner"):
    df = base_df.copy()
    df = df.merge(right=stores, how=how, on="Store")
    df = df.merge(right=features.drop("IsHoliday", axis=1), how=how, on=["Date", "Store"])
    return df

def set_information_lag(df, lag_range= 7):
    for i in range(1, (lag_range + 1)):
        df["lag_{}".format(i)] = df.Weekly_Sales.shift(i)
    return df

def get_train_test_size(df_len, test_size=0.3):
    test = round(df_len * test_size)
    train = df_len - test
    return train, test

def get_train_test_ndarray(df, train_len):
    train = df.iloc[0:train_len, :].values
    test = df.iloc[train_len:, :].values
    return train, test

def get_train_test_df(df, train_len):
    train = df.iloc[0:train_len, :]
    test = df.iloc[train_len:, :]
    return train, test

def train_test_split_time_series(df, test_size=0.3, ndarrayType=True):
    train_len, _ = get_train_test_size(len(df), test_size)
    if ndarrayType == True:
        return get_train_test_ndarray(df, train_len)
    else:
        return get_train_test_df(df, train_len)

def split_X_y_ndarray(ndarray):
    X = ndarray[:, 2:]
    y = ndarray[:, 1]
    return X, y

def split_X_y_df(df):
    try:
        df = df.set_index("Date")
    except:
        pass
    X = df.drop("Weekly_Sales",axis=1).iloc[:, 1:]
    y = df["Weekly_Sales"].values
    return X, y

def rmse(y_test, y_hat):
    result = mse(y_test, y_hat)**0.5
    print("RMSE", result)
    return result

def add_sqr_foot_sales(df):
    df["SquareFoot_Sales"] = df.Weekly_Sales / df.Size
    return df

def drop_sqr_foot_sales(df):
    df.drop(columns=["SquareFoot_Sales"], inplace=True)
    
def drop_type(df):
    df.drop(columns=["Type"], inplace=True)
    
def plot_lag(df, lag_num=0):
    if (lag_num == 0):
        plt.plot(df.Date, df.Weekly_Sales)
    else:
        plt.plot(df.Date, df[F"lag_{lag_num}"])
        
def lag_graph(df, lag_range= 3):
    plt.figure(figsize=(20, 10))
    plot_lag(df_sales)
    for i in range(1, (lag_range + 1)):
        plot_lag(df_sales, i)
        
def plot_prediction_result(df, y_test, y_hat):
    plt.figure(figsize=(20, 7))
    offset = len(df) - len(y_test)
    plt.plot(df_sales.Date[offset:], y_test)
    plt.plot(df_sales.Date[offset:], y_hat)
    
def update_dataframe_lag(df, value, lag_range=7, offset=0):
    for i in range(0, lag_range + 1):
        try:
            df.iloc[i + offset, i] = value
        except:
            pass   
        
def predict_df_sales(df, model):
    for i in range(len(df)):   
        y_hat = model.predict([df_test.iloc[i, 1:].to_numpy()])
        update_dataframe_lag(df_test, y_hat, offset=i)
        
def plot_dow_boxplot(df, side: int, title: str):
    sns.set(style="whitegrid")
    plt.figure(figsize=(10,3))
    plt.title(title)
    ax = sns.boxplot(x=df["Weekly_Sales"])
    ax.axis(xmin=4*10000000,xmax=6.5*10000000)
    
def apply_dickey_fuller_stationary_test(df):
    print('Results of the Dickey Fuller Test')
    dftest = adfuller(x = df['Weekly_Sales'], autolag= 'AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    print(dfoutput)
    for key,value in dftest[4].items():
        print('Critical Value ({}) = {}'.format(key,value))

def remove_variance_df(df):
    try:
        df = df.set_index("Date")
    except:
        pass
    df_log = np.log(df) # Penalizes the positive heteroscedasticity
    plt.figure(figsize=(18, 7))
    plt.plot(df_log['Weekly_Sales'], linewidth = 3)
    return df_log

def remove_tendency_df(df):
    try:
        df = df.set_index("Date")
    except:
        pass
    dfTimeShift = df.shift()
    dfDiffShift = df - dfTimeShift 
    plt.figure(figsize=(18, 7))
    plt.plot(dfDiffShift, linewidth = 3)
    return dfDiffShift

def apply_holiday_progression(df, regression_len=4, _type="exp", column_name='Holiday_Progression'):
    holidays = df[df.IsHoliday == True].Date.unique()
    df[column_name] = 0
    days = df.Date.unique()
    for date in holidays:
        idx = np.where(days == date)[0][0]
        try:
            for i in range(0, regression_len):
                df.loc[df.Date == days[idx - i], column_name] = regression_len - i
        except:
            pass
    if (_type == "exp"):
        df['Exp_' + column_name] = df[column_name] ** 2
        df.drop(columns=[column_name], inplace=True)
        df.rename(columns={F'Exp_{column_name}' : column_name}, inplace=True)
        
    return df

def drop_holiday_progression(df, column_name='Holiday_Progression'):
    df.drop(columns=[column_name], inplace=True)

def plot_scatter(df, independent, dependent="Weekly_Sales"):
    plt.figure(figsize=(18, 10))
    _ = df.groupby(independent)[dependent].sum().reset_index()
    sns.scatterplot(_[independent], _[dependent], alpha=0.7)
    del _
    
def plot_hist(df, independent, dependent="Weekly_Sales", bins=20):
    plt.figure(figsize=(18, 10))
    sns.distplot(df.groupby(independent)[dependent].sum().reset_index(), bins=bins)
    
def drop_markdowns(df):
    for i in range(1, 6):
        df.drop(columns=F"MarkDown{i}", inplace=True)
        
def set_prophet_requirements(df):
    df = set_datetime_column(df)
    df.rename(columns={"Date": "ds", "Weekly_Sales": "y"}, inplace=True)
    #df = df.set_index('ds')
    return df

def fill_markdowns(df, how="zero"):
    for i in range(1, 6):
        if how == "zero":
            df[F"MarkDown{i}"].fillna(0)
        elif how == "median":
            df[F"MarkDown{i}"].fillna(df[F"MarkDown{i}"].median())
        elif how == "mean":
            df[F"MarkDown{i}"].fillna(df[F"MarkDown{i}"].mean())
    return df

### <span style="color: blue">Explaratory Data Analysis (EDA)</span>

#### Possibilities

- I can rank (encode) store type as ordinals 
- sales per square foot as an indicator of sales performance
- lags. How many?
- holidays
- Holidays progression
- DOW (Day of Week)
- sales last month (window of 4)
- get year-month-day
- get sales same week last year
- get statistical data from windows
- what is more important: raw size or log size?
- how does fuel price impacts weekly sales?
- how does temperature impacts weekly sales?
- what is the relation between CPI and sales?
- cluster CPI

In [None]:
stores.describe().T

In [None]:
features.describe().T

In [None]:
df = join_dataframe_columns(train)
df.head()

In [None]:
plt.figure(figsize=(15, 15))
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='Blues',
        xticklabels=corr,
        yticklabels=corr)

### Type

In [None]:
# How Store Type matters to Weekly Sales
_type = df.groupby("Store").Weekly_Sales.sum().reset_index().merge(stores, on="Store")
_type.head()
plt.title("Type X Weekly_Sales")
plt.scatter(_type.Type, _type.Weekly_Sales, alpha=0.5)

In [None]:
# How Store Size matters to define Store's Type
plt.title("Type X Size")
plt.scatter(_type.Type, _type.Size, alpha=0.5)

In [None]:
# How Store Size matters to define Store's Type
plt.figure(figsize=(17, 10))
plt.title("Type X Log(Size)")
sns.violinplot(_type.Type, np.log(_type.Size))
del _type

**Type Conclusion**

Type seems to matter for Weekly_Sales, but there is some intersections points on Store Size and Weekly Sales. Some outliers needs to be threaten before encoding as ordinal. 

______

### Size


In [None]:
# How Size matters to Weekly Sales
plt.figure(figsize=(17, 10))

_ = df.groupby('Size').Weekly_Sales.median().reset_index()

m, b = np.polyfit(np.log(_.Size),_.Weekly_Sales, 1) # slope and intercept of best fit line
plt.plot(np.log(_.Size), m*np.log(_.Size) + b, color="g") # plot best fit line

plt.scatter(np.log(_.Size), _.Weekly_Sales) # plot scatter

In [None]:
# Sales per square foot is a good indicator of sales performance?

df = add_sqr_foot_sales(df)
_sf_sales = df.groupby(['Date', 'Store']).SquareFoot_Sales.sum().reset_index()
_sf_sales = _sf_sales.merge(df.groupby(['Date', 'Store']).Weekly_Sales.sum().reset_index(), on=["Store", "Date"])
_sf_sales.head()
plt.scatter(_sf_sales.SquareFoot_Sales, _sf_sales.Weekly_Sales, alpha=0.7)
drop_sqr_foot_sales(df)
del _sf_sales, _

**Size Conclusion**

Size has an linear correlation with Weekly Sales and Square foot Sales  has an linear "custered" correlation, besides it's heteroscedasticity, seems to have an positive tendency on separeted clusters.

_____

### Lag

In [None]:
# Plotting ACF (Auto Correlation Function) test result, to determinate Lags 

_temp = df.groupby(["Date"]).Weekly_Sales.sum().reset_index()
_temp = set_datetime_column(_temp)
_temp = _temp.set_index('Date')

fig, ax = plt.subplots(figsize=(15,10)) # Increase plot size
fig = sm.graphics.tsa.plot_acf(_temp.values.squeeze(), lags=60, ax=ax) # shows ACF test result
ax.set_xticks(range(0,60, 2)) # change X axis ticks to show every 2 numbers
fig.show() # show figure

In [None]:
# Plotting PACF (Auto Correlation Function) test result, to determinate Sazonal Lags 

fig, ax = plt.subplots(figsize=(15,10)) # Increase plot size
fig = sm.graphics.tsa.plot_pacf(_temp.values.squeeze(), lags=60, ax=ax) # shows ACF test result
ax.set_xticks(range(0,60, 2)) # change X axis ticks to show every 2 numbers
fig.show() # show figure


In [None]:
_temp = set_information_lag(_temp, 60)
plt.figure(figsize=(15, 15))
sns.heatmap(_temp.corr(), cmap="Blues"); # Manual ACF
del _temp

**Lag Conclusion**

The most relevants lags were 
- ACF (1, 2, 5, 52)
- PACF (1, 5, 37, 47, 48, 49, 50 ,51, 52) 
- Pearson's Lag Corr. (1, 2, 3, 4, 48 ~ 56)

The annual sazonality is really strong as we can see on the 52º week of the last year.

___

### Markdown (1 - 5)

In [None]:
features.info()

**Markdown Conclusion**

By analysing the information of Markdowns, we can see that are too many values missing. My first decision is to drop markdown values on first hand, and than come back to analyse it on a second time for deeper conclusions.

___

### Holidays

In [None]:
_holi = df[df.IsHoliday == True].groupby(["Date"]).Weekly_Sales.sum().reset_index()
_common = df[df.IsHoliday == False].groupby(["Date"]).Weekly_Sales.sum().reset_index()

In [None]:
# Plotting Holiday Weekly_Sales agains Common Days Weekly Sales

plot_dow_boxplot(_holi, 1, "Holidays") # Plot boxplot (holidays)
plot_dow_boxplot(_common, 2, "Common Days") # Plot boxplot (Common days)

In [None]:
_holi.Date.unique()

In [None]:
df = apply_holiday_progression(df)
df.head()

In [None]:
# Analysing Holiday Progression with Weekly_Sales
_temp = df.groupby(["Date", "Holiday_Progression"]).Weekly_Sales.sum().reset_index()
_temp.head()
plt.figure(figsize=(18, 10))
plt.title("Holiday Progression (4 Weeks)")
sns.scatterplot(_temp.Holiday_Progression, _temp.Weekly_Sales, alpha=0.5)

In [None]:
# Analysing Holiday Progression with Weekly_Sales
df = apply_holiday_progression(df, 3)
df.head()
_temp = df.groupby(["Date", "Holiday_Progression"]).Weekly_Sales.sum().reset_index()
_temp.head()
plt.figure(figsize=(18, 10))
plt.title("Holiday Progression (3 Weeks)")
sns.scatterplot(_temp.Holiday_Progression, _temp.Weekly_Sales, alpha=0.5)

In [None]:
# Cleaning data
drop_holiday_progression(df)
del _holi, _common, _temp

#### Current Holidays

- Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
- Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
- Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
- Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

By analysing the box-plot graphs of holidays and common days, we can see that holidays have a bigger mean of sales, so it's relevant feature. By creating a holiday progression, we can see that the increase on sales are not centered on holiday specific day. Depending on holiday type, the summit of sales will be centered on holiday or previous weeks.

___

### Stationarity

In [None]:
# grouping all week sales on dataframe
_ = train.groupby("Date").Weekly_Sales.sum().reset_index()
plt.figure(figsize=(20, 10))
plt.plot(_.Date, _.Weekly_Sales)

In [None]:
# Rolling Statistics
rolmean = _.rolling(window=4).mean() # Gives a series of means of the number of previous values equals the window size.
rolstd = _.rolling(window=4).std()

plt.figure(figsize=(18, 7))
plt.plot(_['Weekly_Sales'], linewidth = 2, label = 'Weekly_Sales')
plt.plot(rolmean, linewidth = 2, label = 'Rolling Mean', color = 'r')
plt.plot(rolstd, linewidth = 2, label = 'Rolling Std Dev', color = 'k')
plt.legend(loc = 'best')
plt.title('Rolling Mean and Standard Deviation')

In [None]:
apply_dickey_fuller_stationary_test(_)

In [None]:
# Applying variance correction to time series
_ = remove_variance_df(_)

In [None]:
# Applying tendency correction to time series
_ = remove_tendency_df(_)

In [None]:
apply_dickey_fuller_stationary_test(_.dropna())
del _

**Stationarity Conclusion**

We can conclude by the Augmented Dickey Fuller stationarity test that our time series is not stationary. Besides it shows no tendency and no relevance variance, it didn't passed on fullers test even with stationary correction. 

___

### Fuel

In [None]:
_ = df.groupby("Fuel_Price").Weekly_Sales.sum().reset_index()
sns.scatterplot(_.Fuel_Price, _.Weekly_Sales, alpha=0.7)

In [None]:
sns.distplot(_.Fuel_Price, bins=20)

In [None]:
sns.distplot(np.log(_.Fuel_Price), bins=20)
del _

**Fuel Conclusion**

The graphs shows that fuel price seems not to correlate with weeekly sales directly. The graph shows a random dispersion of weekly sales points over fuel prices. The distribution is not an gaussian, but seems to be composed by 2 gaussians distributions.

___

### Temperature

In [None]:
_ = df.groupby("Temperature").Weekly_Sales.sum().reset_index()
sns.scatterplot(_.Temperature, _.Weekly_Sales, alpha=0.7)

In [None]:
sns.distplot(_.Temperature, bins=20)
del _

**Temperature Conclusion**

The graphs shows that when the temperature increases, the sales also increase. It's an heterocedastic graph, but shows an correlation between the two variables

___

### CPI

In [None]:
plot_scatter(df, "CPI")

In [None]:
plot_hist(df, "CPI")

**CPI Conclusion**

Lower CPI values are correlated with higher Weekly_Sales. We can see that are some clusters on CPI prices, that can be separeted easily.

___

### Unemployment

In [None]:
plot_scatter(df, "Unemployment")

In [None]:
plot_hist(df, "Unemployment")

**Unemployment Conclusion**

Unemployment seems to have a weak correlation with Weekly_Sales. Even when unemployment is high, the weekly sales not shows a decrease tendency

___

### Prophet

In [None]:
!pip install fbprophet

In [None]:
# Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
# Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
# Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
# Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

laborday = pd.DataFrame({
  'holiday': 'laborday',
  'ds': pd.to_datetime(['2010-09-10', '2011-09-09', '2012-09-07', '2013-09-06']),
  'lower_window': 0,
  'upper_window': 1,
})
superbowls = pd.DataFrame({
  'holiday': 'superbowl',
  'ds': pd.to_datetime(['2010-02-12', '2011-02-11', '2012-02-10', '2013-02-08']),
  'lower_window': 0,
  'upper_window': 1,
})
thanksgiving = pd.DataFrame({
  'holiday': 'thanksgiving',
  'ds': pd.to_datetime(['2010-11-26', '2011-11-25', '2012-11-23', '2013-11-29']),
  'lower_window': 0,
  'upper_window': 1,
})
xmas = pd.DataFrame({
  'holiday': 'christmas',
  'ds': pd.to_datetime(['2010-12-31', '2011-12-30', '2012-12-28', '2013-12-27']),
  'lower_window': 0,
  'upper_window': 1,
})
holidays = pd.concat((laborday, superbowls, thanksgiving, xmas))

In [None]:
from fbprophet import Prophet

m = Prophet(holidays=holidays, interval_width=0.95)
_temp = df.groupby("Date").Weekly_Sales.sum().reset_index()
_temp = set_prophet_requirements(_temp)
m.fit(_temp)
_temp.head()

In [None]:
future = m.make_future_dataframe(periods=365, freq='d', include_history = True)
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()

In [None]:
fig1 = m.plot(forecast)

In [None]:
fig2 = m.plot_components(forecast)

In [None]:
sample = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip')
sample['Date'] = sample.Id.apply(lambda x: x.split('_')[2])
sample = set_datetime_column(sample)
final = sample.drop('Weekly_Sales', axis=1).merge(forecast[['ds', 'yhat']], how='left',left_on='Date', right_on='ds').drop('Date', axis=1)
final.drop(columns='ds', inplace=True)
final.rename(columns={'yhat': 'Weekly_Sales'}, inplace=True)
final.to_csv('walmart-submission.csv', index=False)
final.tail()

___

### Clusters

#### Stores

#### Dept

___

### <span style="color:blue">Feature Engeneering</span>

### <span style="color:blue">Training</span>

In [None]:
df = set_datetime_column(df)
df = split_datetime_info(df)
df = create_num_ordinary_type_column(df)
df.head()

### <span style="color: blue" >Benchmark Baseline (Lag Based - Random Forest Regressor)</span>

Basic Time Series (Lag-Based) prediction for benchmark baseline

In [None]:
# grouping all week sales on dataframe
df_sales = train.groupby("Date").Weekly_Sales.sum().reset_index()
df_sales.head()

In [None]:
plt.figure(figsize=(20, 10))
plt.plot(df_sales.Date, df_sales.Weekly_Sales)

In [None]:
# Applying Information Lag
df_sales = set_information_lag(df_sales)
df_sales.head(10)

In [None]:
# Lag Visualization
lag_graph(df_sales)

In [None]:
# Random Forest
model = RandomForestRegressor()

df_sales.set_index('Date')
first_weekly_sales = df_sales.iloc[0, 1]
df_sales.fillna(first_weekly_sales, inplace=True) # fill NaN with Sales Mean()
nd_train, nd_test = train_test_split_time_series(df_sales)
X_train, y_train = split_X_y_ndarray(nd_train)

model.fit(X_train, y_train)

In [None]:
X_test, y_test = split_X_y_ndarray(nd_test)
y_hat = model.predict(X_test)
rmse(y_test, y_hat) # RMSE
plt.scatter(y_test, y_hat)

In [None]:
# Benchmark baseline result for comparison
plot_prediction_result(df_sales, y_test, y_hat)

### Prediction

In [None]:
sample = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip')
test = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/test.csv.zip')

In [None]:
test.head()

In [None]:
sample.head()

In [None]:
# Setting initial lag on test dataframe

test = join_dataframe_columns(test)
test["Weekly_Sales"] = 0
df_test = test.groupby("Date").Weekly_Sales.sum().reset_index().set_index("Date")
df_test = set_information_lag(df_test)
df_test = pd.concat([df_sales.iloc[-7:, :].set_index("Date"), df_test])
df_test = pd.DataFrame(df_test.iloc[:, :1])
df_test = set_information_lag(df_test)
df_test = df_test.iloc[7:, :]
df_test.head(10)

In [None]:
y_hat = model.predict([df_test.iloc[0, 1:].to_numpy()])

In [None]:
predict_df_sales(df_test, model)
sample['Date'] = sample.Id.apply(lambda x: x.split('_')[2])
final = sample.drop('Weekly_Sales', axis=1).merge(df_test.iloc[:,0], how='left', on='Date').drop('Date', axis=1)

In [None]:
final.head()

In [None]:
final.to_csv('walmart-submission.csv', index=False)

Submission and Description

Private Score 60776366.55718

Public Score 61069388.70502

___

### <span style="color: blue">Simple Random Forest Regressor</span>

In [None]:
# grouping all week sales on dataframe
df_sales = train.groupby(["Date", "Store"]).Weekly_Sales.sum().reset_index()
df_sales = join_dataframe_columns(df_sales)
df_sales = create_num_ordinary_type_column(df_sales)
drop_type(df_sales)
df_sales = df_sales.set_index('Date')
drop_markdowns(df_sales)
df_sales.head()

In [None]:
# Random Forest Ensambles for Regression
model = RandomForestRegressor()
df_train, df_test = train_test_split_time_series(df_sales, ndarrayType=False)
X_train, y_train = split_X_y_df(df_train)
model.fit(X_train, y_train)

In [None]:
# Testing
X_test, y_test = split_X_y_df(df_test)
y_hat = model.predict(X_test)
rmse(y_test, y_hat) # RMSE

In [None]:
plt.figure(figsize=(20, 7))
df_sales = train.groupby(["Date", "Store"]).Weekly_Sales.sum().reset_index()
offset = len(df_sales) - len(y_test)
_ = df_sales.iloc[offset:, :]
_["y_test"] = y_test
_["y_hat"] = y_hat
plt.plot(_.groupby("Date").Weekly_Sales.sum().reset_index().Date.values, _.groupby("Date").y_test.sum().reset_index().y_test)
plt.plot(_.groupby("Date").Weekly_Sales.sum().reset_index().Date.values, _.groupby("Date").y_hat.sum().reset_index().y_hat)


____

### <span style="color: blue">Random Forest Regressor</span>

In [None]:
df.head()