# Description given by Kaggle

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

Store - the store number
Dept - the department number
Date - the week
Weekly_Sales -  sales for the given department in the given store
IsHoliday - whether the week is a special holiday week
test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store - the store number
Date - the week
Temperature - average temperature in the region
Fuel_Price - cost of fuel in the region
MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week
For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

# First notes on strategy

I will take this task as I would do for first steps of a POC. Most of the work done here concerns first data exploration and insights for first modelling. Of course, to be able to handle this appropriately for a first version model, I might need more domain knowledge, in order to understand the features given in features.csv (mainly the so called  markdowns) and to study further intrinsic behaviour of specific Store-Depts. 

For these Weekly aggregated data (like explained in the description), it seems to me reasonable to use Pandas. I will not try to build my own Python package, but use available packages on PyPI, such as dataprep (for basic edas: missing, correlations) and pycaret to load different models and data transformations. I will work with popular models, like XGBoost, Catboost, LightGBM and Linear models from Sklearn. 

Since it seems like useful to build several features for the model, I will probably not try Arima, statsmodels, Facebook Prophet at this time, since I would need to take into account many assumptions to work with several different Stores and Depts (total of 3331 store-depts). 

LSTM, from Keras for instance, will not be studied here either.  I think it would be interesting to give it a try with a multivariable approach, though. Specially if the time series follow seasonal trends.

In [None]:
!pip install pycaret==2.3.1 dataprep

## Loading main libraries

In [None]:
from pycaret import regression as pyreg
import numpy as np
from sklearn.metrics import make_scorer
import logging
import seaborn as sns
import matplotlib.pyplot as plt
import time
import random
from dataprep import eda

# Checking DATA

In [None]:

import gc

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
def del_df(df):
    try:
        del(df)
    except:
        pass
    gc.collect()

df_features = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/features.csv.zip")
df_train = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/train.csv.zip")
df_test = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/test.csv.zip")
df_sample = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip")
df_stores = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/stores.csv")

In [None]:
gc.collect()

# Basic info and sample of dataframes

Taking into account the size(size up to tens of MB) and number of lines (up to ~400k lines), it seems reasonable to use Pandas transformations for this test. 

In [None]:
df_stores.info()
df_features.info()
df_train.info()
df_test.info()
df_sample.info()

In [None]:
df_train.head()

In [None]:
df_train.nunique()

In [None]:
df_train.groupby(['Store', 'Dept']).size()

In [None]:
df_stores.head()

In [None]:
df_stores.nunique()

In [None]:
df_features.head()

In [None]:
df_features.sort_values("Date").tail(4)

# Min, max dates

df_test starts in a week just after df_train. df_sample corresponds to the name range as df_test as indicated in the competition description. 

In [None]:
df_train["Date"].min(),df_train["Date"].max(),df_train["Date"].count()

In [None]:
df_test["Date"].min(),df_test["Date"].max(),df_test["Date"].count()

In [None]:
df_sample.head()

In [None]:
df_sample["Id"].min(),df_sample["Id"].max(),df_sample["Id"].count()

# Some remarks about range and granularity of ids of time series

This competition asks for a forecast up to 8 months in advance, which is very challenging given little information provided. We really need to rely upon seasonality in order to provided good forecasts. Considering 3331 pairs Store-Dept are quite tough as well, since the problem is too much granular. We cannot generalize the model with a variable taking into account Store-Dept ID due to the complexity it would generate, reducing a lot the degrees of freedom for model. So it is important to find features that could explain store-depts general information.

Besides using the features provided, there are some approaches we could think of:
1. propagating one-year information like lags sales, avgs, etc, improving information with available features
2. propagating the least information available retaining long time information for the last months but short information for the first weeks of prediction. 

The option 2 is probably better in terms of Kaggle metrics, but it makes the model less general, if we decide to transform it in an application. Besides, there will be more technical details to take care of. I would postpone this for a later version of this model. 

# Check Stores

We lack interesting information from Stores and Departments:
* What is the meaning of Weekly_Sales? Total Amount? Gross Profit? Should Weekly_Sales be strictly positive?
* What are the start and end date(store which was closed)? 
* What are the start and end date of given Depts? Many of them can be seasonal, which means we do not expect sales at some specific weeks in the year. Does the sample provided here for Kaggle only contain available Departments? Can you assume for some cases the first entry in weekly sales as a guess for the Dept starting point? Store-Depts not present means zero Weekly_Sales?
* A better description of categories in Departments and Stores would be interesting, specially to undestand whether we should expect increasing in sales in given seasons. 
* A plus would be to provide insights from stocks.

I think we cannot answer these questions with the description given. We will need to think about strategies to go on.





## Type A: Bigger size and number of stores

The definition of A, B and C follow the common sense order:
* There are more stores Type A available, and they are bigger in size
* There are only a few store for Type C and they are smaller than both A and B.

In [None]:
df_stores.head(2)

In [None]:
sns.countplot(x="Type",data=df_stores)
plt.grid(True)

plt.show()
sns.boxplot(x="Type", y="Size", data=df_stores)

plt.grid(True)
plt.show()


# Join df_stores and date tranformations

## Functions to be used later on

In [None]:
FIRST_DATE_AVAIABLE='2010-02-05' # This is to take into account the WeekIndex
def apply_ID(df):
    """
    Here, the transformation needed for the sample data
    """
    df["Store_Dept"] = df["Store"].astype(str)+"_"+df["Dept"].astype(str)
    df["Id"] = df["Store_Dept"]+"_"+df["Date"].astype(str)
    return df        

def date_transforms(df):
    """ Let's create some features related to date specific information, like: Day of Month(and Day_ranges), 
    week_index(from first date available),
    """
    df['dt'] = pd.to_datetime(df['Date'])
    df['Year'] = df['dt'].dt.year
    df['Month'] = df['dt'].dt.month
    df['Week'] = df['dt'].dt.week
    df['Day'] = df['dt'].dt.day
    df['Day_Range'] = pd.cut(df["Day"], bins=[0, 10, 20, 31], right=True)
    df["YearMonth"] = df['dt'].dt.strftime('%Y-%b')
    df["Week_Index"] = (df['dt']-pd.to_datetime(FIRST_DATE_AVAIABLE)).dt.days/7

    return df

In [None]:
def correct_negative_sales(sales):
    """ Since we find negative weekly sales, let's propose a fix for that """
    return 0 if sales < 0 else sales

def join_stores(df,df_stores):
  return df.merge(df_stores, on=["Store"], how="left")

def join_features(df,df_feat):
  return df.merge(df_feat, on=["Store","Date","IsHoliday"], how="left")

## Apply ID-like for samples, joining stores and features + applying date tranformations

Since we are going to work with one year lags, we will use data from one year first after first date to train the model. 

One interesting feature we might take is the min, max weekly sales for this first year, in order to propagate some information on scales.
Another approach that would be interesting is to normalize each store_dept values, which could be interesting to understand the seasonal shapes of the time series. 

Besides improving the model (without taking into account scales of different stores) maybe it also reduced the number of progated avg, lag information, making the model more general.

For this version of the model, we decided to use first year min, max values as features, letting the use of norms for later on.

In [None]:
del_df(df_train)
del_df(df_test)
del_df(df_stores)
del_df(df_features)
gc.collect()

df_train = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/train.csv.zip")
df_test = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/test.csv.zip")
df_stores = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/stores.csv")
df_features = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/features.csv.zip")

df_test["Weekly_Sales"] = None

df_train = apply_ID(df = df_train)
df_train = date_transforms(df = df_train)
df_train = join_stores(df = df_train, df_stores=df_stores)
df_train = join_features(df = df_train, df_feat=df_features)

df_test = apply_ID(df = df_test)
df_test = date_transforms(df = df_test)
df_test = join_stores(df = df_test, df_stores=df_stores)
df_test = join_features(df = df_test, df_feat=df_features)

#Min max span one year
df_max_mins = df_train.groupby(["Store","Dept"]).agg(min_sales_history=("Weekly_Sales","min"),
                                                     max_sales_history=("Weekly_Sales","max"),
                                                     avg_sales_history=("Weekly_Sales","mean")).reset_index()
        
df_max_mins["min_sales_history_nonneg"] = df_max_mins.apply(lambda x:correct_negative_sales(x["min_sales_history"]),axis=1)
df_max_mins["min_sales_history_norm"] = df_max_mins["min_sales_history_nonneg"]/df_max_mins["max_sales_history"]
df_max_mins["avg_sales_history_norm"] = df_max_mins["avg_sales_history"]/df_max_mins["max_sales_history"]


df_train = df_train.merge(df_max_mins,on=["Store","Dept"],how="left")
df_test = df_test.merge(df_max_mins,on=["Store","Dept"],how="left")

df_train["Weekly_Sales_Norm"] = df_train["Weekly_Sales"]/df_train["max_sales_history"]
df_test["Weekly_Sales_Norm"] = None

## Negative Weekly Sales

Yes, there are negative entries!!!
In a real problem, I would ask business and data teams to check what are the meaning of such entries. For this test, I will remove those entries from training/testing and assume zeros for predictions in test data when labels are zero. The fraction is small, though, around 1/400.

In [None]:
df_train[df_train.Weekly_Sales<0].shape

## Checking Missing data

From the features, we miss a lot of Markdown informations. In the description it is written that information is recorded since 2011-11-11. Even with this time cut, only Markdown1 is very populated.
Other variables are thoroughly populated.

In [None]:
eda.plot_missing(df_train,display=["Bar Chart"])

In [None]:
eda.plot_missing(df_train[df_train.Date>="2011-11-11"],display=["Bar Chart"])

In [None]:
eda.plot_missing(df_train[df_train.Date>="2011-02-05"],display=["Bar Chart"])

# Time Series for Store and/or Type aggregated data

Let's have a first look into Time Series plot in order to understand import general aspects of our Weekly_Sales

In [None]:
def stores_month(df,col="Weekly_Sales"):
    """ Monthly behaviour by store and type"""
    return df.groupby(["Store","Type","Year","Month","YearMonth"]).agg(avg_sales=(col,"mean"),
                                                                             sum_sales=(col,"sum"),
                                                                             n_dist_Dept=("Dept","nunique")).reset_index().sort_values(["Year","Month"])

def stores_week(df,col="Weekly_Sales"):
    """ Type of Store weekly aggregated behaviour"""
    return df.groupby(["Type","Week_Index","Date"]).agg(avg_sales=(col,"mean"),
                                           sum_sales=(col,"sum"),
                                           n_dist_Store_Dept=("Store_Dept","nunique")).reset_index().sort_values(["Date"])

In [None]:
df_agg_stores_month = stores_month(df_train,"Weekly_Sales")
df_agg_stores_week= stores_week(df_train,"Weekly_Sales")

## Functions to help plot boxplot and time series

In [None]:
def plot_boxes_ts(df,xcol="YearMonth",hue="Type"):
    """ Boxplot plots for 3 diff aggregated-variables: sum, avg of sales and n distinct Department, Hue is type of store"""
    width = 25
    height = 10

    plt.figure(figsize=(width,height))
    sns.boxplot(x=xcol,y="sum_sales",data=df_agg_stores_month,hue="Type").set_title("Sum sales by Month")
    plt.xticks(rotation=90)
    plt.grid(True)
    plt.show()

    plt.figure(figsize=(width,height))
    sns.boxplot(x=xcol,y="avg_sales",data=df_agg_stores_month,hue="Type").set_title("Avg sales by Month")
    plt.xticks(rotation=90)
    plt.grid(True)
    plt.show()

    plt.figure(figsize=(width,height))
    sns.boxplot(x=xcol,y="n_dist_Dept",data=df_agg_stores_month,hue="Type").set_title("n_dist_Dept sales by Month")
    plt.xticks(rotation=90)
    plt.grid(True)
    plt.show()

def plot_ts(df, var,  title='', xlabel='', ylabel='', all_labels=False, savefig=False, savepath="",label=None,color=None):
    """ Time series for a given variable """
    df[var].plot(title=title, marker="o",label=label,color=color)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.grid(True)
    if all_labels:
        labels = df.index.to_list()
        arrticks = np.arange(len(labels))
        plt.xticks(arrticks, labels, rotation=90)
    else:
        plt.xticks(rotation=90)
    if savefig and savepath != "":
        plt.savefig(savepath)    
        
def plot_ts_multi(df, var1=None,var2=None,title='', xlabel='', ylabel='', all_labels=False, savefig=False, savepath=""):
    """ Time series comparting to variables with two distinc y-axis"""
    if(var1 is None):
      print("var1 cannot be none, returning")
      return False
    
    ax1 = df[var1].plot(figsize=(12, 7), title=title, marker="o",color="blue",label=var1)
    ax1.set_ylabel(var1,color="blue") 
    if(var2 is not None):
      ax2 = ax1.twinx()
      ax2.set_ylabel(var2,color="red") 
      df[var2].plot(figsize=(12, 7), title=title, marker="o",color="red",label=var2)
   
    plt.xlabel(xlabel)
    plt.grid(True)
    
    title = f"Time Series, {var1}" 
    if(var2 is not None):
        title+=f" and {var2}"
    plt.title(title)
    
    if all_labels:
        labels = df.index.to_list()
        arrticks = np.arange(len(labels))
        plt.xticks(arrticks, labels, rotation=90)
    else:
        plt.xticks(rotation=90)
    if savefig and savepath != "":
        plt.savefig(savepath)
 

def plot_ts_multix(df, var1=None,var2=None, title='', xlabel='', ylabel='', all_labels=False, savefig=False, savepath=""):
    """ Just a different size. I need time to improve this"""
    if(var1 is None):
      print("var1 cannot be none, returning")
      return False
    
    ax1 = df[var1].plot(title=title, marker="o",color="blue",label=var1)
    ax1.set_ylabel(var1,color="blue") 
    if(var2 is not None):
      ax2 = ax1.twinx()
      ax2.set_ylabel(var2,color="red") 
      df[var2].plot(title=title, marker="o",color="red",label=var2)
   
    plt.xlabel(xlabel)
    plt.grid(True)
    
    title = f"Time Series, {var1}" 
    if(var2 is not None):
        title+=f" and {var2}"
    plt.title(title)
    
    if all_labels:
        labels = df.index.to_list()
        arrticks = np.arange(len(labels))
        plt.xticks(arrticks, labels, rotation=90)
    else:
        plt.xticks(rotation=90)
    if savefig and savepath != "":
        plt.savefig(savepath)
            

## Stores monthly

Some things are quite as expected:
* Size A has consistently more departments and sales by month, followed B and C.
* There are peak trends on December for type A and B (as expected by major holidays). For Type C is not clear in the general basis. Type C stores might be out of trend and maybe need specific care.

In [None]:
plot_boxes_ts(df_agg_stores_month,xcol="YearMonth",hue="Type")

## Stores weekly

### Stores: Type A

We see clearly peaks around black friday and Christmas both for the sum and average of weekly sales. Although there is a small drop from 2010-2011 in the christmas peak, it seems like a seasonal trend. There is a drop seen in January, which may be due to 2 aspects:
* one: side-effect of overwhelming sales during Christmas.
* two: Weather effects.

Case one seems to be the main reason, but we may think about when dealing with Temperature feature.

In [None]:
var1="sum_sales"
var2="avg_sales"
var3="n_dist_Store_Dept"
typestore="A"
width = 25
height = 10
plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[df_agg_stores_week.Type==typestore].set_index("Date"),var1=var1,var2=var2)

plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[df_agg_stores_week.Type==typestore].set_index("Date"),var1=var1,var2=var3)

Here, the peaks around Black Friday and Christmas are zoomed. It seems interesting to give special attention to them.

In [None]:
var1="sum_sales"
var2="avg_sales"
typestore="A"
width = 25
height = 10
plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[(df_agg_stores_week.Type==typestore)&(df_agg_stores_week.Date>"2010-10-20")&(df_agg_stores_week.Date<"2011-01-10")].set_index("Date"),
              var1=var1,var2=var2)


## Stores: Type B

With a expected lower absolute value than Type A, Type B trends follow a seasonality like stores type A.

In [None]:
var1="sum_sales"
var2="avg_sales"
var3="n_dist_Store_Dept"
typestore="B"
width = 25
height = 10
plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[df_agg_stores_week.Type==typestore].set_index("Date"),var1=var1,var2=var2)

plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[df_agg_stores_week.Type==typestore].set_index("Date"),var1=var1,var2=var3)

## Stores: Type C

Different from Types A and B, C does not seem to follow a well-behavioured trend. Although type C Stores are smaller, it is interesting to have a close look to understand how
to improve the model.

In [None]:
var1="sum_sales"
var2="avg_sales"
var3="n_dist_Store_Dept"
typestore="C"
width = 25
height = 10
plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[df_agg_stores_week.Type==typestore].set_index("Date"),var1=var1,var2=var2)

plt.figure(figsize=(width,height))
plot_ts_multi(df=df_agg_stores_week[df_agg_stores_week.Type==typestore].set_index("Date"),var1=var1,var2=var3)

## Aggregate by store and IsHoliday

Motivation: check missing dates

In [None]:
try:
  del(df_agg_stores_hol)
except:
  pass
gc.collect()

df_agg_stores_hol = df_train.groupby(["Store","IsHoliday"]).agg(
    ndist_date_store=("Date","nunique"),
    min_sales_store=("Weekly_Sales","min"),
    max_sales_store=("Weekly_Sales","max"),
    avg_sales_store=("Weekly_Sales","mean"),
    sum_sales_store=("Weekly_Sales","sum")).reset_index()

### Checking missing dates

In the time span of trainning sample, let's check missing daily entries for stores. In order to do so, the ndist_date_store will be used. 
As seen below, there entries for all the stores in the 10 holidays and on 133 non-holidays.

In [None]:
try:
  del(df_agg_stores_dept_hol)
except:
  pass
gc.collect()
  
df_agg_stores_dept_hol = df_train.groupby(["Store","Dept","IsHoliday"]).agg(
    first_date=("Date","first"),
    last_date=("Date","last"),
    ndist_date=("Date","nunique"),
    max_sales=("Weekly_Sales","max"),
    min_sales=("Weekly_Sales","min"),
    avg_sales=("Weekly_Sales","mean"),
    sum_sales=("Weekly_Sales","sum")).reset_index()

#Let's join total store levels and check share
df_agg_stores_dept_hol["Store_Dep"] = df_agg_stores_dept_hol["Store"].astype(str)+"_"+df_agg_stores_dept_hol["Dept"].astype(str)
df_agg_stores_dept_hol = df_agg_stores_dept_hol.merge(df_agg_stores_hol,on=["Store","IsHoliday"],how="inner")
df_agg_stores_dept_hol["share_Dep"] = df_agg_stores_dept_hol["sum_sales"]/df_agg_stores_dept_hol["sum_sales_store"]
df_agg_stores_dept_hol["diff_Dep_avg"] = (df_agg_stores_dept_hol["avg_sales"]-df_agg_stores_dept_hol["avg_sales_store"])/df_agg_stores_dept_hol["avg_sales_store"]


## Checking missing dates dept

In the time span of trainning sample, let's check missing daily entries for depts. In order to do so, the ndist_date will be used. 

Out of 3331 (See below cells), for non-holidays:
* 2663 store_depts have entries in all dates.
* 196  store_depts have entries in between(100,132) dates. 
* 128  store_depts have entries in between(50,99) dates.
* 340  store_depts have entries in < 50 dates.
* 49 store_depts have entries in only one single date.

Out of 3331 (See below cells), for holidays:
* 2785 store_depts have entries in all dates.
* 60 store_depts have entries 9/10 dates.
* 109  store_depts have entries in only one single date.


It is important to recognize what is the source of these missing dates:
* New Departments?
* Lack of transactions? which means sales = 0 for given departments?

### What to do with them? 
In a real business, we would probably contact business, software engineering or/and data engineering teams in order to understand carefully the source of this behaviour. With these already made csvs, we need to figure out strategies to deal with missing dates.

In [None]:
df_agg_stores_dept_hol["Store_Dep"].nunique()

In [None]:
df_count_non_holidays = df_agg_stores_dept_hol[df_agg_stores_dept_hol.IsHoliday==False].groupby(["ndist_date","IsHoliday"]).agg(n_Store_dep=("Store_Dep","nunique")).reset_index()

In [None]:
df_count_holidays = df_agg_stores_dept_hol[df_agg_stores_dept_hol.IsHoliday==True].groupby(["ndist_date","IsHoliday"]).agg(n_Store_dep=("Store_Dep","nunique")).reset_index()

In [None]:
print(
 df_count_non_holidays[df_count_non_holidays.ndist_date==133]["n_Store_dep"].sum(),    df_count_non_holidays[df_count_non_holidays.ndist_date.between(100,132)]["n_Store_dep"].sum(),  df_count_non_holidays[df_count_non_holidays.ndist_date.between(50,99)]["n_Store_dep"].sum(),
 df_count_non_holidays[df_count_non_holidays.ndist_date<50]["n_Store_dep"].sum(),
 df_count_non_holidays[df_count_non_holidays.ndist_date==1]["n_Store_dep"].sum()) 

In [None]:
df_count_holidays

## Checking dates between

Here, we test whether there are empty spaces between first/last weekly sales, meaning it is not just the case the Dept started at some pointing. 
The answer is yes, there are distinct spots in time series. Probably, due to seasonal Departments.

In [None]:
df_agg_dates = df_train.groupby(["Store","Dept"]).agg(
    first_date=("Date","first"),
    last_date=("Date","last"),
    ndist_date=("Date","nunique"))
df_agg_dates["diff_first_last_weeks"] = 1 + ((pd.to_datetime(df_agg_dates["last_date"]) - pd.to_datetime(df_agg_dates["first_date"])).dt.days)/7
df_agg_dates = df_agg_dates[["first_date","last_date","diff_first_last_weeks","ndist_date"]].reset_index()
df_agg_dates["ndist_equals_diff"] = df_agg_dates["diff_first_last_weeks"]==df_agg_dates["ndist_date"]
df_agg_dates["Store_Dep"] = df_agg_dates["Store"].astype(str) + "_" + df_agg_dates["Dept"].astype(str)

In [None]:
df_agg_dates.groupby("ndist_equals_diff")["Store_Dep"].count()

## Aggregate for store type

In [None]:
try:
  del(df_agg_type)
except:
  pass
gc.collect()


df_agg_type = df_train.groupby(["Type"]).agg(
    min_size=("Size","min"),
    max_size=("Size","max"),
    avg_size=("Size","mean"),
    max_sales_store=("Weekly_Sales","max"),
    min_sales_store=("Weekly_Sales","min"),
    avg_sales_store=("Weekly_Sales","mean")).reset_index()



try:
  del(df_agg_type_date)
except:
  pass
gc.collect()


df_agg_type_date = df_train.groupby(["Type","Date"]).agg(
    ndist_Store=("Store","nunique"),
    ndist_StoreDept=("Store_Dept","nunique"),
    
    min_size=("Size","min"),
    max_size=("Size","max"),
    avg_size=("Size","mean"),
    max_sales_store=("Weekly_Sales","max"),
    min_sales_store=("Weekly_Sales","min"),
    avg_sales_store=("Weekly_Sales","mean")).reset_index()


# Join Features

The strategy here will be get together train and test (test requires some propagation of information from train period like lags), preparing the features and then separating into train and test.

## Concat Train + Test for lag propagation

In [None]:
df_all = pd.concat([df_train,df_test])

In [None]:
df_all.Date.min(),df_all.Date.max()

In [None]:
def create_date_table(df):
    """ For cross dats """
    return df[["dt","IsHoliday"]].drop_duplicates()

def apply_selection(week1,week2,lag1y,lag):
  if(week1!=week2):
    return lag
  else:
    return lag1y


def cross_dates(df):
    """ This to try to propage zeros for all missing dates. Basically, cross join with all the available date.
    We end up with ~ 560k row which are not so much more.
    Seems promissing and sounded good in first results.
    Aborted at this version since we need more time to investigate them """
    df_list = df.groupby(["Store","Dept","Store_Dept","Type","Size"]).agg({"Weekly_Sales":"count"}).reset_index()
    df_dates = create_date_table(df)
    df_list['key'] = 1
    df_dates['key'] = 1
    df_list = df_list.merge(df_dates,on="key").drop(columns=["Weekly_Sales","key"])
    df_list = df_list.merge(df,on=["Store","Dept","Store_Dept","Type","Size","dt","IsHoliday"],how="left")
    df_list["was_missing"] = df_list["Date"].isnull()
    df_list["Date"] = df_list["dt"].astype(str)
    df_list = apply_ID(df = df_list)
    df_list = date_transforms(df = df_list)
    return df_list.fillna({"Weekly_Sales":0,"Weekly_Sales_Norm":0})


def rolling_means(df):
  """ Rolling mean up to 25 weeks. Counts are just for cross check of missing dates """  
  df = df.sort_values(['Store','Dept', 'Date'])
  df = df.set_index(['Store','Dept','Date'])
  df["avg_sales_w"] = df["Weekly_Sales"].rolling(25,min_periods=1).mean()
  df["ncounts_w"] = df["Weekly_Sales"].rolling(25,min_periods=1).count()
  df["avg_sales_w_norm"] = df["Weekly_Sales_Norm"].rolling(25,min_periods=1).mean()
  df["ncount_w_norm"] = df["Weekly_Sales_Norm"].rolling(25,min_periods=1).count()
  return df.reset_index()


def apply_month_zero(df):
    """ applying zeros for negative min sales month"""
    if(df["min_sales_month"]<0):
        return 0
    else:
        return df["min_sales_month"]
    
def month_avg(df):
    """ some agregations in months to be propated to following year"""
    df['Weekly_Sales']=df['Weekly_Sales'].astype(float)
    df['avg_month'] = df.groupby(['Store','Dept',"Year","Month"])['Weekly_Sales'].transform('mean')  
    df['min_month'] = df.groupby(['Store','Dept',"Year","Month"])['Weekly_Sales'].transform('min')    
    df['max_month'] = df.groupby(['Store','Dept',"Year","Month"])['Weekly_Sales'].transform('max')
    df['avg_month_norm'] = df.groupby(['Store','Dept',"Year","Month"])['Weekly_Sales_Norm'].transform('mean')  
    df['min_month_norm'] = df.groupby(['Store','Dept',"Year","Month"])['Weekly_Sales_Norm'].transform('min')    
    df['max_month_norm'] = df.groupby(['Store','Dept',"Year","Month"])['Weekly_Sales_Norm'].transform('max')
    return df

def shift_dates(df):
    """ Since there are missing dates we cannot rely upon shift 52. 
    We need to first try to join  the exactly week number with its correpondent one year past, then fill with other strategy when last year is missing.
    For lag sales we will try:
    * firstly, lag weekly sales same week past year.
    * secondly, lag moving average at same week past year.
    * thirdly, fill zero.
    """
    df_lag1 = df[['Store','Dept', 'Week','Year',"avg_sales_w","avg_sales_w_norm","Weekly_Sales","Weekly_Sales_Norm","Date","avg_month","avg_month_norm"]]
    df_lag1["lag_avg_month"] = df_lag1["avg_month"]
    df_lag1["lag_avg_month_norm"] = df_lag1["avg_month_norm"]
    df_lag1["lag_avg_sales_w"] = df_lag1["avg_sales_w"]
    df_lag1["lag_avg_sales_w_norm"] = df_lag1["avg_sales_w_norm"]
    df_lag1["lag_Weekly_Sales"] = df_lag1["Weekly_Sales"]
    df_lag1["lag_Weekly_Sales_Norm"] = df_lag1["Weekly_Sales_Norm"]  
    df_lag1["lag_Date"] = df_lag1["Date"]
    df_lag1["Year"] = df_lag1["Year"]+1
    df_lag1 = df_lag1.drop(columns=["avg_sales_w","avg_sales_w_norm","avg_month","avg_month_norm","Weekly_Sales","Weekly_Sales_Norm","Date"])
    df = df.sort_values(['Store','Dept', 'Date'])
    df = df.set_index(['Store','Dept'])

    df["lag1y"] = df["Weekly_Sales"].shift(52)
    df["lag1y_norm"] = df["Weekly_Sales_Norm"].shift(52)
    df["lag1y_date"] = df["Date"].shift(52)
    df["lag1y_Week"] = df["Week"].shift(52)

    df["lag1y_avg_sales_w"] = df["avg_sales_w"].shift(52)
    df["lag1y_ncounts_w"] = df["ncounts_w"].shift(52)

    df = df.merge(df_lag1,on=['Store','Dept', 'Week','Year'],how="left").copy()

    df["lag_sales"] = df["lag_Weekly_Sales"].fillna(df["lag_avg_sales_w"]).fillna(0)
    df["lag_sales_Norm"] = df["lag_Weekly_Sales_Norm"].fillna(df["lag_avg_sales_w_norm"]).fillna(0)

    return df.reset_index()

def apply_holiday(df):
    """ Trying special category for holidays, from Black Friday to Christmas """
    if(df["Week"] in([47,52])):
        return df["IsHoliday"]*10
    if(df["Week"] in([48, 49, 50, 51])):
        return df["IsHoliday"]*5
    else:
        return  df["IsHoliday"]*1

def new_features(df):
    """ features for holiday, Store Size and temparature ranges"""
    df["IsHolidayw"] = df["IsHoliday"]*5
    df["IsHolidayfix"] = df.apply(apply_holiday,axis=1)
    df["Month"] = df["Month"].astype(str)
    df["Temperature_Range"] = pd.qcut(df["Temperature"], 4)
    df["Size_Range"] = pd.qcut(df["Temperature"], 10)
    return df

## Running the final tranformations on features

In [None]:
try:
  del(df_all_features)
except:
  pass
gc.collect()

df_all_features = df_all.copy()
df_all_features = df_all_features.fillna({"Weekly_Sales":0,"Weekly_Sales_Norm":0})
# df_all_features = cross_dates(df=df_all_features) aborted
df_all_features = rolling_means(df=df_all_features)
df_all_features = month_avg(df=df_all_features)
df_all_features = shift_dates(df=df_all_features)
df_all_features = new_features(df=df_all_features)

In [None]:
df_all_features.head(2)

## Separating train and test dataframes 

In [None]:
df_train_features = df_all_features[df_all_features.Date<="2012-10-26"]
df_test_features = df_all_features[df_all_features.Date>"2012-10-26"]

# min max dates train features

Checking information provided in the description

In [None]:
df_train_features["Date"].min(),df_train_features["Date"].max()

In [None]:
df_train_features[df_train_features.MarkDown1.isnull()==False]["Date"].min(),df_train_features[df_train_features.MarkDown1.isnull()==False]["Date"].max()

# TimeSeries: Features

Before studying correlation plots, let's start with some Time Series plots, aggregating sales by store. The main point here is to signal seasonal expected events.


Filling out 'na' with zeros to be able to plot comparison plots and aggregating by Store.

In [None]:
# Filling out 'na' with zeros to be able to plot comparison plots
df_agg_train_features = df_train_features.fillna({"MarkDown1":0,"MarkDown2":0,"MarkDown3":0,"MarkDown4":0,"MarkDown5":0}).groupby(["Store","Type","Temperature","CPI",
                                                                           "Unemployment","Fuel_Price","MarkDown1","MarkDown2",
                                                                           "MarkDown3","MarkDown4","MarkDown5","Date","Week_Index"]).agg(sum_sales=("Weekly_Sales","sum"),
                                                                                                               avg_sales=("Weekly_Sales","mean")).reset_index()

##  Having a look into random stores

### Temperature

It seems to have some  dependency in January-February, due to lowest temperatures, but it is also after Christmas. Part of this may signal a consequence of large sales on December.  

In [None]:
var1="sum_sales"
var2="Temperature"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show()    


## Unemployment

The reduction of unemployment in many cases are not clearly reflected in more sales. Unless some really correlated effects are seen, no idea how this feature could help

In [None]:
var1="sum_sales"
var2="Unemployment"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show()    

## CPI: Consumer Price Index

For this variable, unless there was an abrupt peak, it does not seem to be promissing. There is a marginal reduction of Christmas peak from one year to another which may be related to some effect of purchasing power, whereas it seems to be the very close to past year in the other months. It really needs futher studies to understand impacts.

In [None]:
var1="sum_sales"
var2="CPI"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show()    

## Fuel price

Sames as CPI there was an increasing from one year to another, which maybe affected prices and sales on Christmas. It also needs carefully studies, since only the Christmas period seemed to be clearly affected

In [None]:
var1="sum_sales"
var2="Fuel_Price"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show()  

## Markdowns

There were no clear insights from them. Some of them seems to be after Christmas, which will not help so much. Markdown3 is related to Black Friday, which gives some insights of peak of sales, but it is also common sense. We could simply mark them as we did for the feature IsHolydayfix. Markdown5 has also some dependency of the main holiday, but also in other intervals where it does to seem to be really correlated with increase in sales.
Maybe if there is any dependency with departments that appear only in a few weeks of year, they could be interesting for this competition. For real life, not sure, we really need to ask business people about the meaning of these features!

In [None]:
var1="sum_sales"
var2="MarkDown1"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show()  

In [None]:
var1="sum_sales"
var2="MarkDown2"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show()  

In [None]:
var1="sum_sales"
var2="MarkDown3"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show() 

In [None]:
var1="sum_sales"
var2="MarkDown4"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show() 

In [None]:
var1="sum_sales"
var2="MarkDown5"
list_stores = random.choices(df_agg_train_features.Store.unique(),k=8)


width=30
height=30
fig = plt.figure(figsize=(width,height), dpi=80)
x=[421,422,423,424,425,426,427,428]
k=-1


for store in list_stores:    
    k+=1
    ax = fig.add_subplot(x[k])    
    plot_ts_multix(df=df_agg_train_features[df_agg_train_features.Store==store].sort_values("Date").set_index("Date"),var1=var1,var2=var2)
    
plt.show() 

## Correlations

Now some correlation plots, using different date intervals, based on availability and strategy for training process.

### Date >= 2011-11-11 (Where markdowns are available)

Consideting the targe:
* as expected lags, avg, min max sales are highly correlated with target.
* Markdowns 1 and 5 are slightly correlated. I would keep temperature due to it seasonal effent on January, although not very correlated.

In [None]:
cols_corr=[]
for col in df_train_features.select_dtypes(include=['float64']).columns:
    if(("_norm" not in col) and ("_Norm" not in col) and ("counts" not in col) and ("ndist" not in col)):
        cols_corr.append(col)
eda.plot_correlation(df_train_features[df_train_features.Date>="2011-11-11"][cols_corr],col1="Weekly_Sales")

## Pair correlations:

* Lags, avgs and so on are very correlated with each other. That's why will of course not use most of them.
* Markdown 1 and 4 are also correlated with each other.
* CPI and Price have non neglectable negative correlation with each other, which indicates the dependency of price with Fuel. 

In [None]:
cols_corr=[]
for col in df_train_features.select_dtypes(include=['float64']).columns:
    if(("_norm" not in col) and ("_Norm" not in col) and ("counts" not in col) and ("ndist" not in col)):
        cols_corr.append(col)
eda.plot_correlation(df_train_features[df_train_features.Date>="2011-11-11"][cols_corr])

### Final Missing checks with features

In [None]:
eda.plot_missing(df_train_features[df_train_features.Date<"2011-11-11"],display=["Bar Chart"])

In [None]:
eda.plot_missing(df_train_features[df_train_features.Date>="2011-11-11"],display=["Bar Chart"])

## Models

Some points for the model.
1. Since it is not very promissing to use Markdowns in a short time study, I will keep them aside at this POC. Doing so, the available time range for the model will also be extended (from 2011-11 to 2011-02). Using Markdowns with this time range would require hypothesis to deal with missing data.
2. Starting the train range in 2011-02-05, so that it is possible to work with our strategy of one year lag.
3. Filter out negative lag_sales and Weekly_Sales for the training process. For the test labels, we assume zero for negative labels.  

In [None]:
filter_dates=None
filter_dates="2011-02-05"


#min2010-02-05
filter_positive=True
if(filter_dates is not None):
  print(f"filter dates: {filter_dates}")
  df_train_features_final = df_train_features[df_train_features.Date>=filter_dates]
else:
  df_train_features_final = df_train_features

if(filter_positive):
  print(f"filter positive: {filter_positive}")
  df_train_features_final = df_train_features_final[df_train_features_final.Weekly_Sales>=0]
  df_train_features_final = df_train_features_final[df_train_features_final.lag_sales>=0]
else:
  df_train_features_final = df_train_features_final

    

In [None]:
df_train_features_final.IsHolidayfix.unique()

## Transformations:
* Removing outliers 5% threshold (Pycaret uses SVD for this purpose)
* Normalize (using zscore) for linear models
* Using following features:
 1. Weekly_Sales as target
 2. Day_Range (beggining and end of month dependency)
 3. Temperature special for some trends seen in January-February
 4. lag_sales --> One year lag or moving average as explained before. Specially for seasonal dependency.
 5. lag_avg_month --> To bring month dependy. We decided to remove Month variable at this version, reducing complexity
 6. min_sales_history_nonneg --> for lower boundary scale
 7. max_sales_history --> for higher boundary scale
    
We thought for improvements it would be interesting to try the normalized Weekly_Sales, reducing also the need of min,max sales as variable for the model, capturing more intrinsic shapes of store-dept.    

In [None]:
experiment_name = "test" # for Mlflow use
log_experiment = False # for Mlflow use
log_plots = False# for Mlflow use
log_data = False# for Mlflow use
use_gpu = True  #yes we use GPUs for Xgboost
high_cardinality_features = None

data_split_stratify = [] 
bin_numeric_features = []  


train_size = 0.8
transform_target = False
transform_target_method = "box-cox"
remove_multicollinearity = False
multicollinearity_threshold = 0.9

combine_rare_levels = False  #
rare_level_threshold = 0.1

feature_interaction = False
interaction_threshold = 0.01

feature_selection = False
feature_selection_threshold = 0.9

normalize = True
normalize_method = "zscore"

pca = False
pca_components = 10

remove_outliers = True
outliers_threshold = 0.05

all_cols=df_train_features.columns

use_cols=["Weekly_Sales","Type","Day_Range",'Temperature', 'Size',"Weekly_Sales",'lag_sales',"lag_avg_month",'min_sales_history_nonneg', 'max_sales_history']#"MarkDown1",
numeric_features = []  #"IsHolidayw"
categorical_features = ["Month"]
ordinal_features = {"Type":["C","B","A"]}

ignore_features = set(all_cols) - set(use_cols)
target_col="Weekly_Sales"
fix_imbalance=False 

In [None]:
df_train_features_final.Date.max(),df_train_features_final.Date.min()

## Train/Test and validation sample
Strategies:
* First separate the data into train_data = 80% and validation(hold) 20%. The validation data will be selected based on date, being the last N months, checking feasibility to forecast months in advance and also having a sample held from the model intrinsic tests.

* From the train_data, randomly select 80% for train and 20% for test. For this task, since we do not use serial patterns, we will split the data randomly, in order to guarantee we have plenty of seasonality.

* Afterwards, the whole data will be used for trainning from 2011-02-05 on.

* For the propose of model selection we will use MAE as score. Implementations of WMAE (like explained in the description, weighting 5 for holidays) with sklearn make_score might be used in next versions of this work in order to improve selections a bit. At this version WMAE will be used only in validation process comparing with MAE results.


In [None]:
train_fraction=0.8
total = df_train_features_final.count()[0]
train_sample = int(total*train_fraction)

In [None]:
print(total,train_sample)

In [None]:

df_train_features_final=df_train_features_final.sort_values("Date")
date_val = df_train_features_final[0:train_sample].Date.max()
train_data = df_train_features_final[df_train_features_final.Date<date_val]
val_data  = df_train_features_final[df_train_features_final.Date>=date_val]

## Checking dates

In [None]:
train_data.Date.min(),train_data.Date.max(),train_data.shape

In [None]:
val_data.Date.min(),val_data.Date.max(),val_data.shape

In [None]:
train_data.shape,val_data.shape

## Pycaret setup for train used for first selections and validations

In [None]:
reg_val = pyreg.setup(data=train_data,target=target_col,
            high_cardinality_features=high_cardinality_features,
     ignore_features=ignore_features,categorical_features=categorical_features,
     silent=True, experiment_name=experiment_name,
     normalize=normalize, normalize_method=normalize_method,
     rare_level_threshold=rare_level_threshold, combine_rare_levels=combine_rare_levels,
     html=False, log_experiment=log_experiment, log_plots=log_plots, log_data=log_data,
     numeric_features=numeric_features,
     remove_multicollinearity=remove_multicollinearity,
     multicollinearity_threshold=multicollinearity_threshold,
     feature_interaction=feature_interaction, interaction_threshold=interaction_threshold,
     pca=pca, pca_components=pca_components, feature_selection=feature_selection,
     feature_selection_threshold=feature_selection_threshold, train_size=train_size,
     use_gpu=use_gpu,
     remove_outliers=remove_outliers,
     outliers_threshold=0.05,
     bin_numeric_features=bin_numeric_features,
     ordinal_features=ordinal_features,transform_target=transform_target
     )


## Sample of tranformed data

In [None]:
pyreg.get_config("X_train").head(30).T


## Comparing Defaults: "lightgbm","ridge",lr","xgboost"

Using the default configs, which might be a bit misleading of course, Xgboost seems to be the best choice for this POC. We will use the linear model from sklearn(lr) for benchmarking, along with the use of lag_sales.

In [None]:
start_time = time.time()
best = pyreg.compare_models(include=["lightgbm","ridge","lr","xgboost"],fold=2,sort="MAE")
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
best

## A few plots

The main features are lags and avgs lags, since they are intrinsic related to scale and seasonal behaviour. The other ones will make a small effect in some cases. An improvement for this model of course is to reduce the use of this feature and find other features related to intrinsic behaviour of the store sales. 

In [None]:
pyreg.plot_model(best,plot="feature")

In [None]:
pyreg.plot_model(best,plot="feature_all")

In [None]:
linear_model = pyreg.create_model("lr",fold=3)

### Finalizing Linear model for later use

In [None]:
# Finalizing (test+train sample together) for further studies with validation samples
linear_model_final = pyreg.finalize_model(linear_model)

### Crating XGboost model

In [None]:
#Crating Xgboost model
start_time = time.time()
model_train = pyreg.create_model("xgboost",fold=3,max_depth=6,n_estimators=100,learning_rate=0.3)
print("--- %s seconds ---" % (time.time() - start_time))

# First Grid to check Depth

Since depth is deeply related to the complexity of our model.

In [None]:
fold=3
# tune hyperparameters with custom_grid
params = {"max_depth": [2,4,6,8,10,12,14]
          }
tuned_model = pyreg.tune_model(model_train, custom_grid = params,return_tuner=True,return_train_score=True,fold=fold,optimize="MAE")

In [None]:
tuned_model[0]

In [None]:
def get_metrics(tuned_model,save_file=None):
    """ This is to get the metrics from train and test"""
    rand = tuned_model[1]
    params = rand.cv_results_["params"][0]
    data = dict()
    for name in params.keys():
        print(name, name.split("__")[1])
        data[name.split("__")[1]] = [rand.cv_results_["params"][i][name] for i in
                                     range(len(rand.cv_results_["params"]))]

    data["mean_train"] = rand.cv_results_["mean_train_score"]
    data["mean_test"] = rand.cv_results_["mean_test_score"]
    data["std_train"] = rand.cv_results_["std_train_score"]
    data["std_test"] = rand.cv_results_["std_test_score"]
    data["mean_fit_time"] = rand.cv_results_["mean_fit_time"]

    df_metrics = pd.DataFrame(data=data)
    if(save_file is not None):
        print(f"Saving file: {save_file}")
        df_metrics.to_csv(save_file)
    return df_metrics    

def plot_metrics(df_metrics):
    """ Plotting Metrics x Depth"""
    sns.lineplot(x="max_depth",y="mean_train",data=df_metrics,marker="o",label="train")
    sns.lineplot(x="max_depth",y="mean_test",data=df_metrics,marker="o",label="test")

In [None]:
df_metrics_depth = get_metrics(tuned_model=tuned_model,save_file="metrics_study_xgboost_mae_depth_with_val_sample.csv")

## Checking Depth dependency
Since it seems like, the model is not improving information after 8-10 depths, I will keep the chosen depth=10 for validations tests, then I will try to reduce complexity by adding some things like col_samples for hypertune can be done. For future improvements, maybe keeping safe around 6-8 tuning better other parameters. 

In [None]:
plot_metrics(df_metrics_depth)

In [None]:
tuned_model[0]

In [None]:
pyreg.plot_model(tuned_model[0],plot="feature")

In [None]:
final_model_val = pyreg.finalize_model(tuned_model[0])

# Plotting on validation 

For the validation, we will compare the model curve with data and a baseline assumption(using only lag 1y information). We expected that our model would provide better performance than a lag expectation.

In [None]:
predictions=pyreg.predict_model(final_model_val,data=df_train_features_final)
predictions_lr=pyreg.predict_model(linear_model_final,data=df_train_features_final)

In [None]:
def apply_positives(df):
    if(df["Label"]<0):
        return 0
    else:
        return df["Label"]
predictions["Label"] =  predictions.apply(apply_positives,axis=1) 
predictions_lr["Label"] =  predictions_lr.apply(apply_positives,axis=1) 

In [None]:
pred_stores_week_data =  stores_week(predictions,col="Weekly_Sales").drop(columns=['n_dist_Store_Dept']).sort_values(['Type', 'Date'])
pred_stores_week_label =  stores_week(predictions,col="Label").drop(columns=['n_dist_Store_Dept']).sort_values(['Type', 'Date']).rename(columns={"avg_sales": "avg_sales_label", "sum_sales": "sum_sales_label"})

pred_stores_week_data_lr =  stores_week(predictions_lr,col="Weekly_Sales").drop(columns=['n_dist_Store_Dept']).sort_values(['Type', 'Date'])
pred_stores_week_label_lr =  stores_week(predictions_lr,col="Label").drop(columns=['n_dist_Store_Dept']).sort_values(['Type', 'Date']).rename(columns={"avg_sales": "avg_sales_label", "sum_sales": "sum_sales_label"})

pred_stores_week_lag =  stores_week(predictions,col="lag_sales").drop(columns=['n_dist_Store_Dept']).sort_values(['Type', 'Date']).rename(columns={"avg_sales": "avg_sales_lag", "sum_sales": "sum_sales_lag"})

In [None]:
pred_stores_week_label.columns,pred_stores_week_data.columns

In [None]:
pred_stores_week =  pred_stores_week_data.merge(pred_stores_week_label,on=['Type', 'Date',"Week_Index"], how="left")
pred_stores_week_lr =  pred_stores_week_data_lr.merge(pred_stores_week_label_lr,on=['Type', 'Date',"Week_Index"], how="left")

In [None]:
pred_stores_week.columns

In [None]:
from sklearn.metrics import mean_absolute_error #(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]


In [None]:
try:
    del(predictions_val)
except:
    pass
try:
    del(predictions_val_lr)
except:
    pass

predictions_val = predictions[predictions.Date>=date_val].copy()
predictions_val_lr = predictions_lr[predictions_lr.Date>=date_val].copy()

##  Checking WMAE x MAE

For the validation sample, xgboost is also better followed by linear model. We also compute lag_sales as predictions.
WMAE is not too different from MAE, as expected. It seems to be relevant only for selections with small margin gains.

In [None]:
def WMAE(df, y, ypred):
    weights = df.IsHoliday.apply(lambda x: 5 if x else 1)
    return np.round(np.sum(weights*abs(y-ypred))/(np.sum(weights)), 2)

In [None]:
print("Xgboost:",mean_absolute_error(predictions_val["Weekly_Sales"],predictions_val["Label"]) )
print("lag_sales:",mean_absolute_error(predictions_val["Weekly_Sales"],predictions_val["lag_sales"]) )
print("lr:",mean_absolute_error(predictions_val_lr["Weekly_Sales"],predictions_val_lr["Label"]) )

In [None]:
print("Xgboost:",WMAE(predictions_val,predictions_val["Weekly_Sales"],predictions_val["Label"]) )
print("lag_sales:",WMAE(predictions_val,predictions_val["Weekly_Sales"],predictions_val["lag_sales"]) )
print("lr:",WMAE(predictions_val_lr,predictions_val_lr["Weekly_Sales"],predictions_val_lr["Label"]) )

## Checking the sum of sales for grouping by Store Type

The motivation here is to study the overall distributions segmented by store type, which we have seen has an important effect on time series, especially considering the type C, which does not appear to have a well distributed time series. 
Considering the plots, we see that the trained distribution seems to interpret quite well the curve for types A and B, having a distribution fluctuating up and down around the data, as expected (also considering the hold-out validation data). For type C, it seems to under predict the growing behaviour seen after WeekIndex 100.
The lag distributon seems to over predict the peak and under predict other region and describes badly the Type C stores.

In [None]:
for typestore in["A","B","C"]:
    var1="sum_sales"
    var2="sum_sales_label"
    var3="sum_sales_lag"
    pred_stores_week_val = pred_stores_week[pred_stores_week.Date>=date_val]
    pred_stores_week_train = pred_stores_week[pred_stores_week.Date<date_val]
    
    pred_stores_week_val_lr = pred_stores_week_lr[pred_stores_week_lr.Date>=date_val]
    pred_stores_week_train_lr = pred_stores_week_lr[pred_stores_week_lr.Date<date_val]
    
    width = 25
    height = 10
    plt.figure(figsize=(width,height))
    plot_ts(df=pred_stores_week[pred_stores_week.Type==typestore].set_index("Week_Index"),var=var1,label="data")
    plot_ts(df=pred_stores_week_lag[pred_stores_week_lag.Type==typestore].set_index("Week_Index"),var=var3,label="lag")
    plot_ts(df=pred_stores_week_train[pred_stores_week_train.Type==typestore].set_index("Week_Index"),var=var2,label="train labels")
    plot_ts(df=pred_stores_week_val[pred_stores_week_val.Type==typestore].set_index("Week_Index"),var=var2,label="validation labels",color="black")


    plt.legend()

In [None]:
for typestore in["A","B","C"]:
    var1="avg_sales"
    var2="avg_sales_label"
    var3="avg_sales_lag"
    pred_stores_week_val = pred_stores_week[pred_stores_week.Date>date_val]
    pred_stores_week_train = pred_stores_week[pred_stores_week.Date<=date_val]
    width = 25
    height = 10
    plt.figure(figsize=(width,height))
    plot_ts(df=pred_stores_week[pred_stores_week.Type==typestore].set_index("Week_Index"),var=var1,label="data")
    plot_ts(df=pred_stores_week_lag[pred_stores_week_lag.Type==typestore].set_index("Week_Index"),var=var3,label="lag")
    plot_ts(df=pred_stores_week_train[pred_stores_week_train.Type==typestore].set_index("Week_Index"),var=var2,label="train labels")
    plot_ts(df=pred_stores_week_val[pred_stores_week_val.Type==typestore].set_index("Week_Index"),var=var2,label="validation labels",color="black")
    plt.legend()

## Brief Discussion on Linear Models

Looking into the times series from Linear model in comparison with Xgboost, it seems to be slightly worse, specially in the case of Type Store C. Since it does not follow the lag trend well, other nonlinear relations may play an important role.

In [None]:
for typestore in["A","B","C"]:
    var1="sum_sales"
    var2="sum_sales_label"
    var3="sum_sales_lag"
    pred_stores_week_val = pred_stores_week[pred_stores_week.Date>=date_val]
    pred_stores_week_train = pred_stores_week[pred_stores_week.Date<date_val]
    
    pred_stores_week_val_lr = pred_stores_week_lr[pred_stores_week_lr.Date>=date_val]
    pred_stores_week_train_lr = pred_stores_week_lr[pred_stores_week_lr.Date<date_val]
    
    width = 25
    height = 10
    plt.figure(figsize=(width,height))
    plot_ts(df=pred_stores_week[pred_stores_week.Type==typestore].set_index("Week_Index"),var=var1,label="data")

    
    
    plot_ts(df=pred_stores_week_train[pred_stores_week_train.Type==typestore].set_index("Week_Index"),var=var2,label="train labels",color="red")
    plot_ts(df=pred_stores_week_val[pred_stores_week_val.Type==typestore].set_index("Week_Index"),var=var2,label="validation labels",color="black")

    
    plot_ts(df=pred_stores_week_train_lr[pred_stores_week_train_lr.Type==typestore].set_index("Week_Index"),var=var2,label="train labels lr",color="green")
    plot_ts(df=pred_stores_week_val_lr[pred_stores_week_val_lr.Type==typestore].set_index("Week_Index"),var=var2,label="validation labels lr",color="magenta")

    plt.legend()

In [None]:
for typestore in["A","B","C"]:
    var1="avg_sales"
    var2="avg_sales_label"
    var3="avg_sales_lag"
    pred_stores_week_val = pred_stores_week[pred_stores_week.Date>=date_val]
    pred_stores_week_train = pred_stores_week[pred_stores_week.Date<date_val]
    
    pred_stores_week_val_lr = pred_stores_week_lr[pred_stores_week_lr.Date>=date_val]
    pred_stores_week_train_lr = pred_stores_week_lr[pred_stores_week_lr.Date<date_val]
    
    width = 25
    height = 10
    plt.figure(figsize=(width,height))
    plot_ts(df=pred_stores_week[pred_stores_week.Type==typestore].set_index("Week_Index"),var=var1,label="data")

    
    
    plot_ts(df=pred_stores_week_train[pred_stores_week_train.Type==typestore].set_index("Week_Index"),var=var2,label="train labels",color="red")
    plot_ts(df=pred_stores_week_val[pred_stores_week_val.Type==typestore].set_index("Week_Index"),var=var2,label="validation labels",color="black")

    
    plot_ts(df=pred_stores_week_train_lr[pred_stores_week_train_lr.Type==typestore].set_index("Week_Index"),var=var2,label="train labels lr",color="green")
    plot_ts(df=pred_stores_week_val_lr[pred_stores_week_val_lr.Type==typestore].set_index("Week_Index"),var=var2,label="validation labels lr",color="magenta")

    plt.legend()

## Finalize Model

Let's finalize the model using the whole data from 2011-02-05 on.

In [None]:
reg_final = pyreg.setup(data=df_train_features_final,target=target_col,
            high_cardinality_features=high_cardinality_features,
     ignore_features=ignore_features,categorical_features=categorical_features,
     silent=True, experiment_name=experiment_name,
     normalize=normalize, normalize_method=normalize_method,
     rare_level_threshold=rare_level_threshold, combine_rare_levels=combine_rare_levels,
     html=False, log_experiment=log_experiment, log_plots=log_plots, log_data=log_data,
     numeric_features=numeric_features,
     remove_multicollinearity=remove_multicollinearity,
     multicollinearity_threshold=multicollinearity_threshold,
     feature_interaction=feature_interaction, interaction_threshold=interaction_threshold,
     pca=pca, pca_components=pca_components, feature_selection=feature_selection,
     feature_selection_threshold=feature_selection_threshold, train_size=train_size,
     use_gpu=use_gpu,
     remove_outliers=remove_outliers,
     outliers_threshold=0.05,
     bin_numeric_features=bin_numeric_features,
     ordinal_features=ordinal_features,transform_target=transform_target
     )

### Creating generic model for tuning

In [None]:
start_time = time.time()
model_train_xgb = pyreg.create_model("xgboost",fold=3,max_depth=6,n_estimators=100,learning_rate=0.3)
print("--- %s seconds ---" % (time.time() - start_time))

## Hypertuning 

We will try up to 30 samples of randomized grid. We also worked with Bayes approach using scikit-optimize, which was also worked well. The main purpose here was just to find a first group of parameters. Since we saw more than depth 10 is pointless, we will work with some combination of colsample_bytree(reducing complexity), learning rate and nestimators for depths from 6 to 10.

In [None]:
start_time = time.time()
fold=3
# tune hyperparameters with custom_grid
params = {"max_depth": [6,7,8,9,10],
          "learning_rate":[0.1,0.3],
          "n_estimators":[200,100],
          "colsample_bytree":[0.6,0.8,1.0]
          }
tuned_model_final = pyreg.tune_model(model_train_xgb, 
                                     custom_grid = params,
                                     return_tuner=True,
                                     return_train_score=True,
                                     fold=fold,
                                     optimize="MAE",
                                     n_iter=30)
print("--- %s seconds ---" % (time.time() - start_time))

## Getting final model

In [None]:
tuned_model_final[0]

In [None]:
tuned_model_final[1] 

In [None]:
df_metrics_final = get_metrics(tuned_model=tuned_model_final,save_file="metrics_study_xgboost_mae_final.csv")

In [None]:
df_metrics_final

In [None]:
df_metrics_final.sort_values("mean_test")

In [None]:
final_model_xgb = pyreg.finalize_model(tuned_model_final[0])


In [None]:
predictions_xgb=pyreg.predict_model(final_model_xgb,df_test_features[df_test_features.Date>='2012-11-02'])
# predictions_lr=pyreg.predict_model(final_model_lr,df_test_features[df_test_features.Date>='2012-11-02']) # for linear

In [None]:
df_sample_pyreg_xgb = df_sample.drop(columns=["Weekly_Sales"]).merge(predictions_xgb,on=["Id"],how="inner")
# df_sample_pyreg_lr = df_sample.drop(columns=["Weekly_Sales"]).merge(predictions_lr,on=["Id"],how="inner") #for linear 

In [None]:
df_sample_pyreg_xgb["Weekly_Sales"] = df_sample_pyreg_xgb["Label"]
# df_sample_pyreg_lr["Weekly_Sales"] = df_sample_pyreg_lr["Label"]

## Fixing any negative weekly sales

In [None]:
def apply_non_zeros(sales):
  if(sales<0):
    return 0
  else:
    return sales

In [None]:
df_sample_pyreg_zeros_xgb=df_sample_pyreg_xgb.copy()
df_sample_pyreg_zeros_xgb["Weekly_Sales"] = df_sample_pyreg_zeros_xgb.apply(lambda x:apply_non_zeros(x["Weekly_Sales"]),axis=1)

In [None]:
df_sample_pyreg_zeros_xgb[["Id","Weekly_Sales"]].to_csv('/kaggle/working/final_hyper_xgb.csv',index=False)


In [None]:
df_sample_pyreg_zeros_xgb.head(2)

## Output with lag_sales

In [None]:
# df_sample_pyreg_lag=df_sample_pyreg_xgb
# df_sample_pyreg_lag["Weekly_Sales"]=df_sample_pyreg_lag["lag_sales"]
# df_sample_pyreg_lag_zeros=df_sample_pyreg_lag.copy()
# df_sample_pyreg_lag_zeros["Weekly_Sales"] = df_sample_pyreg_lag_zeros.apply(lambda x:apply_non_zeros(x["Weekly_Sales"]),axis=1)
# df_sample_pyreg_lag_zeros[["Id","Weekly_Sales"]].to_csv('/kaggle/working/final_lag_zeros.csv',index=False)

## Final Scores:

* Xgboost --> 2998.46 (Private) Public(2848.31)
* Lag Sales --> 3025.84 (Private) Public(2943.89) 
* Linear Model --> 3010.27 (Private) Public(3088.29) 


# Discussion

In general the model with Xgboost retained better scores in all cases. In the Public data the difference was higher, maybe because with the depth=10 we overfitted a bit this sample. 
The Lag sales were even better than linear models in the public data. In the private one it was the opposite, which means that the last part of data had less dependency with previous year result. Maybe they hid more Store Type C data, which were more difficult to find a good trend based on previous years.

I think much work still need to be done on feature engineering, specially generating better features and reducing a bit the use of averages and lags. Important things to be considered as evolution:

* Maybe Use norms instead of absolute weekly sales
* Understand better missing Dates for Store-Depts.
* Understand specific behviour of type C stores.
* Understand better how to use markdowns.
* Maybe clusterize store-depts (Type A, B and C maybe are not enough) or find dependencies of specific stores or Depts.
* With parallel computing, maybe prepare prefits for single Store-Depts.
* Use of Sequential analysis as LSTM may be interesting .
* improve use of wmae or check carefully its importance.