#<div style="padding:20px;color:white;margin:0;font-size:175%;text-align:center;display:fill;border-radius:5px;background-color:#016CC9;overflow:hidden;font-weight:500"> Competition target.</div>

In this competition, the training data contains the daily sales of various families of product in various stores.
We are asked to predict the sales for the 16 days following the training period.

# <b><span style='color:#4B4B4B'>1 |</span><span style='color:#016CC9'> Load data </span></b>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import seaborn as sns
pd.options.mode.chained_assignment = None
train=pd.read_csv('../input/store-sales-time-series-forecasting/train.csv')
print(train.head(3))

In [None]:
stores=pd.read_csv('../input/store-sales-time-series-forecasting/stores.csv')
print(stores.head(3))

# <b><span style='color:#4B4B4B'>2 |</span><span style='color:#016CC9'> First look at the data</span></b> 

We would like to have a quantitative idea of the amount of data we need to process for the training.

In [None]:
print('There are {} days, {} stores and {} families in the training data'.format(train['date'].nunique(),
                                                                                train['family'].nunique(),
                                                                                train['store_nbr'].nunique()))

There are **33x54=1782** possible pairs of (store, family). We have to predict the sales over 16 days. We have to produce **16x1782=28512** values. Let's check if this is consistent with the test file.

In [None]:
test=pd.read_csv('../input/store-sales-time-series-forecasting/test.csv')
print(test.head(3))

In [None]:
print ('Test has {} rows, {} days, {} stores and {} families'.format(len(test),test['date'].nunique(),
                                                                                test['family'].nunique(),
                                                                                test['store_nbr'].nunique()))


We have checked that **test** is consistent with **train** and now have a clear quantitative idea of the competition. There are 1782 pairs (store, family). It is a large number for visualization and grouping will help us to get a clear picture of the time series. We will group by **stores** first and then by **families**.

# <b><span style='color:#4B4B4B'>3 |</span><span style='color:#016CC9'> Stores  </span></b>

Let's group the training data by stores.

In [None]:
store_daily=train.groupby(['date','store_nbr']).agg({'sales':'sum'}).reset_index()
store_daily.rename(columns={'sales':'store_daily_sales'},inplace=True)

We can now visualize the time series for each store. We will use a 150 day moving average for more clarity.

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
for store_nbr in store_daily['store_nbr'].unique():
    store_daily_i = store_daily.loc[store_daily['store_nbr']==store_nbr]
    store_daily_i['MA_store_daily_sales']=store_daily_i['store_daily_sales'].rolling(150).mean()
    ax.plot(store_daily_i['date'],store_daily_i['MA_store_daily_sales'])
ax.set_xlabel('date', fontsize=18)
ax.set_ylabel('sales / store', fontsize=16)
ax.set_title('Sales by Store time series',fontsize=18)
loc = plticker.MultipleLocator(base=60) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
ax.tick_params(axis='x', rotation=70)
plt.show()

We can see that some stores are consistently above while other stores had no sales at the begining and picked up later. We would now like to visualize the sales distribution over the stores.

In [None]:
store=store_daily.groupby(['store_nbr']).agg({'store_daily_sales':'sum'}).reset_index()
store.rename(columns={'store_daily_sales':'store_sales'},inplace=True)

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=store['store_nbr'],y=store['store_sales'])
plt.title('Sales distribution by Store', fontsize=18)
plt.xlabel('Store',fontsize=16)
plt.ylabel('Sales',fontsize=16)
plt.show()

This graph shows clearly which stores are successful.

#  <b><span style='color:#4B4B4B'>4 |</span><span style='color:#016CC9'> Families</span></b>

Let's now group the sales by family.

In [None]:
family_daily=train.groupby(['date','family']).agg({'sales':'sum'}).reset_index()
family_daily.rename(columns={'sales':'family_daily_sales'},inplace=True)

We can now visualize the time series for each family. We will use a 50 day moving average for more clarity.

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
for family in family_daily['family'].unique():
    family_daily_i = family_daily.loc[family_daily['family']==family]
    family_daily_i['MA_family_daily_sales']=family_daily_i['family_daily_sales'].rolling(50).mean()
    ax.plot(family_daily_i['date'],family_daily_i['MA_family_daily_sales'])
ax.set_xlabel('date', fontsize=16)
ax.set_ylabel('sales / family', fontsize=16)
ax.set_title('Sales by Family time series',fontsize=18)
loc = plticker.MultipleLocator(base=60) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
ax.tick_params(axis='x', rotation=70)
plt.show()

We can see that some families are consistently above. We would now like to visualize the sales distribution over the families.

In [None]:
family=family_daily.groupby('family').agg({'family_daily_sales':'sum'}).reset_index()
family.rename(columns={'family_daily_sales':'family_sales'},inplace=True)

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=family['family'],y=family['family_sales'])
plt.xticks(rotation=70)
plt.title('Sales distribution by Family', fontsize=18)
plt.xlabel('Family',fontsize=16)
plt.ylabel('Sales',fontsize=16)

plt.show()

**Grocery I, Beverages, Produce and Cleaning** are the most successful families.

#   <b><span style='color:#4B4B4B'>5 |</span><span style='color:#016CC9'>total sales time series </span></b>

We will now visualize the total sales time series. This will give us some important clues regarding trend and seasonality. so let's group train by days.

In [None]:
total_daily=train.groupby(['date']).agg({'sales':'sum'}).reset_index()
total_daily.rename(columns={'sales':'total_daily_sales'},inplace=True)
print(total_daily.head(3))

In [None]:

total_daily['MA_total_daily_sales']=total_daily['total_daily_sales'].rolling(30).mean()

fig, ax = plt.subplots(figsize=(15,10))
ax.plot(total_daily['date'],total_daily['MA_total_daily_sales']) 
ax.set_xlabel('date', fontsize=16)
ax.set_ylabel('total sales', fontsize=16)
ax.set_title('total sales time series', fontsize=18)
loc = plticker.MultipleLocator(base=60) # this locator puts ticks at regular intervals
ax.xaxis.set_major_locator(loc)
ax.tick_params(axis='x', rotation=70)
plt.show()

There is a general uptrend with some swings. We would like to better visualize the weekly, monthly and yearly cycles.

In [None]:
total_daily['date'] = pd.to_datetime(total_daily['date'], format='%Y/%m/%d')
total_daily['day_of_month'] = total_daily['date'].dt.day
total_daily['month'] = total_daily['date'].dt.month_name()
total_daily['day_of_week'] = total_daily['date'].dt.day_name()

year_cycle=total_daily.groupby('month').agg({'total_daily_sales':'sum'}).reset_index()
year_cycle.rename(columns={'total_daily_sales':'total_sales'},inplace=True)
month_cycle=total_daily.groupby('day_of_month').agg({'total_daily_sales':'sum'}).reset_index()
month_cycle.rename(columns={'total_daily_sales':'total_sales'},inplace=True)
week_cycle=total_daily.groupby('day_of_week').agg({'total_daily_sales':'sum'}).reset_index()
week_cycle.rename(columns={'total_daily_sales':'total_sales'},inplace=True)

# Weekly cycle <b><span style='color:#4B4B4B'>6 |</span><span style='color:#016CC9'> Load data </span></b>

In [None]:
cat_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
week_cycle['day_of_week'] = pd.Categorical(week_cycle['day_of_week'], categories=cat_week, ordered=True)
week_cycle = week_cycle.sort_values('day_of_week')
fig, ax = plt.subplots(figsize=(15,10))
ax.bar(week_cycle['day_of_week'],week_cycle['total_sales'],color = 'g', width = 0.5) 
ax.set_xlabel('day of week', fontsize=16)
ax.set_ylabel('total sales', fontsize=16)
ax.set_title('Weekly cycle',fontsize=18)
plt.show()

There is a clear weekly pattern. Sunday is the busiest day followed by Saturday. Thurday is the least busy day.

# <b><span style='color:#4B4B4B'>5 |</span><span style='color:#016CC9'> Monthly cycle </span></b>

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.bar(month_cycle['day_of_month'],month_cycle['total_sales'],color = 'g', width = 0.5) 
ax.set_xlabel('day of month', fontsize=16)
ax.set_ylabel('total sales', fontsize=16)
ax.set_title('Monthly cycle',fontsize=18)
ax.set_xlim(xmin=1,xmax=30)
plt.show()

Here we have a useful information. The days following the start and the middle of the month are stronger.
In the data description we see:
***Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.***
We can see this reflected in the sales numbers.


#  <b><span style='color:#4B4B4B'>6 |</span><span style='color:#016CC9'>Yearly cycle</span></b>

In [None]:
cat_year = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
year_cycle['month'] = pd.Categorical(year_cycle['month'], categories=cat_year, ordered=True)
year_cycle = year_cycle.sort_values('month')
fig, ax = plt.subplots(figsize=(15,10))
ax.bar(year_cycle['month'],year_cycle['total_sales'],color = 'g', width = 0.5) 
ax.set_xlabel('month', fontsize=16)
ax.set_ylabel('total sales', fontsize=16)
ax.set_title('Yearly cycle',fontsize=18)
ax.tick_params(axis='x', rotation=70)
plt.show()

July is the strongest month, and September the weakest.

If you found this Notebook useful, please upvote!