## <span style="font-size:25px;color:#328c4f">Table of Contents</span>
- <span style="font-size:20px">[Details about Data](#data-details)</span>
- <span style="font-size:20px">[Importing the libraries](#library-import)</span>
- <span style="font-size:20px">[Loading CSV Files and Merging data](#data-load)</span>
- <span style="font-size:20px">[Average Sales Analysis](#avg-sales)</span>
- <span style="font-size:20px">[Average Sales Analysis for Year-Month](#avg-yearly)</span>
- <span style="font-size:20px">[Average Sales Analysis by Month, Week, Day of Week, Quarters](#avg-othertimes)</span>
- <span style="font-size:20px">[Store Analysis](#store-analysis)</span>
    - <span style="font-size:20px">[Average Sales: Store Type Vs Holiday Type](#store_type-holiday)</span>
    - <span style="font-size:20px">[Average Sales: Store Type Vs Year(Month)](#store_type-year)</span>
    - <span style="font-size:20px">[Average Sales: Month Vs Holiday Type](#month-holiday)</span>
    - <span style="font-size:20px">[Average Sales: Holiday_type Vs Year(Month)](#holiday-year)</span>

## <a id="data-details"></a><span style="color:#328c4f">Details about Data</span>.

<p style="font-size:25px; color:#04661e">Competition Purpose</p>

In this competition, one will predict sales for the thousands of product families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models.


<p style="font-size:25px; color:#04661e">train.csv</p>

- The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
- store_nbr identifies the store at which the products are sold.
- family identifies the type of product sold.
- sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
- onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.

<p style="font-size:25px; color:#04661e">test.csv</p>

- The test data, having the same features as the training data. You will predict the target sales for the dates in this file.
- The dates in the test data are for the 15 days after the last date in the training data.

<p style="font-size:25px; color:#04661e">stores.csv</p>

- Store metadata, including city, state, type, and cluster.
- cluster is a grouping of similar stores.

<p style="font-size:25px; color:#04661e">oil.csv</p>

- Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)

<p style="font-size:25px; color:#04661e">holidays_events.csv</p>

- Holidays and Events, with metadata
- NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12.
- Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). - These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
- Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

<p style="font-size:25px; color:#04661e">Additional Notes</p>

- Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.
- A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.


##  <a id="library-import"></a> <span style="color:#328c4f">Importing the libraries</span>.

In [None]:
import pandas as pd
import numpy as np
import calendar

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode(connected = True)

##  <a id="data-load"></a><span style="color:#328c4f">Loading CSV Files and Merging data</span>.

In [None]:
df_holi = pd.read_csv('../input/store-sales-time-series-forecasting/holidays_events.csv')
df_oil = pd.read_csv('../input/store-sales-time-series-forecasting/oil.csv')
df_stores = pd.read_csv('../input/store-sales-time-series-forecasting/stores.csv')
df_trans = pd.read_csv('../input/store-sales-time-series-forecasting/transactions.csv')

df_train = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv')
df_test = pd.read_csv('../input/store-sales-time-series-forecasting/test.csv')

A brief description of the available csv data(loaded into dataframes) are presented below.

In [None]:
print('Holiday Events###############')
print(df_holi.shape)
print(df_holi.dtypes)
print(df_holi.head(3))

print('Oil Data################')
print(df_oil.shape)
print(df_oil.dtypes)
print(df_oil.head(3))

print('Store Data############')
print(df_stores.shape)
print(df_stores.dtypes)
print(df_stores.head(3))

print('Transaction Data################')
print(df_trans.shape)
print(df_trans.dtypes)
print(df_trans.head(3))

print('Train Data###################')
print(df_train.shape)
print(df_train.dtypes)
print(df_train.head(3))
print(df_test.shape)

Merge data from other dataframes into the training dataframe

In [None]:
# copying of train data and merging other data
df_train1 = df_train.merge(df_holi, on = 'date', how='left')
df_train1 = df_train1.merge(df_oil, on = 'date', how='left')
df_train1 = df_train1.merge(df_stores, on = 'store_nbr', how='left')
df_train1 = df_train1.merge(df_trans, on = ['date', 'store_nbr'], how='left')
df_train1 = df_train1.rename(columns = {"type_x" : "holiday_type", "type_y" : "store_type"})

df_train1['date'] = pd.to_datetime(df_train1['date'])
df_train1['year'] = df_train1['date'].dt.year
df_train1['month'] = df_train1['date'].dt.month
df_train1['week'] = df_train1['date'].dt.isocalendar().week
df_train1['quarter'] = df_train1['date'].dt.quarter
df_train1['day_of_week'] = df_train1['date'].dt.day_name()
df_train1.head(3)

##  <a id="avg-sales"></a><span style="color:#328c4f">Average Sales Analysis</span>.

Let's determine the average sales by store type, store family, and store cluster

In [None]:
df_st_sa = df_train1.groupby('store_type').agg({"sales" : "mean"}).reset_index().sort_values(by='sales', ascending=False)
print('Average Sales by Store Type')
print(df_st_sa)
df_fa_sa = df_train1.groupby('family').agg({"sales" : "mean"}).reset_index().sort_values(by='sales', ascending=False)
df_fa_sa_top_10 = df_fa_sa[:10]
print('\nAverage Sales by Store Family')
print(df_fa_sa.shape)
print(df_fa_sa_top_10)

df_cl_sa = df_train1.groupby('cluster').agg({"sales" : "mean"}).reset_index() 
print('\nAverage Sales by ')
print(df_cl_sa)

In [None]:

# chart color
df_fa_sa_top_10['color'] = '#ba3d34'
df_fa_sa_top_10['color'][2:] = '#eda29d'
df_cl_sa['color'] = '#70aad4'

# chart
fig = make_subplots(rows=2, cols=2, 
                    specs=[[{"type": "bar"}, {"type": "pie"}],
                           [{"colspan": 2}, None]],
                    column_widths=[0.7, 0.3], vertical_spacing=0, horizontal_spacing=0.02,
                    subplot_titles=("Top 10 Highest Product Sales", "Highest Sales in Stores", "Clusters Vs Sales"))

fig.add_trace(go.Bar(x=df_fa_sa_top_10['sales'], y=df_fa_sa_top_10['family'], marker=dict(color= df_fa_sa_top_10['color']),
                     name='Family', orientation='h'), 
                     row=1, col=1)
fig.add_trace(go.Pie(values=df_st_sa['sales'], labels=df_st_sa['store_type'], name='Store type',
                     marker=dict(colors=['#3FA64B','#4DAF59','#5DB867','#6DC177','#7FC987']), hole=0.7,
                     hoverinfo='label+percent+value', textinfo='label'), 
                    row=1, col=2)
fig.add_trace(go.Bar(x=df_cl_sa['cluster'], y=df_cl_sa['sales'], 
                     marker=dict(color= df_cl_sa['color']), name='Cluster'), 
                     row=2, col=1)

# styling
fig.update_yaxes(showgrid=False, ticksuffix=' ', categoryorder='total ascending', row=1, col=1)
fig.update_xaxes(visible=False, row=1, col=1)
fig.update_xaxes(tickmode = 'array', tickvals=df_cl_sa.cluster, ticktext=[i for i in range(1,17)], row=2, col=1)
fig.update_yaxes(visible=False, row=2, col=1)
fig.update_layout(height=500, bargap=0.2,
                  margin=dict(b=0,r=20,l=20), xaxis=dict(tickmode='linear'),
                  title_text="Average Sales Analysis",
                  template="plotly_white",
                  title_font=dict(size=29, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'), 
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)
fig.show()

<p style="font-size:19px">
    <b>Interpretation:</b><br><br>
    - Highest sales are made by the products like grocery and beverages.<br>
    - Store A has the highest sales which is about 40% and Store C has the lowest sales which is about 10%.
    - Cluster 5 has the highest average sales
</p>

##  <a id="avg-yearly"></a><span style="color:#328c4f">Average Sales Analysis for Year-Month</span>.

 <font size="4">At first, we'll determine the average sales per month from 2013-2017</font> 


In [None]:
#2013 average sales
df_2013 = df_train1[df_train1['year']==2013][['month','sales']]
df_2013 = df_2013.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s13'})
#2014 average sales
df_2014 = df_train1[df_train1['year']==2014][['month','sales']]
df_2014 = df_2014.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s14'})
#2015 average sales
df_2015 = df_train1[df_train1['year']==2015][['month','sales']]
df_2015 = df_2015.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s15'})
#2016 average sales
df_2016 = df_train1[df_train1['year']==2016][['month','sales']]
df_2016 = df_2016.groupby('month').agg({"sales" : "mean"}).reset_index().rename(columns={'sales':'s16'})
#2017 average sales
df_2017 = df_train1[df_train1['year']==2017][['month','sales']]
df_2017 = df_2017.groupby('month').agg({"sales" : "mean"}).reset_index()
df_2017_no = pd.DataFrame({'month': [9,10,11,12], 'sales':[0,0,0,0]})
df_2017 = df_2017.append(df_2017_no).rename(columns={'sales':'s17'})
#merge all the yearly average sales df into one dataframe
df_year = df_2013.merge(df_2014,on='month').merge(df_2015,on='month').merge(df_2016,on='month').merge(df_2017,on='month')

 <font size="4">Let's draw the average sales per month from 2013-2017</font> 

In [None]:
# top levels
top_labels = ['2013', '2014', '2015', '2016', '2017']

colors = ['rgba(143, 28, 28, 0.8)', 'rgba(161, 49, 49, 0.8)',
          'rgba(181, 82, 82, 0.8)', 'rgba(206, 116, 116, 0.85)',
          'rgba(222, 151, 151, 1)']

# X axis value 
df_year = df_year[['s13','s14','s15','s16','s17']].replace(np.nan,0)
x_data = df_year.values

# y axis value (Month)
df_2013['month'] =['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
y_data = df_2013['month'].tolist()

fig = go.Figure()
for i in range(0, len(x_data[0])):
    for xd, yd in zip(x_data, y_data):
        fig.add_trace(go.Bar(
            x=[xd[i]], y=[yd],
            orientation='h',
            marker=dict(
                color=colors[i],
                line=dict(color='rgb(248, 248, 249)', width=1)
            )
        ))
        
fig.update_layout(title='Avg Sales for each Year',
    xaxis=dict(showgrid=False, 
               zeroline=False, domain=[0.15, 1]),
    yaxis=dict(showgrid=False, showline=False,
               showticklabels=False, zeroline=False),
    barmode='stack', 
    template="plotly_white",
    margin=dict(l=0, r=50, t=100, b=10),
    showlegend=False, 
)


annotations = []
for yd, xd in zip(y_data, x_data):
    # labeling the y-axis
    annotations.append(dict(xref='paper', yref='y',
                            x=0.14, y=yd,
                            xanchor='right',
                            text=str(yd),
                            font=dict(family='Arial', size=14,
                                      color='rgb(67, 67, 67)'),
                            showarrow=False, align='right'))
    # labeling the first Likert scale (on the top)
    if yd == y_data[-1]:
        annotations.append(dict(xref='x', yref='paper',
                                x=xd[0] / 2, y=1.1,
                                text=top_labels[0],
                                font=dict(family='Arial', size=14,
                                          color='rgb(67, 67, 67)'),
                          showarrow=False))
    space = xd[0]
    for i in range(1, len(xd)):
            # labeling the Likert scale
            if yd == y_data[-1]:
                annotations.append(dict(xref='x', yref='paper',
                                        x=space + (xd[i]/2), y=1.1,
                                        text=top_labels[i],
                                        font=dict(family='Arial', size=14,
                                                  color='rgb(67, 67, 67)'),
                                        showarrow=False))
            space += xd[i]
fig.update_layout(
    annotations=annotations)


fig.show()

<p style="font-size:19px;">
    <b>Interpretation:</b><br><br>
    - Highest sales are made in December month and then decreases in January.<br>
    - For years 2013-2016 year end months(Sept-Dec) witness increase in sales for those years <br>
    - Sales are increasing gradually from 2013 to 2017.<br>
    <b>Note:</b> We don't have data for 2017: 9th to 12th month. 
</p>

##  <a id="avg-othertimes"></a><span style="color:#328c4f">Average Sales Analysis by Month, Week, Day of Week, Quarters </span>.

 <font size="4">Let's calculate the average sales by month/week/quarter</font> 

In [None]:
df_m_sa = df_train1.groupby('month').agg({"sales" : "mean"}).reset_index()
df_m_sa['sales'] = round(df_m_sa['sales'],2)
df_m_sa['month_text'] = df_m_sa['month'].apply(lambda x: calendar.month_abbr[x])
df_m_sa['text'] = df_m_sa['month_text'] + ' - ' + df_m_sa['sales'].astype(str) 

df_w_sa = df_train1.groupby('week').agg({"sales" : "mean"}).reset_index() 
df_q_sa = df_train1.groupby('quarter').agg({"sales" : "mean"}).reset_index() 
# chart color
df_m_sa['color'] = '#bd2b20'
df_m_sa['color'][:-1] = '#e37f78'

 <font size="4">Let's draw the average sales by month/week/quarter</font> 

In [None]:
# chart
fig = make_subplots(rows=2, cols=2, vertical_spacing=0.08,
                    row_heights=[0.7, 0.3], 
                    specs=[[{"type": "bar"}, {"type": "pie"}],
                           [{"colspan": 2}, None]],
                    column_widths=[0.7, 0.3],
                    subplot_titles=("Month wise Avg Sales Analysis", "Quarter wise Avg Sales Analysis", 
                                    "Week wise Avg Sales Analysis"))

fig.add_trace(go.Bar(x=df_m_sa['sales'], y=df_m_sa['month'], marker=dict(color= df_m_sa['color']),
                     text=df_m_sa['text'],textposition='auto',
                     name='Month', orientation='h'), 
                     row=1, col=1)
fig.add_trace(go.Pie(values=df_q_sa['sales'], labels=df_q_sa['quarter'], name='Quarter',
                     marker=dict(colors=['#347834','#52a152','#6db06d','#7fb37f']), hole=0.7,
                     hoverinfo='label+percent+value', textinfo='label+percent'), 
                     row=1, col=2)
fig.add_trace(go.Scatter(x=df_w_sa['week'], y=df_w_sa['sales'], mode='lines+markers', fill='tozeroy', fillcolor='#c979c4',
                     marker=dict(color= '#4d0548'), name='Week'), 
                     row=2, col=1)

# styling
fig.update_yaxes(visible=False, row=1, col=1)
fig.update_xaxes(visible=False, row=1, col=1)
fig.update_xaxes(tickmode = 'array', tickvals=df_w_sa.week, ticktext=[i for i in range(1,53)], 
                 row=2, col=1)
fig.update_yaxes(visible=False, row=2, col=1)
fig.update_layout(height=750, bargap=0.15,
                  margin=dict(b=0,r=20,l=20), 
                  title_text="Average Sales Analysis",
                  template="plotly_white",
                  title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'),
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)
fig.show()

 <font size="4">Let's draw the average sales by days of week</font> 

In [None]:
# data
df_dw_sa = df_train1.groupby('day_of_week').agg({"sales" : "mean"}).reset_index()
df_dw_sa.sales = round(df_dw_sa.sales, 2)

# chart
fig = px.bar(df_dw_sa, y='day_of_week', x='sales', title='Avg Sales vs Day of Week',
             color_discrete_sequence=['#b85cb1'], text='sales',
             category_orders=dict(day_of_week=["Monday","Tuesday","Wednesday","Thursday", "Friday","Saturday","Sunday"]))
fig.update_yaxes(showgrid=False, ticksuffix=' ', showline=False)
fig.update_xaxes(visible=False)
fig.update_layout(margin=dict(t=60, b=0, l=0, r=0), height=350,
                  hovermode="y unified", 
                  yaxis_title=" ", template='plotly_white',
                  title_font=dict(size=25, color='#5c5a5c', family="Lato, sans-serif"),
                  font=dict(color='#757175'),
                  hoverlabel=dict(bgcolor="#d9ccd8", font_size=13, font_family="Lato, sans-serif"))

<p style="font-size:19px"><b>Interpretation:</b><br><br>
    - As we saw in the above chart there is an upward trend in sales over the time. Although there are ups and downs at every point in time, generally we can observe that the trend increases. Also we can notice how the ups and downs seem to be a bit regular, it means we might be observing a seasonal pattern here too. Let’s take a closer look by observing some year’s data:<br>
    - Highest sales are made on <b>Sunday</b>.<br>
    - <b>December</b> month has the highest sales.<br>
    <b>Note:</b> We don't have data for 2017: 9th to 12th month. 
</p>

##  <a id="store-analysis"></a> <span style="color:#328c4f">Store Analysis </span>.

 <a id="store_type-holiday"></a> <font size="4">Average Sales: Store Type Vs Holiday Type</font> 

In [None]:
# data
df_st_ht = df_train1.groupby(['store_type','holiday_type']).agg({"sales" : "mean"}).reset_index()
df_st_ht['sales'] = round(df_st_ht['sales'], 2)

# chart
fig = px.scatter(df_st_ht, x='store_type', color='sales', y='holiday_type', size='sales',
                 color_discrete_sequence=px.colors.qualitative.D3,
                 title="Average Sales: Store Type Vs Holiday Type")
# styling
fig.update_yaxes(ticksuffix='  ')
fig.update_layout(height=400, xaxis_title='', yaxis_title='',
                  margin=dict(b=0),
                  plot_bgcolor='#fafafa', paper_bgcolor='#fafafa',
                  title_font=dict(size=23, color='#ba97b8', family="Lato, sans-serif"),
                  font=dict(color='#555'), 
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"))

 <a id="store_type-year"></a> <font size="4">Average Sales: Store Type Vs Year(Month)</font> 

In [None]:
# data
df_y_m_st = df_train1.groupby(['year','month','store_type']).agg({"sales" : "mean"}).reset_index()
df_y_m_st['sales'] = round(df_y_m_st['sales'], 2)

# chart
fig = px.scatter(df_y_m_st, x='month', y='store_type', color='sales', size='sales', 
                 facet_row='year', title='Average Sales: Store Type Vs Year(Month)')
# styling
fig.update_yaxes(ticksuffix='  ')
fig.update_xaxes(tickmode = 'array', tickvals=[i for i in range(1,13)], 
                 ticktext=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
fig.update_layout(height=900, xaxis_title='', yaxis_title='',
                  margin=dict(t=70, b=0),
                  plot_bgcolor='#fafafa', paper_bgcolor='#fafafa',
                  title_font=dict(size=23, color='#ba97b8', family="Lato, sans-serif"),
                  font=dict(color='#555'), 
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"))
fig.show()

 <a id="month-holiday"></a> <font size="4">Average Sales: Month Vs Holiday Type</font> 

In [None]:
# data
df_m_ht = df_train1.groupby(['month','holiday_type']).agg({"sales" : "mean"}).reset_index()
df_m_ht['sales'] = round(df_m_ht['sales'], 2)

# chart
fig = px.scatter(df_m_ht, x='month', color='sales', y='holiday_type', size='sales',
                 color_discrete_sequence=px.colors.qualitative.D3,
                 title="Average Sales: Month Vs Holiday Type")
# styling
fig.update_yaxes(ticksuffix='  ')
fig.update_xaxes(tickmode = 'array', tickvals=[i for i in range(1,13)], 
                 ticktext=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
fig.update_layout(height=400, xaxis_title='', yaxis_title='',
                  margin=dict(b=0),
                  plot_bgcolor='#fafafa', paper_bgcolor='#fafafa',
                  title_font=dict(size=23, color='#ba97b8', family="Lato, sans-serif"),
                  font=dict(color='#555'), 
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"))
fig.show()

<p style="font-size:19px"><b>Interpretation:</b><br><br>
    - Most of the sales were done in Transfer Holiday<br>
    - Throughout all the months of all years store type <b>A</b> showed great sales<br>
    - <b>December</b> month has the highest sales.<br>
    <b>Note:</b> We don't have data for 2017: 9th to 12th month. 
</p>

 <a id="holiday-year"></a> <font size="4">Average Sales: Holiday_type Vs Year(Month)</font> 

In [None]:
# data
df_y_m_ht = df_train1.groupby(['year','month','holiday_type']).agg({"sales" : "mean"}).reset_index()
df_y_m_ht['sales'] = round(df_y_m_ht['sales'], 2)

# chart
fig = px.scatter(df_y_m_ht, x='month', y='holiday_type', color='sales', size='sales', 
                 facet_row='year', title='Average Sales: Holiday_type Vs Year(Month)')
# styling
fig.update_yaxes(ticksuffix='  ')
fig.update_xaxes(tickmode = 'array', tickvals=[i for i in range(1,13)], 
                 ticktext=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
fig.update_layout(height=900, xaxis_title='', yaxis_title='',
                  margin=dict(t=70, b=0),
                  plot_bgcolor='#fafafa', paper_bgcolor='#fafafa',
                  title_font=dict(size=23, color='#ba97b8', family="Lato, sans-serif"),
                  font=dict(color='#555'), 
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"))
fig.show()

<p style="font-size:19px"><b>Interpretation:</b><br><br>
    - <b>December</b> month has the highest sales for various holidays<br>
    - From <b>2015</b> onwards there're growing sales in <b>May</b> too.<br>
    <b>Note:</b> We don't have data for 2017: 9th to 12th month. 
</p>

*This notebook is for research/surveying, learning, experimenting, and reproducing existing literature found online*

**Reference:**
- https://www.kaggle.com/code/kashishrastogi/store-sales-analysis-time-serie