In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import plotly.graph_objects as go
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

In [None]:
holidays_df=pd.read_csv("../input/store-sales-time-series-forecasting/holidays_events.csv")
oil_df=pd.read_csv("../input/store-sales-time-series-forecasting/oil.csv")
stores_df=pd.read_csv("../input/store-sales-time-series-forecasting/stores.csv")
test_df=pd.read_csv("../input/store-sales-time-series-forecasting/test.csv")
train_df=pd.read_csv("../input/store-sales-time-series-forecasting/train.csv")
transactions_df=pd.read_csv("../input/store-sales-time-series-forecasting/transactions.csv")

**Determining shape of each dataset:**

In [None]:
holidays_df.shape

In [None]:
oil_df.shape

In [None]:
stores_df.shape

In [None]:
test_df.shape

In [None]:
train_df.shape

In [None]:
transactions_df.shape

# Understanding each Datasets:

In [None]:
transactions_df.head()

In [None]:
transactions_df.tail()

**1.Transactions_df**

It is a time series data which contains **sales data** of **each store** of **Corporación Favorita, a large Ecuadorian-based grocery retailer**.

The sales data contains sales transactions from year **2013-2017**.

It has almost **83488** entries having details of transaction **date,store_number and the transactions made per day in each store** i.e. all total **3 columns**.



In [None]:
{
    "tags": [
        "hide-input",
    ]
}
stores_df.head()

**2.stores_df**

This dataset contains **store number** of each **stores of Corporación Favorita** located in **each cities** of a particular **state in Ecuador**.

There are all total **54 cities** whereas each city of a **specific state** belongs to a specific **cluster i.e. belonging to similar stores** of different **specific types**

In [None]:
oil_df.head()

In [None]:
oil_df.tail()

**3.oil_df**

It contains daily **oil prices** of each day from **2013 January 1st - 2017 August 31st**.

In [None]:
train_df.head()

**4.train_df**

**The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.**

* store_nbr identifies the store at which the products are sold.

* family identifies the type of product sold.

* sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).

* onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.

In [None]:
test_df.head()

**5. test.csv**
* The test data, having the same features as the training data. You will **predict the target sales for the dates in this file.**

* The **dates** in the test data are for the **15 days after the last date in the training data.**

In [None]:
holidays_df.head()

**6.holidays_df**
* Holidays and Events, with metadata

* NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.

* Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

# Standardising values 

### 1.store_sales

In [None]:
store_sales=train_df

In [None]:
store_sales.info()

In [None]:
store_sales.family= store_sales.family.astype('category')
store_sales.store_nbr= store_sales.store_nbr.astype('category')
store_sales.date=pd.to_datetime(store_sales["date"])

In [None]:
store_sales.info()

**2.stores_df**

In [None]:
stores_df.info()

In [None]:
stores_df["cluster"]=stores_df.cluster.astype("category")
stores_df["type"]=stores_df.type.astype("category")


In [None]:
stores_df.info()

**3.oil_df**

In [None]:
oil_df.info()

In [None]:
oil_df["date"]=pd.to_datetime(oil_df["date"])

**4.transactions_df**

In [None]:
transactions_df["date"]=pd.to_datetime(transactions_df["date"])

In [None]:
transactions_df.info()

**5.holidays_df**

In [None]:
holidays_df.info()

In [None]:
holidays_df["type"]=holidays_df["type"].astype("category")
holidays_df["date"]=pd.to_datetime(holidays_df["date"])

In [None]:
holidays_df.info()

### **Merging all the dataframes in store_sales for performing Data Analysis**

In [None]:
# copying of train data and merging other data
store_sales =store_sales.merge(holidays_df, on = 'date', how='left')
store_sales = store_sales.merge(oil_df, on = 'date', how='left')
store_sales = store_sales.merge(stores_df, on = 'store_nbr', how='left')
store_sales = store_sales.merge(transactions_df, on = ['date', 'store_nbr'], how='left')
store_sales = store_sales.rename(columns = {"type_x" : "holiday_type", "type_y" : "store_type"})
store_sales=store_sales.merge(oil_df, how='left', on='date')

store_sales['date'] = pd.to_datetime(store_sales['date'])
store_sales['year'] = store_sales['date'].dt.year
store_sales['month'] = store_sales['date'].dt.month
store_sales['week'] = store_sales['date'].dt.isocalendar().week
store_sales['quarter'] = store_sales['date'].dt.quarter
store_sales['day_of_week'] = store_sales['date'].dt.day_name()


In [None]:
store_sales['year'] = pd.DatetimeIndex(store_sales['date']).year
store_sales["month"]=pd.DatetimeIndex(store_sales['date']).month
store_sales["day"]=pd.DatetimeIndex(store_sales["date"]).day



In [None]:
store_sales.head()

# Data Analysis:

### **1. Top 10 family based on average sales**

In [None]:
avg_sales_category=store_sales.groupby(by="family")["sales"].mean().sort_values(ascending=False)
avg_sales_top_10_categories=pd.DataFrame(avg_sales_category[:10])
avg_sales_top_10_categories.reset_index(inplace=True)

In [None]:
fig = go.Figure(data=[go.Pie(labels=avg_sales_top_10_categories["family"], values=avg_sales_top_10_categories["sales"], hole=.4)])

fig.update_layout(
    width=700,
    height=700)

fig.show()

### **2. Average sales each store type**

In [None]:
avg_sales_store_type=store_sales.groupby(by="store_type")["sales"].mean()
avg_sales_store_type=pd.DataFrame(avg_sales_store_type)
avg_sales_store_type.reset_index(inplace=True)


In [None]:
avg_sales_store_type= round(avg_sales_store_type, 2)
fig = go.Figure(data=[go.Pie(labels=avg_sales_store_type["store_type"], values=avg_sales_store_type["sales"], textinfo='label+percent',
                             insidetextorientation='radial'
                            )])
fig.show()

### **3. Average sales each cluster**

In [None]:
average_sales_each_cluster=pd.pivot_table(data=store_sales,values="sales",index="cluster",aggfunc="mean")
average_sales_each_cluster= round(average_sales_each_cluster, 2)

In [None]:
fig = px.bar(average_sales_each_cluster, x=average_sales_each_cluster.index, y='sales')
fig.show()

### **4.Average Weekly sales over each year:**

In [None]:
average_sales_weekly_over_each_year=pd.pivot_table(data=store_sales,values="sales",index="week",aggfunc="mean",columns="year")
average_sales_weekly_over_each_year= round(average_sales_weekly_over_each_year, 2)

In [None]:
fig = px.line(average_sales_weekly_over_each_year, x=average_sales_weekly_over_each_year.index, y=average_sales_weekly_over_each_year.columns,
              title='Average sales Weekly Over Each Year',markers=True)

fig.update_layout(
    width=1500,
    height=800)

fig.show()

### **5.Average Sales monthly each year:**

In [None]:
average_sales_month_year=pd.pivot_table(data=store_sales,values="sales",columns="month",index="year",aggfunc='mean')
average_sales_month_year= round(average_sales_month_year, 2)
fig = px.bar(average_sales_month_year, x=average_sales_month_year.columns, y=average_sales_month_year.index, orientation='h',
             height=400,
             title='Average sales figure monthly per year')
fig.update_layout(
    width=1400,
    height=500)

fig.show()

### **6.Average sales per day in a week**

In [None]:
#Average sales year month
average_sales_daywise=store_sales.groupby(by="day_of_week")["sales"].mean()
average_sales_daywise= round(average_sales_daywise, 2)
average_sales_daywise.rename(index={"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7},inplace=True)


In [None]:
fig = px.bar(average_sales_daywise, x="sales", y=average_sales_daywise.index, orientation='h',
             height=400,
             title='Average sales per day in a week')
fig.update_layout(
    width=1400,
    height=500)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'array', 'categoryarray':['d','a','c','b']})
fig.show()

### **7.Monthly sales unit each state**

In [None]:
monthly_store_sales_statewise=store_sales.groupby(by=["state","month"])["sales"].mean()
monthly_store_sales_statewise=pd.DataFrame(monthly_store_sales_statewise)
monthly_store_sales_statewise.reset_index(inplace=True)

In [None]:
monthly_store_sales_statewise['sales'] = round(monthly_store_sales_statewise['sales'], 2)
fig = px.scatter(monthly_store_sales_statewise, x="state", y="month",color="sales",size="sales",title='Monthly sales figure each state')
fig.update_yaxes(tickmode = 'array', tickvals=[i for i in range(1,13)], 
                 ticktext=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
fig.update_layout(height=800,
                  width=1000,
                  plot_bgcolor='#fafafa')

fig.show()

In [None]:
store_sales.head()

### **8.Average sales vs oil price each day over the years**

In [None]:
store_sales.rename(columns={"dcoilwtico_y": "oil_price"},inplace=True)

In [None]:
avg_sales_each_day_vs_oil_price=pd.DataFrame(store_sales.groupby(by=["date","oil_price"])["sales"].mean())
avg_sales_each_day_vs_oil_price.reset_index(inplace=True)
avg_sales_each_day_vs_oil_price["sales"]=round(avg_sales_each_day_vs_oil_price["sales"],2)

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=avg_sales_each_day_vs_oil_price.index, y=avg_sales_each_day_vs_oil_price.sales, name="Sales Unit"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=avg_sales_each_day_vs_oil_price.index, y=avg_sales_each_day_vs_oil_price.oil_price, name="Oil Price"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Average sales vs oil price each day over the years"
)

# Set x-axis title
fig.update_xaxes(title_text="Days")

# Set y-axes titles
fig.update_yaxes(title_text="Sales", secondary_y=False,type='log')
fig.update_yaxes(title_text="Oil price", secondary_y=True,type='log')


fig.show()

### **9.Sales each Holiday in each month**

In [None]:
sales_each_holiday=pd.DataFrame(store_sales.groupby(by=["month","holiday_type"])["sales"].mean()).reset_index()
sales_each_holiday.reset_index(inplace=True)
sales_each_holiday['sales'] = round(sales_each_holiday['sales'], 2)
sales_each_holiday.dropna(inplace=True)

In [None]:

fig = px.scatter(sales_each_holiday, x="month", y="holiday_type",color="sales",size="sales",title='Sales each Holiday in each month')
fig.update_yaxes(ticksuffix='  ')
fig.update_xaxes(tickmode = 'array', tickvals=[i for i in range(1,13)], 
                 ticktext=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
fig.update_layout(height=400,width=800,
                  plot_bgcolor='#fafafa')


fig.show()

### **10.Average sales each day through out the year**

In [None]:
sales_each_day_of_every_years=pd.DataFrame(store_sales.groupby(by=["year","day"])["sales"].mean()).reset_index()

In [None]:
from plotly.subplots import make_subplots
fig = make_subplots(rows=3, cols=2)

fig.add_trace(go.Scatter(x=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2013]["day"], 
                         y=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2013]["sales"],
                    mode='lines',
                    name="2013"),row=1,col=1)
fig.add_trace(go.Scatter(x=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2014]["day"], 
                         y=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2014]["sales"],
                    mode='lines',
                    name="2014"),row=1,col=2)
fig.add_trace(go.Scatter(x=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2015]["day"], 
                         y=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2015]["sales"],
                    mode='lines',
                    name="2015"),row=2,col=1)
fig.add_trace(go.Scatter(x=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2016]["day"], 
                         y=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2016]["sales"],
                    mode='lines',
                    name="2016"),row=2,col=2)                                                                                                     
fig.add_trace(go.Scatter(x=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2017]["day"], 
                         y=sales_each_day_of_every_years[sales_each_day_of_every_years["year"]==2017]["sales"],
                    mode='lines',
                    name="2017"),row=3,col=1)   
fig.update_xaxes(title="day")
fig.update_yaxes(title="sales")
fig.update_layout(title="Avergae sales across each year",height=1000,width=1400)

fig.show()

**10.Average sales each store type through out the year**

In [None]:
# data
Store_Type_Vs_Year_Month= pd.DataFrame(store_sales.groupby(by=['year','month',"store_type"])["sales"].mean()).reset_index()
Store_Type_Vs_Year_Month['sales'] = round( Store_Type_Vs_Year_Month['sales'], 2)
Store_Type_Vs_Year_Month.dropna(inplace=True)
# chart
fig = px.scatter( Store_Type_Vs_Year_Month, x='month', y='store_type', color='sales' ,size="sales",
                 facet_row='year', title='Average Sales: Store Type Vs Year(Month)')
# styling
fig.update_yaxes(ticksuffix='  ')
fig.update_xaxes(tickmode = 'array', tickvals=[i for i in range(1,13)], 
                 ticktext=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
fig.update_layout(height=900, xaxis_title='', yaxis_title='',
                  margin=dict(t=70, b=0),
                  plot_bgcolor='#fafafa', paper_bgcolor='#fafafa')
fig.show()

# **Insights**

1. **Top 10 family based on average sales**- Amonng them Grocery is the highest sales category product which occupies around 35% of the sales unit ,since it is obvious that grocery is an essential commodity.

2. **Average sales each store type**-  If we arrange store type in terms of their rankinng performance in sales,it concludes: A>D>B>E>C.

3. **Average sales each cluster**- Performance wise Cluster 15 has the highest number of sales followed by 14 and 8. The lowest sales being in cluster 7.

4. **Average weekly sales over each year**- We can see  in the last month December, the sales number increases and then it gradually decreases in January.

5. **Average sales monthly each year**- Higher sales occur in the month of December and lowest in January.

6. **Average sales per day in a week**- Sales are highest in Sunday since it is pretty common that in Sunday  most of the people have day off that day and visit stores mostly to get stock daily need throught out the week ,that eventually cause low number in the starting of the week i.e. Monday.

7. **Monthly sales unit each state**- The State Pichincha has the highest sales thought out the month, and the nlowest being the state Pastaza.

8. **Average sales vs oil price each day over the years**- from the graph it is pretty evident that over the year as the oil price have been dropped, the sales began to rise , that tells us,since ecuador is an oil exporter ,it influences economy very much. The oil price is inversely proportional to sales unit,i.e purchasing power of a citizen increases.

9. **Average Sales: Holiday_type Vs Year(Month)**-Most of the sales were done in Transfer Holiday and that to be in Christmas Holidays month December and January and pre-Christmas November month and May month showed a great trend in shopping.

10. **Average sales across each year**- As previously mentioned on 15 th and 30 th the salary is being released, we can see gradual sudden jump in sales specific in these days.

11. **Average sales each store type through out the year**- as previously  analysed A is the type where maximum sales occurs followed by D,B,E and C. 


**Reference:** Thanks [Kashis Rashtogi](https://www.kaggle.com/kashishrastogi/store-sales-analysis-time-serie) for the incredible presentation ideas.