# Introduction

This notebook is the first notebook I've done a data analysis for as a newcomer to kaggle.I hope you understand that you may see redundant source code in the notes often.If there's a better way to write this, please let me know!

# These notes will show you

- Purpose of the competition
- Overview of the data provided
- Basic analysis of the data provided

# Understand the purpose of the competition

M5 Forecasting - Accuracy predicts 28 days of sales for Walmart, the largest retailer in the United States.

# Understand the data provided

- sales_train_validation.csv

Daily sales volume data by product and store. It's possible to see "when, in which stores, in which category, and which products sold".

- calendar.csv 

It contains information about the sale date of the product. By combining "d_" with "sales_train_validation.csv", you can check the data of sales_train_validation.csv with the date.

- sample_submission.csv

It shows the correct form of submission.

- sell_prices.csv

The selling price of each product is shown.

- sales_train_evaluation.csv

Available a month before the deadline. Sales are included.


In this analysis, we will mainly use "sales_train_validation.csv" and "calendar.csv".

# Read the data and check the provided data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
# set_option
pd.set_option('max_columns', 50)
plt.style.use('bmh')
!ls -GFlash --color ../input/m5-forecasting-accuracy/

In [None]:
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

In [None]:
ss.head()

In [None]:
stv.head()

In [None]:
cal.head()

# Data Analysis

## Analysis
1. number of sales by product category in chronological order (all years, annual and weekly)
2. number of sales per store in chronological order (all years, annual and weekly)
3. number of sales by region in time series (all years, annual and weekly)
4. number of stores by region 
5. number of sales in each product category by region
6. top 5 best-selling products

## preliminary preparations

In [None]:
# Check the size of your training data
stv.info()

In [None]:
# Basic analysis of training data
stv.describe() 

## 1.Check the number of sales by product category in chronological order

In [None]:
d_cols = [c for c in stv.columns if 'd_' in c] 

In [None]:
df= stv.groupby('cat_id')[d_cols].sum().sum(axis=1).sort_values()
df.name = 'Category_sum'
df

### Total sales by category (all years)

In [None]:
pd.DataFrame(df).plot(kind='barh', figsize=(15, 5), title='Total sales by category (all years)')
plt.show()

### consideration
- It can be seen that food sales account for a large number of sales.

### Time series Total sales by category (all years)

In [None]:
stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date').plot(kind='line',figsize=(18, 8))
plt.title(" Time series Total sales by category (all years)")
plt.show()

### Considerations.
- Once a year there is a point where sales go to zero.
- This is because the period intervals are unknown in the above graph.

### Time series Total sales by category Moving average (all years)

In [None]:
stv.groupby('cat_id')[d_cols].sum().T.rolling(90).mean().merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date').plot(kind='line',figsize=(15, 5))
plt.title("Rolling 90 Day Average Total Sales (Category)")
plt.show()

### Considerations.
- Overall, sales have been rising at a moderate pace.
- Especially in 2013-2-17, all categories appear to be on the rise.

### Time series Total sales by category (2015)

The analysis will be conducted on the 2015 information.

In [None]:
df = stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series Total sales by category (2015)")
plt.show()

### Time Series Total Sales by Category (January-March 2015)

In [None]:
df = stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-01-') | df.index.str.startswith('2015-02-')| df.index.str.startswith('2015-03-')].plot(kind='line',figsize=(18, 8))
plt.title("Time Series Total Sales by Category (January-March 2015)")
plt.show()

### 考察
- 1月1日は正月休みの影響で売り上げが一時的に下がっている。

### Time Series Total Sales by Category (April - June 2015)

In [None]:
df = stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-04-') | df.index.str.startswith('2015-05-')| df.index.str.startswith('2015-06-')].plot(kind='line',figsize=(18, 8))
plt.title("Time Series Total Sales by Category (April - June 2015)")
plt.show()

### Time Series Total Sales by Category (July-September 2015)

In [None]:
df = stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-07-') | df.index.str.startswith('2015-08-')| df.index.str.startswith('2015-09-')].plot(kind='line',figsize=(18, 8))
plt.title("Time Series Total Sales by Category (July-September 2015)")
plt.show()

### Time Series Total Sales by Category (October - December 2015)

In [None]:
df = stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-10-') | df.index.str.startswith('2015-11-')| df.index.str.startswith('2015-12-')].plot(kind='line',figsize=(18, 8))
plt.title("Time Series Total Sales by Category (October - December 2015)")
plt.show()

### Time Series Total Sales by Category (October 2015)
Review the data for October and check the weekly data cycle.

In [None]:
df = stv.groupby('cat_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df = df[df.index.str.startswith('2015-10-')].plot(kind='line',figsize=(18, 8))

plt.title("Time series Total sales by category (October 2015)")
plt.show()

### Considerations.
- Sales are up when it comes to holidays and holidays.

## 2.Number of sales per store in chronological order

### Total sales per store (all years)

In [None]:
df= stv.groupby('store_id')[d_cols].sum().sum(axis=1).sort_values()
df.name = 'store_sum'
df

In [None]:
pd.DataFrame(df).plot(kind='barh', figsize=(15, 5), title='Total sales per store (all years)')
plt.show()

### Time series Total sales by store (all years)

In [None]:
#Sales trends of each category
stv.groupby('store_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date').plot(kind='line',figsize=(18, 8))
plt.title("Time series Total sales by store (all years)")
plt.show()

### Time series Total sales per store Moving average line (all years)

In [None]:
#Rolling 90 Day Average Total Sales (Category)
stv.groupby('store_id')[d_cols].sum().T.rolling(90).mean().merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date').plot(kind='line',figsize=(15, 5))
plt.title("Time series Total sales per store Moving average line (all years)")
plt.show()

### Time series Total sales by store (Jan-Mar 2015)

In [None]:
df = stv.groupby('store_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-01-') | df.index.str.startswith('2015-02-')| df.index.str.startswith('2015-03-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series Total sales by store (Jan-Mar 2015)")
plt.show()

### Time series Total sales by store (April - June 2015)

In [None]:
df = stv.groupby('store_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-04-') | df.index.str.startswith('2015-05-')| df.index.str.startswith('2015-06-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series Total sales by store (April - June 2015)")
plt.show()

### Time series Total sales by store (Jul-Sep 2015)

In [None]:
df = stv.groupby('store_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-07-') | df.index.str.startswith('2015-08-')| df.index.str.startswith('2015-09-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series Total sales by store (Jul-Sep 2015)")
plt.show()

### Time series Total sales by store (October - December 2015)

In [None]:
df = stv.groupby('store_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-10-') | df.index.str.startswith('2015-11-')| df.index.str.startswith('2015-12-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series Total sales by store (October - December 2015)")
plt.show()

### Time series Total sales by store (October 2015)

In [None]:
df = stv.groupby('store_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df = df[df.index.str.startswith('2015-10-')].plot(kind='line',figsize=(18, 8))

plt.title("Time series Total sales by store (October 2015)")
plt.show()

## 3.Time series] Sales by region

### Total sales by region (all years)

In [None]:
df= stv.groupby('state_id')[d_cols].sum().sum(axis=1).sort_values()
df.name = 'state_sum'
df

In [None]:
pd.DataFrame(df).plot(kind='barh', figsize=(15, 5), title='Total sales by region (all years)')
plt.show()

### Total sales by region (all years)

In [None]:
#Sales trends of each category
stv.groupby('state_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date').plot(kind='line',figsize=(18, 8))
plt.title("Total sales by region (all years)")
plt.show()

### Time series] Total sales by region Moving average (all years)

In [None]:
#Rolling 90 Day Average Total Sales (Category)
stv.groupby('state_id')[d_cols].sum().T.rolling(90).mean().merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date').plot(kind='line',figsize=(15, 5))
plt.title("Time series] Total sales by region Moving average (all years)")
plt.show()

### Time series] Total sales by region (Jan-Mar 2015)

In [None]:
df = stv.groupby('state_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-01-') | df.index.str.startswith('2015-02-')| df.index.str.startswith('2015-03-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series] Total sales by region (Jan-Mar 2015)")
plt.show()

### Time series] Total sales by region (Apr. 2015 - Jun. 2015)

In [None]:
df = stv.groupby('state_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-04-') | df.index.str.startswith('2015-05-')| df.index.str.startswith('2015-06-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series] Total sales by region (Apr. 2015 - Jun. 2015)")
plt.show()

### Time series] Total sales by region (Jul-Sep 2015)

In [None]:
df = stv.groupby('state_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-07-') | df.index.str.startswith('2015-08-')| df.index.str.startswith('2015-09-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series] Total sales by region (Jul-Sep 2015)")
plt.show()

### Time series] Total sales by region (October - December 2015)

In [None]:
df = stv.groupby('state_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df[df.index.str.startswith('2015-10-') | df.index.str.startswith('2015-11-')| df.index.str.startswith('2015-12-')].plot(kind='line',figsize=(18, 8))
plt.title("Time series] Total sales by region (October - December 2015)")
plt.show()

### Time series] Total sales by region (October 2015)

In [None]:
df = stv.groupby('state_id')[d_cols].sum().T.merge(cal.set_index('d')['date'],left_index=True,right_index=True,validate='1:1').set_index('date')
df = df[df.index.str.startswith('2015-10-')].plot(kind='line',figsize=(18, 8))

plt.title("Time series] Total sales by region (October 2015)")
plt.show()

## 4.Number of stores by region

In [None]:
df= stv.groupby('state_id').nunique()['store_id'].sort_values()
df.name = 'state_sum'
pd.DataFrame(df).plot(kind='barh', figsize=(15, 5), title='Number of stores by region')
plt.show()

## 5. Number of sales in each product category by region

In [None]:
df = stv.groupby(['state_id','cat_id']).sum().sum(axis=1)
df = pd.DataFrame({ 'FOODS' : [df['CA','FOODS'], df['TX','FOODS'], df['WI','FOODS']],\
                   'HOBBIES' : [df['CA','HOBBIES'], df['TX','HOBBIES'], df['WI','HOBBIES']],\
                   'HOUSEHOLD' : [df['CA','HOUSEHOLD'], df['TX','HOUSEHOLD'], df['WI','HOUSEHOLD']]},\
                  index=['CA','TX','WI'])
df

In [None]:
df.plot(kind='barh', alpha=0.6,figsize=(9, 3), title='Number of sales in each product category by region')
plt.show()

## 6. Top 5 best-selling products

In [None]:
stv.groupby(['item_id']).sum().sum(axis=1).sort_values(axis=0,ascending=False)

### Considerations.
- It can be seen that "FOODS_3_090,FOODS_3_586,FOODS_3_252,FOODS_3_555,FOODS_3_714" are selling well.

## 7.Best-selling products Sales by store
The best-selling products,FOODS_3_090,FOODS_3_586,FOODS_3_252,FOODS_3_555,FOODS_3_714, will be targeted.

In [None]:
df = stv.groupby(['item_id','store_id']).sum().sum(axis=1)
df['FOODS_3_090'].plot(kind='barh', alpha=0.6,figsize=(9, 3), title='FOODS_3_090')
plt.show()

In [None]:
df['FOODS_3_586'].plot(kind='barh', alpha=0.6,figsize=(9, 3), title='FOODS_3_586')
plt.show()

In [None]:
df['FOODS_3_252'].plot(kind='barh', alpha=0.6,figsize=(9, 3), title='FOODS_3_252')
plt.show()

In [None]:
df['FOODS_3_555'].plot(kind='barh', alpha=0.6,figsize=(9, 3), title='FOODS_3_555')
plt.show()

In [None]:
df['FOODS_3_714'].plot(kind='barh', alpha=0.6,figsize=(9, 3), title='FOODS_3_714')
plt.show()