# 월마트 리테일 상품 판매량 예측 (Estimate the unit sales of Walmart retail goods)

<img src="../img/kaggle-m5-overview.png"/>

# 1a.EDA-Introduction

- Estimate the unit sales of Walmart retail goods
- From 2011-01-29 to 2016-06-19

# Dataset
- calendar.csv - 제품들이 팔린 날짜들 (df_calendar)
- sales_train_validation.csv - 제품과 매장에 따른 과거 일일 판매 데이터 [d_1 - d_1913] (df_sales)
- sell_prices.csv - 매장 및 날짜별로 판매되는 제품의 가격에 대한 정보 (df_prices)
- sales_train_evaluation.csv - Available once month before competition deadline. Will include sales [d_1 - d_1941]
- sample_submission.csv - The correct format for submissions. Reference the Evaluation tab for more info. (df_sub)

<img src="../img/data-overview-01.png" align="left">
<img src="../img/data-overview-02.png" align="left">

## Import Dataset
### 필요한 Library Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import os
from itertools import cycle
import warnings

%matplotlib inline

warnings.simplefilter("ignore", DeprecationWarning)
warnings.simplefilter("ignore", FutureWarning, )

### Download datasets

In [None]:
!mkdir data
!wget -O ./data/kaggle-m5.zip https://sagemaker-sinjoonk.s3.amazonaws.com/kaggle/kaggle-m5.zip
!unzip ./data/kaggle-m5.zip

In [None]:
# Create Dataframes
df_sales = pd.read_csv("./data/sales_train_validation.csv")
df_prices = pd.read_csv("./data/sell_prices.csv")
df_calendar = pd.read_csv("./data/calendar.csv")

# sales_train_validation(df_sales)

Contains the historical daily unit sales data per product and store [d_1 - d_1913].

- item_id: The id of the product.
- dept_id: The id of the department the product belongs to.
- cat_id: The id of the category the product belongs to.
- store_id: The id of the store where the product is sold.
- state_id: The State where the store is located.
- d_1, d_2, ..., d_i, ... d_1941: The number of units sold at day i, starting from 2011-01-29.

In [None]:
df_sales.head()

In [None]:
df_sales.info()

In [None]:
# "state_id", "store_id", "cat_id", "dept_id" 별 Id 집계

titles = ["state_id", "store_id", "cat_id", "dept_id"]
for title in titles:
    df_sales.groupby(title)["id"].count().reset_index().plot.bar(x=title, figsize=(8,5))

- Observation
    - 각 Store 가 파는 item 갯수는 동일하다.
    - `FOODS` category 가 가장 많이 팔린다.
    - `FOODS` category 에서는 그 중 `FOODS_3`dept_id가 가장 많이 팔린다.

In [None]:
# NaN 확인
df_sales.isnull().values.any()

### 판매량 확인 ("state_id", "store_id", "cat_id", "dept_id")

In [None]:
# d_로 시작하는 column 추출
d_cols = [c for c in df_sales.columns if 'd_' in c]

# d_로 시작하는 column의 value(판매량)들을 더해 "sales_total" column에 추가
df_sales["sales_total"] = df_sales.loc[:,d_cols].sum(axis=1)

In [None]:
titles = ["state_id", "store_id", "cat_id", "dept_id"]
for title in titles:
    df_sales.groupby(title)["sales_total"].sum().reset_index().plot.bar(x=title, figsize=(8,5))

- Observations
    - CA (California)의 sales_total이 가장 크다.
    - FOODS category sales_total이 가장 크다.
    - FOODS category 중에서는 dept_id FOODS_3의 sales_total이 가장 크다.

# Calendar (df_calendar)
Contains the dates on which products are sold. The dates are in a yyyy/dd/mm format.

- date: The date in a “y-m-d” format.
- wm_yr_wk: The id of the week the date belongs to.
- weekday: The type of the day (Saturday, Sunday, ..., Friday).
- wday: The id of the weekday, starting from Saturday.
- month: The month of the date.
- year: The year of the date.
- d : Sequence
- event_name_1: If the date includes an event, the name of this event.
- event_type_1: If the date includes an event, the type of this event.
- event_name_2: If the date includes a second event, the name of this event.
- event_type_2: If the date includes a second event, the type of this event.
- snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP 3 purchases on the examined date. 1 indicates that SNAP purchases are allowed.

In [None]:
print(df_calendar.min())
print(df_calendar.max())

In [None]:
df_calendar.head(5)

In [None]:
df_calendar.info()

- Observation : event관련 column을 제외하고는 NaN 값이 없다.

### Event 정보

In [None]:
# Event이름 추출
df_calendar["event_name_1"].unique()

- Observation : NBAFinalsStart, NBAFinalsStop는 기간 이벤트다.

In [None]:
df_calendar[df_calendar["event_name_1"].isin(["NBAFinalsStart", "NBAFinalsEnd"])]

In [None]:
df_calendar[df_calendar["wm_yr_wk"] == 11118]

- Observation : NBA Final Event는 대략 2주 정도 지속된다.

In [None]:
# "event_name_1"는 총 162번, "event_name_2"는 총 5번 

print("# of event_name_1 : {}".format(df_calendar["event_name_1"].notnull().sum()))
print("# of event_name_2 : {}".format(df_calendar["event_name_2"].notnull().sum()))
print("Event ratio(%) : {:.2%}".format(df_calendar["event_name_1"].notnull().sum()/len(df_calendar)))

# sell_prices (df_prices)
Contains information about the price of the products sold per store and date.

- store_id: The id of the store where the product is sold.
- item_id: The id of the product.
- wm_yr_wk: The id of the week.
- sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).

In [None]:
df_prices.head(5)

In [None]:
df_prices.info()

In [None]:
df_prices.min()

In [None]:
df_prices.groupby("item_id").min().reset_index().head(20)

- Observations
    - `df_prices`내 `wm_yr_wk`의 최소값은 11,101
    - `FOODS_1_004`는 `wm_yr_wk`의 최소값이 11,206 : 11,101 ~ 11,205 주 까지는 판매되지 않았다.