#<div style="padding:20px;color:white;margin:0;font-size:175%;text-align:center;display:fill;border-radius:5px;background-color:#016CC9;overflow:hidden;font-weight:500"> Competition target.</div>

**Items** are sold in **shops** during a training period of **34 months**. Each item belongs to a **category**. The target of the competition is to predict for some items and some shops, how many were sold during the month following the training period. In order to have a clearer view of the training data, we will produce a few graphs.

# <b><span style='color:#4B4B4B'>1 |</span><span style='color:#016CC9'> Load data </span></b>

We are going to load all the data: train, test and the additional ressources.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
train=pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv') 
test=pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
items=pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
categories=pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
shops=pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
print(f'There are {len(items)} items and {len(shops)} shops.')

We have 22,170 items and 60 shops. We can expect 22,170x60=1,330,200 possible pairs.
Each line in the test file (and the submission file) is a pair (item, shop).

In [None]:
print(f'We have to predict {len(test)} pairs (item,shop)')

We can see that we are asked to predict 214,200 out of 1,330,200 possible pairs (**16.1%**). This means that not all the items are sold in all the shops. But still, 214,200 pairs is a lot and we can't possibly run an LSTM on each of them. We need to focus on the groups. **Items can be grouped by shops or categories**.

# <b><span style='color:#4B4B4B'>2 |</span><span style='color:#016CC9'>  Monthly grouping</span></b>

The months are indexed with **date_block_num**. Because the prediction is for a whole month, we are going to group the training data by months and give up the day level precision.

In [None]:
train_monthly = train.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()
train_monthly.rename(columns = {'item_cnt_day':'item_cnt_month'},inplace=True)
print(train_monthly.head(3))

# <b><span style='color:#4B4B4B'>3 |</span><span style='color:#016CC9'> Shops </span></b>

We will now focus on the **shops**. We will group the data by shops and give up the item level precision.

In [None]:
train_monthly_shop = train_monthly.groupby(['date_block_num','shop_id']).agg({'item_cnt_month':'sum'}).reset_index()
train_monthly_shop.rename(columns={'item_cnt_month':'shop_cnt_month'},inplace=True)
print(train_monthly_shop.head(3))

It is interesting to graph the time series for each shop.

In [None]:
plt.figure(figsize=(20,10))
for shop_id in train_monthly_shop['shop_id'].unique():
    monthly_shop_i = train_monthly_shop.loc[train_monthly_shop['shop_id']==shop_id]
    plt.plot(monthly_shop_i['date_block_num'],monthly_shop_i['shop_cnt_month'])
plt.xlabel('month', fontsize=18)
plt.ylabel('item sold / shop', fontsize=18)
plt.title('Monthly sales for each shop', fontsize=18)
plt.show()

We can see that some shops are selling much more than others. Also some shops opened more recently. Let's look more precisely at the sales distribution over the shops.

In [None]:
train_shop=train_monthly_shop.groupby(['shop_id']).agg({'shop_cnt_month':'sum'}).reset_index()
train_shop.rename(columns={'shop_cnt_month':'shop_cnt'},inplace=True)
print(train_shop.head(3))

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=train_shop['shop_id'],y=train_shop['shop_cnt'])
plt.title('Total sales for each shop',fontsize = 18)
plt.show()

Shop 31 and shop 25 are very successful. Shop 11 and shop 36 are selling very few items.

# <b><span style='color:#4B4B4B'>4 |</span><span style='color:#016CC9'> Categories </span></b>

Another interesting way to group the sales is through the **categories**. However the categories are not in the train file. Let's add this information using the items file.

In [None]:
train_monthly = pd.merge(train_monthly, items, on='item_id', how='left')
train_monthly.drop(columns='item_name',inplace=True)
print(train_monthly.head(3))

It is interesting to graph the time series for each category.

In [None]:
train_monthly_category = train_monthly.groupby(['date_block_num','item_category_id']).agg({'item_cnt_month':'sum'}).reset_index()
train_monthly_category.rename(columns={'item_cnt_month':'category_cnt_month'},inplace=True)
print(train_monthly_category.head(3))

In [None]:
plt.figure(figsize=(20,10))
for item_category_id in train_monthly_category['item_category_id'].unique():
    monthly_category_i = train_monthly_category.loc[train_monthly_category['item_category_id']==item_category_id]
    plt.plot(monthly_category_i['date_block_num'],monthly_category_i['category_cnt_month'])
plt.xlabel('month', fontsize=18)
plt.ylabel('item sold / category', fontsize=18)
plt.title('Monthly sales for each category',fontsize = 18)
plt.show()

We notice that some categories sell much more than others. We also notice a general down trend and a 12 month seasonality. Let's look more precisely at the sales distribution over the categories.

In [None]:
train_category=train_monthly_category.groupby(['item_category_id']).agg({'category_cnt_month':'sum'}).reset_index()
train_category.rename(columns={'category_cnt_month':'category_cnt'},inplace=True)
print(train_category.head(3))

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=train_category['item_category_id'],y=train_category['category_cnt'])
plt.title('total sales for each category',fontsize = 18)
plt.show()

Categories 30 (Игры PC - Стандартные издания), 40 (Кино - DVD) and 55 (Музыка - CD локального производства) are the 3 most successful. Some categories have almost no sales. If you liked this Notebook, please upvote.