### This is the first notebook '1_EDA' of the full solution for Predict Future Sales competition on Kaggle.
## This notebook describes the first section of the solution: Exploratory Data Analysis and Data Cleaning
The full solution consists of 4 notebooks:
- 1_EDA : Exploratory Data analysis and Data Cleaning
- 2_FE: Feature Engineering
- 3_HPO: Models Hyperparameter optimization
- 4_Ensemble: Ensembling the models

Data:
- The input data is in the 'input' folder of this directory
- The output data is saved in the 'output' folder of this directory

In [None]:
%%writefile libraries.py
# Create a file allowing to import upper level(usefull throughout the whole solution) packages and functions with one line: %run libraries

import os #The functions that the OS module provides allows you to interface with the underlying operating system that Python is running on 

import pickle # Fast saving/loading data

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

# Import visualizations
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (30,5) # Set standard output figure size
import seaborn as sns # sns visualization library
from IPython.display import display # Allows to nicely display/output several figures or dataframes in one cell

# Create an output' folder to save data from the notebook
try: os.mkdir('output') # Try to create
except FileExistsError: pass # if already exist pass
        
print('Upper level libraries loaded')

## 1. EDA

In [None]:
%reset -f
#reset magic function allows one to release all previously used memory. -f (force) parameter allows to run it without confirmation from the user

%run libraries
#jupyter magic function loading standard libraries from the created file.

In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load Data

Data Description  
You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

#### Files description:  
- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.  
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.  
- sample_submission.csv - a sample submission file in the correct format. 
- items.csv - supplemental information about the items/products.  
- item_categories.csv  - supplemental information about the items categories.  
- shops.csv- supplemental information about the shops.  

#### Data fields:
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

In [None]:
#Load data from 'input' folder in the current directory
train   = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
items   = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
cats    = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
shops   = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
test    = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
sample  = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv')

In [None]:
test = test.set_index('ID') #Set index to ID. This way we do not need to drop ID column every time in future calculations

In [None]:
# Show the Loaded Data
# display() allows to output multiple dataframes in one cell
display('train',   train.shape,  train.head(),
        'items',   items.shape,  items.head(),
        'cats',    cats.shape,   cats.head(),
        'shops',   shops.shape,  shops.head(),
        'test',    test.shape,   test.head(),
        'sample',  sample.shape, sample.head()) 

In [None]:
# Define dataframe information function
def df_info(df):
    print('-------------------------------------------shape----------------------------------------------------------------')
    print(df.shape)
    print('-------------------------------------head() and tail(1)---------------------------------------------------------')
    display(df.head(), df.tail(1))
    print('------------------------------------------nunique()-------------------------------------------------------------')
    print(df.nunique())
    print('-------------------------------------describe().round()---------------------------------------------------------')
    print(df.describe().round())
    print('--------------------------------------------info()--------------------------------------------------------------')
    print(df.info())
    print('-------------------------------------------isnull()-------------------------------------------------------------')
    print(df.isnull().sum())
    print('--------------------------------------------isna()--------------------------------------------------------------')
    print(df.isna().sum())
    print('-----------------------------------------duplicated()-----------------------------------------------------------')
    print(len(df[df.duplicated()]))
    print('----------------------------------------------------------------------------------------------------------------')

## Train

In [None]:
df_info(train)

In [None]:
# We see 6 duplicates in data, let's drop them
train.drop_duplicates(inplace=True)

In [None]:
# We see a possible typo in item price in train - negative value 
train[train.item_price <= 0 ]

In [None]:
# Only one datapoint - it should be safe to simply remove it
train = train[train.item_price > 0]

### Train.item_price

In [None]:
#check price distribution
plt.plot(train.item_price)

In [None]:
# There is one clear outlier
print(train[train.item_price > 100000])
print(items[items.item_id == 6066])

In [None]:
# As we see this is a sale of 522 packages in one pack (each one cost 307980/522 = 59 ), let us correct this line
train.item_cnt_day[train.item_id == 6066] = 522
train.item_price[train.item_id == 6066] = 59

In [None]:
# Now let us plot it again
plt.plot(train.item_price)

We see step/piecewise graph here with some outliers. The item price increase (probably because of the inflation of prices with time here - Russia suffered currency crisis in 2014.06-2014.12 with the drop of oil prices - Rubble dropped 2 times). So probably adding prices in USD would help the model.

In [None]:
# Let us plot variation of the mean item price with time
plt.plot(train.groupby(['date_block_num'])['item_price'].mean())

We see the increase in price clearly here

In [None]:
#We do not clearly see much variation of prices within one month of sales 
plt.plot(train[train.date_block_num == 33].item_price)

In [None]:
#Let us see how price is changing for one of the arbitrary taken items
id = 1000 # arbitrary id
plt.figure(figsize=(10,4))
sns.distplot(train[train.item_id == id].item_price, hist_kws={'log':True}, kde = False, bins = 100)

train[train.item_id == id].sort_values(by=['date_block_num'])

We can see the variation in price for a given month for different shops and also variation of price versus time, the price distribution is multimodal.

### Train.item_cnt_day

In [None]:
# Now let us plot item_cnt_day
plt.plot(train.item_cnt_day)

In [None]:
# Plot the logarithmic histograms for item_cnt_day
sns.distplot(train.item_cnt_day, hist_kws={'log':True}, kde = False, bins = 200)

In [None]:
#Couple outliers above 900
train[train.item_cnt_day > 900]

In [None]:
display(items[items.item_id == 9248])

In [None]:
display(items[items.item_id == 20949],
        items[items.item_id == 11373])

In [None]:
# It's possible that a lot of packets and deliveries were done on some occasion but those have to be some holidays for example.
# I think it's better to remove the points as outliers
train = train[train.item_cnt_day < 900]

In [None]:
# Now let us plot item_cnt_day
plt.plot(train.item_cnt_day)

In [None]:
# Let us see sales distribution per month
sns.countplot(x='date_block_num', data=train);

We clearly see a pattern here - overall negative trend (crisis in Russia), with a 12 month period sinusoidal - year cycle. Peak sales - December, low sales on summer months.

In [None]:
# Let us see sales distribution over one month
sns.countplot(x='date', data=train[(train.date_block_num == 21)&(train.shop_id == 12)])

Very clear trend in sales - low sales on monday, highest sales on saturday-sunday. Probably adding number of mondays, tuesdays, etc. as features in the particular month would help. Holidays also show higher sales - better take this into account.

In [None]:
# Let's see the sales per shop
sns.countplot(x='shop_id', data=train)

In [None]:
# Let us plot cumulative sales per shop over time. We will use red color for those shops, that are not present in test set.

fig = plt.figure(figsize=(30,36))
for i in range(len(shops)):
    ts=train[train.shop_id == i].groupby(['date_block_num'])['item_cnt_day'].sum()
    plt.subplot(10, 6, i+1)
    plt.bar(ts.index, ts.values)
    plt.xlim((0, 33))
    plt.ylim(0, 12000)
    if i in set(test.shop_id):
        plt.title(str(i) +' '+ shops.shop_name[i], color = 'k')
    else: 
        plt.title(str(i) +' '+ shops.shop_name[i], color = 'r')
plt.show()

We need to predict sales only for the shops which were not closed :). It might be a good idea to provide a model with a flag for open/closed shops.

In [None]:
# We see that data for some shops was mixed (intentionally I guess), let's fix it
# Якутск Орджоникидзе, 56
train.loc[train.shop_id == 0, 'shop_id'] = 57
test.loc[test.shop_id == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
train.loc[train.shop_id == 1, 'shop_id'] = 58
test.loc[test.shop_id == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
train.loc[train.shop_id == 10, 'shop_id'] = 11
test.loc[test.shop_id == 10, 'shop_id'] = 11
# Now delete those shops from the shops dataframe:
shops.drop([0, 1, 10], inplace = True)

# I think it is also better to remove any data for outbound trade, 
# which is very unusual and misleading (we are not going to predict the outbound trade)
train = train[train.shop_id != 9]
train = train[train.shop_id != 20]
# Now delete those shops from the shops dataframe:
shops.drop([9, 20], inplace = True)

# 12 and 55 are online stores - we cannot remove them because these shops are in test.

In [None]:
# Let us add item_category_id to train and test sets
items_dict = dict(zip(items.item_id, items.item_category_id))
train['item_category_id'] = train['item_id'].map(items_dict)
test['item_category_id'] = test['item_id'].map(items_dict)

In [None]:
# Plot the distribution for sold items relative to the category
fig, ax =plt.subplots(2,1, figsize=(30,10))
sns.countplot(train['item_category_id'], ax=ax[0])
sns.countplot(test['item_category_id'], ax=ax[1])

In [None]:
# We see that some categories are absent in test data but are present in train. Let us remove those categories from train data to make it closer to test.
for i in (set(train.item_category_id) - set(test.item_category_id)):
    train = train[train.item_category_id != i]
    items = items[items.item_category_id != i] # remove them from items
    cats = cats[cats.item_category_id != i]    # remove from cats

In [None]:
# Plot the distribution again
fig, ax =plt.subplots(2,1,  figsize=(30,10))
sns.countplot(train['item_category_id'], ax=ax[0])
sns.countplot(test['item_category_id'], ax=ax[1])

We see that the distributions of Sales vs CategoryID are slightly different for Test and Train sets - e.g. category# 31 is big in test (many items picked up for test) but small in train (because of relatively low sales of those items)

In [None]:
# How many samples in train now?
len(set(train.shop_id))

In [None]:
fig = plt.figure(figsize=(30,60))
i = 1
for shop_id in set(train.shop_id):
    ts=train[train.shop_id == shop_id].groupby(['item_category_id'])['item_cnt_day'].sum()
    plt.subplot(11, 5, i)
    plt.bar(ts.index, ts.values)
    plt.xlim((0, 82))
    if shop_id in set(test.shop_id):
        plt.title(str(shop_id) +' '+ shops.shop_name[shop_id], color = 'k')
    else: 
        plt.title(str(shop_id) +' '+ shops.shop_name[shop_id], color = 'r')
    i+=1
plt.show()

In [None]:
# Shop # 40 showing very different trend from other shops, so let us remove it (it is closed long time ago anyway and we don't need to predict for this shop)
train = train[train.shop_id != 40]
# Now delete the shop from the shops dataframe:
shops.drop([40], inplace = True)

# we do not remove shops # 12 and 55 which are on-line shops and also show different distribution
# shop #55 is an online shop for 1-C Software (business accounting software, #1 in Russia). The sales categories from this shop are only present for this shop and are not present in other shops:
set(train[train.shop_id == 55].item_category_id)

In [None]:
# Now we will save the data, but we do not want to save modifications to data sets at this stage, here we only cleaned the data, so let us drop newly created columns from data. We will modify the data in the next: 2_FeatureEngineering section
train.drop(columns = 'item_category_id', inplace = True)
test.drop(columns = 'item_category_id', inplace = True)

# Save data to the folder to use it in the next part
with open(r'output/1_EDA_data.pkl','wb') as f:
    pickle.dump((train, items, cats, shops, test, sample), f)  
    
'''# Load the saved data in the next section as:
with open(r'output/1_EDA_data.pkl', 'rb') as f:
    (train, items, cats, shops, test, sample) = pickle.load(f)'''

# Thank you for your time!
## Please share your thoughts and comments, as well as suggestions for future improvements.