# Overview
In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to **predict total sales for every product and store in the next month**.  

You are provided with daily historical sales data. The task is to *forecast the total amount of products sold in every shop for the test set*. Note that **the list of shops and products slightly changes every month**. Creating a robust model that can handle such situations is part of the challenge.  


# File descriptions
- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv - a sample submission file in the correct format.
- items.csv - supplemental information about the items/products.
- item_categories.csv  - supplemental information about the items categories.
- shops.csv- supplemental information about the shops.

# Data fields
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

# 0. Configure Package Dependencies

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# 1. Import the Complete Dataset

In [None]:
# Read the ".csv" from input folder.
train_df = pd.read_csv('../input/sales_train.csv', index_col=False)
test_df = pd.read_csv('../input/test.csv', index_col=False)
item_df = pd.read_csv('../input/items.csv', index_col=False)
item_cat_df = pd.read_csv('../input/item_categories.csv', index_col=False)
shop_df = pd.read_csv('../input/shops.csv', index_col=False)


# 2. Preview the Complete Dataset

### ds1 -  "sales_train.csv"
From the data above, we can conclude that the same shop will sell the same products at different times.

In [None]:
# Preview the first 10 instances in "sales_train.csv".
train_df.head(10)

In [None]:
# Display the dimensions and data type of the train data.
train_df.info()

### ds2 - "test_csv"
We only need to consider the "ID", which includes "shop_id" and "item_id".  
According to "shop_id" and "item_id", we can generate a new feature which stands for the ID.  

In [None]:
# Preview the first 10 instances in "test.csv".
test_df.head(10)

In [None]:
# Display the dimensions and data type of the test data.
test_df.info()

### ds3 - "shops.csv"
Currently, there are 60 shops in dataset.  
"shop_name" is not a necessary feature.

In [None]:
# Preview the first 10 instances in "shops.csv".
shop_df.head(10)

In [None]:
# Display the dimensions and data type of the shops data.
shop_df.info()

### ds4 - "items.csv"
Currently, there are 22170 items in the dataset.  
We can merge "item_category_id" into train data feature set.  

In [None]:
# Preview the first 10 instances in "items.csv".
item_df.head(10)

In [None]:
# Display the dimensions and data type of the items data.
item_df.info()

### ds5 - "item_categories.csv"  
There are 84 different items in the dataset.

In [None]:
# Preview the first 10 instances in "item_categories.csv".
item_cat_df.head(10)

In [None]:
# Display the dimensions and data type of the item category data.
item_cat_df.info()

# 3.  Initialise Data Wrangling Routine

## Exploratory Data Analysis

In [None]:
import pandas_profiling

# Generates profile reports from a pandas DataFrame.
pandas_profiling.ProfileReport(train_df, check_correlation=False)

## Data Wrangling and Feature Construction
1. Drop all instances whose "item_cnt_day" is negative.  
2. Drop all instances whose "item_price" is negative.  
3. Drop duplicates.  
4. Keep instances whose "shop_id" and "item_id" in 'test_csv'.  


In [None]:
from copy import deepcopy

wrangled_train = deepcopy(train_df)

In [None]:
# Drop all instances whose "item_cnt_day" is negative.
wrangled_train = wrangled_train[wrangled_train['item_cnt_day'] >= 0]

In [None]:
# Drop all instances whose "item_price" is negative.
wrangled_train = wrangled_train[wrangled_train['item_price'] >= 0]

In [None]:
# Check any NaN or null in each columns of train data.
wrangled_train.isnull().any()

In [None]:
# Drop duplicates.
subset = ['date','date_block_num','shop_id','item_id','item_cnt_day']
print(wrangled_train.duplicated(subset=subset).value_counts())
wrangled_train = wrangled_train.drop_duplicates(subset=subset)

In [None]:
# Keep instances whose "shop_id" and "item_id" in 'test_csv'.
shop_id_arr = wrangled_train['shop_id'].unique()
item_id_arr = wrangled_train['item_id'].unique()

wrangled_train = wrangled_train[wrangled_train['shop_id'].isin(shop_id_arr)]
wrangled_train = wrangled_train[wrangled_train['item_id'].isin(item_id_arr)]

In [None]:
# Display the dimensions and data type of the wrangled data.
wrangled_train.info()

In [None]:
# Drop all features with datetime values.
wrangled_train = wrangled_train.drop(['date'], axis=1)    

In [None]:
# Drop all features with price.
wrangled_train = wrangled_train.drop(['item_price'], axis=1)

In [None]:
# Change "date_block_num" into "month" in range [0,33].
wrangled_train.rename(columns={'date_block_num':'month'},inplace=True)

In [None]:
# Merge train data with items to generate new feature "item_category_id".
wrangled_train = wrangled_train.merge(item_df,how='left',on=['item_id'])
wrangled_train = wrangled_train.drop(['item_name'],axis=1)

In [None]:
wrangled_train.head(10)

In [None]:
# Check any NaN or null in each columns of train data.
wrangled_train.isnull().any()

In [None]:
# Replace "shop_id" and "item_id" with "ID".
complete_df = wrangled_train.merge(test_df,how='right',on=['shop_id','item_id'])

In [None]:
# Check any NaN or null in each columns of train data.
complete_df.isnull().any()

In [None]:
complete_df['month'] = complete_df['month'].astype(int)
complete_df['item_category_id'] = complete_df['item_category_id'].astype(int)
complete_df.head(10)

In [None]:
# Group by "ID" and 

# 4. Test Harness (Pre-Optimisation)