## *NOTE: This notebook is still under construction*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Introduction

This notebook is being developed as I take the Coursera Course "How to Win a Data Science Competition". The idea here is to not only develop the abilities learned on the course but also get useful insights for winning the competition.

# Understanding the competition

Our first step regards an understanding of the competition requirements and the data that was provided.
The main information we may retrieve from the competition description include:

* We have a time-series dataset
* We have to predict total sales for every product and store in the next month
* Our submissions will be evaluated by root mean squared error (RMSE)

Competition organizers provide us with 6 files, which are described as follows:

* **sales_train.csv** - the training set. Daily historical data from January 2013 to October 2015.
* **test.csv** - the test set. We need to forecast the sales for these shops and products for November 2015.
* **sample_submission.csv** - a sample submission file in the correct format.
* **items.csv** - supplemental information about the items/products.
* **item_categories.csv**  - supplemental information about the items categories.
* **shops.csv** - supplemental information about the shops.

In the next sections, we will explore each one of these files in detail.

# Exploring *sales_train.csv*

We begin this task by importing the csv file into our notebook. Next, we take a look at the first rows of the imported dataframe.

In [None]:
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
sales_train.head()

On the description page of the competition, there is information about each feature of this table:

* **date** - date in format dd/mm/yyyy
* **date_block_num** - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* **shop_id** - unique identifier of a shop
* **item_id** - unique identifier of a product
* **item_price** - current price of an item
* **item_cnt_day** - number of products sold. We are predicting a monthly amount of this measure

Now we will examine each feature in detail

## Features *date* and *date_block_num*

We are examining both features *date* and *date_block_num* together because they are intrinsically related.

There is not much to be evaluated from these features alone. So, let's just see if all months within the period that goes from January 2013 to October 2015 are represented in the dataset.

In [None]:
sales_train['date_block_num'].unique()

We can conclude that all 34 months within that interval are represented.

## Feature *shop_id*

The first thing we can check is the number of different shops that we have in our dataset.

In [None]:
len(sales_train['shop_id'].unique())

Let's check how these ids are distributed

In [None]:
np.sort(sales_train['shop_id'].unique())

Okay, now we know that all 60 shop ids range from 0 to 59.

## Feature *item_id*

The first thing we can check is the number of different items that we have in our dataset.

In [None]:
len(sales_train['item_id'].unique())

Okay, that is a huge variety of items.

## Feature *item_price*

We may first check the price range we will find in this dataset

In [None]:
print("The minimum price is "+str(sales_train['item_price'].min()))
print("The maximum price is "+str(sales_train['item_price'].max()))

That is weird because some of the items have a negative price. Let's check how many of them are like that.


In [None]:
print("There are "+str(len(sales_train['item_price']))+" rows in our table.")
print(str(len(sales_train['item_price'].loc[sales_train['item_price']==-1]))+" of them are priced as -1")
print(str(len(sales_train['item_price'].loc[sales_train['item_price']==0]))+" of them are priced as 0")

Okay so only 1 price of ours is set as -1. We should probably ignore it during our analysis.
Now let's check price distribution

In [None]:
plt.hist(sales_train['item_price'], bins=[0,50,100,250,500,1000,2000,5000,10000])

In [None]:
print("There are "+str(len(sales_train['item_price']))+" rows in our table.")
print(str(100*len(sales_train['item_price'].loc[(sales_train['item_price']>=0)&(sales_train['item_price']<500)])/len(sales_train['item_price']))
      + "%"
      + " of them have prices 0 => x > 500")
print(str(100*len(sales_train['item_price'].loc[(sales_train['item_price']>=500)&(sales_train['item_price']<1000)])/len(sales_train['item_price']))
      + "%"
      + " of them have prices 500 => x > 1000")

More than half the items sold cost less than 500 dollars.

## Feature *item_cnt_day*

There is a lot to explore about this feature, as it is our target feature for prediction.