# EDA & Previous Value Benchmark

Basically, this is the first jump into the data. 
- EDA -
Trying to build up an intuition for the data and the competition challenge. 
- Previous Value Benchmark -
Once, we have some intuition, I'll build a very basic benchmark using previous values as predictions. No need to train any model yet. It's helpful to have a benchmark against which to evaluate all future models.

## Task Background

Aim: We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills. You are provided with daily historical sales data.

The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

Evaluation: Submissions are evaluated by root mean squared error (RMSE). True target values are clipped into [0,20] range.


You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

**File descriptions**
* sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
* test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
* sample_submission.csv - a sample submission file in the correct format.
* items.csv - supplemental information about the items/products.
* item_categories.csv  - supplemental information about the items categories.
* shops.csv- supplemental information about the shops.

**Data fields**
* ID - an Id that represents a (Shop, Item) tuple within the test set
* shop_id - unique identifier of a shop
* item_id - unique identifier of a product
* item_category_id - unique identifier of item category
* item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
* item_price - current price of an item
* date - date in format dd/mm/yyyy
* date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name - name of item
* shop_name - name of shop
* item_category_name - name of item category

## Load Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# ID in test set (corresponds to a shop_id and item_id, predicted number of items sold
sample_submission = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv")
test = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv")
# items: item_name, item_id, item_category_id
items = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/items.csv") 
# item_categories: item_category_name, item_category_id
item_categories = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv")
# sales_train: date, date_block_num, shop_id, item_id, item_price, item_cnt_day**
sales_train = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv")
# shops: shop_name, shop_id
shops = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/shops.csv")

In [None]:
# merge all training data together: items, item_categories, sales_train, shops
items_with_categories = pd.merge(items, item_categories, on="item_category_id")
sales_with_items = pd.merge(sales_train, items_with_categories, on="item_id")
train_merged = pd.merge(sales_with_items, shops, on="shop_id")
train_merged.info()

### Downcast Dataframe
Our DF will take up a lot of memory (almost 250MB!), but we can downcast the columns to types to representations that use fewer bits. 
Learned about this from another notebook in this competition: https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data. Below code for downcasting is pulled from there.

In [None]:
def downcast_dtypes(df):
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols = [c for c in df if df[c].dtype in ["int64", "int32"]]
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

train_merged = downcast_dtypes(train_merged)
print(train_merged.info())

Downcasting got our dataframe down to 157MB (a huge improvement). We likely won't need all these variables when we train our model so we should be able to decrease this size further once we start actually using the data.

### Pandas Profiler

The Pandas Profiler does some initial basic EDA. It's helpful to use to get a first high-level overview of the data.

In [None]:
# Pandas profiling automates some early EDA
from pandas_profiling import ProfileReport
profile = ProfileReport(train_merged.sample(frac=0.01), title="Training Data Profile")

In [None]:
# show profile
profile

### Key Top Level Takeaways from Profiler

Right off the bat there are a few high level takeaways that help us understand the data we have a bit better.
1. There is no missing data (good!)
2. There are no duplicate rows (good!)
3. There are 34 date blocks -- month intervals in the data
4. There are around 60 shops
5. There are around 8500 unique items for sale


### Compare Training and Test Set Distributions
It's important to check the distribution in the training data compared with the distribution in the test data. We want to know if they have been pulled from the same distribution.


In [None]:
test_profile = ProfileReport(test.sample(frac=0.01), title="Test Data Profile")
test_profile

It's first important to notice that the test set really contains three key variables.
- shop_id
- item_id
- ID (unique identifier for shop_id, item_id tuple.

This is key because we'll be submitting an output file that uses the unique ID for shop_id and item_id pairs. Our target variable will be the number of sales a given shop made of a given product for a particular month (in this case - November 2015)

Off the bat, it looks like the test set has a slightly different distribution in terms of number of unique shops (~ 42 in test vs. ~ 60 in train) and items covered (~ 1750 in test vs. ~ **8500 in train). This is in part due to the fact that Pandas Profiler samples a small percentage of the total df. This will be worth investigating more in the future by comparing the percentage representation of data points for a given shop or product in the train set vs. the test set.

## Clean the Data

* **Data Types**:
The first thing we should do is convert the date column in the training data from an "object" to a "datetime" dtype.

* **Constant features**:
There doesn't appear to be any constant features (features with no variation across all rows)

* **Duplicated features and rows**:
As we discovered from the Pandas Profiler, there is no missing data and there are no duplicates

In [None]:
train_merged['date'] = pd.to_datetime(train_merged['date'])
train_merged.dtypes

# Previous Value Benchmark

Ok, so now we have a basic intuition and understanding of what's in our dataset. 

Let's build a very basic benchmark model. The assumption of the previous value benchmark approach is that the number of items sold at a particular shop doesn't vary too much from month-to-month. As a result, we can just straight up submit the monthy sales for October 2015 as a proxy for November 2015 (the test set period).

### Project Tip #2 (From Coursera):
A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get lagged values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference.

In [None]:
# for now we only care about a couple of features so I'll subset the training df to make it easier to work with.
X = train_merged[["date_block_num", "shop_id", "item_id", "item_cnt_day"]]

In [None]:
# aggregate daily data and construct a monthly sales data frame
X["item_cnt_month"] = X.groupby(["date_block_num", "shop_id", "item_id"])["item_cnt_day"].transform(np.sum)

# remove item_cnt_day column as it's no longer useful for previous benchmark
del X["item_cnt_day"]

# sorting by the monthly item count for each row we can see that each row still represents a single day. For a row to represent the entire month, just eliminate all duplicates across (date_block_num, shop_id, and item_id)
X.sort_values(by="item_cnt_month",ascending= False)
print(len(X))

X = X.drop_duplicates(["date_block_num", "shop_id", "item_id"])
print(len(X))

In [None]:
# filter down to just Oct 2015
last_month = X["date_block_num"] == 33
last_month_X = X[last_month].copy()
last_month_X = last_month_X.sort_values(by=["shop_id", "item_id"])

In [None]:
# match up item_cnt_month from last month with test set - #join on shop_id, item_id

combined = pd.merge(test, last_month_X, on=["shop_id", "item_id"], how="left")

# some shops didn't sell a given item in October 2015. As a result, there are lot's of NaNs.
# make NaNs zero
combined = combined.fillna(0) # could impute the mean or median month count value from previous months, but that's even a bit too much for a basic benchmark

#clip results between [0,20]
combined["item_cnt_month"] = combined["item_cnt_month"].clip(upper=20)

# only keep necessary columns for submission
submission = combined[["ID", "item_cnt_month"]]
submission.head()

In [None]:
# submit benchmark - achieves RMSE of 1.16777
submission.to_csv("submission.csv", index=False)