# A Robust and Thorough Exploratory Data Analysis
After reading the rules of the challenge and adhering to the terms, the beginning of every challenge should start with setting up your environment (I usually use [miniconda](https://docs.conda.io/en/latest/miniconda.html)), directory structure, and obtaining all of the needed data.

_**You should then spend at least several hours getting to know the data very well. This will save you time in the long run**_. It will help identify potentital data problems and might curate new modeling ideas.

You can spend a very long time on EDA, but at the very least, be able to answer the following questions:
## What information is contained in each dataset?
1. What is the size of the stored data? This will affect how you may store, access, and process the data. Working with MB of data vs TB require different tools and hardware.
    1. Number of rows
    1. Number of columns
    1. How is it stored?
1. Look at a couple of example rows to get a better sense of what it looks like
1. Column fields
    1. What are the descriptions?
    1. Which are unique allegedly? (Always good to verify!)
    1. Which columns will be feature variables and which one is the predictor variable
    1. If you have multiple datasets, how can the data be joined together? Are there common columns that you can join the datasets to?
    1. Some examples of values for each column
    1. Types: does anything need to be converted? E.g. strings into dates, strings into floats, etc.
    1. Any metadata explaining possible encodings? E.g. in sku numbers, the first two digits may represent a retail's department for that particular item. There may be more information contained in some of the columns.
    
## What is the quality?
1. Any missing data? Can it be inputted or should it be dropped?
1. Any duplicates?
1. Outliers? Should they be removed or adjusted?
1. Unusual values? E.g. negative values for price, or very old dates, etc. What assumptions can be made? What cannot be corrected?
1. Check any assumptions if possible, are values actually unique?

## What are some basic statistics? 
1. Mean
1. Median
1. Quartiles
1. Count of distinct values

## What does the data look like when plotted?
1. Histograms/Density plots
1. Box plots
1. Scatter plots
1. Line plots

Now let's get started! Feel free to add additional tips in the columns--I'll try to update it here.

..................................................................................................................................................................................................................................

# What information is contained in each dataset?

## Size and Format

We have been given 6 datasets, all as csv's, I downloaded and unzipped them all to get the size.

| Dataset Name  | Size (unzipped) | 
|---------------|-----------------|
|sales_train    |94.6 MB          |
|test           |3.2 MB           |
|items          |1.6 MB           |
|item_catgories |4 KB             |
|shops          |3 KB             |

Great news! I can do all of the processing locally probably and just use the data in the raw csv format. That's a relief.

Okay, now let's get it into Python. Of course, I'll use pandas to start exploring. If you don't know pandas very well, you can pick up [Wes McKinney's book](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662).

In [None]:
import pandas as pd

# Import all of my data
sales_train = pd.read_csv("../input/sales_train.csv")
test = pd.read_csv("../input/test.csv")
items = pd.read_csv("../input/items.csv")
item_categories = pd.read_csv("../input/item_categories.csv")
shops = pd.read_csv("../input/shops.csv")


# This step just helps me loop over these variables in the future
# I'm lazy and always try to write less code
from collections import namedtuple
Dataset = namedtuple("Dataset", "name df")
datasets = [Dataset(name="Sales train", df=sales_train),
            Dataset(name="Test", df=test),
            Dataset(name="Items", df=items),
            Dataset(name="Item categories", df=item_categories),
            Dataset(name="Shops", df=shops)]

In [None]:
# Get the size info
for d in datasets:
    print(d.name + ": " + str(d.df.shape))

For better or for worse, these are pretty slim datasets in terms of number of features. Thus, if we want to use some ML models as opposted to time series, we may want to do a fair amount of feature engineering.

## Examples
Look at some examples of rows so we have some sense of what we will be looking at

In [None]:
sales_train.sample(random_state=4)

In [None]:
test.sample(random_state=4)

In [None]:
items.sample(random_state=4)

A good purchase

In [None]:
item_categories.sample(random_state=4)

Definitely make sure you don't save anything with an ASCII encoding or it will screw up on the Russian characters

In [None]:
shops.sample(random_state=4)

## Column fields

In [None]:
# Get the names of columns
print("Column names for each dataset")
for d in datasets:
    print('{:<16}'.format(d.name + ":"), end=" ")
    print(*d.df.columns.tolist(), sep=", ")

Going back to the [Data documentation](https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data), I know what these fields are

|Source         |Column            |Description
|---------------|------------------|-----------
|sales_train    |date              |date of sales presumably, in dd.mm.yyyy
|sales_train    |date_block_num    |consecutive month number. January 2013 is 0, February 2013 is 1
|sales_train    |shop_id           |unique identifier of a shop
|sales_train    |item_id           |unique identifier of a product
|sales_train    |item_price        |current price of an item
|sales_train    |item_cnt_day      |number of products sold
|test           |ID                |an ID that represents a (shop,item) tuple in test set
|test           |shop_id           |unique identifier of a shop
|test           |item_id           |unique identifier of a product
|items          |item_name         |item name
|items          |item_id           |unique identifier of a product
|items          |item_category_id  |unique identifier of item category
|item_categories|item_category_name|name of item category
|item_categories|item_category_id  |unique identifier of item category
|shops          |shop_name         |shop name
|shops          |shop_id           |unique identifier of a shop

So great! No descriptions are missing and I more or less know what everything is. My only confusion was the ID in the test is not a concatenation of the shop id and item id. It's just a different unique identifier.

The predictor value in this case will be *item_cnt_day*. Note that predictions need to be at a monthly level per item. The train data looks like the test set but the test set does not have item price, so we need to consider that before building a model.

It looks like my sales_train and test can be merged with items, item_categories, and shops to get the shop name, item category name, and item name if I want it.

In [None]:
# Get some samples of values
for d in datasets:
    cols = d.df.columns
    print(d.name + " column examples")
    for c in cols:
        print('    ' + c + ": ", end = " ")
        print(*d.df[c].sample(5, random_state=4), sep="; ")

Looking at the items, it seems that this store sells CDs, games, and DVDs. Google-translate of a couple of the categories shows Подарки - Мягкие игрушки is 'Gifts - soft toys', Подарки - Открытки, наклейки is 'Gifts - Cards, stickers'. So it seems to sell an array of items.

Moving onwards, you have noticed at this point that the dates are in a different format than what pandas usually prints, let's check the types

In [None]:
# Check the types
for d in datasets:
    print("• " + d.name + " types")
    print(d.df.dtypes, end="\n\n")

Sure enough, date is an object instead of a datetime. Also, I'm a little curious that item_cnt_day is a float instead of an int. How can you sell just half an item? I will check my intuition

In [None]:
# Convert date to a datetime
sales_train["date"] = pd.to_datetime(sales_train["date"], format='%d.%m.%Y')

In [None]:
# Check if all of the item count days are ints
all(sales_train["item_cnt_day"] == sales_train["item_cnt_day"].astype(int))

Okay, I am satisifed that the item_cnt_day is always whole integer numbers. 

## Quality check

**Never assume that the data is clean cause it's probably not**. You're probably going to spend more time cleaning up data than modeling, but it will probably improve your results more than a model ever could. Don't skimp on this step!

In [None]:
# Looks for NAs
for d in datasets:
    print("• " + d.name + " number of missing values")
    print(d.df.isna().sum(), end="\n\n")

AMAZING!!!

In [None]:
# Looks for duplicates
for d in datasets:
    print(d.name + " any duplicates?", end=" ")
    print(d.df.duplicated().any())

Darn, let's figure out what this/these duplicate(s) is/are

In [None]:
sales_train[sales_train.duplicated(keep=False)]

In my job, I would maybe try to track down or talk to the team who possesses this data to ask why there are duplicates. Given, this is a kaggle competition, it's hard to ask someone. But I can make one of two decisions: either this is duplicated rows, and item 20130 only sold 1 item on 05.01.2013 in shop 54, or it actually sold 2 and it wasn't aggregated correctly. I'm going to assume it's just a duplicate and drop these extra rows.

In [None]:
sales_train = sales_train.drop_duplicates(keep='first')

Now, I mentioned that we could merge these datasets together, let's do that and make sure we wouldn't be missing any data between the datasets. E.g. if the sales_train dataset had a shop_id = 9000, but there was not shop 9000 in the shops dataset, then I will be missing some info.

The following code will merge the train and test datasets to the lookup datasets items, shops, and item_categories. It will raise exceptions if a lookup fails.

In [None]:
def merge_data(df):
    """Merges data from the inputed dataframe to return shop and item details.
    
    Paramters
    ---------
    df : DataFrame
        Train or test dataframe to merge lookup values to
    
    Returns
    -------
    DataFrame
        Inputted dataframe with columns shop_name, item_name,
            item_category_id, and item_category_name appended to it
    
    Raises
    ------
    Exception
        If there is a value in the inputted df that is not present in the lookup dataframes
    """
    # validate ensures that the keys are unique in the left dataset
    shop_merge = pd.merge(df, shops,
                          on="shop_id",
                          how="left",
                          validate="m:1",
                          indicator="_shop_merge")
    # This exception catches if the lookup could not be found in the right datatset
    if any(shop_merge["_shop_merge"]=="left_only"):
        raise Exception("Could not lookup value for shop")

    item_merge = pd.merge(shop_merge, items, 
                          on="item_id", 
                          how="left", 
                          validate="m:1", 
                          indicator="_item_merge")
    if any(item_merge["_item_merge"]=="left_only"):
        raise Exception("Could not lookup value for item")
    
    item_cat_merge = pd.merge(item_merge, item_categories, 
                              on="item_category_id", 
                              how="left", 
                              validate="m:1", 
                              indicator="_cat_merge")
    if any(item_cat_merge["_cat_merge"]=="left_only"):
        raise Exception("Could not lookup value for item category")

    # Drop merge columns
    item_cat_merge = item_cat_merge.drop(columns=["_shop_merge", "_item_merge", "_cat_merge"])
    return item_cat_merge 

sales_train_merge = merge_data(sales_train)
test_merge = merge_data(test)

Great! Everything merged without issues and I'm not missing any keys. I'm going to be using these datasets going forward. Here's a quick preview:

In [None]:
sales_train_merge.sample(random_state=4)

Now this is not the plotting section, but I'm going to use scatterplot to find outliers for *item_cnt_day* and *item_price*, if there are any. Boxplots could also be helpful here too but I am trying to view the data as rawly as possible for now

In [None]:
from matplotlib import pyplot
sales_train_merge.plot.scatter(x="date_block_num", y="item_price")
pyplot.show()

What?! There's an item with a price over 300k?

In [None]:
sales_train_merge.loc[sales_train_merge["item_price"]>300000]

This is some kind of software program. Maybe it was just a mistake? Let's see if there are any other entries

In [None]:
sales_train_merge.loc[sales_train_merge["item_id"]==6066]

No other entries, a quick google finds the software's site: http://www.radmin.com/ordering/, and looking at the conversion of 1 USD to RUB, it looks like 300k is actually reasonable for over 500 licenses. That's like $5k USD. Okay, so I guess it's fine. 

In [None]:
sales_train_merge.plot.scatter(x="date_block_num", y="item_cnt_day")
pyplot.show()

Let's look closer at some of these values

In [None]:
sales_train_merge.loc[sales_train_merge["item_cnt_day"] > 500].sort_values("item_cnt_day", ascending=False)

I noticed 'Grand Theft Auto V' in there, the release date of which was [14 April 2015](https://support.rockstargames.com/articles/202423276/Grand-Theft-Auto-V-PC-Release-Date-And-Time), so I guess that explains the high quantity that day. Clearly a new release of an item will lead to large sales. Let's keep that in mind when building our model.

The item category 80 is for service tickets, so I expect this was probably when tickets were first released.

The top-selling item is for some delivery service: https://boxberry.ru/en/


Okay, most of these outliers seem defensible so I'm not going to modify anything

Now, let's check some assumptions. First, let's make sure the dates appear to be in the correct range

In [None]:
print("Min date ", min(sales_train_merge["date"]))
print("Max date ", max(sales_train_merge["date"]))

These seem fine. Now what about the item_cnt_day? Are they all positive?

In [None]:
(sales_train_merge["item_cnt_day"] >= 0 ).all()

Hmm, no?

In [None]:
sales_train_merge.loc[sales_train_merge["item_cnt_day"] < 0]

It seems like there are probably some returns in there. It is good to know that. Sometimes, it may make sense to predict returns and sales separately. In this case, all of the data has been aggregated together so we won't be able to do that. 

In [None]:
sales_train_merge.loc[sales_train_merge["item_cnt_day"]==0]

Looks like there we don't have information about items where nothing is sold, so it's not clear if it's because the item wasn't carried or if the item just didn't sell.

In [None]:
sales_train_merge.loc[sales_train_merge["item_price"] < 0 ]

This just doesn't seem right.

In [None]:
sales_train_merge.loc[(sales_train_merge["item_id"] == 2973 ) & (sales_train_merge["shop_id"] == 32 )]

Let's fill this problematic value with the median

In [None]:
sales_train_merge.loc[sales_train_merge["item_price"] < 0, "item_price"] = None
sales_train_merge["item_price"] = sales_train_merge["item_price"].fillna(sales_train_merge["item_price"].median())

I'm going to make sure that the dateblock values are correctly calculated before I use them for anything

In [None]:
months = sales_train_merge["date"].dt.month - 1
years = (sales_train_merge["date"].dt.year - 2013)*12

my_dateblock = months + years
all(my_dateblock == sales_train_merge["date_block_num"])

Swell, nothing unexpected there

Last, I just want to get an idea of how many new combinations of show and items are in the test set compared to the training set

In [None]:
train_set = set([k for k in sales_train_merge[["shop_id", "item_id"]].itertuples(index=False)])
test_set = set([k for k in test_merge[["shop_id", "item_id"]].itertuples(index=False)])
new = test_set - train_set
print("New or never sold item + shop combinations in test", len(new), "out of", len(test_set))

In [None]:
test_merge.columns

That's a big proportion. It may just be because most of these items never sold previously at those stores, but it will be trickier if it is because a store never carried them before. If it's the latter, then we might want to forecast at a higher level than item + shop.

## Statistics

Let's quickly calculate some basic statistics using pandas's describe function

In [None]:
sales_train_merge[["item_price", "item_cnt_day"]].describe()

3/4 of items are price below 1000. It also looks most items only set 1 per shop per day. It probably makes more sense to look at this information at a monthly level across all stores since that's what we need to forecast.

In [None]:
sales_train_agg = sales_train_merge.groupby(["item_id", "date_block_num"])["item_cnt_day"].sum()
sales_train_agg.describe()

This is more helpful. From these values, we should expect that each item will only sell several units per month. The mean (15.6) is much greater than the median (4) indicating a large skew. Let's plot to visualize the distribution.

In [None]:
pyplot.figure(1, figsize=(12,6))
pyplot.suptitle("Distribution of units sold")
pyplot.subplot(121)
xlims = (-20, 100)  # set x-limits
sales_train_agg.hist(range=xlims, bins=12)
pyplot.subplot(122)
sales_train_agg.to_frame().boxplot()
pyplot.show()

It's kind of hard to read the boxplot, so let's zoom in closer

In [None]:
sales_train_agg.to_frame().boxplot()
ax = pyplot.gca()
ax.set_ylim(xlims)
pyplot.show()

We can also see on these plots above that very few items sell more than 20 items in a month.

In [None]:
print("Number of unique shops: {0:d}".format(sales_train_merge["shop_id"].nunique()))
print("Number of unique item names: {0:d}".format(sales_train_merge["item_name"].nunique()))
print("Number of unique item ids: {0:d}".format(sales_train_merge["item_id"].nunique()))
print("Number of unique categories: {0:d}".format(sales_train_merge["item_category_id"].nunique()))

## Plots

There are so many ways to view this data, by shop, by item id, by category, by day, by month, etc. Let's start with the big picture and view the number of units sold across time.

In [None]:
pyplot.figure(figsize=(16,8))
sales_train_merge.groupby("date")["item_cnt_day"].sum().plot()
pyplot.show()

There are huge spikes around December. There is definitely some seasonal variation. There is an overall decline in sales. Let's see what it looks like when we view it from a monthly view.

In [None]:
pyplot.figure(figsize=(16,8))
sales_train_merge.groupby("date_block_num")["item_cnt_day"].sum().plot()
pyplot.show()

This makes the spikes and trend even more apparent and less noisy. Even more apparent (and always good to know how to use grouper in pandas):

In [None]:
pyplot.figure(figsize=(16,8))
sales_train_merge.groupby(pd.Grouper(key="date", freq="A"))["item_cnt_day"].sum().plot.bar()
pyplot.show()

Part of this can be explained that the data for 2015 ends in October and doesn't include the final months, and we have seen how sales tend to spike in December.

Now let's see sales at a category-level

In [None]:
# There's too many to plot, so let's just look at top-selling categories
topselling = sales_train_merge.groupby("item_category_id")["item_cnt_day"].sum().nlargest(10).index
sales_topselling = sales_train_merge.loc[sales_train_merge["item_category_id"].isin(topselling)]
sales_topselling.groupby(["date_block_num", "item_category_id"])["item_cnt_day"].sum().unstack().plot(figsize=(16,8))
pyplot.show()

In [None]:
sales_topselling["item_category_name"].unique()

These correspond to 
'Cinema - Blu-Ray', 'Music - Local Production CD',
'Games - XBOX 360', 'Games - PS3',
'PC Games - Additional Editions',
'PC Games - Standard Editions', 'Cinema DVD',
'Gifts - Board Games (Compact)',
'Gifts - Bags, Albums, Mousepad', 'Games - PS4'

Thanks to Google Translate

Hope this helped you consider some new things to look for when doing EDA! Happy modeling!