Fast EDA from raw data
-----


In this notebook I tried to apport a fast visualization for the data that we have in each dataset (before merging) to get an idea of how we can manage it.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('seaborn-pastel')
sns.set_theme(style="whitegrid", palette="pastel")

Fistly load the data.
Below we have a recap of what is explained in the dataset.

- **Data**. This is the train dataset, it contains time series of features store_nbr, family, onpromotion, and target sales.
    - store_nbr identifies the store at which the products are sold.
    - family identifies the type of product sold.
    - sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
    - onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.
   
- **Stores** . Store metadata, including city, state, type, and cluster (grouping of similar stores).

- **Oil**. Daily oil price (Includes values during both the train and test data timeframes!)

- **Holiday**. Holidays and Events, with metadata


- **Transactions**.


Notes:

1. Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.

2. Transferred colum in holidays. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).


3. Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.

4. A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.

In [None]:
data = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/train.csv')
stores = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/stores.csv')
oil = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/oil.csv')
holiday = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv')
transactions = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/transactions.csv')

## 1. Data (train/test dataframe)

In [None]:
data.info()

There are no missing values in this dataset.

In [None]:
print(data.family.unique())
print('There are %s families' %len(data.family.unique()))

Then, let us plot grouping by each family type.

In [None]:
# creating initial dataframe
family_types = data.family.unique()
# converting type of columns to 'category'
data['family'] = data['family'].astype('category')
# Assigning numerical values and storing in another column
data['family_cat'] = data['family'].cat.codes

# let us group and visually check
data_grouped_family_types = data.groupby(['family_cat']).mean()[['sales', 'onpromotion']]

plt.subplots(3,1,figsize=(22,6))
plt.subplot(131)
plt.title('Sales')
data_grouped_family_types.sales.plot(kind='pie')
plt.subplot(132)
plt.title('On promotion')
data_grouped_family_types.onpromotion.plot(kind='pie')
plt.subplot(133)
plt.xticks(np.arange(0, 33, step=2))
plt.plot(data_grouped_family_types.index, data_grouped_family_types.sales)
plt.plot(data_grouped_family_types.index, data_grouped_family_types.onpromotion)

data_grouped_family = data.groupby(['family']).mean()[['sales', 'onpromotion']]
data_grouped_family.sort_values('sales', ascending=False)[:3]
print(family_types[12], ',', family_types[3], ',' , family_types[30])

In both cases, sales and promotion the group with higher area in the plot is GROCERY I. Followed by Beverages and Produce. Looking at both pies it seems that the size of the portions are relatively similar from one to another. On the right plot, wee see how the three highest peaks of course correspond to the three groups mentioned above.

Now let us compare how sales and onpromotion vary depending on the day of the week, month or year.

In [None]:
data['date'] = pd.to_datetime(data['date'])
data['day_of_week'] = data['date'].dt.dayofweek
data['month'] = data['date'].dt.month
data['year'] = data['date'].dt.year

In [None]:
data_grouped_day = data.groupby(['day_of_week']).mean()[['sales', 'onpromotion']]
data_grouped_month = data.groupby(['month']).mean()[['sales', 'onpromotion']]
data_grouped_year = data.groupby(['year']).mean()[['sales', 'onpromotion']]

#### cheeky plots
plt.subplots(3,1,figsize=(22,10))
plt.subplot(231)
plt.title('sales - day')
data_grouped_day.sales.plot()
plt.subplot(232)
plt.title('sales - month')
data_grouped_month.sales.plot()
plt.subplot(233)
plt.title('sales - year')
data_grouped_year.sales.plot()
plt.subplot(234)
plt.title('onpromotion - day')
data_grouped_day.onpromotion.plot(color='lightcoral')
plt.subplot(235)
plt.title('onpromotion - month')
data_grouped_month.onpromotion.plot(color='lightcoral')
plt.subplot(236)
plt.title('onpromotion - year')
data_grouped_year.onpromotion.plot(color='lightcoral')

First row of plots correspond to the average in sales per day of week (Monday is 0), month, and year. The second one corresponds to onpromotion in the same timespans. 
The shapes of the curve seems to correlate, especially in the case of the day od the week.

- The higher values appear Sunday un Saturday, while the peak on days onpromotion appears on saturday
- The sales increase notabily on month 12 (christmass?), there is a peak in month 7 that also appears in onpromotion.
- Over years both sales and onpromotion the tendency is to increase.

## 2. Stores

In [None]:
stores.head(3)

In [None]:
print('There are %s Store Numbers in the dataset' %len(stores.store_nbr.unique()))
print('There are %s different states' %len(stores.city.unique()))
print('There are %s different types' %len(stores.type.unique()))
print('There are %s different clusters' %len(stores.cluster.unique()))
print('There are %s missing values' %sum(stores.isna().sum()))

Let's try some visualizations.

In [None]:
plt.subplots(2,1,figsize=(21,6))
plt.subplot(121)
sns.countplot(x=stores.type, order = stores.type.value_counts().index)
plt.subplot(122)
sns.countplot(y=stores.city, order = stores.city.value_counts().index)

In [None]:
#sns.factorplot(x = 'type', y='cluster',data=stores)
sns.catplot(x = 'type', y='cluster',data=stores, kind='strip')
#stores[stores.type == 'E']
#stores[stores.cluster == 10]

The count plots show that the order for types  (considering the times they appear) is: D-C-A-B-E. And the most of them appear in Quito.

If we remember that the stores were 'ordered' in clusters, we can visualize the type vs the cluster they belong in the plot above. 
It is interesting to remark that all type E belong to cluster 10, but not all cluster 10 belong to E! this is the only time that this happens, then if merging with the dataset contaning the 'family' if would be interesting the relation between familty and cluster.

## 3. Transactions

Below we have the mean per store_nbr.

In [None]:
transactions.head(2)
transactions_grouped_store = transactions.groupby(['store_nbr']).mean()[['transactions']]
print(stores.store_nbr.unique())
plt.figure(figsize=(15, 5))
plt.title('Mean of transactions for each store_nbr')
plt.xticks(np.arange(1, 55, step=2))
plt.plot(transactions_grouped_store)

## 4. Oil

In [None]:
oil.head(3)

In [None]:
print(oil.isna().sum())
plt.figure(figsize=(12, 4))
plt.plot(oil.dcoilwtico)
#oil.dcoilwtico.plot()
plt.legend()

## 5. Holiday

In [None]:
holiday.head(2)

In [None]:
print('There are %s types of days' %len(holiday.type.unique()))
print('There are %s types of holyday' %len(holiday.locale.unique()))
print('There are %s different local_names' %len(holiday.locale_name.unique()))
print('There are %s missing values' %sum(holiday.isna().sum()))

In [None]:
plt.subplots(2,1,figsize=(20,5))
plt.subplot(121)
plt.title('Counts on type of holiday depending on its type')
sns.countplot(x=holiday.type, hue=holiday.locale)
plt.legend(loc='upper right')
plt.subplot(122)
plt.title('Counts of type of holiday')
sns.countplot(x=holiday.locale)