# Expoloratory Data Analysis

In [53]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.io as pio

pio.templates.default = "plotly_dark"

import warnings
warnings.filterwarnings("ignore")

### Load the data

In [54]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,Date,store,product,number_sold
0,2010-01-01,0,0,801
1,2010-01-02,0,0,810
2,2010-01-03,0,0,818
3,2010-01-04,0,0,796
4,2010-01-05,0,0,808


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230090 entries, 0 to 230089
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Date         230090 non-null  object
 1   store        230090 non-null  int64 
 2   product      230090 non-null  int64 
 3   number_sold  230090 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 7.0+ MB


Note that the `date` column is an `object` dtype. We'll convert this to `datetime` dtype and we'll set this as the index for the dataframe. This will help with time series plotting as well with pandas built in time based functions.

Also, we'll convert the `store` and `product` columns from an `int64` to a `category` column. Its a good practice to convert unique identifiers as `category` dtype as it in both reducing memory as well as prevents us from accidentally having our model learn from the values of the `store` and `product` itself. eg: A product ID of 2 does not inherently mean it will sell twice as much as a product ID of 1

In [56]:
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

df['store'] = df['store'].astype('category')
df['product'] = df['product'].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 230090 entries, 2010-01-01 to 2018-12-31
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   store        230090 non-null  category
 1   product      230090 non-null  category
 2   number_sold  230090 non-null  int64   
dtypes: category(2), int64(1)
memory usage: 4.0 MB


### Visualizing the data

In [57]:
# Example of what a time series would look like
swatch = px.colors.qualitative.Set2

for _ in range(3):
    random_store, random_product = df['store'].sample(1).values[0],df['product'].sample(1).values[0]

    # Filter the data to only include the store and product
    df_ = df[(df['store'] == random_store) & (df['product'] == random_product)]

    # Plot the data
    fig = px.line(df_, 
                  x=df_.index, 
                  y='number_sold', 
                  title=f'Monthly Value for Store {random_store}, Product {random_product}', 
                  labels={'value':'Value', 'Period':'Month'})
    
    fig.update_layout(height=300,showlegend=False)
    fig.update_traces(line=dict(color=np.random.choice(swatch, 1)[0]))  # Change the color of the line
    
    fig.show()

Clearly, each store product combination has a different trend. We can see that there is a trend and seasonality in the data.

There are 7 stores and 10 products.
We can create a time series for each store and product combination. This would result in 70 different time series associated with every unique store-product combination.

We can also create a time series for each store, which would be an aggregation of all products sold in that store. This would result in 7 time series

In [58]:
df_store_groupedSales = df.groupby(['store','Date',])['number_sold'].sum()

px.line(df_store_groupedSales.reset_index(), x='Date', y='number_sold', color='store', title='Monthly Sales per Store')

In [85]:
fig = px.box(df, 
             x='store', 
             y='number_sold', 
             color = 'product',
             title='Box Plot of Products at Store 1')
fig.show()

# Conclusion

We performed a basic EDA of the data to see how the sales data is distributed. As expected we can see that the sales data is distributed differently for each store and product. One incidental takeaway is that store 3 in underperforms for sales on all products among other stores.

This is important to keep in mind when building a model to predict sales.