# Initial Exploration

In this notebook, we will explore the data, check for missing values, check categorical variables and do some basic statistical analysis.

In [1]:
import polars as pl
import duckdb as db
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
train=pd.read_csv("../../data/raw/TRAIN.csv", parse_dates=['Date'])
train

Unnamed: 0,ID,Store_id,Store_Type,Location_Type,Region_Code,Date,Holiday,Discount,Orders,Sales
0,T1000001,1,S1,L3,R1,2018-01-01,1,Yes,9,7011.84
1,T1000002,253,S4,L2,R1,2018-01-01,1,Yes,60,51789.12
2,T1000003,252,S3,L2,R1,2018-01-01,1,Yes,42,36868.20
3,T1000004,251,S2,L3,R1,2018-01-01,1,Yes,23,19715.16
4,T1000005,250,S2,L3,R4,2018-01-01,1,Yes,62,45614.52
...,...,...,...,...,...,...,...,...,...,...
188335,T1188336,149,S2,L3,R2,2019-05-31,1,Yes,51,37272.00
188336,T1188337,153,S4,L2,R1,2019-05-31,1,No,90,54572.64
188337,T1188338,154,S1,L3,R2,2019-05-31,1,No,56,31624.56
188338,T1188339,155,S3,L1,R2,2019-05-31,1,Yes,70,49162.41


In [4]:
train.dtypes

ID                       object
Store_id                  int64
Store_Type               object
Location_Type            object
Region_Code              object
Date             datetime64[ns]
Holiday                   int64
Discount                 object
Orders                    int64
Sales                   float64
dtype: object

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188340 entries, 0 to 188339
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   ID             188340 non-null  object        
 1   Store_id       188340 non-null  int64         
 2   Store_Type     188340 non-null  object        
 3   Location_Type  188340 non-null  object        
 4   Region_Code    188340 non-null  object        
 5   Date           188340 non-null  datetime64[ns]
 6   Holiday        188340 non-null  int64         
 7   Discount       188340 non-null  object        
 8   Orders         188340 non-null  int64         
 9   Sales          188340 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 14.4+ MB


In [6]:
train.describe()

Unnamed: 0,Store_id,Date,Holiday,Orders,Sales
count,188340.0,188340,188340.0,188340.0,188340.0
mean,183.0,2018-09-15 12:00:00.000000256,0.131783,68.205692,42784.327982
min,1.0,2018-01-01 00:00:00,0.0,0.0,0.0
25%,92.0,2018-05-09 18:00:00,0.0,48.0,30426.0
50%,183.0,2018-09-15 12:00:00,0.0,63.0,39678.0
75%,274.0,2019-01-22 06:00:00,0.0,82.0,51909.0
max,365.0,2019-05-31 00:00:00,1.0,371.0,247215.0
std,105.366308,,0.338256,30.467415,18456.708302


<b style='padding: 4px 10px 6px 10px;border-radius: 5px;background: #009688;color: #fff;display: inline-block;'>Observations</b>

- The dataset contains 188340 records

In [7]:
train["Store_Type"].value_counts()

Store_Type
S1    88752
S4    45924
S2    28896
S3    24768
Name: count, dtype: int64

In [8]:
train["Region_Code"].value_counts()

Region_Code
R1    63984
R2    54180
R3    44376
R4    25800
Name: count, dtype: int64

In [9]:
train["Location_Type"].value_counts()

Location_Type
L1    85140
L2    48504
L3    29928
L5    13932
L4    10836
Name: count, dtype: int64

In [10]:
train["Store_id"].value_counts()

Store_id
1      516
61     516
63     516
64     516
65     516
      ... 
338    516
349    516
350    516
351    516
364    516
Name: count, Length: 365, dtype: int64

<b style='padding: 4px 10px 6px 10px;border-radius: 5px;background: #009688;color: #fff;display: inline-block;'>Observations</b>

- The dataset contains data of 365 stores

In [9]:
train.filter(pl.col("Date") == "2018-01-01")

ID,Store_id,Store_Type,Location_Type,Region_Code,Date,Holiday,Discount,Order,Sales
str,i64,str,str,str,str,i64,str,i64,f64
"""T1000001""",1,"""S1""","""L3""","""R1""","""2018-01-01""",1,"""Yes""",9,7011.84
"""T1000002""",253,"""S4""","""L2""","""R1""","""2018-01-01""",1,"""Yes""",60,51789.12
"""T1000003""",252,"""S3""","""L2""","""R1""","""2018-01-01""",1,"""Yes""",42,36868.2
"""T1000004""",251,"""S2""","""L3""","""R1""","""2018-01-01""",1,"""Yes""",23,19715.16
"""T1000005""",250,"""S2""","""L3""","""R4""","""2018-01-01""",1,"""Yes""",62,45614.52
…,…,…,…,…,…,…,…,…,…
"""T1000361""",359,"""S2""","""L3""","""R2""","""2018-01-01""",1,"""Yes""",55,43514.28
"""T1000362""",362,"""S1""","""L3""","""R3""","""2018-01-01""",1,"""Yes""",37,27770.4
"""T1000363""",363,"""S1""","""L1""","""R2""","""2018-01-01""",1,"""Yes""",42,29676.24
"""T1000364""",360,"""S2""","""L1""","""R1""","""2018-01-01""",1,"""Yes""",28,25680.27


In [24]:
db.sql("""
    select  Region_Code,Location_Type, Store_Type, count(*) cnt from train group by Region_Code,Location_Type, Store_Type order by Region_Code,Location_Type, Store_Type
""")

┌─────────────┬───────────────┬────────────┬───────┐
│ Region_Code │ Location_Type │ Store_Type │  cnt  │
│   varchar   │    varchar    │  varchar   │ int64 │
├─────────────┼───────────────┼────────────┼───────┤
│ R1          │ L1            │ S1         │  8772 │
│ R1          │ L1            │ S2         │  1032 │
│ R1          │ L1            │ S3         │  5676 │
│ R1          │ L1            │ S4         │ 10836 │
│ R1          │ L2            │ S1         │   516 │
│ R1          │ L2            │ S3         │  2064 │
│ R1          │ L2            │ S4         │ 17028 │
│ R1          │ L3            │ S1         │  6708 │
│ R1          │ L3            │ S2         │   516 │
│ R1          │ L3            │ S3         │  1548 │
│ ·           │ ·             │ ·          │    ·  │
│ ·           │ ·             │ ·          │    ·  │
│ ·           │ ·             │ ·          │    ·  │
│ R4          │ L1            │ S2         │  1032 │
│ R4          │ L1            │ S3         │  

<b style='padding: 4px 10px 6px 10px;border-radius: 5px;background: #009688;color: #fff;display: inline-block;'>Observations</b>

- It seems like the stores are distributed in 4 regions. For a high level forecasting we can create 4 different models, 1 for each region. 
- We can also build a global model for entire dataset.

In [7]:
db.sql("""
select distinct Date, Holiday from train
 """)
db.sql("""
select distinct Date from train
 """)

┌────────────┬─────────┐
│    Date    │ Holiday │
│  varchar   │  int64  │
├────────────┼─────────┤
│ 2018-02-24 │       0 │
│ 2018-04-05 │       0 │
│ 2018-06-08 │       0 │
│ 2018-07-11 │       0 │
│ 2018-08-10 │       0 │
│ 2018-11-17 │       0 │
│ 2018-12-19 │       0 │
│ 2019-01-18 │       0 │
│ 2019-02-21 │       0 │
│ 2019-05-31 │       1 │
│     ·      │       · │
│     ·      │       · │
│     ·      │       · │
│ 2018-11-11 │       0 │
│ 2019-02-17 │       0 │
│ 2019-04-23 │       0 │
│ 2018-05-16 │       0 │
│ 2018-06-18 │       0 │
│ 2018-02-21 │       0 │
│ 2018-07-08 │       0 │
│ 2018-06-28 │       0 │
│ 2019-03-13 │       0 │
│ 2018-09-06 │       0 │
├────────────┴─────────┤
│ 516 rows (20 shown)  │
└──────────────────────┘

┌────────────┐
│    Date    │
│  varchar   │
├────────────┤
│ 2018-04-08 │
│ 2018-06-12 │
│ 2018-07-14 │
│ 2018-08-14 │
│ 2018-10-19 │
│ 2018-10-20 │
│ 2018-12-21 │
│ 2018-12-22 │
│ 2019-03-30 │
│ 2019-04-30 │
│     ·      │
│     ·      │
│     ·      │
│ 2018-07-09 │
│ 2018-09-11 │
│ 2018-12-16 │
│ 2018-02-22 │
│ 2018-02-24 │
│ 2018-11-06 │
│ 2018-01-28 │
│ 2018-11-30 │
│ 2018-07-16 │
│ 2018-04-04 │
├────────────┤
│  516 rows  │
│ (20 shown) │
└────────────┘

# Report

- The dataset contains 188340 records
- The dataset contains data of 365 stores
- It seems like the stores are distributed in 4 regions. For a high level forecasting we can create 4 different models, 1 for each region.