## **Problem Statement**

Given historic sales data, forecast department-wise weekly sales for 45 Walmart stores located in different regions.




In [5]:
#data analysis libraries
import pandas as pd
import numpy as np

#visualization libraries

import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px 
from scipy.stats import pearsonr

# Set the default figure size for Seaborn plots globally
sns.set(rc={'figure.figsize':(10,6)})
%matplotlib inline 

In [9]:
# Read the data

features_df = pd.read_csv('/Users/gopi.gadu/Downloads/Forecasting_Sales/Forecasting_Walmart_weekly_sales/walmart-recruiting-store-sales-forecasting/features.csv')
stores_df = pd.read_csv('/Users/gopi.gadu/Downloads/Forecasting_Sales/Forecasting_Walmart_weekly_sales/walmart-recruiting-store-sales-forecasting/stores.csv')
train_df = pd.read_csv('/Users/gopi.gadu/Downloads/Forecasting_Sales/Forecasting_Walmart_weekly_sales/walmart-recruiting-store-sales-forecasting/train.csv')
test_df = pd.read_csv('/Users/gopi.gadu/Downloads/Forecasting_Sales/Forecasting_Walmart_weekly_sales/walmart-recruiting-store-sales-forecasting/test.csv')


### Train data 

This is the historical training data, which has data from 2010-02-05 to 2012-11-01. It has the following fields:

- Store -the store number (1 to 45)
- Dept - department number, numbered 1 to 99. All stores don't have every department 
- Date - Date corresponding to Thursday of each week
- Weekly_Sales - Sales for the given department in the given store 
- IsHoliday - whether the week is a special holiday week

In [11]:
train_df.head()


Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.5,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.9,False


In [12]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  bool   
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB


- There are no null values in the dataset.

In [13]:
train_df.describe()


Unnamed: 0,Store,Dept,Weekly_Sales
count,421570.0,421570.0,421570.0
mean,22.200546,44.260317,15981.258123
std,12.785297,30.492054,22711.183519
min,1.0,1.0,-4988.94
25%,11.0,18.0,2079.65
50%,22.0,37.0,7612.03
75%,33.0,74.0,20205.8525
max,45.0,99.0,693099.36


- There are 45 stores in each with department in it.
- Our target variable Weekly_Sales, is the sales for the week reported on Thursdays of the week.
- There are negative values and clearly outliers in the sales column

### Test Data

This dataset has the same fields as that of the train dataset except for the weekly sales which is our target variable to predict.

In [14]:
test_df.head()

Unnamed: 0,Store,Dept,Date,IsHoliday
0,1,1,2012-11-02,False
1,1,1,2012-11-09,False
2,1,1,2012-11-16,False
3,1,1,2012-11-23,True
4,1,1,2012-11-30,False


In [15]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115064 entries, 0 to 115063
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Store      115064 non-null  int64 
 1   Dept       115064 non-null  int64 
 2   Date       115064 non-null  object
 3   IsHoliday  115064 non-null  bool  
dtypes: bool(1), int64(2), object(1)
memory usage: 2.7+ MB


- There are no null values in the test set
- We have to predict the weekly sales for a period of 39 weeks (from 2012-11-02 to 2013-07-26)

### **Stores data**

This dataset contains anonymized information about the 45 stores, indicating the type and size of store.

In [16]:
stores_df.head()

Unnamed: 0,Store,Type,Size
0,1,A,151315
1,2,A,202307
2,3,B,37392
3,4,A,205863
4,5,B,34875


In [17]:
stores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Store   45 non-null     int64 
 1   Type    45 non-null     object
 2   Size    45 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 1.2+ KB
