<a href="https://colab.research.google.com/github/sarajaved797/Walmart-Retail-Sales-Forecasting/blob/master/1_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

Store - the store number
Dept - the department number
Date - the week
Weekly_Sales -  sales for the given department in the given store
IsHoliday - whether the week is a special holiday week
test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store - the store number
Date - the week
Temperature - average temperature in the region
Fuel_Price - cost of fuel in the region
MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week
For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13



1. Load Data & Basic Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load merged train dataset
df = pd.read_csv('/content/drive/MyDrive/AI ML SQL Excel projects/Walmat Sales Forecasting Time series project/Data/walmart-recruiting-store-sales-forecasting/Prcoessed Data/train_combined.csv', parse_dates=['Date'])



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Store         421570 non-null  int64         
 1   Dept          421570 non-null  int64         
 2   Date          421570 non-null  datetime64[ns]
 3   Weekly_Sales  421570 non-null  float64       
 4   IsHoliday     421570 non-null  bool          
 5   Temperature   421570 non-null  float64       
 6   Fuel_Price    421570 non-null  float64       
 7   MarkDown1     150681 non-null  float64       
 8   MarkDown2     111248 non-null  float64       
 9   MarkDown3     137091 non-null  float64       
 10  MarkDown4     134967 non-null  float64       
 11  MarkDown5     151432 non-null  float64       
 12  CPI           421570 non-null  float64       
 13  Unemployment  421570 non-null  float64       
 14  Type          421570 non-null  object        
 15  Size          421

(421570, 16)

In [None]:
# Basic inspection
print("✅ Dataset loaded!\n")
print("🔹 Shape:", df.shape)
print("\n🔹 Data Types:")
print(df.dtypes)
print("\n🔹 Preview:")
print(df.head())

✅ Dataset loaded!

🔹 Shape: (421570, 16)

🔹 Data Types:
Store                    int64
Dept                     int64
Date            datetime64[ns]
Weekly_Sales           float64
IsHoliday                 bool
Temperature            float64
Fuel_Price             float64
MarkDown1              float64
MarkDown2              float64
MarkDown3              float64
MarkDown4              float64
MarkDown5              float64
CPI                    float64
Unemployment           float64
Type                    object
Size                     int64
dtype: object

🔹 Preview:
   Store  Dept       Date  Weekly_Sales  IsHoliday  Temperature  Fuel_Price  \
0      1     1 2010-02-05      24924.50      False        42.31       2.572   
1      1     1 2010-02-12      46039.49       True        38.51       2.548   
2      1     1 2010-02-19      41595.55      False        39.93       2.514   
3      1     1 2010-02-26      19403.54      False        46.63       2.561   
4      1     1 2010-03-05  

 Check Missing Values

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values in Each Column:\n", missing_values)

# Calculate percentage of missing values
missing_percentage = (missing_values / len(df)) * 100
print("\nMissing Value Percentage:\n", missing_percentage)


Missing Values in Each Column:
 Store                0
Dept                 0
Date                 0
Weekly_Sales         0
IsHoliday            0
Temperature          0
Fuel_Price           0
MarkDown1       270889
MarkDown2       310322
MarkDown3       284479
MarkDown4       286603
MarkDown5       270138
CPI                  0
Unemployment         0
Type                 0
Size                 0
dtype: int64

Missing Value Percentage:
 Store            0.000000
Dept             0.000000
Date             0.000000
Weekly_Sales     0.000000
IsHoliday        0.000000
Temperature      0.000000
Fuel_Price       0.000000
MarkDown1       64.257181
MarkDown2       73.611025
MarkDown3       67.480845
MarkDown4       67.984676
MarkDown5       64.079038
CPI              0.000000
Unemployment     0.000000
Type             0.000000
Size             0.000000
dtype: float64


Explanation of the Code:
.notnull(): This method returns True for non-null (non-NaN) values and False for NaN values.

For example: If MarkDown1 = 10.5, then MarkDown1.notnull() would return True. If MarkDown1 = NaN, it would return False.

.astype(int): This converts True/False values into 1/0.

True becomes 1, and False becomes 0.

In [None]:
# Create a flag for each MarkDown column (1 if there's a markdown, 0 otherwise)
df['MarkDown1_active'] = df['MarkDown1'].notnull().astype(int)
df['MarkDown2_active'] = df['MarkDown2'].notnull().astype(int)
df['MarkDown3_active'] = df['MarkDown3'].notnull().astype(int)
df['MarkDown4_active'] = df['MarkDown4'].notnull().astype(int)
df['MarkDown5_active'] = df['MarkDown5'].notnull().astype(int)

# Check the new columns
print(df[['MarkDown1', 'MarkDown1_active', 'MarkDown2', 'MarkDown2_active', 'MarkDown3', 'MarkDown3_active']].head())


   MarkDown1  MarkDown1_active  MarkDown2  MarkDown2_active  MarkDown3  \
0        NaN                 0        NaN                 0        NaN   
1        NaN                 0        NaN                 0        NaN   
2        NaN                 0        NaN                 0        NaN   
3        NaN                 0        NaN                 0        NaN   
4        NaN                 0        NaN                 0        NaN   

   MarkDown3_active  
0                 0  
1                 0  
2                 0  
3                 0  
4                 0  


In [None]:
df.sample(5)

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,...,MarkDown5,CPI,Unemployment,Type,Size,MarkDown1_active,MarkDown2_active,MarkDown3_active,MarkDown4_active,MarkDown5_active
345712,36,93,2011-04-15,26343.69,False,73.25,3.763,,,,...,,214.026217,8.3,A,39910,0,0,0,0,0
163404,17,48,2010-09-24,1791.0,False,58.39,2.872,,,,...,,126.190033,6.697,B,93188,0,0,0,0,0
185804,19,85,2011-09-09,5148.86,True,68.28,3.93,,,,...,,136.274581,7.806,A,203819,0,0,0,0,0
108202,12,3,2010-05-21,10327.94,False,76.2,3.12,,,,...,,126.184387,14.099,B,112238,0,0,0,0,0
216405,22,95,2010-07-09,69672.06,False,79.22,2.806,,,,...,,136.396264,8.433,B,119557,0,0,0,0,0


Handling Missing Values

In [None]:
markdown_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
df[markdown_cols] = df[markdown_cols].fillna(0)


In [None]:
df.sample(5)

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,...,MarkDown5,CPI,Unemployment,Type,Size,MarkDown1_active,MarkDown2_active,MarkDown3_active,MarkDown4_active,MarkDown5_active
155381,16,74,2011-03-04,8936.56,False,31.77,3.232,0.0,0.0,0.0,...,0.0,192.0116,6.614,B,57197,0,0,0,0,0
418926,45,67,2011-04-08,5108.3,False,48.71,3.72,0.0,0.0,0.0,...,0.0,185.363666,8.521,B,118221,0,0,0,0,0
63257,7,32,2012-02-17,8369.43,False,27.03,3.113,5453.54,4918.31,1.88,...,5303.89,196.943271,8.256,B,70713,1,1,1,1,1
138779,15,5,2011-10-07,16981.6,False,51.24,3.775,0.0,0.0,0.0,...,0.0,136.472,7.866,B,123737,0,0,0,0,0
18615,2,82,2010-10-01,20599.14,False,69.24,2.603,0.0,0.0,0.0,...,0.0,211.329874,8.163,A,202307,0,0,0,0,0


In [None]:
missing_values=df.isnull().sum()
print(missing_values)

Store               0
Dept                0
Date                0
Weekly_Sales        0
IsHoliday           0
Temperature         0
Fuel_Price          0
MarkDown1           0
MarkDown2           0
MarkDown3           0
MarkDown4           0
MarkDown5           0
CPI                 0
Unemployment        0
Type                0
Size                0
MarkDown1_active    0
MarkDown2_active    0
MarkDown3_active    0
MarkDown4_active    0
MarkDown5_active    0
dtype: int64


Descriptive Stats

Transpose --- for better readability

In [None]:
df.describe().T


Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Store,421570.0,22.200546,1.0,11.0,22.0,33.0,45.0,12.785297
Dept,421570.0,44.260317,1.0,18.0,37.0,74.0,99.0,30.492054
Date,421570.0,2011-06-18 08:30:31.963375104,2010-02-05 00:00:00,2010-10-08 00:00:00,2011-06-17 00:00:00,2012-02-24 00:00:00,2012-10-26 00:00:00,
Weekly_Sales,421570.0,15981.258123,-4988.94,2079.65,7612.03,20205.8525,693099.36,22711.183519
Temperature,421570.0,60.090059,-2.06,46.68,62.09,74.28,100.14,18.447931
Fuel_Price,421570.0,3.361027,2.472,2.933,3.452,3.738,4.468,0.458515
MarkDown1,421570.0,2590.074819,0.0,0.0,0.0,2809.05,88646.76,6052.385934
MarkDown2,421570.0,879.974298,-265.76,0.0,0.0,2.2,104519.54,5084.538801
MarkDown3,421570.0,468.087665,-29.1,0.0,0.0,4.54,141630.61,5528.873453
MarkDown4,421570.0,1083.132268,0.0,0.0,0.0,425.29,67474.85,3894.529945


In [None]:
df.columns

Index(['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Type', 'Size', 'MarkDown1_active',
       'MarkDown2_active', 'MarkDown3_active', 'MarkDown4_active',
       'MarkDown5_active'],
      dtype='object')

Renaming Columns

In [None]:
df.rename(columns={
    'Store': 'store',
    'Dept': 'dept',
    'Date': 'date',
    'Weekly_Sales': 'wk_sales',
    'IsHoliday': 'is_holiday',
    'Temperature': 'temp',
    'Fuel_Price': 'fuel',
    'MarkDown1': 'md1',
    'MarkDown2': 'md2',
    'MarkDown3': 'md3',
    'MarkDown4': 'md4',
    'MarkDown5': 'md5',
    'MarkDown1_active': 'md1_active',
    'MarkDown2_active': 'md2_active',
    'MarkDown3_active': 'md3_active',
    'MarkDown4_active': 'md4_active',
    'MarkDown5_active': 'md5_active',
    'CPI': 'cpi',
    'Unemployment': 'unemp',
    'Type': 'type',
    'Size': 'size'
}, inplace=True)


In [None]:
df.sample(5)

Unnamed: 0,store,dept,date,wk_sales,is_holiday,temp,fuel,md1,md2,md3,...,md5,cpi,unemp,type,size,md1_active,md2_active,md3_active,md4_active,md5_active
111829,12,29,2011-12-09,10809.29,False,42.17,3.644,8374.63,15.85,573.92,...,37581.27,129.855533,12.89,B,112238,1,1,1,1,1
144529,15,54,2011-09-23,58.92,False,59.0,3.899,0.0,0.0,0.0,...,0.0,136.367,7.806,B,123737,0,0,0,0,0
119751,13,17,2010-02-05,22419.66,False,31.53,2.666,0.0,0.0,0.0,...,0.0,126.442065,8.316,A,219622,0,0,0,0,0
229982,24,23,2012-07-06,40085.29,False,77.18,3.646,2920.43,559.3,181.49,...,3889.14,138.229633,8.953,A,203819,1,1,1,1,1
394804,42,46,2011-07-15,5848.39,False,86.01,3.779,0.0,0.0,0.0,...,0.0,129.133839,8.257,C,39690,0,0,0,0,0


Descriptive Statistics
 is a summary of numeric features to see basic stats (mean, std, min, quartiles, max). This gives an idea of overall distributions.


In [None]:
# Get descriptive statistics for numeric columns (transposed for readability)
print(df.describe().T)



               count                           mean                  min  \
store       421570.0                      22.200546                  1.0   
dept        421570.0                      44.260317                  1.0   
date          421570  2011-06-18 08:30:31.963375104  2010-02-05 00:00:00   
wk_sales    421570.0                   15981.258123             -4988.94   
temp        421570.0                      60.090059                -2.06   
fuel        421570.0                       3.361027                2.472   
md1         421570.0                    2590.074819                  0.0   
md2         421570.0                     879.974298              -265.76   
md3         421570.0                     468.087665                -29.1   
md4         421570.0                    1083.132268                  0.0   
md5         421570.0                    1662.772385                  0.0   
cpi         421570.0                     171.201947              126.064   
unemp       

Adjusting display settings in Colab

In [None]:


# Set display formatting options for clearer viewing
pd.options.display.float_format = '{:,.2f}'.format  # Format floats with 2 decimal places
pd.set_option('display.max_rows', None)            # Show all rows
pd.set_option('display.max_columns', None)         # Show all columns
pd.set_option('display.width', 2000)                 # Set display width to 2000 characters

# Now when you run your descriptive statistics, the output should be clearer
print(df.describe().T)


                count                           mean                  min                  25%                  50%                  75%                  max       std
store      421,570.00                          22.20                 1.00                11.00                22.00                33.00                45.00     12.79
dept       421,570.00                          44.26                 1.00                18.00                37.00                74.00                99.00     30.49
date           421570  2011-06-18 08:30:31.963375104  2010-02-05 00:00:00  2010-10-08 00:00:00  2011-06-17 00:00:00  2012-02-24 00:00:00  2012-10-26 00:00:00       NaN
wk_sales   421,570.00                      15,981.26            -4,988.94             2,079.65             7,612.03            20,205.85           693,099.36 22,711.18
temp       421,570.00                          60.09                -2.06                46.68                62.09                74.28               100.14   

In [None]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0].sort_values(ascending=False))


Series([], dtype: int64)


Understand MarkDown Usage Patterns
 let’s look at how frequently markdowns were used — this can give us insight into promotion strategies, and whether some of these markdowns are even worth keeping in the model.

In [None]:
markdown_flags = ['md1_active', 'md2_active', 'md3_active', 'md4_active', 'md5_active']
print(df[markdown_flags].sum().sort_values(ascending=False))


md5_active    151432
md1_active    150681
md3_active    137091
md4_active    134967
md2_active    111248
dtype: int64
