<a href="https://colab.research.google.com/github/shreeya09/sales-forecasting/blob/main/Walmart_Sales_Forecasting_(Mid_Course_Assessment_ML_Project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

In today’s competitive retail environment, data-driven decision-making is no longer optional — it is a necessity. This project, Walmart Sales Forecasting, aims to use historical sales data to build machine learning models that can accurately predict future weekly sales. Forecasting sales not only enables better inventory and supply chain management but also ensures optimal staffing, marketing, and operational planning. Walmart, being a retail giant with a complex sales ecosystem across numerous stores and departments, presents a rich and challenging dataset ideal for predictive analytics.

The core goal of this project is to develop a regression model that predicts the weekly sales for Walmart stores using historical data. The dataset provided contains features such as store number, department number, date, whether a given week is a holiday, and the weekly sales figure. In addition to these, temporal features like the month, year, and ISO calendar week are extracted to enrich the feature space and capture seasonal and cyclical trends.

**The project is structured in several phases:**

**Problem Definition:**
The objective is to build a machine learning model that can predict the Weekly_Sales value using relevant features. The problem type is regression, as the output is a continuous numerical value.

**Data Exploration and Cleaning:**
The dataset is loaded and examined for structure, missing values, and duplicate entries. We convert the date column to a datetime object and derive new time-based features such as Month, Week, and Year. The IsHoliday field is encoded into binary (0 or 1), and missing values are forward-filled. Duplicate rows are checked and addressed, and all features are examined for their unique values and distributions.

**Exploratory Data Analysis (EDA):**
This phase involves visualizing and understanding relationships between variables. Line plots are used to analyze trends in total weekly sales over time. Box plots show variability across different stores. Holiday vs. non-holiday sales are compared using bar charts, revealing the impact of seasonal events. A heatmap is also generated to explore correlations among numerical features. From these visualizations, patterns like higher sales during holiday seasons, monthly seasonality, and store-wise performance disparities are uncovered.

**Hypothesis Testing:**
Before modeling, we formulate and consider several hypotheses such as:

“Holiday weeks yield significantly higher sales than non-holiday weeks.”

“Sales remain normally distributed over time.”

“Department-wise sales show consistent behavior annually.”

These hypotheses guide feature selection and model expectations.

**Feature Engineering:**
Important features like store ID, department number, holiday flag, and time features are selected. These features are used to train machine learning models. Proper splitting of the data into training and testing sets ensures robust evaluation.

**Modeling and Evaluation:**
Multiple machine learning models are implemented — Linear Regression, Decision Tree Regressor, and Random Forest Regressor. Each model is evaluated using regression metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared. Among these, the Random Forest model outperforms the others with the lowest error rates and highest explanatory power.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In the dynamic and competitive landscape of retail, accurately forecasting sales is crucial for effective inventory management, staffing, supply chain optimization, and financial planning. Walmart, as one of the largest retail chains in the world, generates a massive volume of sales data across hundreds of stores and departments. Making sense of this data to forecast future weekly sales is both a challenge and an opportunity.

The goal of this project is to develop a machine learning model that can predict the weekly sales for Walmart stores using historical data. The dataset includes key attributes such as store ID, department number, date, whether the week is a holiday, and the actual weekly sales figures. These variables provide the foundation for uncovering patterns, seasonality, and trends in sales behavior.

This is a supervised regression problem, where the target variable Weekly_Sales is continuous, and the objective is to minimize prediction errors on unseen data. By leveraging historical sales patterns and other relevant features, we aim to create a predictive model that can assist Walmart in making informed decisions for future weeks.

The solution should:

-Accurately forecast weekly sales based on available data.

-Capture the effects of holidays and seasonal trends.

-Generalize well to new data (stores, weeks, or departments not seen during training).

-Provide actionable insights through data analysis and visualization.

Ultimately, this model will help Walmart proactively plan logistics, improve product availability, reduce overstock and stockouts, and deliver a better customer experience through smarter forecasting.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

# Machine Learning models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Model evaluation
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Warnings and Display settings
import warnings
warnings.filterwarnings('ignore')



### Dataset Loading

In [None]:
# Load Dataset
file_path = 'https://raw.githubusercontent.com/shreeya09/sales-forecasting/main/walmart.xlsx'
df = pd.read_excel(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing/Null values in each column:\n")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The Walmart sales dataset contains **6,435 records and 8 columns**, capturing weekly sales performance across 45 stores. Each row represents the sales for a specific store in a particular week.

The key target variable is Weekly_Sales, which records the total sales in dollars. The dataset includes features like Store (store ID), Date (week-ending date), Holiday_Flag (binary indicator for major holidays), Temperature (average temperature), Fuel_Price (fuel cost), CPI (Consumer Price Index), and Unemployment (unemployment rate).

These features provide insights into store operations, economic conditions, and seasonal trends. Importantly, the dataset has no missing values and no duplicate rows, ensuring data quality is high for analysis and modeling.

## ***2. Understanding Your Variables***

In [None]:
#Dataset Columns
print("Dataset Columns:\n")
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe()

### Variables Description


**Store**: This column represents the unique identifier for each Walmart store. There are 45 unique stores in the dataset, indicating that data is collected from 45 different retail locations. This field helps segment sales and performance across geographic and operational variations among the stores.

**Date**: This column contains the week-ending date for each sales record. It is a critical feature for time-series analysis and h **bold text**elps in extracting seasonal trends and identifying holiday effects. There are 143 unique dates, implying the data spans roughly three years of weekly observations.

**Weekly_Sales**: This is the target variable of the project. It represents the total dollar amount of sales generated by a specific store in a particular week. Each value is continuous and unique to the store-week combination. The goal of the machine learning model is to accurately forecast this variable.

**Holiday_Flag**: This binary column indicates whether the week contains a major holiday (e.g., Super Bowl, Labor Day, Thanksgiving, or Christmas). A value of 1 means it's a holiday week, and 0 otherwise. This flag is essential for understanding the impact of holidays on consumer purchasing behavior.

**Temperature**: This field shows the average temperature for that week in the respective store's location. Weather can influence store traffic and shopping patterns, so it’s useful to analyze its correlation with sales.

**Fuel_Price**: This column indicates the cost of fuel during the week, which can affect consumer spending and store visits, particularly in suburban or rural areas. It has 892 unique values, showing moderate variability over time.

**CPI (Consumer Price Index**): This economic indicator reflects the average change over time in the prices paid by consumers for goods and services. It can impact overall spending capacity and patterns. This column can help capture macroeconomic influences on retail sales.

**Unemployment**: This represents the unemployment rate for the region in that specific week. Higher unemployment might correlate with lower spending. With 349 unique values, this feature may serve as a good economic signal for sales prediction.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print("Unique values in each column:\n")
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1. Remove duplicates if any
df = df.drop_duplicates()

# 2. Handle missing values
# (There were none in your dataset, but this is a safety net)
df = df.fillna(method='ffill')

# 3. Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# 4. Create time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Week'] = df['Date'].dt.isocalendar().week

# 5. Ensure 'Holiday_Flag' is integer (for modeling)
df['Holiday_Flag'] = df['Holiday_Flag'].astype(int)

# 6. Re-check data types
print(df.dtypes)

# 7. Final shape and sample
print(f"Cleaned Dataset Shape: {df.shape}")
print(df.head())


### What all manipulations have you done and insights you found?

-Removed any duplicate rows

-Forward-filled missing values (if any)

-Converted the Date column to datetime

-Extracted Year, Month, Week for time-based analysis

-Converted Holiday_Flag to numeric (if it wasn’t already)

**Insights Found So Far**

-Clean and Complete Data: The dataset is free of missing and duplicate values, which means it's reliable for modeling and doesn't require imputation or heavy cleaning.

-Time-Aware Data: Creating new time-based features (like Month, Year, Week) allows for the detection of seasonal trends, such as spikes in sales during holidays or end-of-year periods.

-Holiday Indicator: The binary Holiday_Flag column will be valuable for evaluating sales performance during major holidays, which is common in retail forecasting.

-Economic Context: Columns like Fuel_Price, CPI, and Unemployment introduce macroeconomic factors that could influence purchasing power and consumer behavior.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Weekly Sales Over Time

In [None]:
# Chart - 1 visualization code
# Weekly Sales Over Time
df.groupby('Date')['Weekly_Sales'].sum().plot(figsize=(14, 5), title="1. Total Weekly Sales Over Time")
plt.ylabel("Sales ($)")
plt.grid(True)
plt.show()

##### Gained Insights and Business Impact

**Insight**:

The line chart of total weekly sales shows clear fluctuations over time, with noticeable spikes during holiday seasons and dips during off-peak periods.

**Business Impact:**

Helps Walmart anticipate demand surges and prepare by increasing inventory, staff, and marketing during high-sales weeks.

Useful for seasonal planning and understanding year-over-year growth or decline.

#### Chart - 2- Average Sales by Month

In [None]:
# Chart - 2 visualization code
# Average Sales by Month

plt.figure(figsize=(10, 5))
sns.barplot(x='Month', y='Weekly_Sales', data=df, estimator=np.mean)
plt.title("2. Average Weekly Sales by Month")
plt.xlabel("Month")
plt.ylabel("Average Sales ($)")
plt.show()

##### Gained Insights and Business Impact

**Insight**:

This bar chart reveals that certain months consistently outperform others in terms of sales (e.g., November and December often show higher averages due to holidays).

**Business Impact:**

Assists in budgeting and promotional planning based on monthly trends.

Helps optimize supply chain operations by planning restocking and logistics ahead of high-sales months.

Drives strategic campaign timing for sales and discounts.

#### Chart - 3 Holiday vs Non-Holiday Sales

In [None]:
# Chart - 3 visualization code
# Holiday vs Non-Holiday Sales

plt.figure(figsize=(7, 5))
sns.boxplot(x='Holiday_Flag', y='Weekly_Sales', data=df)
plt.title("3. Sales Distribution: Holiday vs Non-Holiday Weeks")
plt.xticks([0, 1], ['Non-Holiday', 'Holiday'])
plt.ylabel("Weekly Sales ($)")
plt.show()

##### Gained Insights and Business Impact

**Insight**:

Sales during holiday weeks are generally higher and more variable than during non-holiday weeks, as shown by the box plot.

**Business Impact:**

Justifies investing in holiday-specific promotions and staffing boosts.

Enables precise forecasting adjustments for special weeks.

Supports holiday-centric stock management to avoid shortages or overstocking.

#### Chart - 4 Average Sales by Store

In [None]:
# Chart - 4 visualization code
# Average Sales by Store

plt.figure(figsize=(12, 5))
avg_sales = df.groupby('Store')['Weekly_Sales'].mean().sort_values()
sns.barplot(x=avg_sales.index, y=avg_sales.values)
plt.title("4. Average Weekly Sales by Store")
plt.xlabel("Store")
plt.ylabel("Average Sales ($)")
plt.xticks(rotation=90)
plt.show()

##### Gained Insights and Business Impact

**Insight**:

There’s significant variation in performance between stores — some consistently outperform others, while a few lag behind.

**Business Impact:**

Identifies high-performing stores to replicate their strategies across the chain.

Flags underperforming stores for audit, retraining, or regional analysis.

Helps in resource allocation, i.e., prioritizing investment and logistics where ROI is highest.

#### Chart - 5 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("5. Feature Correlation Heatmap")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

**Insight**:

The correlation matrix shows how variables like CPI, Fuel Price, and Temperature relate (positively or negatively) to weekly sales. For example, temperature may be mildly correlated, while CPI and unemployment are more stable.

**Business Impact:**

Aids in feature selection for machine learning models, improving prediction accuracy.

Reveals external economic drivers that affect consumer spending.

Allows Walmart to consider economic indicators in strategic planning.

#### Chart - 6 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select important numeric columns
selected_cols = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']

# Create pair plot
sns.pairplot(df[selected_cols], corner=True)
plt.suptitle("Pair Plot of Key Variables", y=1.02)
plt.show()


##### 2. What is/are the insight(s) found from the chart?

Reveals relationships and distribution patterns between multiple variables at once.

Helps identify linear or non-linear trends, clusters, or outliers.

Useful for feature engineering and initial hypothesis testing before model training.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothesis 1:**
Holiday weeks have significantly higher weekly sales than non-holiday weeks.
We observed this in the boxplot comparing holiday vs non-holiday sales.

**Hypothesis 2:**
There is a significant difference in average sales across different months.
Monthly bar chart showed seasonality in sales — we’ll test if this is statistically significant.

**Hypothesis 3:**
Fuel prices are correlated with weekly sales.
From the heatmap and pair plot, we suspect there may be a relationship worth testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null (H₀): Mean sales during holiday weeks = Mean sales during non-holiday weeks

Alt (H₁): Mean sales during holiday weeks ≠ Mean sales during non-holiday weeks

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind, f_oneway, pearsonr

holiday_sales = df[df['Holiday_Flag'] == 1]['Weekly_Sales']
non_holiday_sales = df[df['Holiday_Flag'] == 0]['Weekly_Sales']
t_stat, p_val_ttest = ttest_ind(holiday_sales, non_holiday_sales, equal_var=False)
print("Hypothesis 1 - t-test (Holiday vs Non-Holiday Sales):")
print(f"  t-statistic = {t_stat:.3f}, p-value = {p_val_ttest:.5f}")
print("  → Reject H0 if p < 0.05")

##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Independent Samples t-test

##### Why did you choose the specific statistical test?

The t-test is used to compare the means of two independent groups. In this case, the two groups are:

Weekly sales during holiday weeks

Weekly sales during non-holiday weeks

Since the sales values are continuous and the two groups are distinct and non-overlapping, an independent two-sample t-test is the most suitable choice. It helps determine if the difference in average sales between the two types of weeks is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null (H₀): Mean weekly sales are the same across all months

Alt (H₁): At least one month has a different mean weekly sales

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

anova_groups = [df[df['Month'] == m]['Weekly_Sales'] for m in df['Month'].unique()]
f_stat, p_val_anova = f_oneway(*anova_groups)

print("Hypothesis 2 - ANOVA (Sales across Months):")
print(f"  F-statistic = {f_stat:.3f}, p-value = {p_val_anova:.5f}")
print("  → Reject H0 if p < 0.05")


##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: One-Way ANOVA (Analysis of Variance)

##### Why did you choose the specific statistical test?

One-way ANOVA is used when comparing the means of more than two independent groups. Here, we are analyzing average sales across 12 months, which are separate groups (January to December).

ANOVA checks whether at least one of these groups has a mean that's significantly different from the others. It's ideal for detecting seasonal variations in sales across the year.



### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null (H₀): There is no correlation between fuel price and weekly sales

Alt (H₁): There is a statistically significant correlation between fuel price and weekly sales

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
corr_coef, p_val_corr = pearsonr(df['Fuel_Price'], df['Weekly_Sales'])

print("Hypothesis 3 - Pearson Correlation (Fuel Price vs Sales):")
print(f"  Correlation = {corr_coef:.3f}, p-value = {p_val_corr:.5f}")
print("  → Significant if p < 0.05")

##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Pearson Correlation Coefficient

##### Why did you choose the specific statistical test?

The Pearson correlation measures the linear relationship between two continuous variables — in this case:

Fuel_Price (independent variable)

Weekly_Sales (dependent variable)

It provides a correlation coefficient (ranging from -1 to +1) and a p-value to determine whether the relationship is statistically significant. This test is appropriate because both variables are numerical and assumed to be normally distributed.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Optional: Remove extreme sales outliers using IQR
Q1 = df['Weekly_Sales'].quantile(0.25)
Q3 = df['Weekly_Sales'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Weekly_Sales'] >= Q1 - 1.5 * IQR) & (df['Weekly_Sales'] <= Q3 + 1.5 * IQR)]


##### What all outlier treatment techniques have you used and why did you use those techniques?

Outlier treatment was considered due to the presence of extreme spikes in weekly sales, particularly during holidays. Although not removed by default to preserve meaningful seasonal patterns, we used the Interquartile Range (IQR) method as an optional approach to identify and filter extreme sales values when necessary.

### 2. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Extract time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Week'] = df['Date'].dt.isocalendar().week

# Encode Holiday flag as binary
df['Holiday_Flag'] = df['Holiday_Flag'].astype(int)

# Optionally: Create 'Season' feature
df['Season'] = df['Month'] % 12 // 3 + 1  # 1: Winter, 2: Spring, 3: Summer, 4: Fall


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Define feature columns and target variable
features = ['Store', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI',
            'Unemployment', 'Month', 'Week', 'Year', 'Season']
target = 'Weekly_Sales'

X = df[features]
y = df[target]

##### What all feature selection methods have you used  and why?

For feature selection, we relied primarily on manual selection driven by domain knowledge and insights gained during exploratory data analysis (EDA).

##### Which all features you found important and why?

Features like Store, Holiday_Flag, Temperature, Fuel_Price, CPI, Unemployment, and date-derived variables such as Month, Week, and Year were retained due to their logical relationship with sales behavior. These features are not only interpretable but also supported by visual evidence of their influence on sales trends.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

**Why Transformation May Be Needed:**

-The Weekly_Sales values often have:
-Skewed distribution with very high peaks during holidays

Large scale (in the range of hundreds of thousands or millions), which may affect model stability — especially for linear models or algorithms sensitive to magnitude and variance.

**Why Use the following transformation?**
-Reduces right skewness in sales data
-Compresses extreme values, making the distribution closer to normal
-Helps improve performance of models like Linear Regression, Ridge/Lasso, and algorithms assuming Gaussian distribution
-Maintains all values as real and interpretable

In [None]:
# Transform Your data
df['Weekly_Sales_log'] = np.log1p(df['Weekly_Sales'])

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


##### Which method have you used to scale you data and why?

The dataset was scaled using StandardScaler from scikit-learn, which standardizes features by removing the mean and scaling to unit variance. This was important for models like linear regression that are sensitive to feature magnitude. We did not apply dimensionality reduction, as the number of features was relatively small and all variables were meaningful and relevant

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

We did not apply dimensionality reduction, as the number of features was relatively small and all variables were meaningful and relevant. Introducing dimensionality reduction like PCA would risk losing interpretability without a significant gain in performance.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

For data splitting, we used an 80/20 train-test ratio, a widely accepted standard that balances training performance with the ability to evaluate generalization on unseen data. This split ensures enough data for learning while holding back sufficient data for reliable testing of model performance.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np  # Also import numpy

# Scaling for linear regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

# Correct evaluation without 'squared' argument
mse = mean_squared_error(y_test, y_pred_lr)
rmse = np.sqrt(mse)  # Take square root manually for RMSE

print("Linear Regression")
print("RMSE:", rmse)
print("MAE:", mean_absolute_error(y_test, y_pred_lr))
print("R2 Score:", r2_score(y_test, y_pred_lr))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

# Your Evaluation Metric Scores for Linear Regression
rmse_lr = 519075.10
mae_lr = 431692.64
r2_lr = 0.164

# Metrics and Values
metrics = ['RMSE', 'MAE', 'R² Score']
scores = [rmse_lr, mae_lr, r2_lr]

# Plotting
plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, scores, color=['skyblue', 'lightgreen', 'salmon'])

# Adding value labels on top
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval, f'{yval:.2f}', ha='center', va='bottom')

plt.title('📈 Linear Regression Model Evaluation Metrics')
plt.ylabel('Score Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Import libraries
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Scaling features if not already done
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the model
ridge = Ridge()

# Define parameter grid for tuning
param_grid = {
    'alpha': [0.01, 0.1, 1, 10, 50, 100]  # Regularization strength
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid,
                           cv=5, scoring='neg_mean_squared_error', verbose=1)

# Fit the algorithm
grid_search.fit(X_train_scaled, y_train)

# Get the best model
best_ridge = grid_search.best_estimator_

# Predict on the model
y_pred_ridge = best_ridge.predict(X_test_scaled)

# 📊 Evaluate
mse = mean_squared_error(y_test, y_pred_ridge)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_ridge)
r2 = r2_score(y_test, y_pred_ridge)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Best CV Score (Negative MSE):", grid_search.best_score_)
print("\nTuned Ridge Regression Model Performance:")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R² Score: {r2:.4f}")


##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV for hyperparameter optimization.

GridSearchCV exhaustively searches through a specified grid of hyperparameter values.

It performs k-fold cross-validation (we used 5-fold CV) for each combination of parameters to find the set of parameters that gives the best performance.

It is simple, systematic, and guarantees finding the best parameters within the defined grid.

GridSearchCV is ideal when:

The parameter space is small or manageable (like tuning alpha in Ridge Regression).

You want precise, optimal parameters rather than a rough guess.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, we saw an improvement!
After tuning the Ridge Regression model's hyperparameters using GridSearchCV:

RMSE decreased slightly

MAE decreased slightly

R² Score increased a bit, indicating better explanatory power

While the improvement was not drastic (because Ridge is still a linear model and the data is complex), the model's overall fit improved compared to the simple Linear Regression baseline.

### ML Model - 2

In [None]:
# Decision Tree
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("Decision Tree Results")
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mse_dt)  # Take square root manually

print("RMSE:", rmse_dt)
print("MAE:", mean_absolute_error(y_test, y_pred_dt))
print("R2 Score:", r2_score(y_test, y_pred_dt))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Fit the Decision Tree model
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Calculate evaluation metrics manually
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mse_dt)
mae_dt = mean_absolute_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

# Metrics and Values
metrics = ['RMSE', 'MAE', 'R² Score']
scores = [rmse_dt, mae_dt, r2_dt]

# Plotting the bar chart
plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, scores, color=['skyblue', 'lightgreen', 'salmon'])

# Add value labels on top
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval, f'{yval:.2f}', ha='center', va='bottom')

# Add titles and labels
plt.title('🌳 Decision Tree Regressor - Evaluation Metrics')
plt.ylabel('Score Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)


# Define parameter grid for Decision Tree
param_grid_dt = {
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

# Initialize Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)

# Set up GridSearchCV
grid_search_dt = GridSearchCV(estimator=dt_model,
                              param_grid=param_grid_dt,
                              cv=5,                   # 5-Fold Cross-Validation
                              scoring='neg_mean_squared_error',
                              verbose=1)

# Fit the Grid Search to the data
grid_search_dt.fit(X_train, y_train)

# Best model after tuning
best_dt = grid_search_dt.best_estimator_

# Predict on the test set
y_pred_best_dt = best_dt.predict(X_test)

# Evaluate the tuned Decision Tree
mse_best_dt = mean_squared_error(y_test, y_pred_best_dt)
rmse_best_dt = np.sqrt(mse_best_dt)
mae_best_dt = mean_absolute_error(y_test, y_pred_best_dt)
r2_best_dt = r2_score(y_test, y_pred_best_dt)

# Print results
print(" Best Hyperparameters for Decision Tree:", grid_search_dt.best_params_)
print("\nTuned Decision Tree Performance:")
print(f"RMSE: {rmse_best_dt:.2f}")
print(f"MAE: {mae_best_dt:.2f}")
print(f"R² Score: {r2_best_dt:.4f}")


##### Which hyperparameter optimization technique have you used and why?

Which Hyperparameter Optimization Technique Have You Used and Why?
We used GridSearchCV for hyperparameter optimization on the Decision Tree Regressor.

GridSearchCV systematically tries every combination of given hyperparameters.

It uses k-fold cross-validation (we used 5 folds) to evaluate the performance of each combination.

It guarantees finding the best performing set of hyperparameters within the defined search space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, we observed a significant improvement after hyperparameter tuning:

-RMSE decreased (better predictive accuracy).

-MAE decreased (lower average error).

-R² increased (better explanatory power — from 93.9% to 94.8%).

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Manual RMSE calculation
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)

# Print evaluation metrics
print("Random Forest Results")
print("RMSE:", rmse_rf)
print("MAE:", mean_absolute_error(y_test, y_pred_rf))
print("R2 Score:", r2_score(y_test, y_pred_rf))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart


# Your Evaluation Metric Scores for Random Forest
rmse_rf = 113283.34  # Replace with your calculated RMSE
mae_rf = 61764.26    # Replace with your calculated MAE
r2_rf = 0.960        # Replace with your calculated R²

# Metrics and Values
metrics = ['RMSE', 'MAE', 'R² Score']
scores = [rmse_rf, mae_rf, r2_rf]

# Plotting
plt.figure(figsize=(8, 5))
bars = plt.bar(metrics, scores, color=['skyblue', 'lightgreen', 'salmon'])

# Adding value labels on top
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval, f'{yval:.2f}', ha='center', va='bottom')

plt.title('🌳 Random Forest Model Evaluation Metrics')
plt.ylabel('Score Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Base model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter space
param_dist_rf = {
    'n_estimators': [100, 150, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Randomized Search Setup
random_search_rf = RandomizedSearchCV(estimator=rf,
                                      param_distributions=param_dist_rf,
                                      n_iter=20,             # Search only 20 random combos
                                      cv=3,                  # 3-Fold Cross Validation (faster)
                                      scoring='neg_mean_squared_error',
                                      verbose=1,
                                      n_jobs=-1,
                                      random_state=42)

# Fit the randomized search model
random_search_rf.fit(X_train, y_train)

# Best model
best_rf = random_search_rf.best_estimator_

# Predictions
y_pred_best_rf = best_rf.predict(X_test)

# Evaluation
mse_best_rf = mean_squared_error(y_test, y_pred_best_rf)
rmse_best_rf = np.sqrt(mse_best_rf)
mae_best_rf = mean_absolute_error(y_test, y_pred_best_rf)
r2_best_rf = r2_score(y_test, y_pred_best_rf)

print("Best Hyperparameters for Random Forest:", random_search_rf.best_params_)
print("Tuned Random Forest Performance:")
print(f"RMSE: {rmse_best_rf:.2f}")
print(f"MAE: {mae_best_rf:.2f}")
print(f"R² Score: {r2_best_rf:.4f}")


##### Which hyperparameter optimization technique have you used and why?

We used RandomizedSearchCV for hyperparameter optimization of the Random Forest Regressor model.

Why RandomizedSearchCV?

GridSearchCV was taking too much time because it tries all combinations of parameters (108 combinations × 5-fold CV = 540 models).

RandomizedSearchCV, on the other hand, randomly samples a limited number of parameter combinations (for example, 20 random tries), which makes it much faster.

It still covers the parameter space efficiently and is good enough to find an almost-optimal model.

It is ideal when the parameter grid is large and time or computation resources are limited.

Thus, RandomizedSearchCV helped us achieve good performance improvements quickly without exhaustively searching the entire parameter grid.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, we observed a clear improvement in model performance after applying RandomizedSearchCV for hyperparameter tuning on the Random Forest Regressor.

Before tuning:

The Random Forest model used default parameters.

Although the model performed well, it was not fully optimized for the data patterns and complexity.

After tuning:

Using RandomizedSearchCV, we found better hyperparameters (e.g., optimized max_depth, n_estimators, min_samples_split, and min_samples_leaf).

The model's predictive performance improved significantly.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

In the Walmart Sales Forecasting project, we consistently used three key evaluation metrics for all three machine learning models (Linear Regression, Decision Tree, Random Forest) to ensure positive business impact:


Metric	Purpose	Business Relevance
RMSE (Root Mean Squared Error) measures overall prediction error magnitude	Penalizes large errors heavily, critical for forecasting large sales numbers.

MAE (Mean Absolute Error)	Measures average prediction error size	Easier to interpret in dollars, directly usable for planning inventory and revenue.

R² Score (Coefficient of Determination)	Measures how much variance in sales is explained by the model	Ensures good model fit, so management can trust forecasts for decision making.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After evaluating all three machine learning models—Linear Regression, Decision Tree Regressor, and Random Forest Regressor—we selected the **Random Forest Regressor** as our final prediction model. This decision was based on its superior performance across all key evaluation metrics.

Compared to the other models, Random Forest achieved the lowest Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), along with the highest R² score, indicating it captured the most variance in weekly sales. Unlike Linear Regression, which underperformed due to its inability to model complex patterns, and Decision Trees, which risk overfitting, Random Forest provided a robust balance of accuracy and generalization by averaging the outputs of multiple decision trees.

It also effectively handled non-linear relationships, seasonal variations, and holiday impacts in the dataset. Additionally, after applying RandomizedSearchCV for hyperparameter tuning, the model's performance improved further, making it more reliable for real-world forecasting. By choosing Random Forest, we ensure Walmart benefits from more precise sales forecasts, leading to better inventory planning, supply chain optimization, and overall operational efficiency.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

For this project, we used the Random Forest Regressor as our final prediction model. Random Forest is an ensemble learning algorithm that builds multiple decision trees during training and outputs the average of the predictions from all the trees. This approach increases accuracy and reduces the risk of overfitting compared to a single decision tree. One of the key advantages of Random Forest is its ability to provide insights into feature importance, which helps us understand which variables are most influential in predicting Walmart's weekly sales.

To explain the model’s behavior, we used feature importance analysis, which is built into the Random Forest model itself. This technique ranks features based on how much they decrease the impurity (or error) across all trees in the forest. In our analysis, the most important features turned out to be 'Store', 'CPI' (Consumer Price Index), 'Unemployment', and 'Holiday_Flag'. This indicates that both economic conditions and calendar effects significantly influence sales. Additionally, time-based features like 'Month' and 'Week' also showed meaningful contributions, highlighting the importance of seasonal patterns and recurring sales trends.

For more detailed explainability, we can optionally use advanced tools like SHAP (SHapley Additive exPlanations) or LIME to visualize and interpret the model’s predictions at a local and global level. These tools help decompose predictions into individual feature contributions, making the model more transparent and business-friendly. However, for this project, Random Forest’s built-in feature importance provided a sufficient and interpretable way to understand how different factors drive Walmart's weekly sales predictions.

In [None]:


# Train Random Forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Feature importances
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.title("Feature Importance - Random Forest Regressor")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()

# Display the data table
feature_importance_df.reset_index(drop=True)


**Insights**:
Store has by far the highest impact, showing store-level differences dominate sales behavior.

CPI and Unemployment are strong economic indicators, directly tied to customer spending.

Week contributes to capturing seasonal and promotional patterns.

Time-based variables like Month, Year, and Season have minor contributions individually, but still help model seasonality.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save the trained model to a .joblib file
joblib.dump(rf, 'random_forest_model.joblib')


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

# Load the saved model
loaded_model = joblib.load('random_forest_model.joblib')

# Predict on test set
predictions = loaded_model.predict(X_test)

# Show a few predictions vs actual
for i in range(5):
    print(f"Predicted: {predictions[i]:,.2f} | Actual: {y_test.iloc[i]:,.2f}")


# **Conclusion**

This project aimed to build a predictive model to accurately forecast Walmart’s weekly sales at the store level, using historical sales data along with external factors such as holidays, temperature, fuel prices, unemployment, and the Consumer Price Index (CPI). Effective sales forecasting is critical for Walmart’s operations—impacting inventory management, workforce allocation, and strategic decision-making.

We began by cleaning and preprocessing the dataset, including handling missing values, removing duplicates, converting date fields into meaningful time-based features (e.g., Month, Week, Season), and encoding categorical variables. We then conducted exploratory data analysis (EDA) to identify trends and patterns in the data, such as the significant sales spikes during holiday weeks, variation across stores, and the correlation between economic indicators (like CPI and Unemployment) and weekly sales.

Three machine learning models were implemented and evaluated:

-Linear Regression as a baseline model

-Decision Tree Regressor to capture non-linear patterns

-Random Forest Regressor as a more powerful ensemble model

After evaluating all models using metrics like RMSE, MAE, and R² score, the Random Forest Regressor was selected as the final model due to its superior accuracy and robustness. We further improved its performance through RandomizedSearchCV, a hyperparameter optimization technique that significantly reduced model error and improved generalization. The trained model was then saved using joblib, reloaded, and tested successfully on unseen data—demonstrating readiness for deployment.

**Insights Gained**

-Store identity was the most important feature, indicating that location-specific factors heavily influence sales.

-CPI and Unemployment were significant, showing the effect of economic conditions on consumer behavior.

-Holidays and seasonal patterns (captured through Week and Month) had noticeable impacts on sales volume.

-Time-based features provided meaningful lift to model accuracy, highlighting the importance of seasonality in retail sales.

**Business Impact**

-The predictive model provides Walmart with an accurate, data-driven way to forecast weekly sales.

-By identifying high-impact features (like holidays and economic factors), the model enables smarter inventory and workforce planning.

-Reducing forecasting errors by even 5–10% can result in millions of dollars in cost savings across supply chain operations.

-Reliable predictions will help Walmart minimize stockouts and overstocking, leading to better customer experience and higher revenue.

**Final Note**

This project demonstrates how machine learning can transform historical retail data into actionable insights and strategic decisions. By deploying the Random Forest model into Walmart’s forecasting pipeline, the company can enhance its operational efficiency, make informed business decisions, and stay competitive in a fast-paced retail environment.