Skip to content

Welcome to my captivating project, where I embark on a journey to forecast store sales for Corporation Favorita, a prominent grocery retailer in Ecuador. With a meticulous approach, I'll dive into a rich dataset encompassing store information, product families, promotions, holidays, and more.

Notifications You must be signed in to change notification settings

snyamson/LP3-Super-Store-Time-Series-Forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💼 Time Series Sales Machine Learning Prediction Project

head image

In this machine learning regression project, the goal is to develop a model that can accurately predict the value of the dependent variable based on the values of the independent variables. The model is developed by training the algorithm on a dataset of historical data. The algorithm learns from the data and identifies patterns that can be used to predict the value of the dependent variable.

Once the model is trained, it can be used to predict the value of the dependent variable for new data points. This can be used to make decisions about future outcomes, such as predicting sales, forecasting demand, or assessing risk.

Project Overview

This project follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework to explore and analyze the sales data. In this project, we'll predict store sales on data from Corporation Favorita, a large Ecuadorian-based grocery retailer.

Specifically, we are to build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

📑 Table of Contents

Project Structure📂

  • code/: Contains the dataset used for analysis and the Jupyter notebook detailing the data exploration, preprocessing, and model building steps.
  • article/: Holds project-related article.
  • LICENSE: Project license.
  • README.md: Project overview, links, highlights, and information.

Data Dictionary

Dataset Description
train.csv Training data containing time series of features store_nbr, family, and onpromotion, as well as the target sales.
- store_nbr: Identifies the store where the products are sold.
- family: Identifies the type of product sold.
- sales: Total sales for a product family at a specific store on a given date (can be fractional).
- onpromotion: Total number of items in a product family that were being promoted at a store on a given date.
test.csv Test data with the same features as the training data. Predict target sales for these dates.
transaction.csv Contains date, store_nbr, and transactions made on specific dates.
sample_submission.csv Sample submission file in the correct format.
stores.csv Store metadata, including city, state, type, and cluster.
- cluster: Grouping of similar stores.
oil.csv Daily oil price data, including values during both the train and test data timeframes.
holidays_events.csv Holidays and events data, with metadata.

Project Highlights🚀

  • Employed a holistic approach, embracing the CRISP-DM framework, to gain a deep understanding of retail dynamics.
  • Mined invaluable insights from extensive exploratory data analysis, unveiling hidden trends and patterns within the dataset.
  • Engineered advanced predictive models, featuring the formidable XGBoost algorithm, to forecast sales with unprecedented accuracy.
  • Implemented rigorous hyperparameter tuning, unlocking the full potential of our models and achieving unparalleled predictive performance.
  • Crafted a compelling and informative article, sharing the project's compelling journey, groundbreaking results, and its potential to reshape the future of retail forecasting.

Summary

Code Name Published Article Deployed Dashboard
LP 3 Sales Time Series Prediction Read Article View Dashboard

Hypothesis Investigated

Null Hypothesis (H0) : The number of products under promotion does not influence sales in supermarkets.

Alternate Hypothesis (H1) : The number of products under promotion significantly influence sales in supermarkets.

Rationale

The rationale for testing these hypotheses is to determine whether there is empirical evidence to support the idea that promotions have a meaningful impact on sales in supermarkets.

By testing these hypotheses and examining the correlation between promotions and sales, businesses can gain valuable insights into the dynamics of supermarket sales and make evidence-based decisions regarding their promotional strategies.

Results

Test Conducted Pearson Correlation P-Value
Independent Samples T - Test 0.4180 0.0000

In conclusion, the Pearson correlation coefficient calculated between the number of products under promotion (as indicated by the "onpromotion" column) and sales in supermarkets is approximately 0.4180. The corresponding p-value obtained from the correlation analysis is very close to zero (P-value: 0.0000). Based on the results of this analysis, we reject the null hypothesis.

There is a statistically significant positive correlation (Pearson Correlation Coefficient = 0.4180) between the number of products under promotion and sales in supermarkets. This suggests that promotions have a significant influence on sales, and as the number of products under promotion increases, sales tend to increase as well.

Exploratory Data Analysis (EDA)📊

A snapshot of the conducted exploratory data analysis, aimed at addressing pivotal business inquiries during the analysis process.

storesbytype storesbystate
oil trend newplot

Model Selection

modelse After carefully assessing the performance of our models using key evaluation metrics, it is evident that the XGBoost model stands out as the most effective choice for our dataset. The RMSLE (Root Mean Squared Logarithmic Error) serves as a crucial indicator, and the XGBoost model achieved the lowest RMSLE of 0.0054 among all models evaluated. This indicates that the XGBoost model provides the most accurate and precise predictions when compared to ARIMA, SARIMA, and ETS models.

Therefore, for this specific forecasting task, we are adopting the XGBoost model for its superior predictive accuracy.

Recommendations

  1. Promotion Optimization: Based on the analysis of the impact of promotions on sales, consider optimizing promotion strategies. Identify which types of promotions (e.g., discounts, BOGO offers) have the most significant influence on sales and tailor promotional campaigns accordingly. By focusing promotional efforts on what truly drives sales, you can maximize the return on investment.

  2. Focus on High-Performing Cities: The top-performing city, "Quito," stands out with the highest sales. It's essential to allocate additional resources and marketing efforts to maintain and potentially increase sales in Quito. Additionally, cities like "Guayaquil," "Cuenca," "Ambato," and "Santo Domingo" have also shown strong sales performance. Consider developing city-specific strategies to capitalize on these markets.

  3. Cluster-Centric Approach: The analysis reveals that certain clusters, such as "Cluster 14," "Cluster 6," and "Cluster 8," exhibit remarkable sales figures. Invest in understanding the unique characteristics of these clusters and tailor product assortments, promotions, and inventory management strategies to maximize sales potential in these areas.

  4. Cross-Analysis Opportunities: Explore opportunities to combine the strengths of high-performing cities, clusters, store types, and states. For example, consider aligning promotions with holidays and events in top cities and clusters to maximize sales impact. Additionally, assess whether specific store types thrive in particular cities or clusters, and use this information to refine expansion plans.

Getting Started🏁

  1. Clone this repository: git clone https://github.com/snyamson/LP3-Super-Store-Time-Series-Forecasting.git
  2. Navigate to the project directory: LP3-Super-Store-Time-Series-Forecasting
  3. Explore the Jupyter notebooks for detailed steps and code execution.
  4. Read the published article for a comprehensive understanding of the project.

License📜

This project is licensed under the MIT License.

Author✍️

Solomon Nyamson

Connect with me on LinkedIn: LinkedIn Profile


Feel free to star ⭐ this repository if you find it helpful!

About

Welcome to my captivating project, where I embark on a journey to forecast store sales for Corporation Favorita, a prominent grocery retailer in Ecuador. With a meticulous approach, I'll dive into a rich dataset encompassing store information, product families, promotions, holidays, and more.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published