Brewery Sales Forecasting🍺

Project Overview

This project involves forecasting total sales for a brewery using a dataset from Kaggle. The models used for forecasting include Linear Regression, Random Forest, and Decision Tree. The project has been executed and tested on Google Cloud Platform's DataProc service.

Dataset

The dataset used in this project can be found at Kaggle Brewery Operations and Market Analysis Dataset. It provides extensive data related to brewery operations and market analysis, which is suitable for developing predictive models.

Requirements

Google Cloud Platform account
Access to GCP DataProc
Apache Spark
PySpark
Python 3.x

Installation and Setup

Set up GCP DataProc Cluster: Ensure that you have a GCP account and create a DataProc cluster to run PySpark jobs.
Install PySpark:
```
pip install pyspark
```

Download the Dataset:

Download the dataset from Kaggle and upload it to a bucket in Google Cloud Storage accessible by your DataProc cluster.

EDA

Refer to notebooks/Brewery data analysis.ipynb and databricks notebook

Models Used

Linear Regression: A basic model for establishing a baseline in forecasting performance. Refer to brewery_pyspark_lr.py
Random Forest: An ensemble model that uses multiple decision trees to improve the predictive accuracy and control over-fitting. Refer to brewery_pyspark_rf.py
Decision Tree: A model that splits the data into subsets while at the same time developing a corresponding decision tree. The final decision tree can be used to make predictions. Refer to brewery_pyspark_dt.py

Training:

Initialize SparkSession
Load dataset
Drop duplicates and NAs
Cast the target variable to Float
Split DateTime column into Year, Month and Day
Convert categorical columns to numeric values or one hot encoding-Linear Regression
Initialize VectorAssembler
Split dataset into Train and Test
Initialize Model
[Optional] Initialize params for grid search
[Optional] Initialize the CrossValidator along with metric
Train the model
Initialize evaluator
Compute RMSE and R squared

NOTE: As we're not treating this as a time series problem we can split it randomly into Train and Test

Parameter Selection

brewery_pyspark_rf_grid.py

Results

RMSE: 2486

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
imgs		imgs
notebooks		notebooks
.gitignore		.gitignore
2024 - Spring - Finals Presentation - Team D - Brewery Sales Forecasting.pptx		2024 - Spring - Finals Presentation - Team D - Brewery Sales Forecasting.pptx
README.md		README.md
Report.docx		Report.docx
Report.pdf		Report.pdf
brewery_pyspark_dt.py		brewery_pyspark_dt.py
brewery_pyspark_lr.py		brewery_pyspark_lr.py
brewery_pyspark_rf.py		brewery_pyspark_rf.py
brewery_pyspark_rf_grid.py		brewery_pyspark_rf_grid.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brewery Sales Forecasting🍺

Project Overview

Dataset

Requirements

Installation and Setup

Download the Dataset:

EDA

Models Used

Training:

Parameter Selection

Results

About

Releases

Packages

Languages

smendes2901/Brewery-Sales-Forecasting

Folders and files

Latest commit

History

Repository files navigation

Brewery Sales Forecasting🍺

Project Overview

Dataset

Requirements

Installation and Setup

Download the Dataset:

EDA

Models Used

Training:

Parameter Selection

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages