This project involves forecasting total sales for a brewery using a dataset from Kaggle. The models used for forecasting include Linear Regression, Random Forest, and Decision Tree. The project has been executed and tested on Google Cloud Platform's DataProc service.
The dataset used in this project can be found at Kaggle Brewery Operations and Market Analysis Dataset. It provides extensive data related to brewery operations and market analysis, which is suitable for developing predictive models.
- Google Cloud Platform account
- Access to GCP DataProc
- Apache Spark
- PySpark
- Python 3.x
-
Set up GCP DataProc Cluster: Ensure that you have a GCP account and create a DataProc cluster to run PySpark jobs.
-
Install PySpark:
pip install pyspark
Download the dataset from Kaggle and upload it to a bucket in Google Cloud Storage accessible by your DataProc cluster.
Refer to notebooks/Brewery data analysis.ipynb
and databricks notebook
- Linear Regression: A basic model for establishing a baseline in forecasting performance. Refer to
brewery_pyspark_lr.py
- Random Forest: An ensemble model that uses multiple decision trees to improve the predictive accuracy and control over-fitting. Refer to
brewery_pyspark_rf.py
- Decision Tree: A model that splits the data into subsets while at the same time developing a corresponding decision tree. The final decision tree can be used to make predictions. Refer to
brewery_pyspark_dt.py
- Initialize SparkSession
- Load dataset
- Drop duplicates and NAs
- Cast the target variable to Float
- Split DateTime column into Year, Month and Day
- Convert categorical columns to numeric values or one hot encoding-Linear Regression
- Initialize VectorAssembler
- Split dataset into Train and Test
- Initialize Model
- [Optional] Initialize params for grid search
- [Optional] Initialize the CrossValidator along with metric
- Train the model
- Initialize evaluator
- Compute RMSE and R squared
NOTE: As we're not treating this as a time series problem we can split it randomly into Train and Test
brewery_pyspark_rf_grid.py
RMSE: 2486