# QCTO - Workplace Module

### Project Title: Regional and Packaging Trends in Avocado Prices: 2015-2023
#### Done By: Tshepiso Mudau
#### Github Repo: 

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

### Introduction

The goal of this project is to analyze historical data from the Hass Avocado Board to understand how regional differences and packaging types (bulk vs. bagged) have influenced avocado prices and sales trends between 2015 and 2023. By examining these factors, we aim to uncover seasonal and regional patterns, consumer preferences, and price dynamics, ultimately developing a predictive model to forecast future avocado prices. This analysis will provide valuable insights into market trends that can assist retailers, farmers, and industry stakeholders in making informed decisions.

### Problem Statement and Objectives

Avocado consumption has surged in recent years, and understanding the factors influencing its pricing and sales trends is crucial for businesses across the supply chain. The dataset includes key variables such as avocado type (bulk vs. bagged), region, and price over time, allowing us to address the following questions:

- Regional Price Variability: How do avocado prices fluctuate across different U.S. regions? Are there specific regions where prices are consistently higher or lower?
- Packaging Preferences: Do consumers prefer bulk or bagged avocados, and how does this preference impact prices and sales?
- Seasonal Effects: What role does seasonality play in avocado pricing and sales volumes? Are there noticeable peaks or dips during certain times of the year?
- Price Forecasting: Can we build a predictive model to estimate future avocado prices based on historical data, regional differences, and packaging type?

##### This project leverages historical data analysis and predictive modeling to provide actionable insights, helping stakeholders anticipate market shifts and optimize their business strategies.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>


---

In [3]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Date and time handling
import datetime as dt

# Machine learning and statistical modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
import xgboost as xgb

# Utility for saving/loading models
import joblib


---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

### Purpose:

The data used for this analysis was sourced from the **Hass Avocado Board**, a dataset publicly available on **Kaggle**. It contains historical information on avocado prices and sales volume across various regions in the United States from 2015 to 2023. The dataset aims to help understand how factors such as region, packaging (bulk vs. bagged), and seasonality impact avocado pricing and sales trends.

### Details:

1. **Source**: 
   - The dataset was obtained from Kaggle: [Avocado Prices and Sales Volume (2015-2023)](https://www.kaggle.com/datasets/vakhariapujan/avocado-prices-and-sales-volume-2015-2023).
   - Data is provided by the Hass Avocado Board, representing multiple U.S. regions.

2. **Collection Methods**: 
   - Data was collected by aggregating sales information from various grocery stores and suppliers across the United States.
   - It captures information about different types of avocados (conventional vs. organic), and categorizes the sales data based on packaging types (bulk vs. bagged) and regions.

3. **Dataset Overview**:
   - **Size**: The dataset consists of **18249 rows** and **13 columns**.
   - **Scope**: It covers data for both **conventional** and **organic** avocados, including information on total sales volume, price, and packaging type across **multiple U.S. regions**.
   - **Data Types**:
     - **Numerical**: `AveragePrice`, `TotalVolume`, `plu4046`, `plu4225`, `plu4770`, `TotalBags`, `SmallBags`, `LargeBags`, `XLargeBags`.
     - **Categorical**: `region`, `type`.
     - **Date**: `Date`, capturing the time dimension of the data.

The data provides an extensive view of market trends for avocado pricing and consumer behavior across regions and time, enabling further exploration of regional price variability, seasonal effects, and packaging preferences.


---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

To begin the analysis, the avocado dataset is loaded into a Pandas DataFrame. The dataset is stored in a CSV file format, which can be easily read using the `pandas.read_csv()` function. This function allows us to import the data into Python for further manipulation and analysis.

In this step, the dataset is read and then displayed to ensure it has been loaded correctly. The first few rows of the data are shown to provide an initial look at the structure and contents.


In [4]:
# Load the dataset
avocado_data = pd.read_csv('Avocado_HassAvocadoBoard_20152023v1.0.1.csv')

# Display basic information about the dataset
avocado_data.info()

# Show the first few rows to get an overview
avocado_data.head()

# Summary statistics
avocado_data.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53415 entries, 0 to 53414
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          53415 non-null  object 
 1   AveragePrice  53415 non-null  float64
 2   TotalVolume   53415 non-null  float64
 3   plu4046       53415 non-null  float64
 4   plu4225       53415 non-null  float64
 5   plu4770       53415 non-null  float64
 6   TotalBags     53415 non-null  float64
 7   SmallBags     41025 non-null  float64
 8   LargeBags     41025 non-null  float64
 9   XLargeBags    41025 non-null  float64
 10  type          53415 non-null  object 
 11  region        53415 non-null  object 
dtypes: float64(9), object(3)
memory usage: 4.9+ MB


Unnamed: 0,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags
count,53415.0,53415.0,53415.0,53415.0,53415.0,53415.0,41025.0,41025.0,41025.0
mean,1.42891,869447.4,298270.7,222217.0,20531.95,217508.3,103922.2,23313.16,2731.811796
std,0.393116,3545274.0,1307669.0,955462.4,104097.7,867694.7,569260.8,149662.2,22589.096454
min,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.119091,16264.65,694.725,2120.8,0.0,7846.52,0.0,0.0,0.0
50%,1.4,120352.5,14580.58,17516.63,90.05,36953.1,694.58,0.0,0.0
75%,1.69,454238.0,128792.4,93515.6,3599.735,111014.6,37952.98,2814.92,0.0
max,3.44083,61034460.0,25447200.0,20470570.0,2860025.0,16298300.0,12567160.0,4324231.0,679586.8


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

Before proceeding with the analysis, it is essential to clean and filter the data to ensure that it is in a usable format. Data cleaning involves handling missing values, removing duplicates, and transforming data types where necessary. Proper filtering is also done to focus on the most relevant data for analysis.

#### Key Cleaning and Filtering Steps:
1. **Handling Missing Values**: It is important to check for any missing values in the dataset and decide how to address them, either by imputing with suitable values or removing rows/columns if necessary.
   
2. **Removing Duplicates**: To avoid redundant data and ensure accurate analysis, any duplicate rows are identified and removed.
   
3. **Converting Data Types**: The `Date` column is converted from string format to a `datetime` object for easier manipulation and filtering.
   
4. **Filtering Data**: Depending on the scope of the analysis, we can filter the data based on specific attributes such as avocado type (e.g., "conventional") or region (e.g., "California").


In [5]:
# Check for missing values
avocado_data.isnull().sum()

# Check for duplicates and remove them if any
avocado_data.drop_duplicates(inplace=True)

# Convert 'Date' to datetime format for easier filtering
avocado_data['Date'] = pd.to_datetime(avocado_data['Date'])

# Filter data (e.g., only keeping 'conventional' avocados)
filtered_data = avocado_data[avocado_data['type'] == 'conventional']

# Further filtering, e.g., for specific regions
region_filtered_data = filtered_data[filtered_data['region'] == 'California']

# Show cleaned and filtered data
filtered_data.head()


Unnamed: 0,Date,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags,type,region
0,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,Albany
2,2015-01-04,1.0,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
4,2015-01-04,1.08,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington
6,2015-01-04,1.01,80034.32,44562.12,24964.23,2752.35,7755.62,6064.3,1691.32,0.0,conventional,Boise
8,2015-01-04,1.02,491738.0,7193.87,396752.18,128.82,87663.13,87406.84,256.29,0.0,conventional,Boston
