# Avocado Price Prediction


* The data is made up of 2015-2018 retail scan data for national retail volume (units) and price for avacados. 
* Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados.
* Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military.
* The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. 
* The Product Lookup codes (PLU’s) in the table are only for Hass avocados.

## The Problem
Here we want to use some or all of the features in this dataset to predict avocado price.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/kaggle/input/avocado-prices/avocado.csv')

df['Date'] = pd.to_datetime(df['Date'])
df.head()

## Check out the dataset
Let's first check if we have any nulls

In [None]:
df.isnull().sum()

In [None]:
df.info()

Looks like we have a couple of categorical fields. What are their distinct values? We see type is either conventional or organic and its roughly 50/50. And we see that each region has 338 records.

In [None]:
sns.countplot(x="type" , data=df)

In [None]:
df.type.value_counts()

In [None]:
df.region.value_counts()

I believe right off the bat we have a couple of fields we don't need. Firstly, the unnamed field seems to be a row id we can discard that. Year is a field that we can always pull from date. Let's just drop it.

In [None]:
df = df.drop(['Unnamed: 0', 'year'], axis=1)
df.head(2)

## EDA

Let's look at the distributions.

In [None]:
remove_dates = [col for col in df.columns if col !='Date']
df[remove_dates].hist(bins=50, figsize=(15,8))
plt.tight_layout()
plt.show()

We see that average price is a near normal distribution. The other features look a bit funky. Let's take a deeper look at volume as this might be driving this behavior. Looks like 80% of the volume data doesn't exeed about 6.25K units. We also see from the box plot and the outlier stats that the first outlier is somewhere around 1,067,498 units. This is something to consider going foward.

In [None]:
pd.cut(df['Total Volume'],100).value_counts()/df.shape[0]

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(df['Total Volume'])

In [None]:
from matplotlib.cbook import boxplot_stats  
boxplot_stats(df['Total Volume']).pop(0)['fliers'].min()

Now let's see how the average price changes over time. I imagine that season has much to do with the price of avocados and the volume. When in season I imagine more is sold. According to a quick google search, Hass are in season from May through to January. We can take averages for organic and conventional and see how they behave over time. Below I see a little bit of a correlation with month, particularly Spring time. We can also see strongly that when volume jumps price goes down. Seems to be a strong indicator of basic supply and demand principles, at least for the conventional type. Sales for organic is pretty flat.

In [None]:
con = df[df['type']=='conventional'].groupby('Date').mean().reset_index()
org = df[df['type']=='organic'].groupby('Date').mean().reset_index()

fig, ax1 = plt.subplots(figsize=(15,8))
ax2 = ax1.twinx()
ax1.plot(con['Date'], con['AveragePrice'], c='red', label='Conventional Price')
ax1.plot(org['Date'], org['AveragePrice'], c='blue', label='Organic Price', alpha=.4, linestyle=':', lw=3)
ax2.plot(con['Date'], con['Total Volume'], c='green', label='Conventional Volume')
ax2.plot(org['Date'], org['Total Volume'], c='purple', label='Organic Volume', linestyle=':', lw=3, alpha=.4)
         
ax1.set_xlabel('Date')
ax1.set_ylabel('Avg of Avg Price')
ax2.set_ylabel('Volume')

h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
ax1.legend(h1+h2, l1+l2, loc=2)
plt.title('Mean Avg Price and Volume Over Time')
plt.show()

Let's take a look at just one location and use STL for getting a trend. For example, let's take a look at conventional avocados in Chicago. We can get a trend with seasonal_decompose. Below I don't see much correlation with month however we can see strongly that when volume jumps price goes down. Seems to be a strong indicator of basic supply and demand principles. 

In [None]:
from pandas.plotting import register_matplotlib_converters
from statsmodels.tsa.seasonal import STL

chicago_conventional = df[(df['region'] == 'Chicago') & (df['type']=='conventional')].sort_values('Date')
chicago_conventional.index = chicago_conventional['Date']

stl_price = STL(chicago_conventional['AveragePrice'], period=7)
res_price = stl_price.fit()

stl_vol = STL(chicago_conventional['Total Volume'], period=7)
res_vol = stl_vol.fit()


fig, ax1 = plt.subplots(figsize=(15,8))
ax2 = ax1.twinx()
ax1.plot(chicago_conventional['Date'], res_price.trend, c='blue', label='Price', alpha=.5, linestyle=':', lw=4)
ax2.plot(chicago_conventional['Date'], res_vol.trend, c='green', label='Volume', alpha=.5, lw=3)

ax1.set_xlabel('Date')
ax1.set_ylabel('Avg Price')
ax2.set_ylabel('Volume')

h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
ax1.legend(h1+h2, l1+l2, loc=2)
plt.title('Price and Volume Trend')
plt.show()

We see that we don't have much volume in 2018 and 2017 is the best year out of the other three years.

In [None]:
df.groupby(df.Date.dt.year).sum()[['Total Volume']].plot(kind='bar')

And as it turns out the last date in 2018 is March 25th.

In [None]:
df[df.Date.dt.year == 2018][['Date']].value_counts().sort_index()

Finally, what does the month distribution look like for each of these years. We see in 2015 and 2016 a spike in Spring. There is also a bit of a spike in Spring for 2017 but the year really started out strong. I think there is a bit of signal here and we can substitute date for month in the dataset.

In [None]:
df2 = df.copy()
df2['month'] = df2.Date.dt.month
df2['year'] = df2.Date.dt.year
df2 = df2.groupby(['year', 'month']).sum().reset_index()

plt.figure(figsize=(12,8))
for i,year in enumerate(df2['year'].unique()):
    plt.subplot(2,2,i+1)
    plt.bar(df2[df2['year']==year]['month'], df2[df2['year']==year]['Total Volume'])
    plt.title(year)
fig.tight_layout()

In [None]:
df_copy = df.copy()
df_copy['month'] = df_copy['Date'].dt.month
df_copy = df_copy.drop(['Date'], axis=1)
df_copy.head(2)

Finally let's explore correlation. We see there is very mild negative correlation between price and volume, which we picked up from the time series plot. This seems to be specfically more pronounced with PLU 4046. We see there is very mild positive correlation between price and month.

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_copy.corr(), annot=True)

# Data Preperation

Let's get our data ready to use in our model selection. We need to do the following
* create train and test splits
* scale our numerical features
* one-hot encode our categorical features

Let's do a simple test/train split. Doesn't appear we need to do a stratified split. So this is very straight forward.

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df_copy, test_size=0.2, random_state=42)
train_set.shape, test_set.shape

In [None]:
train_features = train_set.drop("AveragePrice", axis=1)
train_labels = train_set["AveragePrice"].copy()

test_features = test_set.drop("AveragePrice", axis=1)
test_labels = test_set["AveragePrice"].copy()

Now let's set up a pipeline for scaling and one-hot encoding and perform it on our train set.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_attribs = ['Total Volume', '4046', '4225', '4770', 'Total Bags','Small Bags', 'Large Bags', 'XLarge Bags','month']
cat_attribs = ['type', 'region']

num_pipline = Pipeline([('std_scaler', StandardScaler()),])
pipeline = ColumnTransformer([("num", num_pipline, num_attribs), ("cat", OneHotEncoder(), cat_attribs)])

train_prepared = pipeline.fit_transform(train_features)
train_prepared.toarray()[0]

# Model Selection

Let's just start off with simple linear regression, take a look of how it predicts off of some of our training data. Then Let's see with the RMSE value is. We see that our predictions do follow a line but perhaps not the best. The RMSE is within about a 26 cents.

In [None]:
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(train_prepared, train_labels)

some_data = train_features.iloc[:30]
some_data_prepared = pipeline.transform(some_data)
some_data_predicted = lin_reg.predict(some_data_prepared)
some_data_actual = train_labels.iloc[:30]

test_result = pd.DataFrame({
    'actual': list(some_data_actual),
    'predictions': list(some_data_predicted)
})

plt.scatter(test_result['actual'], test_result['predictions'])

In [None]:
from sklearn.metrics import mean_squared_error
pred = lin_reg.predict(train_prepared)
lin_mse = mean_squared_error(train_labels, pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

Let's try Decision Tree Regressor.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Decision tree model
tree_reg = DecisionTreeRegressor()
tree_reg.fit(train_prepared, train_labels)

some_data = train_features.iloc[:30]
some_data_prepared = pipeline.transform(some_data)
some_data_predicted = tree_reg.predict(some_data_prepared)
some_data_actual = train_labels.iloc[:30]

test_result = pd.DataFrame({
    'actual': list(some_data_actual),
    'predictions': list(some_data_predicted)
})

plt.scatter(test_result['actual'], test_result['predictions'])

In [None]:
tree_predictions = tree_reg.predict(train_prepared)
tree_mse = mean_squared_error(train_labels, tree_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Wow, awesome results! Nah, we just way overfit the data. We will need some cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

scores_tree = cross_val_score(tree_reg, train_prepared, train_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores_tree)
display_scores(tree_rmse_scores)

Now let's try it for linear regression.

In [None]:
scores_lin_reg = cross_val_score(lin_reg, train_prepared, train_labels, scoring="neg_mean_squared_error", cv=10)
lin_reg_rmse_scores = np.sqrt(-scores_lin_reg)
display_scores(lin_reg_rmse_scores)

Let's also take a look at Random Forest Regressor. We see that RandomForestRegressor is the best of the three. We will use that going forward.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(train_prepared, train_labels)
random_tree_scores = cross_val_score(forest_reg, train_prepared, train_labels, scoring="neg_mean_squared_error", cv=10)
randome_tree_rmse_scores  = np.sqrt(-random_tree_scores)
display_scores(randome_tree_rmse_scores )

# Fine-Tuning
Let's use GridSearchCV to better our model.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 20, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10, 20, 30], 'max_features': [2, 4, 6, 8]},
  ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(train_prepared, train_labels)
print(grid_search.best_params_)
print(grid_search.best_estimator_)

Let's explore the final model with the test set.

In [None]:
final_model = grid_search.best_estimator_

X_test_prepared = pipeline.transform(test_features)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(test_labels, final_predictions)
final_rmse = np.sqrt(final_mse) 
print(final_rmse)
plt.scatter(test_labels, final_predictions)