All About Avocadoes 

My attempt to dissect a data set of one of my favourite foods and experiment with different seaborn plotting methods and fbprophet forecasting library - this will be an ongoing project as I continue to explore and learn as much as I can. 

1) Exploratory Data Analysis using **seaborn** 
2) Simple Forecasting using **fbprophet** 

# Part 1: Exploratory Data Analysis

In [None]:
#import libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#load data
data = pd.read_csv('../input/avocado-prices/avocado.csv', parse_dates=['Date'])

In [None]:
data.head()

In [None]:
#plot average prices of conventional vs. organic avocadoes over time 
sns.set_style('darkgrid')
sns.set_context('notebook')
sns.relplot(x='Date', y='AveragePrice', data=data, kind='line', hue='type', height=6, aspect=2);

In [None]:
#drill down deeper into the price and volume deltas between conventional and organic avocadoes across the years
display(sns.catplot(x='year', y='AveragePrice', data=data, col='type', kind='bar'));
display(sns.catplot(x='year', y='Total Volume', col='type', kind='bar', data=data, sharey=False));

In [None]:
#transform data to discover seasonality trend
data['month']=data.Date.dt.strftime('%b')
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", 
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
data.month = pd.Categorical(values=data.month, categories=months, ordered=True)
seasonal = data.groupby(['month','type'], as_index=False)['AveragePrice'].mean().sort_values('month')

In [None]:
#plot seasonal trend of avocado prices
g =sns.relplot(x='month', y='AveragePrice', data=data, kind='line', row='type', ci=None, height=3, aspect=3, facet_kws={'sharey':False, 'sharex':False});

In [None]:
#plot seasonal trend of avocado retail volume
g =sns.relplot(x='month', y='Total Volume', data=data, kind='line', row='type', ci=None, height=3, aspect=3, facet_kws={'sharey':False, 'sharex':False});

In [None]:
#transform data for geographical analysis, excluding aggregated regions
df = data.groupby(['region', 'year'], as_index=False)['AveragePrice','Total Volume'].mean()
list_to_exclude = ['TotalUS', 'West', 'SouthCentral','Northeast','Southeast','GreatLakes','Midsouth','Plains']
df1 = df[~df.region.isin(list_to_exclude)].sort_values('AveragePrice', ascending=False)

In [None]:
#plot the range of average prices per city
sns.catplot(x='AveragePrice', y='region', data=df1, height=10, aspect=1, kind='box');

# Part 2: Forecasting

With reference to the guide in this very helpful [article](https://pbpython.com/prophet-overview.html) as an example to use fbprophet

In [None]:
#install library
from fbprophet import Prophet

In [None]:
#create data subset of Total US conventional avocado prices as the forecast target 
subset = data[(data.region == 'TotalUS') & (data.type == 'conventional')]
subset = subset[['Date', 'AveragePrice']]
subset = subset.set_index('Date').sort_index()

In [None]:
#split the data into train and test, with the split being all data before/after Jun 30 2017 
split_date = '01-Aug-2017'
train = subset.loc[subset.index <= split_date].copy()
test = subset.loc[subset.index > split_date].copy()

In [None]:
#plot train and test together for visualization of total data sets
test \
    .rename(columns={'AveragePrice': 'Test Set'}) \
    .join(train.rename(columns={'AveragePrice': 'Training Set'}),
          how='outer') \
    .plot(figsize=(15,5), title='Conventional Avocado Prices')

plt.show();

In [None]:
#format columns for prophet model using ds and y
train = train.reset_index()
train.columns = ["ds", "y"]

In [None]:
#create the model and fit the data
m1 = Prophet()
m1.fit(train)

In [None]:
#tell prophet to predict out 1 year
future1 = m1.make_future_dataframe(periods=365*2)

In [None]:
future1.tail()

In [None]:
#make the forecast
forecast1 = m1.predict(future1)

In [None]:
#examine the forecasted values yhat and its lower & upper range
forecast1[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()

In [None]:
#plot a pretty graph 
m1.plot(forecast1);

In [None]:
#plot the forecast with the actuals
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(10)
ax.scatter(test.index, test['AveragePrice'], color='r', alpha=0.4)

fig = m1.plot(forecast1, ax=ax);

In [None]:
#plot various components of the model too 
m1.plot_components(forecast1);

Here are some findings:
* Conventional avocados are expected to continue rising in prices
* The prices dip at the start of the year and are most expensive a few months post summer 
* This is in line with the opposite trend in volume - highest at the start and lowest post summer 
* There was a price spike in 2017, probably due to high demand and a flat supply (esp for conventional avocadoes, whose volume remained the same as 2016) 
* Unsurprisingly organic avocados are more expensive than conventional avocados 
* They're most expensive in more affulent cities like San Francisco and New York 
* FB Prophet managed to predict the general trend, but not the granular points - potentially more data is required 

Future ideas: 
* Examine other predictive algorithms (XGBoost) 
* Try other data sets including stocks