In this notebook, I predict the average price of avocado using time-series modeling 
techniques

In [None]:
# The usual suspects
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from sklearn.cross_validation import cross_val_score, time
from sklearn.model_selection import TimeSeriesSplit
# Ignore warnings (this isn't a good practice usually)
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Avocados are green. :) 
GREEN_COLORMAP = sns.color_palette("Greens")

Let's start by loading the data and doing some time-series exploration.

# Loading data and EDA

In [None]:
DATA_PATH  = "../input/avocado.csv"
df = pd.read_csv(DATA_PATH, parse_dates=['Date'])

First, let's plot the avocado's average price over time.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 8))
df.set_index('Date').plot(y='AveragePrice', ax=ax, color=GREEN_COLORMAP[2])

Not bad for a first graph. However, things are a little bit packed. Possible solution: let's make 
a time-series plot per year.

In [None]:
# Get the number of years and create one plot per year.
# Notice that 2018 has less samples than the previous ones.
years = df.year.unique()
number_years = len(years)
fig, axes = plt.subplots(number_years, 1, figsize=(12, 8))
for i, year in enumerate(years):
    # One green shade per year :)
    # Also, no line connecting the points and marker set to a dot
    # for enhanced readability.
    (df.set_index('Date')
       .loc[lambda df: df.year == year]
       .plot(y='AveragePrice', ax=axes[i], color=GREEN_COLORMAP[i],
             marker="o", linestyle=""))
    axes[i].legend_.remove()

fig.set_tight_layout("tight")

It appears there are multiple points per day. How many exactly?

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
df.groupby('Date').size().plot(ax=ax)

In [None]:
df.Date.diff().value_counts()

We also notice that most observations are made once a week (thus the delta of -7  days). 
Why are there 108 observations per week though? Hint: look at the `region` and `type` columns.

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
df.groupby('Date')['region'].nunique().plot(ax=axes[0])
df.groupby('Date')['type'].nunique().plot(ax=axes[1])

In [None]:
("Bingo, that's it: there are {} unique regions and {} unique" 
" types of Avocado ({})").format(df['region'].nunique(),
                                 df['type'].nunique(),
                                ' and '.join(df['type'].unique())) 

To wrap this short EDA, let's check these regions and types.

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

df['region'].value_counts().plot(kind='bar', ax=axes[0], 
                                 color=GREEN_COLORMAP)
df['type'].value_counts().plot(kind='bar', ax=axes[1], color=GREEN_COLORMAP)
fig.set_tight_layout("tight")

# Time-series processing

Next, we will compute the "real" average avocado's price (over the different regions and types) and only keep
this column (in addition to the `Date` obviously). 

In [None]:
ts = df.groupby('Date')['AveragePrice'].mean().reset_index()

In [None]:
ts.sample(5)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ts.set_index('Date').plot(ax=ax, marker="o", linestyle="-", color=GREEN_COLORMAP[2])

Let's see how this time-series looks like when resampled to a monthly frequency.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
(ts.set_index('Date')
   .resample('1M')
   .mean()
   .plot(ax=ax, marker="o", linestyle="-", color=GREEN_COLORMAP[2]))

Some observations: 
    
* There are seasonal variations: prices are higher from July to October (roughly) since demand is higher during these months.
* There are also yearly variations: an upward trend probably due to a higher demand?
* As mentionned earlier, 2018 data stops in March. 

To finish this section, let's plot some simple statistics about the average monthly price: mean, standard deviation, median, min, and max values. 

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 10))
(ts.set_index('Date')
   .assign(month=lambda df: df.index.month)
   .groupby('month')['AveragePrice'].agg(["mean", "std", "median", "min", "max"])
   .plot(ax=ax, marker="o"))
ax.set_xlabel('Month')

# Temporal train/test split

As in any ML task, will start by dividing the dataset into train and test data. 
We won't look at the test dataset until the end. 
Also, since this is a time-series ML problem, we will use a timestamp to perform the split (can't use the [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from sklearn): 
data before 2018 belongs to the train dataset. After, it belongs to the test dataset. 

In [None]:
# Renaming the ts DataFrame's columns (you will see why soon) before temporal split
renamed_ts = ts.rename(columns={"Date": "ds", "AveragePrice": "y"})
train_ts = renamed_ts.loc[lambda df: df['ds'].dt.year < 2018, :]
test_ts = renamed_ts.loc[lambda df: df['ds'].dt.year == 2018, :]

In [None]:
train_ts.head()

In [None]:
train_ts.tail()

To assess the quality of the model's predictions, will be using a the [**mean absolute error**](https://en.wikipedia.org/wiki/Mean_absolute_error). The lower is this error, the better the model. 

Alright, time to kickstart the modeling! Let's begin with traditional models, i.e. statistical time-series models. 

# Statistical models

For that, let's use [`Prophet`](https://facebook.github.io/prophet/docs/quick_start.html) (this is why we renamed the `Date` and `AveragePrice` columns), an open-source time-series (unveiled a year and half ago) analysis library developed at Facebook. For more details, I recommend checking the [announcement](https://research.fb.com/prophet-forecasting-at-scale/) blog post.

In [None]:
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics


# TODO: Add some comments
HORIZON = "90days"
PERIOD = "7days"

prophet_model = Prophet()
prophet_model.fit(train_ts)
prophet_cv_df = cross_validation(prophet_model, horizon=HORIZON, 
                                 period=PERIOD)

In [None]:
prophet_cv_df.head()

In [None]:
prophet_perf_df = performance_metrics(prophet_cv_df)

In [None]:
prophet_perf_df.sample(5)

In [None]:
from fbprophet.plot import plot_cross_validation_metric
plot_cross_validation_metric(prophet_cv_df, metric='mae');

What is the CV MAPE evolution when varying the horizon (i.e. the number of days in the future to predict)? 

In [None]:
fig ,ax = plt.subplots(1, 1, figsize=(12, 8))
(prophet_perf_df.groupby('horizon')['mae']
                .mean()
                .plot(ax=ax, marker="o", colors=GREEN_COLORMAP[2]))
ax.set_ylabel('MAE')

In [None]:
future_prophet_df = prophet_model.make_future_dataframe(periods=365)
predicted_prophet_df = prophet_model.predict(future_prophet_df)
prophet_model.plot(predicted_prophet_df);
prophet_model.plot_components(predicted_prophet_df);

## ML models

Before starting this section, we will need to extract calendar features from the `ds` column. 
Will also add average rolling mean prices (yearly and monthly). Notice that I approximate the last year using
the 52 previous points and the last months by using the 4 previous points. Finally, I backfill missing data. 

In [None]:
def add_calendar_features(df):
    # TODO: Add some comments
    return (df.assign(month=lambda df: df['ds'].dt.month, 
                                     week=lambda df: df['ds'].dt.week,
                                     year=lambda df: df['ds'].dt.year,
                                     past_month_mean_y=lambda df: 
                                      (df['y'].rolling(window=4)
                                              .mean()
                                              .fillna(method='bfill')),
                                     past_year_mean_y=lambda df: 
                                      (df['y'].rolling(window=52)
                                              .mean())
                                              .fillna(method='bfill'))
                              )



augmented_ts = add_calendar_features(renamed_ts)
augmented_train_ts = augmented_ts.loc[lambda df: df['ds'].dt.year < 2018, :].drop('ds', axis=1)
augmented_test_ts = augmented_ts.loc[lambda df: df['ds'].dt.year == 2018, :].drop('ds', axis=1)

In [None]:
augmented_train_ts.head()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
augmented_train_ts.plot(y='past_month_mean_y', ax=ax)
augmented_train_ts.plot(y='past_year_mean_y', ax=ax)
augmented_train_ts.plot(y='y', ax=ax)

ALright, now we need to define a time-series compatible CV. For that, we will 
use `TimeSeriesSplit` from `sklearn`.

In [None]:
tscv = TimeSeriesSplit(n_splits=3)

If you are unfamilar with CV for time-series, I highly recommend checking this blog post: https://robjhyndman.com/hyndsight/tscv/. 

As a model, let's try the tpot auto-ml tool and see what it gets. Notice that I use the negative MAE since sklearn needs a score (the higher the better) to optimize in the CV method.

In [None]:
from tpot import TPOTRegressor

# TODO: Try more generations and a bigger population size. 
# Be careful not to run out of time!

tpot_model = TPOTRegressor(generations=20, population_size=100, cv=tscv, 
                           scoring="neg_mean_absolute_error", 
                           n_jobs=2, verbosity=2)

In [None]:
tpot_model.fit(augmented_train_ts.drop('y', axis=1), 
               augmented_train_ts['y'])

Based on the CV score, tpot is the winner!
Let's see if this is true on the test dataset.

# Test evaluation

Let's plot the predictions for each model (alongside the true values). For that, we need to prepare
the predictions DataFrame. Also, we will compute the MAE for each model.

In [None]:
test_timestamps = test_ts.ds.values

In [None]:
predicted_prophet_s = predicted_prophet_df.loc[lambda df: df['ds']
                                               .isin(test_timestamps), "yhat"]
predicted_tpot_s = tpot_model.predict(augmented_test_ts.drop("y", axis=1))

In [None]:
assert predicted_tpot_s.shape == predicted_prophet_s.shape
assert predicted_tpot_s.shape == test_ts["y"].shape

In [None]:
predictions_df = pd.DataFrame({'tpot': predicted_tpot_s, 
                              'prophet': predicted_prophet_s,
                              'true': test_ts['y'].values,
                              'Date': test_ts['ds'].values})

In [None]:
print("MAE for tpot on the test dataset is: {}".format(
    (predictions_df['tpot'] - predictions_df['true']).abs().mean(axis=0)))
print("MAE for prophet on the test dataset is: {}".format(
    (predictions_df['prophet'] - predictions_df['true']).abs().mean(axis=0)))

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 10))
predictions_df.set_index('Date').plot(marker='o', ax=ax)

That's it for now. I hope you have enjoyed exploring this dataset and some of the time-series modeling techniques.

To be continued, stay tuned!

Other ideas: 

*  Better explanation and investigation of CV for Prophet model.
* Tuning hyperparamters for Prophet model.
* RNN models. 
* More generation for TPOT

Also, since I am not in expert in time-series modeling, let me know if there is any mistake or data leakage.
As usual, enjoy!