# Prediction of avocado price using neural network


## Introduction
There is definitely an avocado trend these days! I mean why not, it is healthy. Because of that, it is really interesting to get an idea of the market of avocados.   
Our goal will be to get a good picture of that market by trying to predict the average price.

## Our method
Usually, the way I work on this kind of data is pretty iterative, the goal here is to produce a first 'silly' model and then try to improve it, so it can generalize the problem more efficiently.  
Here we will be using neural network.

## The Data

In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Load the data


In [None]:
import pandas as pd
import numpy as np

# seed
np.random.seed(1337)

# load avocado file
avocadoCSV = pd.read_csv('/kaggle/input/avocado-prices/avocado.csv', parse_dates=['Date'], dtype={"region": "category","type": "category","year": "category"})  



## A bit of data visualization

Let's check the distribution of avocado prices

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('seaborn-darkgrid')
fig1 = plt.figure(1, figsize=(14,7))

sns.distplot(avocadoCSV['AveragePrice'],color='b',  axlabel='Average Price')


So from there we can see that most of the avocado have prices between 0.7 and 1.7, but we can interpretate the two peaks as the preferred prices, the fact that the prices range in a small interval can also cause some issues for our model. 

In order to have a more specific idea of the data, we want to have a look at the repartition by type (organic, conventional).

In [None]:
fig2 = plt.figure(2, figsize=(5,5))
sns.boxplot(x="type", y="AveragePrice", data=avocadoCSV, palette="Set1")

fig3 = plt.figure(3, figsize=(8,7))
sns.boxplot(x="type", y="AveragePrice", hue='year', data=avocadoCSV, palette="Set1")

Big news! Organic product are generally more expensive :)  
Also it seems that the year 2017 is the year where the product became the most expensive, the interesting thing to notice is that the boxplots for conventional avocado prices show less variance than the ones for organic. The way i interpretate that is the 'quality' of an organic product is subject to a lot of natural causes (bad season, different expiration dates by species, etc.) which can strongly affect the price of the avocado.  

Another important thing is that because we don't have a lot of years in the dataset, it is difficult to extract some trends.

## Data preprocessing  

First we load the sample, the goal here is to use the dataset for supervised training in order to predict the average price

In [None]:
avocadoCSV = avocadoCSV.sample(frac=1) # randomize sample
avocadoCSV.head()

In [None]:
print("dataset shape = ",avocadoCSV.shape)
avocadoCSV.describe()

Because we might have some seasonality effect on the average price, it is important to add a new column for the months.  
  
  
Also you can see that we remove the Total bags column because this information is redundant because of the columns: 'Small Bags', 'Large Bags', 'XLarge Bags'

In [None]:
avocadoCSV['month'] = avocadoCSV['Date'].map(lambda x: x.month)
avocadoCSV['month'] = avocadoCSV['month'].astype('category')

#avocado=avocadoCSV[[ 'month' ,'year', 'region', 'Small Bags', 
#'Large Bags','XLarge Bags', 'type','AveragePrice']]

Now we need our model to predict the average price depending on all those features selected, however we need first to convert some those columns to categories, and therefore affect weights to each codes of those categories.  

For instance we have 2 types of avocados: organic,and conventional; we want instead of one column for the two possible values, two columns, each one of them representing a value.
type -> (type1, type2)

We do the same for month, year, region

In [None]:
def makeAvocadoWithCategory(data, categoryColumns, fieldsToKeep):

	allFields = categoryColumns + fieldsToKeep
	df = data[allFields]

	dfCategories = [ pd.get_dummies(df[column], prefix=column) for column in categoryColumns ]
	df = pd.concat([df] + dfCategories, axis=1)
	df = df.drop(columns=categoryColumns)

	return df

avocado = makeAvocadoWithCategory(
	avocadoCSV,
	['month' ,'year', 'region','type'],
	['Small Bags','Large Bags','XLarge Bags','AveragePrice']
)

In [None]:
avocado.shape

In [None]:
avocado.columns

## The Neural Network

Now we can make the train and the test sets.

In [None]:
def makeTrainAndTestSet(data):
	dataAveragePrice = data['AveragePrice']
	dataNoAveragePrice = data.drop(columns=['AveragePrice'])

	dataTrain = dataNoAveragePrice[:15000]
	dataYTrain = dataAveragePrice[:15000]


	dataTest = dataNoAveragePrice[15001:]
	dataYTest = dataAveragePrice[15001:]
	return dataTrain, dataYTrain, dataTest,dataYTest

avocadoTrain, avocadoYTrain, avocadoTest, avocadoYTest = makeTrainAndTestSet(avocado)

In [None]:
avocadoTrain.shape

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten

model = Sequential()
model.add(Dense(4, activation='relu', input_dim=75))
model.add(Dense(6, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1))
model.compile('adam', loss='mean_squared_error')

In [None]:
history1 = model.fit(avocadoTrain, avocadoYTrain, epochs=15)

In [None]:
def makePredictSummary(modelTo, XtestData, YtestData):
	pred = modelTo.predict(XtestData).T[0]
	real = YtestData.values
	# compute relative error
	err = np.abs((real - pred) / real)
	predictionSummary = pd.DataFrame({'real': real, 'pred': pred, 'err(%)': err})

	return predictionSummary

summary = makePredictSummary(model, avocadoTest, avocadoYTest)
summary[:20]

## Tuning the model
### First remarks


1. The model can return negative values
2. Because the values for the prices are so close, the model have difficulties to differentiate avocados sometimes
3. If u rerun the model u'll see that the result are not stable
 * The model is very sensitive to the number of epochs
 * With different epoch u can see that at some point the model will just return constant values
4. Some features need to be adjusted
 * this is the case for the region features, we need a more abstract representation of that otherwise the model will overfit
 * This one is a bit tricky, but the year is not something to change to be a categorical variable, what we want instead is to get an idea of the trend of the current year (see [ARIMA](http://https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model) or any temporal serie analysis method) - honestly, i don't know if it's possible for that one :)
 5. the data doesn't have a lot of rows
 
### Change the model

First we're going to improve the model

In [None]:
from keras.layers import BatchNormalization

model2 = Sequential()
model2.add(Dense(6, activation='relu', input_dim=75))
model2.add(Dense(6, activation='relu'))
model2.add(BatchNormalization())
model2.add(Dense(10, activation='relu'))
model2.add(Dropout(0.25))
model2.add(Dense(16, activation='elu'))
model2.add(Dense(10, activation='elu'))
model2.add(Dropout(0.5))
model2.add(Dense(6, activation='relu'))
model2.add(BatchNormalization())
model2.add(Dense(4, activation='relu'))
model2.add(Dense(4, activation='relu'))
model2.add(Dense(1))

model2.compile("adam", loss='mean_squared_error')

In [None]:
history2 = model2.fit(avocadoTrain, avocadoYTrain, epochs=20, batch_size=64)

Ok, so my idea there is to add more layers, with more neurons. Because i often noticed that my gradient flow get killed, i had to:
1. Change the relu in the deep layers to be elu
 * This allow me to have an extended non saturation regime compared to relu
 * I also have the intuition that in the deep layers you want the gradient to be *passed* whereas in the first or the last layers you want to *summarize* the information, that's why u need activation functions like relu, that will *cut* the space, and cut the gradient flow with their saturation regime
2. I added Dropouts to prevent from overfitting, and batch normalization to stabilize the model  


I changed the batch size to get a more accurate batch stochastic gradient, at each step, which is important given the size of the data

In [None]:
summary2 = makePredictSummary(model2, avocadoTest, avocadoYTest)
summary2[:20]

Because the average prices are really close, I suggest to predict 
$$expAveragePrice = 3^{AveragePrice}$$

In [None]:
avocadoYTrainExp = np.power(3, avocadoYTrain)
avocadoYTestExp = np.power(3, avocadoYTest)
historyExp2 = model2.fit(avocadoTrain, avocadoYTrainExp, epochs=20, batch_size=64)

In [None]:
summaryExp = makePredictSummary(model2, avocadoTest, avocadoYTestExp)
summaryExp

 
### Learning rate

Then we are going to fix the learning rate.  
In the case of the first model you can see the overfitting


In [None]:
plt.figure(4, figsize = (7,4))

plt.plot(history1.history['loss'], '-p', markersize=6, linewidth=2)
plt.title('First Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['learning rate 0.001'], loc='upper left')

We can see that the loss function is dropping too fast! 
Let's see the evolution with our second model and different learning rates

In [None]:
from keras import optimizers


adam2 = optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=adam2, loss='mean_squared_error')
historyExp2_2 = model.fit(avocadoTrain,  avocadoYTrainExp, epochs=20, batch_size=64, verbose=0)

adam3 = optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer=adam3, loss='mean_squared_error')
historyExp2_3 = model.fit(avocadoTrain,  avocadoYTrainExp, epochs=20, batch_size=64, verbose=0)



In [None]:
plt.figure(5, figsize = (9,4))
plt.plot(historyExp2_2.history['loss'], '-p', markersize=6, linewidth=2)
plt.plot(historyExp2_3.history['loss'], '-p', markersize=6, linewidth=2)
plt.title('Second Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['0.0001', '0.0005'], loc='upper left')

From the two curve we can see that a good learning rate would be between 0.0001 and 0.0005, let's choose 0.0003 with  20 epochs.  

## Remove the year column (TBC)

As I said earlier, putting year as a category is not the best thing to do, instead we want to incorporate the trend in the dataset

In [None]:
# remove the year column as a category

avocadoNew = makeAvocadoWithCategory(
	avocadoCSV,
	['month' ,'region','type'],
	['Small Bags','Large Bags','XLarge Bags','AveragePrice', 'year']
)

avocadoNewTrain, avocadoNewYTrain, avocadoNewTest, avocadoNewYTest = makeTrainAndTestSet(avocado)

avocadoNewYTrain = np.power(3, avocadoNewYTrain)
avocadoNewYTest = np.power(3, avocadoNewYTest)

Our new model

In [None]:
model3 = Sequential()
model3.add(Dense(6, activation='relu', input_dim=75))
model3.add(Dense(6, activation='relu'))
model3.add(BatchNormalization())
model3.add(Dense(10, activation='relu'))
model3.add(Dropout(0.25))
model3.add(Dense(16, activation='elu'))
model3.add(Dense(10, activation='elu'))
model3.add(Dropout(0.5))
model3.add(Dense(6, activation='relu'))
model3.add(BatchNormalization())
model3.add(Dense(4, activation='relu'))
model3.add(Dense(4, activation='relu'))
model3.add(Dense(1))

adam = optimizers.Adam(learning_rate=0.0003) #new learning rate
model3.compile(adam, loss='mean_squared_error')

##  Need to be improved

Add new features (regional features)  
Incoroporate a trend year feature  
Check for R-NNN to get an ARIMA-like behavior in the model