In this study, we will try to see if we can predict the Avocado’s Average Price based on different features.  . The features are different (Total Bags,Date,Type,Year,Region…).

The variables of the dataset are the following:

* Categorical: ‘region’,’type’
* Date: ‘Date’
* Numerical:‘Unamed: 0’,’Total Volume’, ‘4046’, ‘4225’, ‘4770’, ‘Total Bags’, ‘Small Bags’,’Large Bags’,’XLarge Bags’,’Year’
* Target:‘AveragePrice’

The unclear numerical variables terminology is explained in the next section:

* ‘Unamed: 0’ : Its just a useless index feature that will be removed later
* ,’Total Volume’ : Total sales volume of avocados
* ‘4046’ : Total sales volume of  Small/Medium Hass Avocado
* ‘4225’ : Total sales volume of Large Hass Avocado
* ‘4770’ : Total sales volume of Extra Large Hass Avocado
* ‘Total Bags’: Total number of Bags sold
* ‘Small Bags’: Total number of Small Bags sold
* Large Bags’: Total number of Large Bags sold
* ‘XLarge Bags’: Total number of XLarge Bags sold

**So lets start by importing our usual suspects !!**


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

Read in the Avocado Prices csv file as a DataFrame called df

In [None]:
df= pd.read_csv("../input/avocado.csv")

Lets check our data head:

In [None]:
df.head()

The Feature "Unnamed:0" is just a representation of the indexes, so it's useless to keep it, lets remove it !

In [None]:
df.drop('Unnamed: 0',axis=1,inplace=True)

Lets check our data head again to make sure that the Feature Unnamed:0 is removed

In [None]:
df.head()

Great! now lets use the info() methode to get an a general idea about our data:

In [None]:
df.info()

well as a first observation we can see that we are lucky, we dont have any missing values (18249 complete data) and 13 columns.
Now let's do some Feature Engineering on the Date Feature so we can be able to use the day and the month columns in building our machine learning model later. ( I didn't mention the year because its already there in data frame)

In [None]:
df['Date']=pd.to_datetime(df['Date'])
df['Month']=df['Date'].apply(lambda x:x.month)
df['Day']=df['Date'].apply(lambda x:x.day)

Lets check the head to see what we have done:

In [None]:
df.head()

Now lets do some plots!! 
I'll start by plotting the Avocado's Average Price  through the Date column

In [None]:
byDate=df.groupby('Date').mean()
plt.figure(figsize=(12,8))
byDate['AveragePrice'].plot()
plt.title('Average Price')

Cool right? now lets have an idea about the relationship between our Features(Correlation)

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df.corr(),cmap='coolwarm',annot=True)

As we can from the heatmap above, all the Features are not corroleted with the Average Price column, instead most of them are correlated with each other.
So now I am bit worried because that will not help us get a good model. Lets try and see.
First we have to do some Feature Engineering on the categorical Features : region and type

In [None]:
df['region'].nunique()



In [None]:
df['type'].nunique()

as we can see we have 54 regions and 2 unique types, so it's going to be easy to to transform the type feature to dummies, but for the region its going to be a bit complexe so I decided to drop the entire column.
I will drop the Date Feature as well because I already have 3 other columns for the Year, Month and Day.

In [None]:
df_final=pd.get_dummies(df.drop(['region','Date'],axis=1),drop_first=True)

In [None]:
df_final.head()

In [None]:
df_final.tail()

Now our data are ready! lets apply our model which is going to be the Linear Regression because our Target variable 'AveragePrice'is continuous.
Let's now begin to train out regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable

In [None]:
X=df_final.iloc[:,1:14]
y=df_final['AveragePrice']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train,y_train)
pred=lr.predict(X_test)

In [None]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

The RMSE is low so we can say that we do have a good model, but lets check to be more sure.
Lets plot the y_test vs the predictions

In [None]:
plt.scatter(x=y_test,y=pred)

As we can see that we dont have a straigt line so I am not sure that this is the best model we can apply on our data

Lets try working with the  DecisionTree Regressor model


In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr=DecisionTreeRegressor()
dtr.fit(X_train,y_train)
pred=dtr.predict(X_test)

In [None]:
plt.scatter(x=y_test,y=pred)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

Nice, here we can see that we nearly have a straigt line, in other words its better than the Linear regression model, and to be more sure lets check the RMSE

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

Very Nice, our RMSE is lower than the previous one we got with Linear Regression. ok now I am going to try one last model to see if I can improve my predictions for this data which is the RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rdr = RandomForestRegressor()
rdr.fit(X_train,y_train)
pred=rdr.predict(X_test)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

Well as we can see the RMSE is lower than the two previous models, so the RandomForest Regressor is the best model in this case.

In [None]:
sns.distplot((y_test-pred),bins=50)

Notice here that our residuals looked to be normally distributed and that's really a good sign which means that our model was a correct choice for the data. 

In [None]:
data = pd.DataFrame({'Y Test':y_test , 'Pred':pred},columns=['Y Test','Pred'])
sns.lmplot(x='Y Test',y='Pred',data=data,palette='rainbow')
data.head()