# Avocado Prices

DataSet : https://www.kaggle.com/neuromusic/avocado-prices 

It is a well known fact that Millenials LOVE Avocado Toast. It's also a well known fact that all Millenials live in their parents basements.

Clearly, they aren't buying home because they are buying too much Avocado Toast!

But maybe there's hope… if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream.

Here we will try to compare the Linear Regression Model and SVR Model.

For us the target of this project is to predict the future price of avocados depending on some variables / features we have such as 

* Type     
* Bags(4 units) vs Bundle(one unit)     
* Region      
* Volume      
* Size     
* Years

## Column Description

* Date - The date of the observation   --> Will not be using this feature.
* AveragePrice - the average price of a single avocado    --> Target
* Total Volume - Total number of avocados sold (small Hass + Large Hass + XLarge Hass + Total Bags)
* 4046 - Total number of avocados with PLU 4046 sold  (Small Hass)
* 4225 - Total number of avocados with PLU 4225 sold  (Large Hass)
* 4770 - Total number of avocados with PLU 4770 sold  (XLarge Hass)
* Total Bags = Small Bags + Large Bags + XLarge Bags 
* type - conventional or organic
* year - the year
* Region - the city or region of the observation

# Import Library

In [None]:
# Import Library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
from sklearn import metrics
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn import svm



# Load Data

In [None]:
df = pd.read_csv('../input/avocado-prices/avocado.csv', usecols=range(1,14))

In [None]:
df.head()

In [None]:
# 8603.62 + 93.25 + 0
1036.74 + 54454.85 + 48.16 + 8696.87

In [None]:
df.shape

# Check for Missing Data

In [None]:
df.isnull().sum()

# Class Imbalance Check
## Region

In [None]:
len(df.region.unique())

In [None]:
df.groupby('region').size() 

There are ~338 observations from each region, dataset seems balanced, and there are 54 regions.

## The average prices by regions

In [None]:
plt.figure(figsize=(15,15))

plt.title("Avgerage Price of Avocado by Region")

sns.barplot(x="AveragePrice",y="region",data= df)

plt.show()

Observation : Seems there are some regions which are US States (say California) and US Cities (say San Francisco) of that State or just Cities. Also there is a region as "TotalUS"; "West".

For now lets leave them as is... but we could handle them.

## type

In [None]:
print(len(df.type.unique()))

df.groupby('type').size()


`Types` of avocados are also balanced since the ratio is nearly 0.5 each.

# The average prices of avocados by types 

In [None]:
plt.figure(figsize=(5,7))

plt.title("Avg.Price of Avocados by Type")

sns.barplot(x="type",y="AveragePrice",data= df)

plt.show()

# Correlation

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df.corr(),cmap='coolwarm',annot=True)

## Observation :
There is a high correlation between pairs: 
* 4046 & total volume  (0.98)    
* 4225 & total volume  (0.97)
* 4770 & total volume  (0.87)
* total bags & total volume  (0.96)      
* small bags & total bags    (0.99) 
* etc

* `4046` avocados are the most preferred/sold type in the US and customers tend to buy those avocados as bulk, not bag.
* Retailers want to increase the sales of bagged avocados instead of bulks. They think this is more advantageous for them.
* Total Bags variable has a very high correlation with Total Volume (Total Sales) and Small Bags, so we can say that most of the bagged sales comes from the small bags.

As we already see the field descriptions, so for our training we are interested only in fields as below 

In [None]:
# Specifying dependent and independent variables

X = df[['4046', '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region']]
y = df['AveragePrice']
y=np.log1p(y)

In [None]:
X.head()

In [None]:
y.head()

# Labeling the categorical variables

In [None]:
# X_labelled = pd.get_dummies(X[["type","region"]], drop_first = True)
# X_labelled.head()

X = pd.get_dummies(X, prefix=["type","region"], columns=["type","region"], drop_first = True)
X.head()

In [None]:
X.columns

# Split into Train and Valid set

In [None]:
from sklearn.model_selection import train_test_split 

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, random_state = 99)

In [None]:
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

# Training the Model
## Multiple Linear Regression 

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)

print("R2 of Linear Regresson:", lr.score(X_train,y_train) )
print("----- Prediction Accuracy-----")
print('MAE: ',metrics.mean_absolute_error(y_valid, lr.predict(X_valid)))
print('MSE: ',metrics.mean_squared_error(y_valid, lr.predict(X_valid)))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_valid, lr.predict(X_valid))))

In [None]:
# Creating a Histogram of Residuals
plt.figure(figsize=(6,4))
sns.distplot(y_valid - lr.predict(X_valid))
plt.title('Distribution of residuals')
plt.show()

In [None]:
plt.scatter(y_valid,lr.predict(X_valid))

## SVR Regressor

In [None]:
from sklearn.svm import SVR

let's first choose the best kernel for our data out of provided kernels.

In [None]:
# clf = svm.SVR(kernel = 'linear')
# clf.fit(X_train, y_train)
# confidence = clf.score(X_train, y_train)
# print(k,confidence)

In [None]:
# for k in ['linear','poly','rbf','sigmoid']:
#     print("Running for k as ", k)
#     clf = svm.SVR(kernel=k)
#     clf.fit(X_train, y_train)
#     confidence = clf.score(X_train, y_train)
#     print(k,confidence)

So '' is best for our data.

Parameter Tuning or Hyperparameter

Intuitively, the `gamma` defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’.

The `C` parameter trades off correct classification of training examples against maximization of the decision function’s margin. 

* For larger values of `C`, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. 

* A lower `C` will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. 

* In other words `C` behaves as a regularization parameter in the SVM.

In [None]:
svr = SVR(kernel='rbf', C=1, gamma= 0.5)   # Parameter Tuning to get the best accuracy

svr.fit(X_train,y_train)
print(svr.score(X_train,y_train))

In [None]:
from math import sqrt 

In [None]:
# calculate RMSE
error = sqrt(metrics.mean_squared_error(y_valid,svr.predict(X_valid))) 
print('RMSE value of the SVR Model is:', error)

In [None]:
# Creating a Histogram of Residuals
plt.figure(figsize=(6,4))
sns.distplot(y_valid - svr.predict(X_valid))
plt.title('Distribution of residuals')
plt.show()

In [None]:
plt.scatter(y_valid,svr.predict(X_valid))

# Compare RMSE

In [None]:
# Linear Regression RMSE : 
print('RMSE value of the Linear Regr : ',round(np.sqrt(metrics.mean_squared_error(y_valid, lr.predict(X_valid))),4))

# SVR RMSE               : 
print('RMSE value of the SVR Model   : ',round(np.sqrt(metrics.mean_squared_error(y_valid, svr.predict(X_valid))),4))
