# Avocado prices

### Nature and rationale of the data

> The `data` represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

## In this Notebook I have done the following tasks: 

#### Did Feature engineering to round off certain features and created new features like Day & Month.

#### Went in detail to  explore and visualize the data. Some important graphs in notebooks are:
* Avg.Price of Avocado by City,Total volume of Avocado sold by City.
* Avg.price of Avocado as per type, AvgPrice of avocado as per days of month, Total volume of avacado sold as per month.
* Pie graph to visualize volume distribution.

#### Used Facebook Prophet to predict the future average price of Avodaco.



## Some relevant columns in the dataset:

*  Date - The date of the observation
*  AveragePrice - the average price of a single avocado
*  type - conventional or organic
*  year - the year
*  Region - the city or region of the observation
*  Total Volume - Total number of avocados sold
*  4046 - Total number of avocados with PLU 4046 sold
*  4225 - Total number of avocados with PLU 4225 sold
*  4770 - Total number of avocados with PLU 4770 sold

## Step 1: Import all the libraries, load the dataset and have a first look at the data.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('../input/avocado-prices/avocado.csv')
df.head()

In [None]:
df = df.drop('Unnamed: 0',axis=1)

In [None]:
df.info()

In [None]:
df.describe()

## Step 2: Feature Engineering.

##### Chage the date column to datetime so this can be used as a time serise data. 

In [None]:
df['Date'] = pd.to_datetime(df['Date'],errors='coerce')

#### Luckily we dont have any null values which makes working with data easier.

In [None]:
sns.heatmap(df.isnull(),cbar=False,cmap='Blues',yticklabels=False)

In [None]:
df.year.value_counts()

#### Rounding off certain columns as I find it more comfertable to work with rounded off values. This step can actually be skipped.

In [None]:
round_columns = df[['Total Volume','4046','4225','4770','Total Bags','Small Bags','Large Bags','XLarge Bags']]

In [None]:
for i in round_columns.columns:
    df[i] = df[i].apply(np.round)

In [None]:
df.head()

## Step 3: Exploratory Data Analysis

#### From the histogram of the data I see Avg Price has a very nice distribution. Hence I will plot that seprately in the next cell.

In [None]:
df.hist(bins=30,figsize=(12,10),color='skyblue',ec="black")

In [None]:
plt.figure(figsize=(10,5))
plt.title("Price Distribution")
ax = sns.distplot(df["AveragePrice"], color = 'b')

In [None]:
sns.barplot(x=df['type'],y=df['Total Volume'].value_counts())

In [None]:
df.year.value_counts().sort_index().plot(kind='barh',figsize=(6,4),color='skyblue',ec='black')

#### It makes more sense to work with only cities rather then states. So I am going to remove all the states from the region column.

In [None]:
regionsToRemove = ['California', 'GreatLakes', 'Midsouth', 'NewYork', 'Northeast', 'SouthCarolina', 'Plains', 'SouthCentral', 'Southeast', 'TotalUS', 'West']
df = df[~df.region.isin(regionsToRemove)]
len(df.region.unique())

In [None]:
plt.figure(figsize=(15,12))
sns.set(style="white", context="talk")
plt.title("Avg.Price of Avocado by City")
sns.barplot(x="AveragePrice",y="region",data= df,palette="rocket")

# As seen avg price of avocado is the most in San Francisco & hartford springfield

#### As seen avg price of avocado is the most in San Francisco & hartford springfield

In [None]:
plt.figure(figsize=(15,12))
sns.set(style="white", context="talk")
plt.title("Total volume of Avocado sold by City")
sns.barplot(x="Total Volume",y="region",data= df,palette="deep")

#### Clearly LA is in love with avocados, one reason can be that Avg price of Avocado in LA is not very high.

In [None]:
plt.figure(figsize=(8,4))
sns.set(style="white", context="talk")
plt.title("Avg.price of Avocado as per type")
sns.boxplot(x="AveragePrice",y="type",data= df,palette="vlag")

#### Organic surely are more expensive. 

In [None]:
# Making a new column 'Month'
df['Month'] = pd.DatetimeIndex(df['Date']).month

In [None]:
df.head(1)

In [None]:
axis = df.groupby('Month')[['AveragePrice']].mean().plot(figsize=(10,5),marker='o',color='r')
plt.figure()
axis = df.groupby('Month')[['Total Volume']].mean().plot(figsize=(10,5),marker='o',color='g')

#### The above visual clearly shows that Avg Price of avocado has an affect on the sales. The sales are high when the Avg price is low and vice versa.

In [None]:
# Making a new column 'Day'.
df['Day'] = pd.DatetimeIndex(df['Date']).day

In [None]:
axis = df.groupby('Day')[['AveragePrice']].mean().plot(figsize=(14,5),marker='o',color='r')
plt.figure()
axis = df.groupby('Day')[['Total Volume']].mean().plot(figsize=(14,5),marker='o',color='g')

In [None]:
plt.figure(figsize=(18,18))
sns.set(style="white", context="talk")
plt.title("Avg.Price of Avocado by City")
sns.boxplot(x="AveragePrice",y="region",data= df,palette="deep")

#### As we can see above Houston clearly has the cheapest avocados but the record cheapest price at one time was recoreded in PhoenixTucson

#### Avg price of avocado over time as per types.

In [None]:
fig,ax = plt.subplots(figsize=(15,6))
df.groupby(['Date','type']).mean()['AveragePrice'].unstack().plot(ax=ax)
plt.title('Avg Price of avocado as per type on avocado over time')

#### Total volume of avocado sold over time as per types.

In [None]:
fig,ax = plt.subplots(figsize=(15,6))
df.groupby(['Date','type']).mean()['Total Bags'].unstack().plot(ax=ax)
plt.title('Total volume of avocado sold as per type on avocado over time')

In [None]:
avacado_type = df['type']=='organic'
plt.figure(figsize=(18,18))
sns.set(style="white", context="talk")
plt.title("Average price of organic Avocado as per City")
sns.boxplot(x="AveragePrice",y="region",data= df[avacado_type],palette="deep")

In [None]:
avacado_type = df['type']=='conventional'
plt.figure(figsize=(18,18))
sns.set(style="white", context="talk")
plt.title("Average Price of conventional Avocado as per City")
sns.boxplot(x="AveragePrice",y="region",data= df[avacado_type],palette="deep")

#### As seen from above 2 graphs. Houston has the cheapest organic avocados whereas conventional are the cheapest at PhoenixTucson.

In [None]:
df_corr = df[['AveragePrice','Total Volume','Total Bags','Month']]
correlations = df_corr.corr()
plt.figure(figsize=(8,5))
sns.heatmap(correlations,annot=True,cmap="YlGnBu",linewidths=.5)
# Month has a very good correlation with AvgPrice.

### Selecting important volume features to draw pie charts to see the volume distribution.

In [None]:
df_to_plot = df.drop(['Date','AveragePrice', 'Total Volume', 'Total Bags','type','region','Month','Day'], axis = 1).groupby('year').agg('sum')
df_to_plot.head()

In [None]:
index = ['4046', '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags']
series = pd.DataFrame({'2015': df_to_plot.loc[[2015],:].values.tolist()[0],
                      '2016': df_to_plot.loc[[2016],:].values.tolist()[0],
                      '2017': df_to_plot.loc[[2017],:].values.tolist()[0],
                      '2018': df_to_plot.loc[[2018],:].values.tolist()[0]}, index=index)
series.plot.pie(y='2015',figsize=(9, 9), autopct='%1.1f%%', colors=['silver', 'pink', 'orange', 'palegreen', 'aqua', 'blue'], fontsize=18, legend=False, title='2015 Volume Distribution').set_ylabel('')
series.plot.pie(y='2016',figsize=(9, 9), autopct='%1.1f%%', colors=['silver', 'pink', 'orange', 'palegreen', 'aqua', 'blue'], fontsize=18, legend=False, title='2016 Volume Distribution').set_ylabel('')
series.plot.pie(y='2017',figsize=(9, 9), autopct='%1.1f%%', colors=['silver', 'pink', 'orange', 'palegreen', 'aqua', 'blue'], fontsize=18, legend=False, title='2017 Volume Distribution').set_ylabel('')
series.plot.pie(y='2018',figsize=(9, 9), autopct='%1.1f%%', colors=['silver', 'pink', 'orange', 'palegreen', 'aqua', 'blue'], fontsize=18, legend=False, title='2018 Volume Distribution').set_ylabel('')

## Step 4: Avg avocado price perdiction using Facebook Prophet.

In [None]:
from fbprophet import Prophet

In [None]:
df_pr = df.copy()

In [None]:
df_pr = df_pr[['Date', 'AveragePrice']].rename(columns = {'Date': 'ds', 'AveragePrice':'y'})

In [None]:
train_data_pr = df_pr.iloc[:len(df)-30]
test_data_pr = df_pr.iloc[len(df)-30:]

In [None]:
m = Prophet()
m.fit(train_data_pr)
future = m.make_future_dataframe(periods=30,freq='MS')
prophet_pred = m.predict(future)

In [None]:
prophet_pred.tail()

In [None]:
prophet_pred = pd.DataFrame({"Date" : prophet_pred[-30:]['ds'], "Pred" : prophet_pred[-30:]["yhat"]})

In [None]:
prophet_pred = prophet_pred.set_index("Date")

In [None]:
prophet_pred.index.freq = "MS"

In [None]:
test_data_pr["Prophet_Predictions"] = prophet_pred['Pred'].values

In [None]:
test_data_pr = test_data_pr.set_index("ds")

### Comparing Prophet's predicions with the original. 

In [None]:
test_data_pr.head(10)

In [None]:
test_data_pr.tail(10)

### Visualising the comparision between Prophet's prediction with the actual price. Blue is the Prophets prediction and red is the actual price.

In [None]:
plt.figure(figsize=(16,5))
ax = sns.lineplot(x= test_data_pr.index, y=test_data_pr["y"])
sns.lineplot(x=test_data_pr.index, y = test_data_pr["Prophet_Predictions"]);

## Conclusion.

#### I was able to get some really nice visualizations and comparison graphs. I tried to look through the data left, right and centre. The first 3 steps went smooth. But in the last step I was not able to get very accurate predictions using Facebook Prophet. I will try to come back to my notebook and make changes in last step. Also I would like to try LSTM & ARIMA for my future tasks. It would be nice to compare the results of all 3 then.
#### As of now I would like to conclude here. 
#### If you have gone through my work and found it good please upvote. Anyone can copy and edit my notebook. It would be amazing to get some suggestions.

#### One can find my more notebooks and work here: https://www.kaggle.com/vikasbhadoria/notebooks

### Thank you