# Data Visualisation of Avocado Prices

### Context
It's well known that millenials LOVE avocado toast. It's also a well known fact that all millenials live in their parents basements. Clearly, they aren't buying home because they are buying too much Avocado Toast! But.. if a millenial could find a city where the avocados are cheap, they could live out the millenial American Dream.

### Data
The table reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados.

**Some relevant columns in the dataset:**
* `Date` - The date of the observation
* `AveragePrice` - the average price of a single avocado
* `type` - conventional or organic
* `year` - the year
* `Region` - the city or region of the observation
* `Total Volume` - Total number of avocados sold
* `4046` - Total number of avocados with PLU 4046 sold
* `4225` - Total number of avocados with PLU 4225 sold
* `4770` - Total number of avocados with PLU 4770 sold

### Objective
In this notebook, we will try to understand and answer the following questions:
* In which cities can millenials have their avocado toast and buy a home?
* Was the Avocadopocalypse of 2017 real?

## Importing the libraries and retrieving the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns
sns.set(style="whitegrid")

avocado = pd.read_csv("/kaggle/input/avocado-prices/avocado.csv", index_col=0)

# sorting data frame by Date
avocado.sort_values("Date", axis=0, ascending=True, inplace=True, na_position='last')
avocado.head(10)

**We need to check the file for missing data**  
We don't see any indication for the use of a special mark for missing data. So let's check the data for missing data filled with a 0.

In [None]:
print (avocado.isnull().sum())

There aren't any missing datapoints, so we can start exploring the data

## Let's get some valuable insights & answer the first question

In [None]:
lowest_price_year = avocado.groupby("year")["AveragePrice"].min().reset_index()
average_price_year = avocado.groupby("year")["AveragePrice"].mean().reset_index()
highest_price_year = avocado.groupby("year")["AveragePrice"].max().reset_index()

price_table = pd.DataFrame({"Year":lowest_price_year["year"],
                            "Lowest average":lowest_price_year["AveragePrice"],
                            "Overall average":average_price_year["AveragePrice"],
                            "Highest average":highest_price_year["AveragePrice"]})
price_table.head()

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(x=avocado["year"], y=avocado["AveragePrice"], palette="rainbow")
plt.xticks(rotation=90)
plt.xlabel('Year')
plt.tick_params(labelsize = 15)
plt.ylabel('Average Price')
plt.title('Average Price of Avocado According to Year')

**What does it tells us**  
The above data and boxplot tells us that we could have bought the cheapest avocados in 2017, but that the overall average was also by far the most expensive one. In 2016 we paid the lowest all year average prices for avocados.

In [None]:
#Sorted (low > high) overall average price per region
grouped = avocado.groupby("region")["AveragePrice"].mean().reset_index()
sorted_regio_mean = grouped.sort_values("AveragePrice", ascending=True)
sorted_regio_mean.head()

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=sorted_regio_mean["region"], y=sorted_regio_mean["AveragePrice"], palette="icefire_r")
plt.xticks(rotation=90)
plt.xlabel('Region')
plt.tick_params(labelsize = 15)
plt.ylabel('Average Price')
plt.title('Average Price of Avocado According to Region')

### We now know..
.. that Houston got the overall cheapest average price (€1,05) for avocados. So students who live there can live out the millenial American Dream. As a side note we need to keep in mind that the above overall average price includes organic and conventional avocados.

**Let's find out if it's relevant to split per type**

In [None]:
sns.boxplot(y="AveragePrice", x="type", data=avocado, palette = 'Set3')

In [None]:
sns.barplot(y="Total Volume", x="type", data=avocado, palette = 'Set3')

**What we can learn from the above data**  
As we can see in the above plots, the prices between an organic and a conventional avocado differ a lot. In the plot underneath we see that the conventional avocados got a way bigger total volume. This can be caused by the high prices of the organic one, but also by the amount of stores it's sold in or the ignorance of the buyer. Atleast we learned that the total volume of the organic avocados is a drop in the ocean.

In [None]:
conventional = avocado[avocado.type=="conventional"]
organic = avocado[avocado.type=="organic"]
groupBy1_price = conventional.groupby('Date').mean()
groupBy2_price = organic.groupby('Date').mean()

df = pd.DataFrame({
    'Conventional':groupBy1_price.AveragePrice,
    'Organic': groupBy2_price.AveragePrice
    }, index=groupBy1_price.AveragePrice.index)
lines = df.plot.line(figsize=(20,10))
plt.tick_params(labelsize = 15)
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.title('Average Price of Avocado According to Type')

## To answer the second question..
.. as shown above the prices increased rapidly in 2017, so the avocadopocalypse of 2017 was real. Within a quick Google we know why the prices increased that extreme; in 2016, a massive heat wave affected both California and Mexico, what caused lost fruit that would have sized up to be 2017 year’s crop.

**To round up we will plot all the average prices per year in a figure**  
So we got a clear overview of all the price fluctuations per year in each region.

In [None]:
sns.catplot('AveragePrice','region',data=avocado,
                   hue='year',
                   palette='Set3', kind="box",
                height=20, aspect=0.6, width=1.1
              )

##  Got any feedback?
I would be delighted to get any feedback about wrtiting cleaner and faster code, or best case practices that i could use. So please take a minute to scan my code and help me advance my data career. Looking forward to do the same for other people in the near future.