# Exercise 2: Exploring the Avocado Dataset

We're going to do some exploratory data analysis on an avocado retail dataset. 

### Importing Dependencies 

Before we start, we'll need to import `pandas`, `seaborn`, and `matplotlib` which are important tools for performing data exploration in Python. 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Loading the Data

We'll look at the [Avocado Dataset](https://www.kaggle.com/neuromusic/avocado-prices/data) published on Kaggle. The data was retrieved in 2018 and comes from the Hass Avocado Board. The data dictionary below will give us a better idea of what each column represents:

|column|description|
|------|-----------|
|date |Date of the observation|
|average_price|Average price of a single avocado|
|type |Type of avocado. Can be either conventional or organic|
|year | Year of observation|
|region | City or region of the observation|
|Total Volume | Total number of avocados sold|
|4046|Total number of avocados with PLU 4046 sold|
|4225|Total number of avocados with PLU 4225 sold|
|4770|Total number of avocados with PLU 4770 sold|

In [None]:
avocados = pd.read_csv("https://s3.us-east-2.amazonaws.com/explore.datasets/rbi/avocado.csv", index_col=0)
avocados.head()

How many columns and rows are we dealing with?

In [None]:
avocados.____

What is the datatype of each column?

In [None]:
avocados.____

How many missing values are in our dataset?

In [None]:
avocados.____().sum()

What is the mean price of an avocado? What are the lowest and highest prices? Use the `mean`, `min`, and `max` functions to calculate these values.

In [None]:
mean_price = avocados['average_price']._____
min_price = avocados['average_price']._____
max_price = avocados['average_price']._____

print(f"The mean price of an avocado is {mean_price}, with a min of {min_price} and max of {max_price}")

What is the distribution of price? Use Seaborn's [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) to plot the distribution. You can also add another line representing the mean price using Matplotlib's `axvline`.

In [None]:
sns._____(avocados['____'])
plt.axvline(mean_price)

What proporition of avocaods are conventional vs. organic? Check the value_counts of the `type` column and pass in `normalize=True`. 

In [None]:
avocados['type'].______(_____)

Is there a difference in price between avocado types? You can groupby `type` and calculate the mean price for each avocado type.

In [None]:
avocados.groupby('____')['average_price'].mean()

We have a `date` column that represents the date at which avocado retail information was collected. Let's extract the month from this column. We first need to convert the `date` column to datetime. Use `pd.to_datetime` to make this conversion.

In [None]:
avocados['date'] = pd.to_datetime()

To extract the month, we can simply apply `dt.month`. If you want the full name month name, yo you can use `dt.month_name`.

In [None]:
avocados['month'] = avocados['date'].dt____

Let's look at avocados from the year 2017. We can create a new dataframe that gets avocados exclusively from 2017.

In [None]:
avocados_2017 = avocados[avocados['year']== ____]

Now that we have our `month` column, let's plot the average price of avocados for each month in 2017. We can do this using Seaborn's [barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html).

In [None]:
sns.barplot(x='_____', y='average_price', data=avocados_2017)

We can add another element to our barplot. Let's look at differences in price by `type` for each month. We just need to assign `type` to the `hue` parameter. Try playing around with the palette too. 

In [None]:
sns.barplot(x='_____', y='average_price', hue='_____', data=avocados_2017, palette='viridis')

Let's now take a look at `month` and `total_bags`. Let's group our `avocados_2017` dataset by month and calculate the mean number of bags. 

In [None]:
avocados_2017.groupby(['month'])['_______'].mean()

Let's repeat this but this time we'll look at the number of `xlarge_bags`. 

In [None]:
avocados_2017.groupby(['month'])['_______'].mean()

Which `region` produced the most avocados in 2017? Let's look at the `total_volume` of avocados.

In [None]:
volumn_by_region = avocados_2017.groupby('region')['_____'].mean().reset_index()

To see which regions produced the greatest volume of avocaods, we'll need to sort `volumn_by_region` by `total_volume`. Make sure to set `ascending` to be False. 

In [None]:
volumn_by_region.sort_values(by='___', ______).head()