# Exploratory Data Analysis on Avocado Prices

First of all, let's import our data and take a look on the Data frame.

In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
data = pd.read_csv('/kaggle/input/avocado-prices/avocado.csv')
data.head()

Let's check our data frame columns data types

In [None]:
data.dtypes

As seen, the Date column was read as string, so let's convert it to Datetime format.

In [None]:
data.Date = pd.to_datetime(data.Date)
data.dtypes

Looking for missing values:

In [None]:
data.isnull().sum()

Nice! There aren't missing values in our data frame so we can jump to the fun part!

Let's take an overview off the price variaton in the entire dataset.

In [None]:
data['AveragePrice'].agg(['min','max','mean','std'])

It looks there was a moment where people would buy a lot of avocado without feeling guilty!

I wonder if the average price is increasing these years. Let's check it!

First of all, we need to create our 'month_name' column.

In [None]:
data['month_name'] = data.Date.dt.month_name()
data.head()

Selecting columns we need:

In [None]:
date_price = data[['year','month_name','AveragePrice']]
# Checking which years we have in our data
date_price.year.unique()

In [None]:
# Convertig 'month_names' in category data type ir order to plot de graph sorted by month
month_ordered = ['January','February','March','April','May','June','July','August','September','October','November','December']
date_price['month_name'] = pd.Categorical(date_price['month_name'], categories=month_ordered, ordered=True)

# Slicing by year
price2015 = date_price.loc[date_price['year'] == 2015].groupby('month_name').mean()
price2016 = date_price.loc[date_price['year'] == 2016].groupby('month_name').mean()
price2017 = date_price.loc[date_price['year'] == 2017].groupby('month_name').mean()
price2018 = date_price.loc[date_price['year'] == 2018].groupby('month_name').mean()

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(12,6))
sns.lineplot(x= price2015.index, y= price2015.AveragePrice, label='2015')
sns.lineplot(x= price2016.index, y= price2016.AveragePrice, label='2016')
sns.lineplot(x= price2017.index, y= price2017.AveragePrice, label='2017')
sns.lineplot(x= price2018.index, y= price2018.AveragePrice, label='2018')
plt.xlabel('Month')

### Notice it
- Avocado tend to be cheaper at the beginning and end of each year.
- There was a big increase in avocado prices in 2017 between July and September.


Until now we just checked the price variation in each year, ignoring other variables. What else can we discover?

## Average Prices distribution per Type of Avocado

Let's take a look at density per type of avocado.

In [None]:
sns.set_style('darkgrid')
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(20,6))
fig.suptitle('Average Price distribution per Type of Avocado')

# Swarm plot Axes
ax[0].set_title('Average Prices per Type')
sns.swarmplot(ax=ax[0], x= data.type, y= data.AveragePrice)
# KDE plot Axes
ax[1].set_title('KDE for Average Price')
sns.kdeplot(ax=ax[1], data=data.loc[data['type'] == 'conventional']['AveragePrice'], shade=True, label='Conventional')
sns.kdeplot(ax=ax[1], data=data.loc[data['type'] == 'organic']['AveragePrice'], shade=True, label='Organic')

It seems that organic avocado prices use to vary more than the conventional ones. Also, the average price tend to be higher usually.

### Which region sold more avocado? 


This data set has sales count for the US cities and regions, but that's not interesting in our analysis, so let's filter this out so we can focus on the cities.

In [None]:
region = data[['type','region','Total Volume']]
region = region.loc[~region['region'].isin(['TotalUS','West','SouthCentral','Northeast','Southeast','Plains','GreatLakes','Midsouth region',\
                                            'Midsouth'])]

Let's make a stacked bar plot so we can see Total Volume sold per type of Avocado. Let's keep with the 5 top selling cities.

In [None]:

region_total = region[['region','Total Volume']].groupby('region').sum().sort_values(by='Total Volume', ascending=False).iloc[0:5]
region_organic = region.loc[(region.type == 'organic') & (region.region.isin(list(region_total.index)))][['region','Total Volume']]
region_organic = region_organic.groupby('region').sum().sort_values(by='Total Volume', ascending=False)
plt.figure(figsize=(14,6))
sns.barplot(x=region_total.index, y=region_total['Total Volume'], color='purple', label='Conventional')
sns.barplot(x=region_organic.index, y=region_organic['Total Volume'], color='blue', label='Organic')
plt.legend(fontsize=14)

In [None]:
california = data.loc[data.region == 'California']
california['month_name'] = pd.Categorical(california['month_name'], categories=month_ordered, ordered=True)
fig, ax= plt.subplots(1,2, figsize=(25,6))
plt.suptitle('Price Variation x Amount of Avocado sold in California')
sns.lineplot(ax=ax[0], x=california['month_name'], y = california['AveragePrice'])
sns.lineplot(ax=ax[1], x=california['month_name'], y = california['Total Volume'], color='purple')

It seem's that people tend to buy less avocado when the prices raise. I would buy less, either! What about you?

Thank you for spending sometime checking my work! That's my first data analysis here on Kaggle. I would appreciate your comment or tips to improve my analysis skills. Feel free to say something!

![](https://www.pngkit.com/png/detail/70-706212_nutrition-for-nerds-nutrition-manga-food-and-wine.png)