# * **Introduction**

**Avocado is one of the fruits that most vegetarians love.Today we will do an analysis of the avocado price data.**

# * **Data**

This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here's how the Hass Avocado Board describes the data on [their website](https://hassavocadoboard.com/):

Some relevant columns in the dataset:

Date - The date of the observation.

AveragePrice - the average price of a single avocado.

type - conventional or organic.

year - the year.

Region - the city or region of the observation.

Total Volume - Total number of avocados sold.

4046 - Total number of avocados with PLU 4046 sold.

4225 - Total number of avocados with PLU 4225 sold.

4770 - Total number of avocados with PLU 4770 sold.


# * **Let's Start**

**First we will import libraries and read data**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('../input/avocado-prices/avocado.csv')

show data

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

**Now let's describe data**

In [None]:
df.describe()

**The first column is index so we drop it to get clean data.**

In [None]:
df = df.drop('Unnamed: 0', axis = 1)

# *  **Missing Data**

**let's show if there are missing data or not**.

In [None]:
df.isna().sum()

**Great there are no missing data this make work easy and accurate.**

# *  **data visualization**
**we will focus on Average Price and find relationship between it and other feature .**

  **so first let's show distribution to understand values .**

In [None]:
plt.figure(figsize = (9,6))
plt.title('Distribution Average Price')
sns.distplot(df['AveragePrice'], color = 'b')

**It appears that the data distribution in the average price ranges between 0.3 & 3.4
The upper values of the distribution are at 1.1
Most of the data is concentrated between 0.9 to 1.8**.

**There are two types of avocados in the data: organic and conventional, so let's see how many each one.**

In [None]:
print('The number of each type ',df['type'].value_counts())
plt.figure(figsize =(9,6))
plt.title('The number of each type')
sns.countplot('type', data = df)

In [None]:
plt.figure(figsize = (9,6))
plt.title('price of each type avocado')
sns.boxplot(x= 'type', y = 'AveragePrice', data = df)

**It looks like the ORGANIC price is more than the conventional price. **

In [None]:
plt.figure(figsize =(11,6))
plt.title('Avocado price in each country')
plt.xticks(rotation ='vertical')
sns.boxplot(x = 'region', y = 'AveragePrice', data = df, width = 1, whis= 2)

**Regarding the prices relative to the regions, we find that San Francisco is the most expensive city to sell avocados in general, Houston is the cheapest area to sell**

**Now let's play a little bit with the data
We will isolate the data according to the type of avocado and then take each type to find its prices according to the region and year. To clarify more, let's work**

In [None]:
organic = df[df['type'] == 'organic']
sns.factorplot(x = 'AveragePrice', y = 'region', hue = 'year', data = organic , size=12 , aspect=0.8
               , join=False)

In [None]:
conventional = df[df['type'] == 'conventional']
sns.factorplot(x = 'AveragePrice', y = 'region', hue = 'year', data = conventional , size=12 , aspect=0.8
               , join=False)

**When data is segmented by type
We found that the lowest prices conventional are in Phoenix
and the expensivest city is chicago**

In [None]:
plt.figure(figsize = (9,6))
plt.title('Average price each year')
sns.boxplot(x = 'year', y= 'AveragePrice', data = df)

**The year 2017 was full of expense :)**

**Now we will find the relationship between price and months, since we have the date, we can extract the months to find the relationship with prices..**

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df["months"] = df['Date'].map(lambda x: x.month)
plt.figure(figsize = (9,6))
plt.title('Average price each month')
sns.lineplot(x = 'months', y = 'AveragePrice' , data = df, hue = 'type')

**as I expected ! There is an increase in some months and a decrease in others, and in both types. It seems that this thing makes us think of something else to envision each season of the year**

In [None]:
seasons = {1: 'winter', 4: 'Autumn', 3: 'summer', 2: 'spring'}
df['seasons'] = [(month%12 + 3)//3 for month in df['months']]
df.seasons = [seasons[i] for i in df.seasons]

In [None]:
plt.figure(figsize = (9,6))
plt.title('Average price each season')
sns.barplot(x = 'seasons', y= 'AveragePrice', data = df)

**The rise is clear in the fall and the down in the winter**

In [None]:
plt.figure(figsize = (9,6))
plt.title('The number of order in each season')
sns.countplot(x = 'seasons', data = df)

**Finally let's find heatmap**

In [None]:
from sklearn.preprocessing import LabelEncoder
objectt = LabelEncoder()
di = {}

objectt.fit(df.type.drop_duplicates()) 
di['type'] = list(objectt.classes_)
df.type = objectt.transform(df.type) 

di2 = {}
objectt2 = LabelEncoder()
objectt2.fit(df.seasons.drop_duplicates()) 
di2['seasons'] = list(objectt2.classes_)
df.seasons = objectt2.transform(df.seasons) 

cols = ['AveragePrice','type','year','Total Volume','Total Bags', 'seasons']
cm = np.corrcoef(df[cols].values.T)
sns.heatmap(cm, cbar = True, fmt = '.2f', annot = True, square = True
            , yticklabels = cols, xticklabels = cols)