# Introduction

## Features
- Date - The date of the observation
- AveragePrice - the average price of a single avocado
- type - conventional or organic
- year - the year
- Region - the city or region of the observation
- Total Volume - Total number of avocados sold
- 4046 - Total number of avocados with PLU 4046 sold
- 4225 - Total number of avocados with PLU 4225 sold
- 4770 - Total number of avocados with PLU 4770 sold

## Purpose
The purpose of this notebook is to perform a robust analysis of the data. It will include: data cleaning, descriptive analysis, epxloratory data analysis, data analysis with data wranging, classification and regression models.

## Table of Contents
1.  [Data Loading and Data Cleaning](#1.-Data-Loading-and-Data-Cleaning)
2. [Descriptive Analysis](#2.-Descriptive-Analysis)
3. [EDA](#3.-EDA)
4. [Organic vs Conventional](#4.-which-type-sells-better?-which-one-is-expensier?)
5. [Are organic avos gaining popularity?](#5.-Are-organic-avos-gaining-popularity?)
6. [Seasonality. When can I find more avos? and cheaper?](#6.-Seasonality.-When-can-I-find-more-avos?-and-cheaper?)
7. [Regions. Where can I find more avos? and cheaper?](#7.-Regions.-Where-can-I-find-more-avos?-and-cheaper?)
8. [Classification models. Predicting the type of avocado](#8.-Classification-models.-Predicting-the-type-of-avocado)
9. [Regression models](#9.-Regression-models)

In [None]:
# Data manipulation
from datetime import datetime
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
from matplotlib.colors import DivergingNorm
import seaborn as sns
sns.set_style('whitegrid')

# preprocessing
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, Normalizer
from sklearn.preprocessing import OneHotEncoder


# Machine Learning
from sklearn.model_selection import train_test_split

# Classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

# Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor


np.warnings.filterwarnings('ignore')

# 1. Data Loading and Data Cleaning

In [None]:
avo = pd.read_csv("../input/avocado-prices/avocado.csv")

In [None]:
#avo.profile_report()

In [None]:
display(avo.head(5))
print(avo.info())
print(avo.describe())
print("\n", avo.shape)

The first column gives reduntant index data, so lets drop it

In [None]:
avo.drop('Unnamed: 0', axis=1, inplace=True)

Let's begin by taking a look at Null's values

In [None]:
sns.heatmap(avo.isnull())

Thankfuly, we don't have any null value. Lets continue with the descriptive analysis and, as we take a closer look to the data, see if we find strange values that we can drop.

# 2. Descriptive Analysis
In this section we will take a closer look at the data, make distributions, further clean the data, calculate initial basic stats and start analysing the dataset.
For this, we are going to take a look at each feature

In [None]:
avo.info()
avo.head()

## 2.1 Strings

now let's take a look at the columns with data type 'object'

In [None]:
avo.select_dtypes('object').columns

In [None]:
print(avo['type'].value_counts())
sns.countplot('type', data=avo, palette='Set3')

plt.show()

- We have two classes which are almost perfectly distributed. The data is balanced and could be used as a classifier in a machine learning algorithm

In [None]:
print(avo['region'].value_counts())
print('\n', 'There are:', len(avo['region'].unique()), 'unique values in the feature')
sns.countplot('region', data=avo, palette='Set3')

plt.show()

- The Features have 54 unique values which are perfectly distributed through the dataset. 
- For machine learning purposes, the data could be transformed with the OneHotEncoder formula to have a larger variaty of features to build a machine learning model
- The data can be used as well to analyze the price behaviour and quantity sold in each region

## 2.2 Numbers

In [None]:
avo.info()

In [None]:
numbers = list(avo.select_dtypes(['float64', 'int64']).keys())

# removing years
numbers.remove('year')

avo[numbers].hist(figsize=(20,10), color='green', edgecolor='white')

plt.show()

display(avo[numbers].describe())

**AveragePrice.** 
- Is the most normal distribution. Mean and median are really closed, which means the distribution is not severly influenced by outliers. Still, it is a bit skewed to the right, the mean being bigger than the median reflects that.
**Remaining features**
- The remaining features are severely influenced by outliers, most of the values are located in the first bin of the histograms and the meean is way bigger than the median. 
- These features seem to follow the same distribution, which makes sense since the information (quantity sold) is similar

Lets take the outliers out of the quantities to see if we can find a more normal distribution

In [None]:
avo_o = avo[avo['Total Volume']<50000]
avo_o[numbers].hist(figsize=(20,10), color='green', edgecolor='white')

plt.show()

These kind of distributions, where most of the values are located in lower values and then descends, is really common and could be represented in a different way through log formulas to make it more 'normal' and useful for a model, like regression models, without getting rid of outliers.

A example below with Total Volume.


In [None]:
TotalLog = np.log(avo['Total Volume'] + 1)
TotalLog.hist(color='green', edgecolor='white')

## 2.3 Dates

We have two columns which are 'Date' and 'year', being year the extracted year of date. To make the analysis easier, let's extract day and month out of 'Date' and see each value separately. That way, we are also going to have two more potentially usefull columns: day and month

In [None]:
avo.info()

In [None]:
avo['Date'] = avo['Date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

avo['month'] = avo['Date'].dt.month
avo['day'] = avo['Date'].dt.day
# monday = 0
avo['day of week'] = avo['Date'].dt.dayofweek
dates = ['year', 'month', 'day', 'day of week']
avo[dates]

In [None]:
fig, ax = plt.subplots(2,2, figsize=(20,10))

sns.countplot('year', data=avo, ax=ax[0,0], palette='BuGn_r')
sns.countplot('month', data=avo, ax=ax[0,1], palette='BuGn_r')
sns.countplot('day', data=avo, ax=ax[1,0], palette='BuGn_r')
sns.countplot('day of week', data=avo, ax=ax[1,1], palette='BuGn')

plt.show()

**year**
- 2015, 2016, 2017 have almost the same values
- 2018 is the lowest, the should have ended in the beggining of 2018

**month**
- Shows a descending pattern, This could be because of the same reason as year: 2018 ended in the begging of the year and, therefore, the first months have more entries

**day & day of week**
- We can see that the day chart has a repeating trend, and this is because of the day that the data was always recorded: day 6 (Sunday). 
- The data was, therefore, recorded weekly, 'day of week' becomes redundant and we can eliminate it.

In [None]:
avo.drop('day of week', axis=1, inplace=True)

## 2.4 Descriptive analysis conclusions
- 'type' has to categories and is balanced, could be used as a classifier in model building
- 'region' has 54 unique values and is perfectly balanced, could be hot encoded for model building
- 'avg' price shows and pretty normal distribution and looks tentative for target variable for regression model
- units sold columns show similar data which is similarly distributed, log formulas could be used to increase model performance
- 'dates' is evenly distributed till 2018 and shows that the data was recorded on a weekly basis every Sunday

Let's begin analising and exploring the data to get insights out of data wrangling and a more clear idea of how the model is going to be

In [None]:
avo.info()

# 3. EDA
## 3.1 Correlations
- Let's begging by looking at correlation so we can represent our data in a scatterplot along with the type of avocado
- We are going to take a look at both the dataset with outliers and without outliers

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20,8))

avo_o = avo[avo['Total Volume']<50000]

sns.heatmap(avo.corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True, ax=ax[0])
ax[0].set_title('With outliers', fontsize=20)

sns.heatmap(avo_o.corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True, ax=ax[1])
ax[1].set_title('Without outliers', fontsize=20)

plt.show()


- We are going to take the strongest relationship out of the volume variable and the strongest out of a date variable
- We are going to take the relationships with AveragePrice, out of both heatmaps, since is our target variable for the regression model,
- We are going to color the scatterplot with the type of avo since is our target variable for our classification model

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20,10))

sns.scatterplot(x='4046', y='AveragePrice', data=avo, hue='type', ax=ax[0,0])
sns.scatterplot(x='Large Bags', y='AveragePrice', data=avo_o, hue='type', ax=ax[0,1])
sns.scatterplot(x='month', y='AveragePrice', data=avo, hue='type', ax=ax[1,0])
sns.scatterplot(x='month', y='AveragePrice', data=avo_o, hue='type', ax=ax[1,1])

- An important insight here is that we can't take the outliers out since all of them correspond to the conventional type, which means that conventional avocados sell way more than organic avocados
- There doesn't seem to be a relationship between month and AveragePrice, what we can see in this graph is that the average price of conventional avocados is way smaller that the organic. **We are going to take a closer look at this in the further sections**
- There is an expected decreasing trend for both types: the more units were sold, the less the average price is, **we are going to take a closer look at this later as well**.
- Perhaps a better way of representing the data is not by taking out the outliers but by normilizing the data, let's try that now with AveragePrice and 4046

In [None]:
scaler = Normalizer()
scaler.fit(avo[['4046', 'AveragePrice']].values)
avo['4046_scaled'] = scaler.transform(avo[['4046', 'AveragePrice']].values)[:,0]
avo['AveragePrice_scaled'] = scaler.transform(avo[['4046', 'AveragePrice']].values)[:,1]

sns.regplot(x='4046_scaled', y='AveragePrice_scaled', data=avo, color='g')
plt.show()

Beautiful :) We now know that both the regression and classification is possible since there is a clear tendency

## 3.2 Dates
Can we predict price or volume doing a time seriess analysis?

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(23,10))

avo['year_month'] = avo['Date'].dt.to_period('M')
grouped = avo.groupby('year_month')[['AveragePrice', 'Total Volume']].mean()

ax[0].plot(grouped.index.astype(str), grouped['AveragePrice'])
ax[0].tick_params(labelrotation=90)
ax[0].set_ylabel('AveragePrice')


ax[1].plot(grouped.index.astype(str), grouped['Total Volume'])
ax[1].tick_params(labelrotation=90)
ax[1].set_ylabel('Total Volume')

plt.show()


- From the graphic we can tell that the, first of all, average price and total volume move in different direction
- Total volume has a pike at the beggining of the year. On the other hand, average price drops at the beggining of the year
- These drops and pikes are a sign of seasonality and that could help in forecasting
- We will dig deeper into these seasonality in further sections

## 3.3 EDA Conclusions
- Conventional avocados sell way more than organic avocados and cost less. Therefore, Total volume, along with other volume variables, and average price, will work well to predict our target variable, type, in our classification model
- Average price and total volume move in different directions, this will come in handy when doing a regression analysis over our target variable, which is average price
- In the time series exploration, we see that there is a pike in total volume and a drop in prices at the beggining of the month, hinting for seasonality and forecasting possibilities

# 4. which type sells better? which one is expensier?
We now from previous sections that organic is expensier and sell less, but let's get into the numbers

In [None]:
avo.info()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14,5))

sns.barplot(x='type', y='AveragePrice', data=avo, palette='Set3', ax=ax[0])
sns.barplot(x='type', y='Total Volume', data=avo, palette='Set3', ax=ax[1], estimator=sum, ci=None)
plt.show()

display(avo.groupby('type')['AveragePrice'].mean())
display(avo.groupby('type')['Total Volume'].sum())

- So we see that convential is cheaper than organic, but momore shockingly, conventional destroyed organic sells

So conventional avos are performing quite well and organic are being left behind, but is organic at least geaining popularity?

# 5. Are organic avos gaining popularity?

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(23,12))
fig.tight_layout(pad=8)


group = avo.groupby(['type', 'year_month'])['Total Volume'].sum()

organic = group['organic']
organic = pd.DataFrame(organic)
organic['Total Volume % change'] = np.round(organic['Total Volume'].pct_change() * 100, 2)

conventional = group['conventional']
conventional = pd.DataFrame(conventional)
conventional['Total Volume % change'] = np.round(conventional['Total Volume'].pct_change() * 100, 2)

norm = DivergingNorm(vmin=organic['Total Volume % change'].min(), vcenter=0, vmax=organic['Total Volume % change'].max())
colors = [plt.cm.RdYlGn(norm(c)) for c in organic['Total Volume % change']]
sns.barplot(x=organic.index, y=organic['Total Volume % change'], data=organic, ax=ax[0], palette=colors)

norm = DivergingNorm(vmin=conventional['Total Volume % change'].min(), vcenter=0, vmax=conventional['Total Volume % change'].max())
colors = [plt.cm.RdYlGn(norm(c)) for c in conventional['Total Volume % change']]
sns.barplot(x=conventional.index, y=conventional['Total Volume % change'], data=conventional, ax=ax[1], palette=colors)


ax[0].tick_params(labelrotation=90)
ax[0].set_title('Organic Percentage Change in Sales', fontsize=15)

ax[1].tick_params(labelrotation=90)
ax[1].set_title('Conventional Percentage Change in Sales', fontsize=15)

plt.show()

conventional['Total Volume % change'].mean()
print("The sum of percentage change of Organic is: {}".format(np.around(organic['Total Volume % change'].sum(), 2)))
print("The sum of percentage change of Conventional is: {}".format(np.around(conventional['Total Volume % change'].sum(), 2)))

- Is hard to tell from the graphic alone but if we sum every percentage change we find that organic has a bigger growth overall with 200.48 against 137.02 of conventional.

 Let's add some business strategy concepts to refine strategy and conclusions here.
 
 The BCG matix is a model that evaluates how a business is performing according its growth and market share. It has for dimensions:
 1. Dogs: These are products with low growth or market share.
 2. Question marks or Problem Child: Products in high growth markets with low market share.
 3. Stars: Products in high growth markets with high market share.
 4. Cash cows: Products in low growth markets with high market share.
 
- Organic might be having way smaller sales than conventional, but its growing rate (higher than conventional) is a good sign to keep producing the organic avos and it already has a market. This is a healthy indicator for businesses.Then, **organic is a Star in the BCG matrix**. A suggestion would then be to have a business growth strategy with them: technologies and methods that produce more and cheaper, promotion and importations.
- Conventional avos are too succesfull and have an already stablished business infrastructure. Therefore, **conventional are Cash cows in the BCG matrix**, and businesses should keep producing them at the same or higher rate.

# 6. Seasonality. When can I find more avos? and cheaper?
Section 3.2 gave us a seasonality clue: more avos are being produce at the beggining of the year.
Let's take a closer look and confirm this.

The approach will be getting the quarters and the average Total Volume and price of each quarter

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12,5))

avo['quarter'] = avo['Date'].dt.quarter


sns.barplot(x='quarter', y='Total Volume', data=avo, palette='Greens_r', ci=None, ax=ax[0])
sns.barplot(x='quarter', y='AveragePrice', data=avo, palette='Greens_r', ci=None, ax=ax[1])


plt.show()

quarter = avo.groupby('quarter')[['Total Volume', 'AveragePrice']].mean()
display(quarter)

- So we see that in the first quarter of the year sales are better than in other quarters and prices are the lowest.
- After the first quarter, sales decrease and prices grow. Given the popularity of avos, businesses should be considering importing more avos when they are not produced in the country, a big oportunity for business-men from both countries.

# 7. Regions. Where can I find more avos? and cheaper?

In [None]:
avo.head()

## 7.1 Price

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,5))

regionP = avo.groupby('region')['AveragePrice'].mean()

expensive = regionP.sort_values(ascending = False).iloc[:10]
cheap = regionP.sort_values().iloc[:10]

sns.barplot(x='AveragePrice', y='region', data = avo, order=expensive.index, ci=None, palette='Greens_r', ax=ax[0])
sns.barplot(x='AveragePrice', y='region', data = avo, order=cheap.index, ci=None, palette='Greens_r', ax=ax[1])

plt.show()

cheap = pd.DataFrame(cheap).reset_index()
expensive = pd.DataFrame(expensive).reset_index()

print('the most expensive avocados can be found in {} '.format(list(expensive.iloc[:5,0])))
print('the cheapest avocados can be found in {} '.format(list(cheap.iloc[:5,0])))


## 7.2 quantity

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,5))

avoStates = avo[avo['region'] !='TotalUS']

regionV = avoStates.groupby('region')['Total Volume'].sum()

most = regionV.sort_values(ascending = False).iloc[:10]
least = regionV.sort_values().iloc[:10]

sns.barplot(x='Total Volume', y='region', data = avoStates, order=most.index, ci=None, palette='Greens_r', ax=ax[0])
sns.barplot(x='Total Volume', y='region', data = avoStates, order=least.index, ci=None, palette='Greens_r', ax=ax[1])

plt.show()

most = pd.DataFrame(most).reset_index()
least = pd.DataFrame(least).reset_index()

print('States with the the biggest demand are {} '.format(list(most.iloc[:5,0])))
print('States with the least demand are {} '.format(list(least.iloc[:5,0])))

# 8. Classification models. Predicting the type of avocado

## 8.1 Decision Tree Classifier

In [None]:
avoTree = avo.drop(['Date', 'region', '4046_scaled', 'AveragePrice_scaled', 'year_month'], axis=1)

target = avoTree['type']
features = avoTree.drop(['type'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(features.values, target.values, random_state=0)

tree = DecisionTreeClassifier(max_depth=7, random_state=0).fit(X_train, y_train)

print("training set score : {:.2f}".format(tree.score(X_train, y_train)))
print("test set score: {:.2f}".format(tree.score(X_test, y_test)))

print("feature importances:")
feature_importance = pd.DataFrame(features.keys(), tree.feature_importances_)
print(feature_importance)


## 8.2 Knn

In [None]:
# knn

avo_model = avo.drop(['Date', 'region', '4046_scaled', 'AveragePrice_scaled', 'year_month'], axis=1)

target = avo_model['type']
features = avo_model.drop(['type'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(features.values, target.values, random_state=0)


clf = KNeighborsClassifier(n_neighbors=12)
clf.fit(X_train, y_train)
print("training set score : {:.2f}".format(clf.score(X_train, y_train)))
print("test set score: {:.2f}".format(clf.score(X_test, y_test)))

## 8.3 SVM

In [None]:
# Linear SVC

svc = LinearSVC(C=211).fit(X_train, y_train)
print("training set score : {:.2f}".format(svc.score(X_train, y_train)))
print("test set score: {:.2f}".format(svc.score(X_test, y_test)))

- The best model is Knn with a training score of 0.97 and a test score of 0.97

# 9. Regression models
In this section we are going to try to predict the price of the avos.

## 9.1 Decision Tree Regressor
The tree models usually don't require preprocessing, so we are going to beging with this model

In [None]:
avo_model = avo.drop(['Date', '4046_scaled', 'AveragePrice_scaled', 'year_month', 'type', 'region'], axis=1)
target = avo_model['AveragePrice']
features = avo_model.drop(['AveragePrice'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(features.values, target.values, random_state=0)

In [None]:
tree = DecisionTreeRegressor(max_depth=14, random_state=0).fit(X_train, y_train)
print("training set score : {:.2f}".format(tree.score(X_train, y_train)))
print("test set score: {:.2f}".format(tree.score(X_test, y_test)))

print("\n", "feature importances:")
feature_importance = pd.DataFrame(list(features.keys()), tree.feature_importances_)
print(feature_importance.sort_index(ascending=False))

## 9.2 Linear models

In [None]:
lr = LinearRegression().fit(X_train, y_train)
print('Linear Regression')
print("training set score : {:.2f}".format(lr.score(X_train, y_train)))
print("test set score: {:.2f}".format(lr.score(X_test, y_test)))

print('Ridge')
ridge = Ridge().fit(X_train, y_train)
print("\n", "training set score : {:.2f}".format(ridge.score(X_train, y_train)))
print("test set score: {:.2f}".format(ridge.score(X_test, y_test)))

print('Lasso')
lasso = Lasso().fit(X_train, y_train)
print("\n", "training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used:", np.sum(lasso.coef_ != 0))

Our model performs quite bad with the given features. From previous sections, we know that a better way to represent the volume data is applying a log formula or a normalizer. let's try both methods and see if we can get a better prediciton

### log

In [None]:
avo_model = avo.drop(['Date', '4046_scaled', 'AveragePrice_scaled', 'year_month', 'type', 'region'], axis=1)
target = avo_model['AveragePrice']
features = avo_model.drop(['AveragePrice'], axis=1)

features.iloc[:,0:7] = np.log(features.iloc[:,0:7] + 1)
X_train, X_test, y_train, y_test = train_test_split(features.values, target.values, random_state=0)

lr = LinearRegression().fit(X_train, y_train)
print('Linear Regression')
print("training set score : {:.2f}".format(lr.score(X_train, y_train)))
print("test set score: {:.2f}".format(lr.score(X_test, y_test)))

print('Ridge')
ridge = Ridge().fit(X_train, y_train)
print("\n", "training set score : {:.2f}".format(ridge.score(X_train, y_train)))
print("test set score: {:.2f}".format(ridge.score(X_test, y_test)))

print('Lasso')
lasso = Lasso().fit(X_train, y_train)
print("\n", "training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used:", np.sum(lasso.coef_ != 0))

way better, let's add dummy variables

In [None]:
avo_model = avo.drop(['Date', '4046_scaled', 'AveragePrice_scaled', 'year_month'], axis=1)
avo_model = pd.get_dummies(avo_model)

target = avo_model['AveragePrice']
features = avo_model.drop(['AveragePrice'], axis=1)

features.iloc[:,0:7] = np.log(features.iloc[:,0:7] + 1)
X_train, X_test, y_train, y_test = train_test_split(features.values, target.values, random_state=0)

lr = LinearRegression().fit(X_train, y_train)
print('Linear Regression')
print("training set score : {:.2f}".format(lr.score(X_train, y_train)))
print("test set score: {:.2f}".format(lr.score(X_test, y_test)))

print('Ridge')
ridge = Ridge().fit(X_train, y_train)
print("\n", "training set score : {:.2f}".format(ridge.score(X_train, y_train)))
print("test set score: {:.2f}".format(ridge.score(X_test, y_test)))

print('Lasso')
lasso = Lasso().fit(X_train, y_train)
print("\n", "training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used:", np.sum(lasso.coef_ != 0))

- We just got a 0.1 better test score than the descision tree with a way lower training score, which is a good sign that we are not over fitting.
- With some feature engineering we were able to make the linear models better than the decision tree, except for Lasso where we never were able to increase the test score more than 1

The notebook is a bit long but i tried to be as concise as possible considering that I wanted to deliver the most robust analysis i could do.
Thank you ver much for reading my kernel and please upvote if you find it useful :) a vote to a beginner never hurts and motivates me to keep learning