# Avocado Data Analysis (Guided Tutorial For Beginner-Intermediate)

This notebook will analyze the Avocado Prices Dataset, which analyzes how the prices of different types of avocados change in response to their handling, the date that they were sold and the total volume of what was sold. This dataset examines Hass Avocados, which are commonly used to make avocado toast in homes nationwide. Three types of Hass Avocados are shown in the dataset: PLU 4046 (extra small), PLU 4225 (small) and PLU 4770 (large).

Let's begin our analysis of this dataset by importing the required libraries and getting to look at the dataset.

1. `matplotlib.pyplot` allows us to graph the data and see relationships between them before we do any machine learning.


2. `pandas` allows us to visualize our data in a Python Excel Worksheet. We can do many operations in `pandas` that allows us to manipulate the data to see any things that need changing before we undergo machine learning.


3. `sklearn` is a machine learning library where we will use algorithms to learn relationships between the data. If we come up with a research question about our data, we can use this library to answer our question using machine learning.


4. `tensorflow` is a deep learning library that will be used to see if deep learning can give us a better result for our research question without overfitting the data. *This may be used in an update to this notebook.*

In [None]:
import numpy as np
import matplotlib.pyplot as plt
#import tensorflow as tf
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

## Exploratory Data Analysis

We begin first by getting to look at our dataset by importing it from its csv file.

This is done using the `read_csv` method from the `pandas` library.

In [None]:
df = pd.read_csv('../input/avocado-prices/avocado.csv')
df.head()

As you can see, there is a lot of columns containing both numerical data and categorical data from the dataset.

Before we can begin our analysis of the columns of the dataset, we should remove the `Unnamed: 0` column since it brings only a duplicate index to our dataset. 

This can be done by dropping the column using the `df.drop` method, specifying an axis of `1` (1 = column, 0 = row). Since we don't need to keep a copy of the dataframe with this column lying around, I use the `inplace` parameter to directly change the DataFrame instead of creating a copy.

In [None]:
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

To start up our data analysis, let's statisically describe our data and find the different regions where avocados are grown using the `region` column. A statistically description of the data is done with the `df.describe()` method of the dataset.

Notice that `df.describe()` only provides statistical analysis of the numerical columns.

In [None]:
df.describe()

Statistical analysis of the categorical columns of the dataframe can be done by specifying the datatype of the columns to be described. For categorical data, we need to specify the `np.object` as a parameter into the keyword argument `include` in the method.

In [None]:
df.describe(include=np.object)

Now, let's see the different regions where avocados are grown by isolating the `region` column from the dataframe and using the `value_counts()` method. This method will count the frequency of each region in the column and note it in a column (or specifically a `pd.Series`).

In [None]:
df['region'].value_counts()

It seems that avocados in this dataset are equally recorded from each of the different cities listed above. I gather this from the frequency of 338 records in the dataset for each U.S. city. However, there are 335 records for the West Texan / New Mexican area. 

If I didn't have as much of a keen eye, I could visualize some of these records in a *horizontal bar graph* as seen below.

### Primer To Plots

Plots will be discussed in the machine learning club in later detail but there are essentially three things to know in order to create a generic plot using `matplotlib`. These steps apply to any plot except a histogram or pie plot.

1. The first argument in the `plot` function must be the x-values to put into the plot.


2. The second argument in the `plot` function must be the y-values to put into the plot.


3. Optional Arguments can be added in the `plot` function to change the color and style of the plot. In this example, I used the `color` keyword argument to provide my own hex color to color the inside of each of the bars. The `edgecolor` argument was also used to give a black outline to each bar to emphasize their length.

In addition to these three steps, you might want to use the `plt.title`, `plt.xlabel` and `plt.ylabel` methods to give your plot a title and labels for your x and y axis. Plots are a good way to visualize your data so try to make as many informative plots as possible.



In [None]:
plt.barh(df['region'].unique()[45:54], df['region'].value_counts().values[45:54], color = '#e57373', edgecolor ='black')
plt.xlabel('Cities')
plt.ylabel('Number of Records')
plt.title('Sample of Records of Avocados Sold In Different Cities')

#### Time Plots

Make sure to convert the **Date** column to `datetime` using `pd.to_datetime` (converts from `object` to `datetime`) in order to get a time line plot.

In [None]:
albany_df = df[df['region'] == 'Albany'].sort_values('Date')
albany_df['Date'] = pd.to_datetime(albany_df['Date'])
albany_avocado_median = [albany_df['AveragePrice'].median()] * len(albany_df['AveragePrice'])

plt.plot(albany_df['Date'], albany_df['AveragePrice'])
plt.plot(albany_df['Date'], albany_avocado_median, color="#e57373")
plt.xlabel('Month Avocado Sold')
plt.ylabel('Average Price Sold')
plt.title('Average Price of Avocados in Albany, NY over 40 months.')

plt.gcf().autofmt_xdate()
plt.grid()

In [None]:
albany_avocado_median = [albany_df['Total Volume'].median()] * len(albany_df['Total Volume'])

plt.plot(albany_df['Date'], albany_df['Total Volume'])
plt.plot(albany_df['Date'], albany_avocado_median, color="#e57373")
plt.xlabel('Month Avocado Sold')
plt.ylabel('Average Price Sold')
plt.title('Average Price of Avocados in Albany, NY over 40 months.')
plt.gcf().autofmt_xdate()
plt.yscale('log')
plt.grid()

In [None]:
tampa_df = df[df['region'] == 'Tampa'].sort_values('Date')
tampa_df['Date'] = pd.to_datetime(tampa_df['Date'])
tampa_avocado_median = [tampa_df['AveragePrice'].median()] * len(tampa_df['AveragePrice'])

plt.plot(tampa_df['Date'], tampa_df['AveragePrice'], color="#ef5350")
plt.plot(tampa_df['Date'], tampa_avocado_median, color="#64b5f6")
plt.xlabel('Month Avocado Sold')
plt.ylabel('Average Price Sold')
plt.grid()
plt.title('Average Price of Avocados in Tampa, FL over 40 months.')
plt.gcf().autofmt_xdate()

In [None]:
lv_df = df[df['region'] == 'LasVegas'].sort_values('Date')
lv_df['Date'] = pd.to_datetime(lv_df['Date'])
ny_avocado_median = [lv_df['AveragePrice'].median()] * len(lv_df['AveragePrice'])

plt.plot(lv_df['Date'], lv_df['AveragePrice'], color="#ff9800")
plt.plot(lv_df['Date'], ny_avocado_median, color="#4db6ac")
plt.xlabel('Month Avocado Sold')
plt.ylabel('Average Price Sold')
plt.grid()
plt.title('Average Price of Avocados in Las Vegas, NV over 40 months.')
plt.gcf().autofmt_xdate()

#### Other Data Exploration Plots

This plot asks the EDA question: "**Do organic avocados cost more than conventional avocados on average?**"

In [None]:
conventional_price = df[df['type'] == 'conventional']['AveragePrice']
organic_price = df[df['type'] == 'organic']['AveragePrice']


price_conv_mean = conventional_price.mean()
price_conv_std = conventional_price.std()

price_org_mean = organic_price.mean()
price_org_std = organic_price.std()

plt.bar(['conventional', 'organic'], [price_conv_mean, price_org_mean], 
        yerr=[price_conv_std, price_org_std], capsize=3, color=["#fbc02d", "#aed581"], edgecolor='black')

plt.grid()
plt.xlabel('Type of Avocado Grown')
plt.ylabel('Average Price')
plt.title('Effect of Avocado Type on Average Price')

It seems from the error bars overlapping that there is no clear way to see if the increased price for organic avocados is a statistically significant increase in price from conventional avocados. I will use a *t-test* to see if the p-value for this difference is statistically significant.

1. Implement a *t-test* function and get the *t* value. 
2. Use a t-chart with the appropriate degrees of freedom to find the corresponding p-value.

In [None]:
from scipy.stats import ttest_ind

ttest = ttest_ind(conventional_price[:5], organic_price[:5])

print("The p value for the t-test to find the statistical difference between conventional and organic avocados is " + 
      "{:7f}".format(ttest.pvalue))
print("The t value for this t-test is {:3f}".format(ttest.statistic))


With only a sample of 5 values from both populations, the *p-value* is already less than the most veriafiable value of statistical signficance at $\alpha = 0.05$, the difference of the prices between organic and conventional avocados is statistically significant.

**No wonder organic avocados cost more!**

----
There was also an inspirational question when I downloaded this dataset about asking ***"Was the Avocadopocalypse of 2017 real?***

In layman's terms, this is asking if there was a shortage in 2017 compared to other years in the dataset.

To answer this question, we need to graph the combined quantities of avocados over the different years and see if the combined quantity of avocados in 2017 was less than during 2015, 2016 and 2018.

In [None]:
unique_years = df['year'].unique()

# This dict comprehension creates a dict with the structure {2015: Volume Mean For The Year}

combined_means = {y: df[df['year'] == y]['Total Volume'].mean() for y in unique_years}
combined_stds = {y: df[df['year'] == y]['Total Volume'].std() for y in unique_years}

plt.bar(unique_years, combined_means.values(),
        color=['#66bb6a', '#66bb6a', '#4db6ac', '#66bb6a'], edgecolor='black', yerr=combined_stds)

plt.xticks(range(2015, 2019))
plt.grid()
plt.title('Combined Quantity of Avocados Sold Over 3 Years')
plt.xlabel('Year')
plt.ylabel('Quantity of Avocados Sold')

Unfortunately, it seems that 2015 was the year with lowest number of avocados sold. The number of avocados sold in 2017 seems to match the amount sold in 2016 with the number of avocados sold increasing in 2018.

---

Now, let's check for any null data to check if we need to impute anything.

In [None]:
df.head()

In [None]:
df.isna().sum()

Sweet, no null values!

## Feature Engineering

For machine learning, we will likely not use the `Date` column. Now, we have to do feature engineering to find the columns that will likely not add any additional data for our classifier.

We will start by trying to find possible correlations among the data.

### Correlations

However, we will first need to define our target variable and our independent features before we start any feature engineering. So let's square that away by isolating the **average price** from the dataframe as our target feature.

In [None]:
X = df.drop('AveragePrice', axis = 1)
y = df['AveragePrice']

**Heatmap**

In [None]:
from seaborn import heatmap

heatmap(df.corr(), cmap='summer', annot=True, linecolor='black', fmt='.2f')

In [None]:
numerical_X = X._get_numeric_data()
categorical_X = X.select_dtypes(np.object)
cat_X = categorical_X.drop('Date', axis = 1)

cat_X

In [None]:
corr_threshold = 0.6
corr_features = []

print(numerical_X.columns)
print()

for col in df._get_numeric_data().columns:
    if abs(df._get_numeric_data()['AveragePrice'].corr(df[col])) >= corr_threshold:
        corr_features.append(col)

        
# Columns are so correlated to one another! This method might not be the best way to do things.
# See Vishal Patel's Dimensionality Reduction lecture for more ways to do this
        
print(corr_features)

Lecture: [Press Here](https://www.slideshare.net/VishalPatel321/feature-reduction-techniques).

### Variance Removal

It seems like most of the numerical data is correlated to each other *but* before removing nearly all of the numerical data because of this, let's explore other options in removing some of our features.




In [None]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

vthresh = VarianceThreshold()
quasi_vthresh = VarianceThreshold(0.1)

Xp = vthresh.fit_transform(numerical_X)
Xpp = quasi_vthresh.fit_transform(numerical_X)

print(f"Original Numerical DataFrame Shape: {X.shape}\n", 
      f"With Variance: {Xp.shape}\n", f"With At Least 0.1 variance: {Xpp.shape}")

del Xpp

This `VarianceThreshold` has a great effect but not as great of an effect as does `SelectKBest`. Since categorical values do not merge well with numerical variables with `VarianceThreshold`, `SelectKBest` is used in this analysis.

In [None]:
# The k parameter will be tuned in a grid search. For right now, k = 5 seems to be the best value to use.
kbest = SelectKBest(f_regression, k = 5)
Xk = kbest.fit_transform(numerical_X, y)

Xk.shape

## Machine Learning

Now, let's test the results of the feature engineering with some basic machine learning. After this, we'll add some grid search to get the best hyperparameters for the model.

<u> Research Question:</u> **How can the information about avocado distribution and volume sold predict the price of avocados sold?**

1. Plot the data in a scatter plot to see an expected regression.
2. Find a model to compute regression.

In [None]:
plt.scatter(df['Total Volume'], df['AveragePrice']) 
plt.xlabel('Total Volume of Avocados Sold (hundred thousands)')
plt.ylabel('Average Price of Avocados')
plt.title('Total Volume vs. Average Price on Avocados')

### Categorical Features & ColumnSelector

In [None]:
X.columns

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

cat_columns = tuple(cat_X.columns)
num_columns = tuple(numerical_X.columns)

col_transform = make_column_transformer(
    (OneHotEncoder(), cat_columns),
    (make_pipeline(SelectKBest(f_regression, k = 5), StandardScaler(with_mean=False)), num_columns)
)
X_transform = col_transform.fit_transform(X, y)
y_transform = y.values.reshape((-1, 1))

# From tinkering (ColumnTransformer is still a bit new to me), k = 5 seems to be the best parameter for SelectKBest

train_x, test_x, train_y, test_y = train_test_split(X_transform, y_transform)
print(train_x)

Recall that the target variable `y` has the price of the avocados and the feature matrix `X` has all of the other items. The `ColumnSelector` will only modify columns that are numerical or categorical. All other columns will be dropped (i.e. **Date**).

In [None]:
X.head()

In [None]:
y.head()

There are also 2 unique values for the `type` column and 54 unique value for the `region` column that the `ColumnTransformer` must one-hot encode.

In [None]:
unique_types = X['type'].unique()
unique_types, len(unique_types)

In [None]:
unique_regions = X['region'].unique()
unique_regions, len(unique_regions)

### Model Selection

Let's start with a simple Linear Regression using only the preprocessing in the `ColumnSelector` and observe the results with the root mean squared error. The $R^2$ result is a bit murky so I will skip using that for now.

The cross validation score is used for the final testing score. 

#### 1. LinearRegression

In [None]:
from sklearn.metrics import mean_squared_error

lr = LinearRegression()

lr.fit(X_transform, y_transform)
y_pred = lr.predict(X_transform)
print(f'Train RMSE: {(mean_squared_error(y_transform, y_pred)) ** 0.5}')

In [None]:
lr.fit(train_x, train_y)
y_pred = lr.predict(test_x)

print(f"Test RMSE: {(mean_squared_error(test_y, y_pred)) ** 0.5}")

cvscore = cross_val_score(lr, X_transform, y_transform, cv=5, scoring="neg_mean_squared_error")
cv_rmse = (-1 * cvscore) ** 0.5

print(f"Cross Validation RMSE: {cv_rmse.mean()} +/- {cv_rmse.std()}")

In [None]:
print(f"Coefficients: {lr.coef_}")
print(lr.coef_.shape) # these are for all of the columns that were kept with the SelectKBest 
print(f"Intercepts: {lr.intercept_}")

Since the training and testing RMSE are close together, this shows that proper preprocessing was done to create a properly fit model with no overtraining.

#### 2. Ridge Regression
Using a `LinearRegression` algorithm is not that bad with a RMSE of 0.27 but I can do better with nonlinear models. Despite the training and validation accuries being close together, let's see if regularizing the regression using RidgeRegression via `RidgeCV` will lead to any results.

In [None]:
from sklearn.linear_model import Ridge
#ridge = RidgeCV(alphas=np.linspace(0.01, 10, 50)) # sample from a wide variety of alpha regularizations.

ridge = Ridge(alpha=0.7)
ridge.fit(train_x, train_y)
y_pred = ridge.predict(test_x)

print(f'Test RMSE Ridge: {(mean_squared_error(test_y, y_pred)) ** 0.5}')

Perhaps using regularization was not the best case (not much variation in the RMSE for different alphas) so we have to check other regression models now such as support vector machines and decision trees.

#### 3. DecisionTreeRegressor

Let's try a decision tree and random forest before building the full pipeline for the model in machine learning. After that, we'll try a deep learning approach before closing the notebook. 

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

dtree = DecisionTreeRegressor(max_depth = 2)

dtree.fit(train_x, train_y)
y_pred = dtree.predict(test_x)

print(f'Test RMSE: {(mean_squared_error(test_y, y_pred)) ** 0.5}')

After noticing a trend that increasing the depth of the tree gives lower RMSE, let's use a `GridSearch` to find the optimal depth and maximum samples for a leaf. 

Note that the maximum for the negative mean squared error also finds the minimum for the RMSE from calculus optimization.

In [None]:
from sklearn.model_selection import GridSearchCV

#param_grid = [{'max_depth' : np.arange(1, 11), 'max_leaf_nodes' : [50, 100, 150, 200, 250, 300]}]
param_grid = [{'max_depth' : np.arange(10, 20), 'max_leaf_nodes' : np.arange(500, 601, 10)}]

gsearch = GridSearchCV(dtree, param_grid, cv=5, scoring='neg_mean_squared_error')
gsearch.fit(train_x, train_y)

gsearch.best_params_

# Upon running the GridSearch, the max values of 10 and 300 were obtained. I need to broaden the range to find a 
# more optimal parameter if it hit the max

# New optimal parameters: depth = 18, max_leaf_nodes (maximum # of leaf nodes) = 520

Now with optimal parameters.

In [None]:
dtree = DecisionTreeRegressor(max_depth = 18, max_leaf_nodes = 520)

dtree.fit(train_x, train_y)
y_pred = dtree.predict(test_x)

print(f'Test RMSE: {(mean_squared_error(test_y, y_pred)) ** 0.5}')

#### 4. RandomForestRegressor

Let's try it with the optimal parameters from the decision tree before optimizing the forest itself.

In [None]:
rforest = RandomForestRegressor(max_depth = 18, max_leaf_nodes = 520, n_estimators=250)

rforest.fit(train_x, train_y.ravel())
y_pred = rforest.predict(test_x)

print(f'Test RMSE: {(mean_squared_error(test_y, y_pred)) ** 0.5}')

It seems that the **RandomForestRegressor** was the winner out of all of the regressors due to the lower RMSE of the decision tree compared to ridge and linear regression. 

## Thanks for Tuning In!

I hope you learned something if you are a beginner. Using a deep neural network might be overkill in this scenario since the RMSE for simple machine learning solutions are low enough to be used for production processes. Of course if you add to the parameter space or number of estimators in voting classifiers like random forest classifiers, it takes longer for the machine learning algorithm to run.