>Hello! My name is [Mauricio Ruanova](https://mruanova.com).

Table of Contents
1. [Step 1 - Identify The Problem](#step1)
1. [Step 2 - Exploratory Data Analysis](#step2)
1. [Step 3 - Distribution](#step3)
1. [Step 4 - Feature Importance](#step4)
1. [Step 5 - Outliers](#step5)
1. [Step 6 - Missing data](#step6)
1. [Step 7 - Select the model](#step6)
1. [Step 8 - Evaluate the model](#step7)
1. [Step 9 - Conclusion](#step8)

![avocado](https://mruanova.com/avocado.gif)

<a id="step1"></a>
# Step 1 Identify The Problem
Given the data set with the [avocado prices](https://www.kaggle.com/neuromusic/avocado-prices) from 2015 to 2018, we will predict the prices using xgboost.

<a id="step2"></a>
# Step 2 Exploratory Data Analysis

In [None]:
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))
import numpy as np # linear algebra
print("NumPy version: {}". format(np.__version__))
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
print("pandas version: {}". format(pd.__version__))
import matplotlib # collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))
import matplotlib.pyplot as plt
%matplotlib inline
import warnings # ignore warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('../input/avocado-prices/avocado.csv',index_col=0) # df.rename( columns={'Unnamed: 0':'id'}, inplace=True )
df.shape

In [None]:
df.head(3)

In [None]:
df.tail(3)

I will rename the first column to 'id'.

In [None]:
df.columns

<a id="step3"></a>
# Step 3 Distribution
First let's take a look at the distribution.

In [None]:
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(df.AveragePrice, ax=ax[0])
sns.boxplot(df.AveragePrice, ax=ax[1])
from scipy import stats
stats.probplot(df['AveragePrice'], plot=ax[2])
plt.show()

Conclusion: Bimodal distribution but why? Maybe because conventional versus organic.

In [None]:
df['type'].unique()

I see that there are two types: conventional and organic.

In [None]:
plt.figure()
plt.title("Avocado Average Price by Type")
sns.barplot(x="type",y="AveragePrice",data= df)
plt.show()

In [None]:
conventional = len(df[df['type'] == 'conventional'])
conventional

In [None]:
organic = len(df[df['type']== 'organic'])
organic

In [None]:
import matplotlib.pyplot as plt
y = ('conventional', 'organic')
y_pos = np.arange(len(y))
x = (conventional, organic)
labels = 'conventional', 'organic'
sizes = [conventional, organic]
fig1, ax1 = plt.subplots()
ax1.pie(sizes,  labels=labels, autopct='%1.1f%%', startangle=90) 
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Percentage', size=16)
plt.show() # Pie chart, where the slices will be ordered and plotted counter-clockwise:

Conclusion: data is 50% conventional and 50% organic.

In [None]:
print("Skewness: %f" % df['AveragePrice'].skew())

Acceptable values of skewness fall between − 3 and + 3.

In [None]:
print("Kurtosis: %f" % df['AveragePrice'].kurt())

Kurtosis is appropriate from a range of − 10 to + 10.

In [None]:
df_conventional = df[df['type'] == 'conventional']
# df_conventional.shape
df_organic = df[df['type'] == 'organic']
# df_organic.shape
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 4))
sns.distplot(df_conventional['AveragePrice']) # histogram
sns.distplot(df_organic['AveragePrice']) # histogram
plt.show()

Conclusion: The organic avocados are more expensive.

In [None]:
df_conventional = df[df['type'] == 'conventional']
# df_conventional.shape
df_organic = df[df['type'] == 'organic']
# df_organic.shape
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 4))
sns.boxplot(df_conventional['AveragePrice']) # histogram
sns.boxplot(df_organic['AveragePrice'],palette = 'pink') # histogram
plt.show()

But now we need to know what other features are driving up the price? Maybe the region?

In [None]:
mask = df['type']=='organic'
g = sns.factorplot('AveragePrice','region',data=df[mask],
    hue='year',size=13,aspect=0.8,palette='Spectral',join=False,)
# https://seaborn.pydata.org/tutorial/color_palettes.html

Conclusion: The price not only depends on the type, but also on the region. 

<a id="step4"></a>
# Step 4 Feature Importance

In [None]:
corrmat = df.corr()
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 10))
ax.set_title("Correlation Matrix", fontsize=16)
filter = df.columns != 'id'
sns.heatmap(df[df.columns[filter]].corr(), vmin=-1, vmax=1, cmap='coolwarm', annot=True)

Conclusion: Total Volume (98) and Total Bags (99) also have a strong correlation.

In [None]:
print('total number of duplicate values : ',sum(df.duplicated()))

<a id="step5"></a>
# Step 5 Outliers

In [None]:
df.describe() # outliers?

Conclusion: I usually take a look at the min and max values to identify outliers but I didn't find any this time.

<a id="step6"></a>
# Step 6 Missing Data

In [None]:
print(f"Missing data: {df.isna().sum(axis=0).any()}")

In [None]:
df['type']= df['type'].map({'conventional':0,'organic':1})

# Extracting month from date column.
df.Date = df.Date.apply(pd.to_datetime)
df['Month'] = df['Date'].apply(lambda x:x.month)
df.drop('Date',axis=1,inplace=True)
df.Month = df.Month.map({1:'JAN',2:'FEB',3:'MARCH',4:'APRIL',5:'MAY',6:'JUNE',7:'JULY',8:'AUG',9:'SEPT',10:'OCT',11:'NOV',12:'DEC'})

<a id="step7"></a>
# Step 7 Select the model


In [None]:
# Creating dummy variables
dummies = pd.get_dummies(df[['year','region','Month']],drop_first=True)
df_dummies = pd.concat([df[['Total Volume', '4046', '4225', '4770', 'Total Bags',
       'Small Bags', 'Large Bags', 'XLarge Bags', 'type']],dummies],axis=1)
target = df['AveragePrice']

# Splitting data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_dummies,target,test_size=0.30)

# Standardizing the data
cols_to_std = ['Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags','Large Bags', 'XLarge Bags']
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X_train[cols_to_std])
X_train[cols_to_std] = scaler.transform(X_train[cols_to_std])
X_test[cols_to_std] = scaler.transform(X_test[cols_to_std])

<a id="step8"></a>
# Step 8 Evaluate the model

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
model = XGBRegressor()
model.fit(X_train, y_train)
Y_pred = model.predict(X_test)
score = model.score(X_train, y_train)
print('Training Score:', score)
score = model.score(X_test, y_test)
print('Testing Score:', score)
output = pd.DataFrame({'Predicted':Y_pred})

In [None]:
print(output.head())
people = output.loc[output.Predicted == 1]["Predicted"]

<a id="step9"></a>
# Step 9 Conclusion

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

In [None]:
mae = np.round(mean_absolute_error(y_test,Y_pred),3)
print('Mean Absolute Error:', mae)

In [None]:
mse = np.round(mean_squared_error(y_test,Y_pred),3)
print('Mean Squared Error:', mse)

In [None]:
score = np.round(r2_score(y_test,Y_pred),3)
print('R2 Score:', score)

XGBoost has a score of 89% which is pretty good but could be better.

thanks to [ayushikaushik](https://www.kaggle.com/ayushikaushik/comparison-of-all-regression-models) for your examples.