Feature importance, hyperparameter tuning and further usage of a ml model
======================


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from numpy import sort
import seaborn as sns
from sklearn.externals import joblib
from pdpbox import pdp

plt.style.use('ggplot')

data = pd.read_csv('../input/avocado.csv')
data.head()

Feature engineering and data cleaning
==============================
This dataset is already fairly good, so we do not really have to do any cleaning. Often you will be working with a lot more messy datasets and multiple datasets that have to be merged. We will still have to do a litte feature engineering to get more out of our dataset.



In [None]:
data['Date'] = pd.to_datetime(data.Date)

In [None]:
data['day_of_week'] = data['Date'].dt.weekday_name

In [None]:
data.day_of_week.unique()

In [None]:
data['month'] = data['Date'].dt.month

In [None]:
data['day'] = data['Date'].dt.day

In [None]:
data = data.rename(columns={'Unnamed: 0': 'Store'})

Understanding the dataset
====================

Unique values of categorical data. It can be useful to know how many unique values you have in your text data, for picking the way of making your categorical data into numrical data.

<h2>About this Dataset</h2>
<h4>Context</h4>

It is a well known fact that Millenials LOVE Avocado Toast. It's also a well known fact that all Millenials live in their parents basements.

Clearly, they aren't buying home because they are buying too much Avocado Toast!

But maybe there's hope... if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream.

Content

This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here's how the Hass Avocado Board describes the data on their website:

> The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

<h4>Some relevant columns in the dataset:</h4>

Date - The date of the observation
AveragePrice - the average price of a single avocado
type - conventional or organic
year - the year
Region - the city or region of the observation

Numerical column names refer to price lookup codes.

4046:  small Hass

4225:  large Hass

4770:  extra large Hass

Therefore for easier reading of our dataset, we can rename the columns with lookup codes as names.

In [None]:
data = data.rename(columns={'4046': 'small Hass', '4225':  'large Hass', '4770':  'extra large Hass'})

In [None]:
print('Unique values in columns with text:\n\n Dates: {0} \n\n Data type: {1} \n\n Year: {2} \n\n Region: {3} \n\n Day of week: {4} \n\n Month: {5}'.format(data.Date.unique(), data.type.unique(), data.year.unique(), data.region.unique(), data['day_of_week'].unique(), data.month.unique()))

In [None]:
data.info()

Handeling categorical data
=====================

I am choosing to handle categorical data, by replacing 

In [None]:
mappings_type = {'conventional':0, 'organic':1}

mappings_dayofweek = {'Sunday':1}

mappings_region = {}

v = 0

regions = list(data.region.unique())

numbers = []

for i in regions:
    v = v+1
    numbers.append(v)

d = zip(regions, numbers)

mappings_region = dict(d)

data.type.replace(mappings_type, inplace=True)
data.day_of_week.replace(mappings_dayofweek, inplace=True)
data.region.replace(mappings_region, inplace=True)

In [None]:
data.head()

<h2>Descriptive statistics</h2>
Describtive statistics is crusial to understand your dataset properly. Descriptive statistics is the first step to perform good prescriptive and predictive statistics.

In [None]:
data.describe()

In [None]:
skew_df = pd.DataFrame(data.skew(), columns={'Skewness'})
skew_df

In [None]:
kurt_df = pd.DataFrame(data.kurtosis(), columns={'Kurtosis'})
kurt_df

<h2>Summarize / What do we know now?</h2>

Building our model
===============
___________
Now it is finnaly time for some fun! We will be building a XGBoost model and plotting feature importance of it, to learn some more about our data.

In [None]:
X = data.drop(['AveragePrice', 'Date'], axis=1)

y = data['AveragePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = XGBRegressor(n_jobs=4)
model.fit(X_train, 
            y_train,
            verbose=True)

In [None]:
predictions = model.predict(X_test)
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=5)
print(scores)
print('Mean Absolute Error: %2f' %(-1 * scores.mean()))

In [None]:
mae = mean_absolute_error(predictions, y_test)
print("Mean Absolute Error : " + str(mae))

error_percent = mae/data['AveragePrice'].mean()*100
print(str(error_percent) + ' %')

Feature Selection
=============
Selecting the right amount of features, and selecting the right features is crucial to avoid data leakage, overfitting or overfitting.

Therefore we can now do feature importance plot.

In [None]:
# plot feature importance
fig, ax = plt.subplots(figsize=(15, 15))
imp_plt = plot_importance(model, ax=ax)

In [None]:
features_to_plot = ['region', 'year']
inter1  =  pdp.pdp_interact(model=model, dataset=X_test, model_features=X.columns.tolist(), features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=X.columns.tolist(), feature='month')

# plot it
pdp.pdp_plot(pdp_goals, 'Month')

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=X.columns.tolist(), feature='year')

# plot it
pdp.pdp_plot(pdp_goals, 'Year')

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=X.columns.tolist(), feature='region')

# plot it
pdp.pdp_plot(pdp_goals, 'Region')
plt.show()

In [None]:
row_to_show = 5
data_for_prediction = X

In [None]:
import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(model)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)



We now know how the model ranks the features, meaning which is the most important features.

Now performing feature selection with scikit learn, can give us a better insight, in how many features might be the optimal number of features to include.

In [None]:
mae = mean_absolute_error(predictions, y_test)
error_percent = mae/data['AveragePrice'].mean()*100

accuracy = mean_absolute_error(predictions, y_test)
print("Mean Absolute Error : " + str(mae) + "\t" + str(error_percent) + ' %')

#scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=5)
#print(scores)
#print('Mean Absolute Error: %2f' %(-1 * scores.mean()))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)

best_score = {}

for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = XGBRegressor(n_jobs=4)
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    #print('Mean Absolute Error: %2f' %(-1 * scores.mean()))
    accuracy = mean_absolute_error(predictions, y_test)
    mae = mean_absolute_error(predictions, y_test)
    error_percent = mae/data['AveragePrice'].mean()*100
    #print("Thresh={0:f}, n={1:f}, Accuracy: {2:f}, Mean Absolute Error {3:f}: , err_perct: {4:f}%".format(thresh, select_X_train.shape[1], accuracy * 100, mae, error_percent))
    
    best_score[select_X_train.shape[1]] = str(error_percent) + ' %'

    
print(best_score)

In [None]:
value = best_score.values()
key = best_score.keys()

min_val = min(value)
min_key = min(best_score, key=best_score.get)
print('Best amout of features: key: {0}, value: {1}'.format(min_key, min_val))

We can now see that the best amount of features corresponding to the feature selection we just did, is 13. We do know which features is the most important from our feature importance plot, this means we can include top 13 features, for building a more optimized model.
______________________

Building a better model
==================
Now that we know which features is the most important ones, and we know which might possible be the best number of features to include. We can build an optimized model, and let it find the best hyper parameters using cross-validation.

--------
We start by selscting the data we need.

In [None]:
X_opt = data.drop(['AveragePrice', 'Date', 'XLarge Bags'], axis=1)
y_opt = data['AveragePrice']

opt_X_train, opt_X_test, opt_y_train, opt_y_test = train_test_split(X_opt, y_opt, test_size=0.33, random_state=42)

Now that we have the data, we can build our pipeline, using gridsearch cv to find the best hyperparameters.

In [None]:
opt_pipeline = Pipeline([('xgb', XGBRegressor(n_jobs=4))])

param_grid = {
    "xgb__n_estimators": [100, 250, 500, 1000],
    "xgb__learning_rate": [0.1, 0.25, 0.5, 1],
    "xgb__max_depth": [6, 7, 8],
    "xgb__min_child_weight": [0.25, 0.5, 1, 1.5]
}

fit_params = {"xgb__eval_set": [(opt_X_test, opt_y_test)], 
              "xgb__early_stopping_rounds": 10, 
              "xgb__verbose": False} 

searchCV = GridSearchCV(opt_pipeline, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(opt_X_train, opt_y_train)

In [None]:
searchCV.best_params_

In [None]:
searchCV.cv_results_['mean_train_score']

In [None]:
searchCV.cv_results_['mean_test_score']

In [None]:
searchCV.cv_results_['mean_train_score'].mean(), searchCV.cv_results_['mean_test_score'].mean()

In [None]:
opt_predictions = searchCV.predict(opt_X_test)

In [None]:
mae = mean_absolute_error(opt_predictions, opt_y_test)
print("Mean Absolute Error : " + str(mae))

error_percent = mae/data['AveragePrice'].mean()*100
print("Error percentage: " + str(error_percent) + ' %')

As we can see We have actually improved our model quite a bit, just with this basic optimazation and feature selection

features_to_plot = ['region', 'year']
inter1  =  pdp.pdp_interact(model=searchCV, dataset=opt_X_test, model_features=X_opt.columns.tolist(), features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')

Futher Usage Of The Model
=====================
-----------------
We can now use our model in other applications, by saving it and loading it in another python program. You can also build an API with it and the include it in an app, on a website, a bigger production system etc.


Lets go ahead and save the model.

In [None]:
joblib.dump(searchCV, "xgboostmodel.joblib.dat")

For further information on how to load it into another application, lookup saving Gradient Boosting Models with joblib.