<a href="https://colab.research.google.com/github/willisbridges/Rural-or-Urban/blob/main/Applied_modeling_Building_the_model_and_communicating_the_results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


BloomTech Data Science

*Unit 2, Sprint 3, Module 4*

---

# Model Interpretation

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploratory visualization, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Make at least 1 Shapley force plot to explain an individual prediction.
- [ ] **Share at least 1 visualization (of any type) on Slack!**

If you aren't ready to make these plots with your own dataset, you can practice these objectives with any dataset you've worked with previously. Example solutions are available for Partial Dependence Plots with the Tanzania Waterpumps dataset, and Shapley force plots with the Titanic dataset. (These datasets are available in the data directory of this repository.)

Please be aware that **multi-class classification** will result in multiple Partial Dependence Plots (one for each class), and multiple sets of Shapley Values (one for each class).

## Stretch Goals

#### Partial Dependence Plots
- [ ] Make multiple PDPs with 1 feature in isolation.
- [ ] Make multiple PDPs with 2 features in interaction. 
- [ ] Use Plotly to make a 3D PDP.
- [ ] Make PDPs with categorical feature(s). Use Ordinal Encoder, outside of a pipeline, to encode your data first. If there is a natural ordering, then take the time to encode it that way, instead of random integers. Then use the encoded data with pdpbox. Get readable category names on your plot, instead of integer category codes.

#### Shap Values
- [ ] Make Shapley force plots to explain at least 4 individual predictions.
    - If your project is Binary Classification, you can do a True Positive, True Negative, False Positive, False Negative.
    - If your project is Regression, you can do a high prediction with low error, a low prediction with low error, a high prediction with high error, and a low prediction with high error.
- [ ] Use Shapley values to display verbal explanations of individual predictions.
- [ ] Use the SHAP library for other visualization types.

The [SHAP repo](https://github.com/slundberg/shap) has examples for many visualization types, including:

- Force Plot, individual predictions
- Force Plot, multiple predictions
- Dependence Plot
- Summary Plot
- Summary Plot, Bar
- Interaction Values
- Decision Plots

We just did the first type during the lesson. The [Kaggle microcourse](https://www.kaggle.com/dansbecker/advanced-uses-of-shap-values) shows two more. Experiment and see what you can learn!

### Links

#### Partial Dependence Plots
- [Kaggle / Dan Becker: Machine Learning Explainability — Partial Dependence Plots](https://www.kaggle.com/dansbecker/partial-plots)
- [Christoph Molnar: Interpretable Machine Learning — Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) + [animated explanation](https://twitter.com/ChristophMolnar/status/1066398522608635904)
- [pdpbox repo](https://github.com/SauceCat/PDPbox) & [docs](https://pdpbox.readthedocs.io/en/latest/)
- [Plotly: 3D PDP example](https://plot.ly/scikit-learn/plot-partial-dependence/#partial-dependence-of-house-value-on-median-age-and-average-occupancy)

#### Shapley Values
- [Kaggle / Dan Becker: Machine Learning Explainability — SHAP Values](https://www.kaggle.com/learn/machine-learning-explainability)
- [Christoph Molnar: Interpretable Machine Learning — Shapley Values](https://christophm.github.io/interpretable-ml-book/shapley.html)
- [SHAP repo](https://github.com/slundberg/shap) & [docs](https://shap.readthedocs.io/en/latest/)

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, roc_auc_score
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier 


  import pandas.util.testing as tm


In [None]:
#Importing cleaned csv
from google.colab import files
files.upload()

Saving urban.csv to urban.csv


{'urban.csv': b'FIPS,Urban,Median_HH_Inc_ACS,PerCapitaInc,Poverty_Rate_ACS,Deep_Pov_All,PctEmpConstruction,PctEmpServices,PctEmpChange0720,PctEmpGovt,PctEmpTrade,PctEmpTrans,PctEmpInformation,PctEmpFIRE,PctEmpGovt\n1001,1,58731.0,29819.0,15.1851717,6.261607371,6.072098524,44.08286437,4.1,9.436424435,12.44596689,6.797977327,1.362042248,5.978305195,9.436424435\n1003,1,58320.0,32626.0,10.35407265,4.046885287,8.585460243,45.20301606,13.9,5.224469193,16.47790012,5.003628104,1.525906763,7.520164895,5.224469193\n1005,0,32525.0,18473.0,30.66868894,15.04215551,6.810887912,33.63841674,-17.7,7.012956139,12.81350291,6.632592417,0.606204683,3.720432664,7.012956139\n1007,1,47542.0,20778.0,18.12718108,8.69038387,9.848575176,39.02468105,-4.5,5.138905449,12.90091809,5.723142959,1.263860737,5.341600095,5.138905449\n1009,1,49358.0,24747.0,13.55151643,6.524479933,9.718483369,37.8792718,-8.4,4.539854907,14.82410914,6.880503719,0.857781631,6.063786102,4.539854907\n1011,0,37785.0,20877.0,28.8736713,13.961569

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'urban.csv'
    !pip install category_encoders==2.*
    !pip install eli5
    !pip install pdpbox
    !pip install shap

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
#creating wrangle feature
def wrangle(filepath):
  #setting FIPS to index
  df = pd.read_csv(filepath, index_col = 'FIPS')
  return df

df = wrangle(DATA_PATH)
  

#Exploring the dataframe
What might give me issues?
Are there NaNs?

In [None]:
df.head()

Unnamed: 0_level_0,Urban,Median_HH_Inc_ACS,PerCapitaInc,Poverty_Rate_ACS,Deep_Pov_All,PctEmpConstruction,PctEmpServices,PctEmpChange0720,PctEmpGovt,PctEmpTrade,PctEmpTrans,PctEmpInformation,PctEmpFIRE,PctEmpGovt.1
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,1,58731.0,29819.0,15.185172,6.261607,6.072099,44.082864,4.1,9.436424,12.445967,6.797977,1.362042,5.978305,9.436424
1003,1,58320.0,32626.0,10.354073,4.046885,8.58546,45.203016,13.9,5.224469,16.4779,5.003628,1.525907,7.520165,5.224469
1005,0,32525.0,18473.0,30.668689,15.042156,6.810888,33.638417,-17.7,7.012956,12.813503,6.632592,0.606205,3.720433,7.012956
1007,1,47542.0,20778.0,18.127181,8.690384,9.848575,39.024681,-4.5,5.138905,12.900918,5.723143,1.263861,5.3416,5.138905
1009,1,49358.0,24747.0,13.551516,6.52448,9.718483,37.879272,-8.4,4.539855,14.824109,6.880504,0.857782,6.063786,4.539855


In [None]:
#not bad, maybe we can impute them in a pipeline
df.isnull().sum().sum()

97

In [None]:
target = 'Urban'
X,y = df.drop(columns = target), df[target]

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=42)


In [None]:
#Baseline accuracy
print('The baseline accuracy is ', y_train.value_counts(normalize=True).max())

The baseline accuracy is  0.6240776699029126


#Building the Models (Classification)

In [None]:
#First, Random Forest

model_rf = make_pipeline(
    SimpleImputer(),
    RandomForestClassifier(n_jobs = -1)
)
model_rf.fit(X_train, y_train)

Pipeline(steps=[('simpleimputer', SimpleImputer()),
                ('randomforestclassifier', RandomForestClassifier(n_jobs=-1))])

In [None]:
#Time to check the training and validation accuracy
print('Training Accuracy (RF):', model_rf.score(X_train, y_train))
print('Validation Accuracy (RF):', model_rf.score(X_val, y_val))

Training Accuracy (RF): 1.0
Validation Accuracy (RF): 0.7934782608695652
