# The Model

So we've built a [well-performing model](https://www.kaggle.com/ruthgn/house-prices-top-8-featengineering-xgb-optuna).

In fact, our final model's submission landed in the top 1% of Kaggle's [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course/overview) leaderboard and top 8% of Kaggle's [House Price Prediction Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) leaderboard (as of 10/29/2021).

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import shap

# Load the saved (pre-trained) model
load_model = pickle.load(open("/kaggle/input/house-prices-top-8-featengineering-xgb-optuna/ames_house_xgb_model.pkl", "rb"))

# Load the processed test data
X_test = pd.read_csv("/kaggle/input/house-prices-top-8-featengineering-xgb-optuna/df_test_processed.csv")

# Make predictions
predictions = np.exp(load_model.predict(X_test))

# Save predictions!
# If you want to try making a submission with the generated predictions, 
# download `my_submission.csv` from this notebook's output first 
# and then submit the predictions by uploading the downloaded file on the competition page
output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})
output = output.set_index('Id', drop=True)
output.index +=1461
output.to_csv('my_submission.csv')
print("Your predictions are successfully saved!")

The question is, **now what**?

# Model Interpretation

First, we need to *understand* our model better. Let's try to answer the following questions about our model:
* What features in the data did the model think are most important?
* For any single prediction from a model, how did each feature in the data affect that particular prediction?
* How does each feature affect the model's predictions in a big-picture sense (what is its typical effect when considered over a large number of possible predictions)?

There is an increasing need for data scientists who are able to extract insights from sophisticated machine learning models to help inform human decision-making.

> Many people say machine learning models are "black boxes", in the sense that they can make good predictions but you can't understand the logic behind those predictions. This statement is true in the sense that most data scientists don't know how to extract insights from models yet.

Some decisions are made automatically by models, but many important decisions are made by humans. For these decisions, insights can be more valuable than predictions. Beyond informing human decision-making, insights extracted from machine learning models have many other uses, including debugging, informing feature engineering, directing future data collection, and *building trust*.



We'll start by using [SHAP Values](https://www.kaggle.com/dansbecker/shap-values) to explain a random individual prediction from the test set. Afterwards, we will look at model-level insights.

In [None]:
# Pick an arbitrary row (first row starts at 0)
row_to_show = 42
data_for_prediction = X_test.iloc[[row_to_show]]

# Generate prediction
y_sample = np.exp(load_model.predict(data_for_prediction))

# Create object that can calculate Shap values
explainer = shap.TreeExplainer(load_model)

# Calculate Shap values from prediction
shap_values = explainer.shap_values(data_for_prediction)

**For a single prediction, what features in the data did the model think are most important?**

In [None]:
plt.title('Feature importance based on SHAP values')
shap.summary_plot(shap_values, data_for_prediction, plot_type="bar")

**How did each feature in the data affect that particular prediction?**

In [None]:
plt.title('Feature impact on model output (feature impact in details below)')
shap.summary_plot(shap_values, data_for_prediction)


shap.initjs()
shap.force_plot(explainer.expected_value, shap_values, data_for_prediction)

Now that we've seen the inner workings of our model in making an individual prediction, let's aggregate all the information into powerful model-level insights.

In [None]:
# Use test set to get multiple predictions
data_for_prediction = X_test

# Generate predictions
y_sample = np.exp(load_model.predict(data_for_prediction))

# Create object that can calculate Shap values
explainer = shap.TreeExplainer(load_model)

# Calculate Shap values from predictions
shap_values = explainer.shap_values(data_for_prediction)

**How does each feature affect the model's predictions in a big-picture sense? In other words, what is its typical effect when considered over a large number of possible predictions?**

In [None]:
plt.title('Feature impact on overall model output (feature impact in details below)')
shap.summary_plot(shap_values, data_for_prediction)


shap.initjs()
shap.force_plot(explainer.expected_value, shap_values, data_for_prediction)

# Model Deployment

Great! With SHAP Values, we are able to gain some understanding of the model behavior and uncover features that have the most impact on predictions. Now the question is, how do we get *others* to understand how the model works and makes its decisions? How do we build trust in our model, particularly among non-coders & stakeholders who can't run the code themselves?

Building an **interactive machine learning web app** for your predictive model allows anyone to explore the model hands-on—the best way to see how it works. Additionally, it is one way to let others operationalize your machine learning model into their business. **[Streamlit](https://streamlit.io/)** is *the* open-source app framework that let's you create beautiful data apps in **hours** instead of **weeks**. The best part? You can do it all in pure Python. Legendary.

Steps to deploy your model:

**1. Convert notebook code into Python scripts**

This includes cleaning nonessential code, refactoring some of your notebook code into functions (depending on how the code was writen originally), and creating more Python scripts for other related tasks (more details on this coming up). You can export your trained machine learning model as a pickle file and include that on your script to save time and make the app run faster—no need to re-train the model each time we want to make predictions.

**2. Customize your web app interface with Streamlit API**

During this process, you are going to write Python scripts that will affect how everything is displayed on the app's interface. [Streamlit's API](https://docs.streamlit.io/library/api-reference) is *magically simple*. With just a few lines of code you can visualize, mutate, and share data in various ways (while seeing it automatically update as you iteratively save the source file!). In this step, you get to decide how much customization you want to add to the web app. 

For this particular app, in order to make user-specified input features in the app more readable, I decided to re-label feature names and the level names of each categorical feature—this way app users understand what data they're actually pumping into the model to generate a prediction. Note that this re-labeling process can be very time-consuming (unless you've figured out how to automate the script generation process), and also made it neccessary to add more Python scripts to convert the input feature names back to the variables that our trained model can work with.

You should expect to go back and forth between Step 1 and Step 2 until you're happy with how everything looks and you're sure that everything is running smoothly. Click [here](https://github.com/ruthgn/Ames-Housing-Price-Prediction/blob/main/ames-house-ml-app.py) to see the final Python script powering the app. At this point, the app is 100% functional locally.

**3. Create a [Github](https://github.com/) repository containing all relevant files**

It's time to gather all of the files we need to run our Python scripts in one repository. One thing that is missing from our repo is a `requirements.txt` file that lists all of the required packages to run our script. To create this file, activate the environment where you're able to run the app locally and enter the following on your command line:

`pip list --format=freeze > requirements.txt`

I would also recommend removing unnecessary packages from the generated `requirement.txt` file to avoid errors and allow faster deployment. Make sure to add the file to your repository.

**4. Deploy app via Streamlit Cloud**

[Streamlit Cloud](https://streamlit.io/cloud) allows you to deploy apps directly from Github repos. Now that we have everything we need in one repo, sign up for a Streamlit account (if you haven't already) and log into Streamlit Cloud! From there, you can simply click on the "New App", enter your relevant Github repo information, and deploy your app in one-click!





# The App

With no further ado, I present you [the app](https://share.streamlit.io/ruthgn/ames-housing-price-prediction/main/ames-house-ml-app.py).

(Complete project code and data available on [Github](https://github.com/ruthgn/Ames-Housing-Price-Prediction)).




*Have questions or comments? Share them on the comments section!*

_____

# Acknowledgement

Steps taken throughout the model-building process in this notebook are inspired by [this Kaggle notebook](https://www.kaggle.com/ryanholbrook/feature-engineering-for-house-prices) by Ryan Holbrook and Alexis Cook (modified for better performance). Check out their notebook for more ideas to improve the prediction model.

Some text in the beginning of the Model Interpretation section is copied from Kaggle's fantastic [Machine Learning Explainability](https://www.kaggle.com/learn/machine-learning-explainability) course.

Other quoted sources include [Business Data Science](https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778) by Matt Taddy.