In [146]:
import pandas as pd
import numpy as np
import sklearn
import random as rand
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from pandas.io.json import json_normalize
from IPython.display import clear_output
import ipywidgets as widgets


# Introduction

### What are we doing here?

You are in school because you either love to, or want to, work with data (hopefully both!). Over the last 10 years, the field of Data Science has matured quite a bit, and with it, the re-emergence of ML as a tool for solving some of the most challenging data problems with increased speed, accurracy and with entirely new capabilities.  

So, Abbass and your other professors are teaching you all about how to make those amazing tools do amazing things--Right? Great.

Well, at some point you'll graduate (seriously).  You'll also get a job (seriously). At that job, you'll do some amazing work eventually--you'll build a ML model that passes all of the tests, and you'll probably want to scale it to your whole user populaton.

So...What do you do?

That's where this workshop comes in.  There are a lot of ways to deploy and scale ML applications today (this is both a good and bad thing) but since it's all changing at breakneck pace, maybe school hasn't caught up yet.  So I want to show you a bit about the deployment paradigms that exist and give you a taste for where you might want to invest more time as you continue your ML/DS/AI Journey.

### First, we need to Give Credit where credit is due:

I drew inspiration, code and ideas from several, excellent articles written by some great people in the open data science community:

[Create a model to predict house prices using Python - Shreyas Raghavan](https://towardsdatascience.com/create-a-model-to-predict-house-prices-using-python-d34fe8fad88f)

[Deploy your Machine Learning model as an API in 5 minutes (with Docker and Flask) - Guissart](https://medium.com/dataswati-garage/deploy-your-machine-learning-model-as-api-in-5-minutes-with-docker-and-flask-8aa747b1263b)

[Create a complete Machine learning web application using React and Flask](https://towardsdatascience.com/create-a-complete-machine-learning-web-application-using-react-and-flask-859340bddb33)

[Tidymodels-tidypredict](https://tidymodels.github.io/tidypredict/)


### Here are the main technologies and tools we'll use for the workshop
* [Python virtual environments](https://docs.python.org/3/tutorial/venv.html)
* [Docker](https://en.wikipedia.org/wiki/Docker_(software))
* [Scikit-learn](https://scikit-learn.org/stable/)
* [Flask](https://en.wikipedia.org/wiki/Flask_(web_framework))
* [React](https://en.wikipedia.org/wiki/React_(web_framework)) - We'll cover this at a high level.  


###  Three Deployment Paradigms

As I said, there are many different ways to push a ML model into production, but for simplicity's sake we'll cover just a few...

1. [Easy] Train in R/python and Run prediction directly in the database using tidypredict or something similar
2. **[Medium-Flexible] Wrap model inside of web app framework (i.e. Flask) and expose scoring function as a RESTful API endpoint**
3. [Hard-Scalable] Build a highly responsive, highly scalable production AI product using Apache Kafka and MLFlow

We're going to spend the most time working through an example that falls into #2

### The Data:

The data is taken from this [Kaggle](https://www.kaggle.com/shivachandel/kc-house-data) page. 

Online property companies offer valuations of houses using machine learning techniques. The aim of this report is to predict the house sales in King County, Washington State, USA using Multiple Linear Regression (MLR). The dataset consisted of historic data of houses sold between May 2014 to May 2015. We will predict the sales of houses in King County with an accuracy of at least 75-80% and understand which factors are responsible for higher property value - $650K and above.”

The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. **This data was published/released under CC0**: Public Domain. Unfortunately, the user has not indicated the source of the data. Please find the citation and database description in the Glossary and Bibliography. The dataset consisted of 21 variables and 21613 observations.

In [147]:
data = pd.read_csv("kc_house_data.csv")

In [148]:
data.count()

id               21613
date             21613
price            21613
bedrooms         21613
bathrooms        21613
sqft_living      21613
sqft_lot         21613
floors           21613
waterfront       21613
view             21613
condition        21613
grade            21613
sqft_above       21613
sqft_basement    21613
yr_built         21613
yr_renovated     21613
zipcode          21613
lat              21613
long             21613
sqft_living15    21613
sqft_lot15       21613
dtype: int64

In [149]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


### Project Setup

We'll need a few things to make this work: 

#### 1. A data set (see above)

 We are going to with on a dataset which consists information about the location of the house , price and other aspects such as square feet etc. The goal is to make a model which can give us a good prediction on the price of the house based on these attributes.  Before going too much farther, we should also discuss our success criteria--that is what is a good enough outcome for use of the model? [Insert discussion about AI product development] We can’t actually define “good accuracy” for this problem, but anything above 85% is good. Our aim on this dataset is to achieve an accuracy score of 85%+
 
Let's also do some prep (handle dates, remove index, extract and remove target variable (price) for training and reduce fearture space from 19 to 6

In [150]:
labels = data['price']
conv_dates = [1 if values == 2014 else 0 for values in data.date ]
data['date'] = conv_dates
train1 = data.drop(['id', 'price'],axis=1)

In [151]:
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)

In [152]:
col_imp = ["grade", "lat", "long", "sqft_living", "waterfront", "yr_built"]

#### 2. An acceptable model which does something interesting (here: predicts home price from house demographics)

 * Walk through training of GBT Regressor

In [153]:
clf = GradientBoostingRegressor(n_estimators=400, max_depth=5, min_samples_split=2,
          learning_rate=0.1, loss='ls')

In [154]:
clf.fit(train1[col_imp], labels)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=5, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=400, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

In [155]:
clf.score(x_test[col_imp],y_test)

0.95538427386965774

#### 3. A way to encode the model for reuse in the Flask server

* Create a function that takes a dictionary value representation for a home and a model returns a  price prediction

In [156]:
def predict(dict_values, col_imp=col_imp, clf=clf):
    x = np.array([float(dict_values[col]) for col in col_imp])
    x = x.reshape(1,-1)
    y_pred = clf.predict(x)[0]
    return y_pred

* Export some test data to a [JSON](https://en.wikipedia.org/wiki/JSON) object so they can be easily handled by the prediction API that we're going to setup

In [157]:
x_test[col_imp].iloc[20].T.to_json(force_ascii = False)

'{"grade":5.0,"lat":47.5138,"long":-122.364,"sqft_living":620.0,"waterfront":0.0,"yr_built":1939.0}'

In [158]:
x_test_json = x_test[col_imp].iloc[20].T.replace('},{', '} {')

In [159]:
x_test_json

grade             5.0000
lat              47.5138
long           -122.3640
sqft_living     620.0000
waterfront        0.0000
yr_built       1939.0000
Name: 12418, dtype: float64

In [160]:
predict(x_test_json)

179378.01971649766

* Saving a random example to file, so we can test it with the Flask API

In [161]:
x_test[col_imp].iloc[rand.randint(1,2100)].T.to_json("../to_predict_json.json")

### UI lab

In [165]:
#draft 1 
score_btn = widgets.Button(description='Return Score')
upload_btn = widgets.Button(description = 'Upload')
btn_select_random = widgets.Button(description = 'Select Random Row & Score')
out_pl = widgets.Output(layout={'border': '1px solid black'})





def btn_select_random_eventhandler(obj):
    with out_pl:
        global x_test_json
        x_test_json = x_test[col_imp].iloc[rand.randint(1,2100)].T.replace('},{', '} {')
        clear_output()
        print(x_test_json)
          
        
        
def btn_score_eventhandler(obj):
    with out_pl:
        clear_output()
        print("Predicted Price: $",predict(x_test_json) )
        
           
    
    

In [167]:
#display(score_btn)
#display(upload_btn)

display(widgets.VBox([btn_select_random,score_btn,out_pl]))

btn_select_random.on_click(btn_select_random_eventhandler)
score_btn.on_click(btn_score_eventhandler)



A Jupyter Widget

In [214]:
#draft 2

score_btn = widgets.Button(description='Return Price', 
                           button_style='primary',
                           layout = widgets.Layout(width='auto', height='40px'))
btn_select_random = widgets.Button(description = 'Select Random House', 
                                   layout = widgets.Layout(width='auto', height='auto'))
out_select = widgets.Output(layout={'border': '1px solid black'})
out_score = widgets.Output(layout={'border': '1px solid black'})


widgets.Layout(width = 'auto')

def btn_select_random_eventhandler(obj):
    with out_select:
        global x_test_json
        x_test_json = x_test[col_imp].iloc[rand.randint(1,2100)].T.replace('},{', '} {')
        clear_output()
        print(x_test_json)
          
        
        
def btn_score_eventhandler(obj):
    with out_score:
        clear_output()
        print("Predicted Price: $",predict(x_test_json) )



In [215]:
display(widgets.VBox([btn_select_random,score_btn,out_select,out_score]))

btn_select_random.on_click(btn_select_random_eventhandler)
score_btn.on_click(btn_score_eventhandler)

A Jupyter Widget

## Appendix

 * Walk through training of RF Regressor?
 * Compare R^2 scores to select one?

In [69]:
x_test[col_imp].iloc[rand.randint(1,2162)]

grade            11.0000
lat              47.5696
long           -122.0900
sqft_living    5270.0000
waterfront        1.0000
yr_built       1979.0000
Name: 13710, dtype: float64

In [None]:
x_test[col_imp].iloc[rand.randint(1,2162)].T.to_json("../to_predict_json.json")

In [217]:
!pip install voila
!jupyter serverextension enable --sys-prefix voila 

Collecting tornado>=5.0 (from jupyter-server<2.0.0,>=0.3.0->voila)
  Downloading https://files.pythonhosted.org/packages/95/84/119a46d494f008969bf0c775cb2c6b3579d3c4cc1bb1b41a022aa93ee242/tornado-6.0.4.tar.gz (496kB)
[K    100% |████████████████████████████████| 501kB 880kB/s ta 0:00:01
Collecting pyzmq>=17 (from jupyter-server<2.0.0,>=0.3.0->voila)
  Downloading https://files.pythonhosted.org/packages/9e/fd/dcebddd29df55fa951144da02057aa2b1c521a5abcf37e811dc093f6f03d/pyzmq-19.0.2-cp36-cp36m-macosx_10_9_intel.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 813kB/s ta 0:00:01
Building wheels for collected packages: tornado
  Running setup.py bdist_wheel for tornado ... [?25ldone
[?25h  Stored in directory: /Users/junorman/Library/Caches/pip/wheels/93/84/2f/409c7b2bb3afc3aa727f7ee8787975e0793f74d1165f4d0104
Successfully built tornado
Installing collected packages: tornado, pyzmq
  Found existing installation: tornado 4.5.2
    Uninstalling tornado-4.5.2:
      Success