## Lighthouse Labs
### W07D2 Deployment of ML Models
Instructor: Socorro Dominguez  
October 27, 2020

**Agenda:**

* REST APIs
    * What is it?
    * Applications
    * Demo
   
* Intro to Flask
    * Flask for API creation

## How is Data Science related to the Web?

Web Pages are intended for Humans. However, there’s lots of valuable data embedding in web pages:
* course listings
* bank records
* blogs

### What if we wanted to collect this data for analysis?

We would need a program that acts like a web browser but collects web document data rather than displaying it.

This is called `web scraping`. Popular methods include Scrapy, a free and open-source web-crawling framework written in Python. 

A Web Scraper...
* acts like a web browser (i.e., sends HTTP GET requests to web server)
* at the time it allows your to process the data that comes back.

Some other useful libraries useful when scraping if you are interested:

Beautiful Soup
* python library that can parse HTML (Super useful)

### Disadvantages of Web Scraping

- Scraping processes are hard to understand.

- Extracted data needs extensive cleaning (This is where we use `Beautiful Soup`). 

- In certain cases, this might take a long time and a lot of energy to complete (show why)

- New data extraction applications a lot of time in the beginning. 

- Web scrapping services are slower than API calls.

- If the developer of a website decides to introduce changes in the code, the scrapping service might stop working.

## What is an API?

**A**pplication  
**P**rogramming  
**I**nterface  
  
  
**RE**presentation  
**S**tate  
**T**ransfer  
**C**haracteristics  

### Characteristics?

Client-server, typically HTTP-based, stateless server


### Furthermore....

some web site’s provide direct access to their data. For example: Twitter, Translink, Car2Go, Google Maps, Yahoo

* Why would they do this?

* Why would some web sites not do this?

### What representation is DATA found in?

**J**ava**S**cript **O**bject **N**otation (json)


Textual format for structured data  
* [a,b,c] for arrays  
* {‘x’: m, ‘y’: n, ‘z’: o} for objects

JSON
* textual description of python (javascript actually) objects
* arrays and dictionaries

```
{
'library': [
           {'title': 'For Whom the Bell Tolls', 'author': 'Ernest Hemingway'},
           {'title': 'Trump: The Art of the Deal', 'author': 'Good Question'}
           ]
}
```

XML
• hierarchical description of tagged data  - This is how you would usually see data if you clic inspect to do Web Scraping

```
<library>
<book>
<title>
For Whom the Bell Tolls
</title>
<author>
Ernest Hemingway
</author>
</book>
<book>
<title>
Trump: The Art of the Deal
</title>
<author>
Good Question
</author>
</book>
</library>
```

### Using a Web API

Provider defines:
* message format for requests and responses
* usually in both XML and JSON
* registration and authentication
* usually using OAuth (delegated authorization framework for REST/APIs. It enables apps to obtain limited access to a user's data without giving away a user's password.)


Language integration
* might be provided or you might have to do it yourself
* if provided, usually someone other than data source
* library API for various languages like python
* you write a python program that calls library procedures
* library formats messages, sends them to web provider, translates responses as return values

### Getting JSON Data

We need to select the output format using API:
* e.g., http header: accept = application/json


View in browser or Postman
* good for exploration / debugging

Use request .get
* this returns a python array or dictionary

Get a string and parse
* import json
* x = json .loads(aJSONString)

Example using Trasnlink API  

 ``` Get out of slideshow mode```

In [1]:
import requests

# Get your own API token from developer.translink.ca
apikey = 'cYLpZHtgW36bD647D1kq';

x = requests.get('http://api.translink.ca/rttiapi/v1/stops/61935?apikey={}' .format(apikey),headers={'accept': 'application/JSON'}).json()
y = requests.get('http://api.translink.ca/rttiapi/v1/stops/61935/estimates?apikey={}' .format(apikey),headers={'accept': 'application/JSON'}).json()
z = requests.get('http://api.translink.ca/rttiapi/v1/buses?apikey={}&routeNo=099' .format(apikey), headers={'accept': 'application/JSON'}).json()

In [2]:
y[0]

{'RouteNo': '099',
 'RouteName': 'COMMERCIAL-BROADWAY/UBC (B-LINE)',
 'Direction': 'EAST',
 'RouteMap': {'Href': 'https://nb.translink.ca/geodata/099.kmz'},
 'Schedules': [{'Pattern': 'E8FL2',
   'Destination': 'TO BOUNDARY B-LINE',
   'ExpectedLeaveTime': '5:16pm 2020-10-26',
   'ExpectedCountdown': -3,
   'ScheduleStatus': ' ',
   'CancelledTrip': False,
   'CancelledStop': False,
   'AddedTrip': False,
   'AddedStop': False,
   'LastUpdate': '04:16:58 pm'},
  {'Pattern': 'E1',
   'Destination': "COMM'L-BDWAY STN",
   'ExpectedLeaveTime': '5:19pm 2020-10-26',
   'ExpectedCountdown': 0,
   'ScheduleStatus': ' ',
   'CancelledTrip': False,
   'CancelledStop': False,
   'AddedTrip': False,
   'AddedStop': False,
   'LastUpdate': '04:19:21 pm'},
  {'Pattern': 'E1',
   'Destination': "COMM'L-BDWAY STN",
   'ExpectedLeaveTime': '5:22pm 2020-10-26',
   'ExpectedCountdown': 3,
   'ScheduleStatus': '*',
   'CancelledTrip': False,
   'CancelledStop': False,
   'AddedTrip': False,
   'AddedStop

### The Anatomy Of A Request

It’s important to know that a request is made up of four things:

1. The endpoint

2. The method

3. The headers

4. The data (or body)

1. The endpoint (or route) is the url you request for

root-endpoint/?

https://api.github.com

2. The Method is the type of request you send to the server. You can choose from these types below:

a. GET - Used to get resource from server

b. POST - Used to create new resource on server

c. PUT/PATCH - update resource on server

d. DELETE - delete a resource on the server

STOCK API demo

## FLASK

Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a predictions as a response.

Now that you are going to be a Data Scientist, you cannot always rely on having your models in Jupyter Notebook.

Jupyter Notebooks are awesome for EDA. However, when you need an application that has a predictive model, you will need to deploy your model elsewhere.

You can try to get the best model possible in a notebook or a script. Once you have decided that you have the best model, you must hand it in a way that the client can run it easily in their infraestructure. 

For this purpose you need a tool that can fit in their  infrastructure, preferably in a language that you’re familiar with. This is where you can use Flask. Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a prediction as a response.

Let's create a super fast model for predicting Boston's house pricing.

``` Out of slideshow mode```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
 
#importing dataset from sklearn
from sklearn.datasets import load_boston
boston_data = load_boston()

# initializing dataset
data_ = pd.DataFrame(boston_data.data)

### Top five rows of dataset
data_.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [2]:
data_.columns = boston_data.feature_names
data_.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [3]:
# Target feature of Boston Housing data
data_['PRICE'] = boston_data.target

In [4]:
# creating feature and target variable 
X = data_.drop(['PRICE'], axis=1)
y = data_['PRICE']
 
# splitting into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
print("X training shape : ", X_train.shape)
print("X test shape : ", X_test.shape )
print("y training shape :", y_train.shape )
print("y test shape :", y_test.shape )
 
# creating model
from sklearn.ensemble import RandomForestRegressor
classifier = RandomForestRegressor()
classifier.fit(X_train, y_train)

X training shape :  (404, 13)
X test shape :  (102, 13)
y training shape : (404,)
y test shape : (102,)


RandomForestRegressor()

In [5]:
# Model evaluation for training data
prediction = classifier.predict(X_train)
print("r^2 : ", metrics.r2_score(y_train, prediction))
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_train, prediction))
print("Mean Squared Error: ", metrics.mean_squared_error(y_train, prediction))
print("Root Mean Squared Error : ", np.sqrt(metrics.mean_squared_error(y_train, prediction)))


# Model evaluation for testing data
prediction_test = classifier.predict(X_test)
print("r^2 : ", metrics.r2_score(y_test, prediction_test))
print("Mean Absolute Error : ", metrics.mean_absolute_error(y_test, prediction_test))
print("Mean Squared Error : ", metrics.mean_squared_error(y_test, prediction_test))
print("Root Mean Absolute Error : ", np.sqrt(metrics.mean_squared_error(y_test, prediction_test)))


r^2 :  0.9811314961312023
Mean Absolute Error:  0.8088490099009886
Mean Squared Error:  1.5242238737623763
Root Mean Squared Error :  1.2345946192019372
r^2 :  0.9061638456037587
Mean Absolute Error :  2.38821568627451
Mean Squared Error :  9.27358217647059
Root Mean Absolute Error :  3.045255683267103


In [7]:
y_test

307    28.2
343    23.9
47     16.6
67     22.0
362    20.8
       ... 
92     22.9
224    44.8
110    21.7
426    10.2
443    15.4
Name: PRICE, Length: 102, dtype: float64

In [6]:
prediction_test

array([29.544, 27.597, 20.492, 20.42 , 19.419, 19.505, 27.797, 18.824,
       20.366, 23.996, 29.116, 30.834, 20.72 , 19.985, 20.345, 24.152,
       11.914, 40.394, 24.311, 14.185, 20.025, 16.878, 24.231, 23.718,
       25.712,  9.547, 14.772, 19.766, 43.614, 12.464, 26.561, 19.695,
       47.78 , 15.929, 23.834, 20.707, 15.504, 33.291, 13.959, 19.941,
       24.558, 23.187, 25.266, 16.043, 15.423, 10.841, 47.421, 11.421,
       21.506, 18.819, 22.879, 21.273, 24.641, 20.895, 10.758, 23.773,
       11.902, 23.595, 19.153, 42.433, 14.445, 26.701, 13.038, 14.702,
       18.001, 32.591, 41.144, 24.998, 21.098, 21.193, 23.854,  6.905,
       18.329, 21.429, 19.523, 20.458, 42.844, 24.175, 28.488, 32.579,
       17.137, 20.929, 34.3  , 12.   , 24.469, 25.462, 14.937, 24.186,
       19.819, 16.895, 26.279, 45.127, 15.501, 21.262, 15.321, 20.903,
       23.804, 23.703, 42.576, 20.82 , 15.589, 16.113])

In [9]:
# saving the model
import pickle

# saving the columns
model_columns = list(X.columns)
with open('model_columns.pkl','wb') as file:
    pickle.dump(model_columns, file)
    
    
pickle.dump(classifier, open('final_prediction.pickle', 'wb'))

### Tmux

Show Tmux and its interactivity for multiple session handling.

You can learn more about it [here](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/)
