## Lighthouse Labs
### W05D5 Midterm Project
Instructor: Socorro Dominguez  
October 27, 2020

**Agenda:**

* REST APIs
    * What is it?
    * Applications
    * Demo
   
* Intro to Flask
    * Flask for API creation

## How is Data Science related to the Web?

Web Pages are intended for Humans. However, there’s lots of valuable data embedding in web pages:
* course listings
* bank records
* blogs


What if we wanted to collect this data for analysis?

We would need a program that acts like a web browser but collects web document data rather than displaying it.

This is called `web scraping`. Popular methods include Scrapy, a free and open-source web-crawling framework written in Python. 

A Web Scraper...
* acts like a web browser (i.e., sends HTTP GET requests to web server)
* at the time it allows your to process the data that comes back.

Some other useful libraries for scraping if you are interested:

curl
* utility and library for accessing web servers
* delivers web data as text

Beautiful Soup
* python library that can parse HTML (Super useful)

### Disadvantages of Web Scraping

For anybody who is not an expert, the scraping processes are hard to understand.

The data that has been extracted will first need to be treated so that they can be easily understood (This is where we use `Beautiful Soup`. 
In certain cases, this might take a long time and a lot of energy to complete.

It is common for new data extraction applications to take some time in the beginning as the software often has a learning curve. Sometimes web scraping services take time to become familiar with the core application and need to adjust to the scrapping language. This means that such services can take some days before they are up and running at full speed.

Most web scrapping services are slower than API calls and another problem is the websites that do not allow screen scrapping. In such cases web scrapping services are rendered useless. Also, if the developer of the website decides to introduce some changes in the code, the scrapping service might stop working.

## What is an API?

**A**pplication  
**P**rogramming  
**I**nterface  
  
  
**RE**presentation  
**S**tate  
**T**ransfer  
**C**haracteristics  

### Characteristics?

Client-server, typically HTTP-based, stateless server


Some web site’s provide direct access to their data. For example: Twitter, Translink, Car2Go, Google Maps, Yahoo

Why would they do this?
Why would some web sites not do this?

### What representation is DATA found in?

**J**ava**S**cript **O**bject **N**otation (json)


Textual format for structured data  
* [a,b,c] for arrays  
* {‘x’: m, ‘y’: n, ‘z’: o} for objects

JSON
* textual description of python (javascript actually) objects
* arrays and dictionaries

```
{
'library': [
           {'title': 'For Whom the Bell Tolls', 'author': 'Ernest Hemingway'},
           {'title': 'Trump: The Art of the Deal', 'author': 'Good Question'}
           ]
}
```

XML
• hierarchical description of tagged data  - This is how you would usually see data if you clic inspect to do Web Scraping

```
<library>
<book>
<title>
For Whom the Bell Tolls
</title>
<author>
Ernest Hemingway
</author>
</book>
<book>
<title>
Trump: The Art of the Deal
</title>
<author>
Good Question
</author>
</book>
</library>
```

### Using a Web API

Provider defines:
* message format for requests and responses
* usually in both XML and JSON
* registration and authentication
* usually using OAuth  


Language integration
* might be provided or you might have to do it yourself
* if provided, usually someone other than data source
* library API for various languages like python
* you write a python program that calls library procedures
* library formats messages, sends them to web provider, translates responses as return values

### Getting JSON Data

We need to select the output format using API:
* e.g., http header: accept = application/json


View in browser or Postman
* good for exploration / debugging

Use request .get
* this returns a python array or dictionary

Get a string and parse
* import json
* x = json .loads(aJSONString)

Example using Trasnlink API

In [1]:
import requests

# Get your own API token from developer.translink.ca
apikey = 'cYLpZHtgW36bD647D1kq';

x = requests.get('http://api.translink.ca/rttiapi/v1/stops/61935?apikey={}' .format(apikey),headers={'accept': 'application/JSON'}).json()
y = requests.get('http://api.translink.ca/rttiapi/v1/stops/61935/estimates?apikey={}' .format(apikey),headers={'accept': 'application/JSON'}).json()
z = requests.get('http://api.translink.ca/rttiapi/v1/buses?apikey={}&routeNo=099' .format(apikey), headers={'accept': 'application/JSON'}).json()

In [2]:
y[0]

{'RouteNo': '099',
 'RouteName': 'COMMERCIAL-BROADWAY/UBC (B-LINE)',
 'Direction': 'EAST',
 'RouteMap': {'Href': 'https://nb.translink.ca/geodata/099.kmz'},
 'Schedules': [{'Pattern': 'E8FL2',
   'Destination': 'TO BOUNDARY B-LINE',
   'ExpectedLeaveTime': '5:16pm 2020-10-26',
   'ExpectedCountdown': -3,
   'ScheduleStatus': ' ',
   'CancelledTrip': False,
   'CancelledStop': False,
   'AddedTrip': False,
   'AddedStop': False,
   'LastUpdate': '04:16:58 pm'},
  {'Pattern': 'E1',
   'Destination': "COMM'L-BDWAY STN",
   'ExpectedLeaveTime': '5:19pm 2020-10-26',
   'ExpectedCountdown': 0,
   'ScheduleStatus': ' ',
   'CancelledTrip': False,
   'CancelledStop': False,
   'AddedTrip': False,
   'AddedStop': False,
   'LastUpdate': '04:19:21 pm'},
  {'Pattern': 'E1',
   'Destination': "COMM'L-BDWAY STN",
   'ExpectedLeaveTime': '5:22pm 2020-10-26',
   'ExpectedCountdown': 3,
   'ScheduleStatus': '*',
   'CancelledTrip': False,
   'CancelledStop': False,
   'AddedTrip': False,
   'AddedStop

### The Anatomy Of A Request

It’s important to know that a request is made up of four things:

1. The endpoint

2. The method

3. The headers

4. The data (or body)

1. The endpoint (or route) is the url you request for

root-endpoint/?

https://api.github.com

2. The Method
The method is the type of request you send to the server. You can choose from these types below:

a. GET - Used to get resource from server

b. POST - Used to create new resource on server

c. PUT/PATCH - update resource on server

d. DELETE - delete a resource on the server

STOCK API demo

## FLASK

Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a predictions as a response.

Now that you are going to be a Data Scientist, you cannot always rely on having your models in Jupyter Notebook.

Sure, Jupyter Notebooks are awesome to do data exploration and analysis. Once you fully understand a project, maybe one of your outcomes is to hand in an application taht has a predictive model.

You can try to get the best model possible in a notebook or a script. Once everybody has conveyed that is the best model, you must hand it in a way that the client can run it easily in their infraestructure. 

For this purpose you need a tool that can fit in their  infrastructure, preferably in a language that you’re familiar with. This is where you can use Flask. Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a prediction as a response.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
 
#importing dataset from sklearn
from sklearn.datasets import load_boston
boston_data = load_boston()

# initializing dataset
data_ = pd.DataFrame(boston_data.data)

### Top five rows of dataset
data_.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [4]:
data_.columns = boston_data.feature_names
data_.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [5]:
# Target feature of Boston Housing data
data_['PRICE'] = boston_data.target

In [6]:
# creating feature and target variable 
X = data_.drop(['PRICE'], axis=1)
y = data_['PRICE']
 
# splitting into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
print("X training shape : ", X_train.shape)
print("X test shape : ", X_test.shape )
print("y training shape :", y_train.shape )
print("y test shape :", y_test.shape )
 
# creating model
from sklearn.ensemble import RandomForestRegressor
classifier = RandomForestRegressor()
classifier.fit(X_train, y_train)

X training shape :  (404, 13)
X test shape :  (102, 13)
y training shape : (404,)
y test shape : (102,)


RandomForestRegressor()

In [7]:
# Model evaluation for training data
prediction = classifier.predict(X_train)
print("r^2 : ", metrics.r2_score(y_train, prediction))
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_train, prediction))
print("Mean Squared Error: ", metrics.mean_squared_error(y_train, prediction))
print("Root Mean Squared Error : ", np.sqrt(metrics.mean_squared_error(y_train, prediction)))


# Model evaluation for testing data
prediction_test = classifier.predict(X_test)
print("r^2 : ", metrics.r2_score(y_test, prediction_test))
print("Mean Absolute Error : ", metrics.mean_absolute_error(y_test, prediction_test))
print("Mean Squared Error : ", metrics.mean_squared_error(y_test, prediction_test))
print("Root Mean Absolute Error : ", np.sqrt(metrics.mean_squared_error(y_test, prediction_test)))


r^2 :  0.9808574134972359
Mean Absolute Error:  0.8176287128712857
Mean Squared Error:  1.5463646485148501
Root Mean Squared Error :  1.2435291104412676
r^2 :  0.9052584692614715
Mean Absolute Error :  2.3210980392156855
Mean Squared Error :  9.363058156862746
Root Mean Absolute Error :  3.059911462258793


In [8]:
prediction_test

array([30.324, 26.611, 19.755, 20.54 , 19.641, 19.722, 28.093, 18.885,
       20.808, 23.675, 30.086, 30.998, 20.578, 19.861, 20.385, 27.151,
       11.918, 41.285, 24.23 , 14.704, 20.05 , 16.177, 24.32 , 23.685,
       25.42 ,  9.301, 15.078, 20.23 , 43.472, 12.891, 26.582, 19.55 ,
       47.898, 16.036, 23.519, 21.001, 15.368, 33.602, 12.93 , 19.562,
       24.572, 23.147, 25.981, 15.942, 16.038, 11.928, 47.708, 10.862,
       21.912, 18.89 , 24.059, 21.955, 25.046, 20.663, 10.915, 23.677,
       11.526, 23.202, 18.671, 42.309, 14.016, 26.831, 12.743, 14.697,
       18.279, 32.385, 41.746, 25.079, 21.492, 20.342, 23.908,  6.609,
       18.596, 20.574, 19.852, 20.459, 39.488, 24.278, 27.146, 32.751,
       17.114, 20.458, 34.248, 11.834, 24.763, 25.8  , 14.847, 24.64 ,
       19.835, 17.512, 28.137, 45.882, 16.513, 21.106, 15.131, 20.755,
       24.128, 23.279, 42.103, 20.856, 16.636, 15.337])

In [10]:
# saving the model
import pickle

# saving the columns
model_columns = list(X.columns)
with open('models/model_columns.pkl','wb') as file:
    pickle.dump(model_columns, file)
    
    
pickle.dump(classifier, open('final_prediction.pickle', 'wb'))