# Appendix B - Building and Deploying the App

## Part 1 - The Webservice



The code for the webservice can be found in the ./webservice subdirectory. This is a copy of the code running in production. The reference origin git repo is *not* in this repository. It is hosted on [Heroku](https://www.heroku.com/). Ask if you want access to the production git repository.

### How the Webservice Works

The webservice is a minimal [Flask](http://flask.pocoo.org/) application used to provide prediction probabilities for a given candidate. There is one useful entry point: `/predict` that expects the user's admissionTest, AP, etc on the query string. It will return a JSON file consisting of the probabilities of getting into each college.

Sample input:

```
http://mypythonapp-wihl.rhcloud.com/predict?admissionstest=0.926899206&AP=7&averageAP=1.06733864&SATsubject=0.324271565&GPA=-0.187109979&schooltype=0&intendedgradyear=2017&female=1&MinorityRace=0&international=0&sports=0&earlyAppl=0&alumni=0&outofstate=0&acceptrate=0.151&size=6621&public=0&finAidPct=0&instatePct=0
```

Sample output:
```
{
  "preds": [
    {
      "college": "Princeton",
      "prob": 0.26166666666666666
    },
    {
      "college": "Harvard",
      "prob": 0.23999999999999999
    },
    {
      "college": "Yale",
      "prob": 0.23999999999999999
    },
    ...
 ]
}
```

### Webservice startup

Upon getting the first `/predict` request, the web service will perform the same logic as the classification iPython notebook. It loads the normalized college data, imputes missing values, and runs Scikit-Learn's Random Forest classification. The resulting classifier is kept in memory as a Python global variable to service subsequent prediction requests. There is no locking at the present time. 

### Webservice Dependencies - OpenShift

We started using OpenShift. The free account worked fine at first and then had consistent and terrible performance problems a few days before the project was due. With two days remaining, we scrambled and moved from OpenShift to Heroku. This section serves as reference (and consider it a warning to not use OpenShit ever again).

Since the webservice is running the full Pandas and Scikit-Learn stacks, these had to installed on the OpenShift cartridge. Here's what was done:

1. Create an OpenShift account
1. Install the [client tools](https://developers.openshift.com/en/managing-client-tools.html). This will install `rhc`, the necessary local command line tool for managing OpenShift apps.
1. Use the Flask Quickstart template ([details](https://developers.openshift.com/en/python-flask.html))

    ```
    rhc app create myflaskapp python-2.7 --from-code=https://github.com/openshift-quickstart/flask-base.git
    ```

1. This will create a local myflaskapp git repository. Go into this repository: `cd myflaskapp`
1. SSH into the app and install the dependent packages:

    ```
    rhc ssh myflaskapp
    source ~/python/virtenv/activate
    pip install numpy
    ```

    The `pip install` has to be repeated for `scipy, pandas` and `scikit-learn`. This takes a while as it is compiled     locally on the OpenShift instance and may not have optimal performance.

1. After all the packages have been installed, take the output of `pip freeze` and update the `requirements.txt` in the *local* repository.
1. At this point, you can grab the appropriate files from `./webservice` directory, notably: `TIdatabase.py, collegelist.csv, collegedata_normalized.csv, flaskapp.py`.



#### DevOps Notes

To see the logs, use `rhc tail -o '-n 100' mypythonapp`

Common rhc commands can be found [here](https://developers.openshift.com/en/managing-common-rhc-commands.html)

### Webservice Dependencies - Heroku

Heroku was easier to configure since there are buildpacks available that contain the entire Condas stack with all the Scipy, Numpy, Scikit-learn dependencies. There were still numerous gotchas, mainly related to finding the right
combination of scipy, numpy and scikit-learn versions that would all play nicely together.

Heroku has a [nice walkthrough](https://devcenter.heroku.com/articles/getting-started-with-python#introduction) about setting up a Python app in minutes. I mostly followed that, with the following changes:

Add the Conda buildpack:
```
heroku config:add BUILDPACK_URL=https://github.com/kennethreitz/conda-buildpack.git
```

This buildpack has a broken scipy, so it was obtained from:
```
heroku buildpacks:set https://github.com/thenovices/heroku-buildpack-scipy
```

This version scipy is broken with the latest scikit-learn, so I had to downgrade scikit-learn. Here is our final requirements.txt file:
```
gunicorn==19.3.0
psycopg2==2.6
SQLAlchemy==1.0.4
whitenoise==1.0.6
Flask==0.10.1
pandas==0.17.1
numpy==1.9.1
scipy==0.15.1
scikit-learn==0.16.1
nose==1.3.7
```

and our Procfile
```
web: gunicorn flaskapp:app --log-file=-
```

After the pain and suffering with OpenShift's free account, we went with the Heroku \$7/mo hobbyist dyno with the hope that the app would not go down again.

### Programming Notes

Using the Quickstart template, the real work is done in `flaskapp.py`.

Logging is off by default. To log errors from your app, use:

```
import logging

logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
```

We simply used global variables to stored information. It is also possible to use the [appcontext](http://flask.pocoo.org/docs/0.10/appcontext/). 

Note that the app can be tested locally very easily. From a local shell use: `python flaskapp.py`. It will say which address / port it is listening on when starting up.

### Consuming the Webservice from R

Sample code to consume the webservice can be found in `rclient.R`. This simulates how the production Shiny app can invoke the webservice. An R data.frame is created with the normalized user inputted values. This is used to populate the query string of the webservice. Note that the webservice ignores the last five variables, which are specific to a given college, since probabilities for *all* colleges are returned.

The returned JSON is easily parsed into an R data.frame for presentation to the user or further manipulation. Here is a snippet:

```
# create query string
qs = paste0(colnames(pred),"=",pred[1,],collapse="&")
server = "http://127.0.0.1:5000/predict"
server = "http://mypythonapp-wihl.rhcloud.com/predict"

URL = paste0(server,"?",qs)

js  = fromJSON(URL)
df = js$preds
df$college = as.factor(df$college)
summary(df)

```




## The Webservice Code

(This is not in a code cell because it is not meant to be executed)



```
from flask import Flask
from flask import jsonify, request

import os
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import logging

import TIdatabase as ti

app = Flask(__name__)

clf = None
logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 

ws_cols = ["admissionstest","AP","averageAP","SATsubject","GPA","schooltype",
                  "female","MinorityRace","international","sports",
                  "earlyAppl","alumni","outofstate"]
college_cols = ["acceptrate","size","public"]
predictor_cols = ws_cols + college_cols

cols_to_drop = ['classrank', 'canAfford', 'firstinfamily', 'artist', 'workexp', 'visited', 'acceptProb',
                'addInfo','intendedgradyear']
NUM_ESTIMATORS = 1000

colleges = ti.College()

def load_classifier():
    global clf
    df = pd.read_csv(os.path.join(os.path.dirname(__file__),"collegedata_normalized.csv"), index_col=0)
    dfr = df.drop(cols_to_drop,axis=1)
    dfr = dfr[pd.notnull(df["acceptStatus"])]
    dfpredict = dfr[predictor_cols]
    dfresponse = dfr["acceptStatus"]
    imp = Imputer(missing_values="NaN", strategy="median", axis=1)
    imp.fit(dfpredict)
    X = imp.transform(dfpredict)
    y = dfresponse
    clf = RandomForestClassifier(n_estimators=NUM_ESTIMATORS, criterion="gini")
    clf.fit(X,y)
    return clf

def genPredictionList(vals):
    """
    vals (coming from the request arguments) is a list of tuples [('name1','val1'),('name2','val2')...]
    """
    global ws_cols
    global clf
    global colleges
    X = pd.Series(dict((name, float(val)) for name, val in vals))
    if clf is None: load_classifier()
    preds = []
    for i, row in colleges.df.iterrows():
        X[college_cols] = row[college_cols]
        y = clf.predict_proba(X[predictor_cols])[0][1]
        p = {'college':row.collegeID, 'prob':y}
        preds.append(p)
    return preds
    #e.g.  [{'college':'harvard', 'prob':y}, {'college':'yale', 'prob':0.25}, {'college':'brown', 'prob':0.89}]

@app.route('/')
def hello_world():
    return "Welcome to the Team Ivy Web Service"

@app.route("/predict")
def predict():
    preds = genPredictionList(request.args.iteritems())
    return jsonify(preds = preds)


if __name__ == '__main__':
    app.run(debug=True)

```

## Part 2 - The Shiny App

[Shiny](http://shiny.rstudio.com/) is a web application framework for R. It allows rapid development of reactive web applications. In this project, Shiny is used to implement all user interaction including plots and charts.

The Shiny app is hosted at http://www.shinyapps.io/

## Part 3 - SquareSpace

SquareSpace hosts the static portion of the public facing web site. It also provides summary usage statistics.

### References

Getting Started with Python on Heroku https://devcenter.heroku.com/articles/getting-started-with-python-o#prerequisites

Buildpack for Conda on Heroku https://github.com/kennethreitz/conda-buildpack

Getting started with OpenShift and Python 2.7 (without Flask): https://developers.openshift.com/en/python-getting-started.html

Getting started with OpenShift and Flask: https://developers.openshift.com/en/python-flask.html

Blog post about OpenShift and Flask https://blog.openshift.com/day-3-flask-instant-python-web-development-with-python-and-openshift/

Somewhat dated: https://blog.openshift.com/beginners-guide-to-writing-flask-apps-on-openshift/