# PredictionIO
This tutorial will introduce you to a machine learning framework that allows you to rapidly deploy multiple engines - PredictionIO. Say for instance you were a data scientist trying to deploy multiple algorithms "into the wild" in order to test their performance - you would have to set up multiple endpoints for incoming data points, and then needlessly rewrite a lot of boilerplate code in processing data, setting up engines, and evaluating performance. PredictionIO aims to get all of this functionality into the same place, allowing you to remove the non-trivial obstacles when it comes to deployment. It is generally used with Scala, which combines Java's object-oriented paradigm with characteristics of a functional language - in this tutorial, however, we cover usage with a Python SDK.

## Setup
Installing the library consists of two parts: installing the PredictionIO application, and installing the Python SDK.

### Manual Setup
Currently, automatic setup using a package manager (such as Homebrew on Mac OS) is a little buggy and doesn't seem to work. Thus, manual setup is preferred, and is detailed [here](http://predictionio.incubator.apache.org/install/install-sourcecode/). A note that I personally prefer using HBase and Elasticsearch (as opposed to PostgreSQL), as I have observed PostgreSQL to be slightly glitchy. In order to make the change, navigate to the directory where you installed PredictionIO, and navigate to the `conf/pio-env.sh` file and change the properties labeled `PIO_STORAGE_REPOSITORIES_<REPO_TYPE>_SOURCE` - where `REPO_TYPE` is any of `METADATA`,`EVENTDATA`, or `MODELDATA` - to `ELASTICSEARCH`, `HBASE`, and `LOCALFS` respectively. For your convenience, a sample `pio-env.sh` file has been included with this tutorial, which should require very few (if any) changes in order to work. 

After installing PredictionIO, test that your setup works by navigating to the `bin/` directory of your PredictionIO installation (or adding it to your `PATH`) and typing

`$ pio-start-all`

After your prompt appears again, you can check the status of PredictionIO by typing

`$ pio status`

If everything is OK, the output should indicate that your system is all ready to go; otherwise, troubleshoot using the [FAQ page](http://predictionio.incubator.apache.org/resources/faq/)

### Python SDK
Installing the Python SDK is a much shorter process, and just requires using Python's built in package manager, `pip`. To install the module, you can use

`$ pip install predictionio`

This should install the package for Python, but in case you want to do it manually, you can also access the Github repository for the Python SDK [here](https://github.com/apache/incubator-predictionio-sdk-python). After cloning the repo, navigate to the project root and run 

`$ python setup.py install`

After installation, you can check that you've successfully added the module to your Python distribution by running the following code (which should throw an exception otherwise):

In [28]:
# Related to the code, but not to PredictionIO
import json
import numpy as np
import pprint
import pytz
import random
from datetime import datetime

# PredictionIO-specific; Client classes imported for brevity
import predictionio as pio
from predictionio import EventClient, EngineClient

### Recommendation Engine Template
We will demo PredictionIO using a ready-made recommendation engine (referred to PIO as "templates"). Navigate to a directory where you would want to place the engine, and type

`$ pio template get apache/incubator-predictionio-template test`

This creates a new directory called `test`, containing the engine template. Navigate to this new directory, and run 

`pio app new test`

This should output something along the lines of 

## Procedure 
### Importing data using Python
PredictionIO represents data using two concepts: "entities" and "events". Entities are abstractions for real-world objects (e.g., users), and events are the actions that they perform (e.g., liking/rating a post, or signing in). We first create an EventClient to our app.

For data representation, PredictionIO has as its core the concepts of "events" and "entities". Entities are abstractions for real-world objects such as users, and entities are the actions that they perform (e.g., liking a post, giving a post a rating, or signing in). Events come in two different flavors: generic events, which are performed by an entity (potentially on a target entity) and special events which record changes to an entity's properties. In order for us to access this functionality in Python, we first create an EventClient.

In [6]:
# replace with your own access key that you got from running 'pio app new test'
event_client = pio.EventClient(access_key='aM0e6FtMBN6FA0xgI_9_2LXUIEjV5aBqMAQ9A_Y889MeIHxZE1qMUR4rVLNCy3Qf', threads=5) 

This allows us to start adding events, which take on two forms: generic events, which are performed by an entity (potentially on a target entity), and special events which record changes to an entity's properties. These special events also allow us to create entities - this is why there is no `EntityClient`.

We now add entities (i.e., users) to our event server. Each entity has a set of properties - to keep it simple, 

We then want to add entities (in this case, users) to our event server in order to act as entities for the events that we want to create. 

Additionally, we would like a set of properties to be associated with each user. In order to set this in our Python program, we can simply create a special dictionary where the keys are our entity property names, and set them to the values that we wish.

In order to communicate with the server, our client can make two types of requests: synchronous and asynchronous. Asynchronous calls are much faster, but give back a slightly different result (of type `predictionio.AsyncRequest`). Synchronous calls will simply block until they are completed - as expected. For comparison, we add a million users with a property named `popularity` to our event server, given some random integer value from 0 to 100.

In [9]:
# capture the results of calling set_user so that we can find the IDs later.
async_event_results = []
event_results = []

# asynchronous requests
for i in xrange(30):
    user_id = 'u' + str(i)
    user_properties = {}
    user_properties['popularity'] = random.randint(0, 100)
    async_event_result = event_client.aset_user(user_id, properties=user_properties)
    async_event_results.append(async_event_result)
# event_client.close() # this line will cause asynchronous requests to block until they are completed.

# check an asynchronous result
event_result = random.choice(async_event_results)

try:
    async_response = event_result.get_response() # blocks until complete
    json_body = json.loads(async_response.__dict__['body']) 
    event_id = json_body['eventId']
    pprint.pprint(event_client.get_event(event_id))
except:
    print('Encountered an error while trying to get the asynchronous response.')

print
    
# synchronous requests
for i in xrange(30):
    user_id = 'u' + str(i)
    user_properties = {}
    user_properties['popularity'] = random.randint(0, 100)
    try:
        event_result = event_client.set_user(user_id, properties=user_properties)
        event_results.append(event_result)
    except: 
        # can log the error here
        print('Encountered an error making a synchronous request to the event server.')

# check a random user to ensure correctness
event_result = random.choice(event_results)
json_body = json.loads(event_result.__dict__['body'])
event_id = json_body['eventId']
pprint.pprint(event_client.get_event(event_id))

{u'creationTime': u'2016-11-04T17:15:18.562Z',
 u'entityId': u'u1',
 u'entityType': u'user',
 u'event': u'$set',
 u'eventId': u'Z0813DMQIKz7N4VGxZhmngAAAVgwVmaRi7wMA2LBwQI',
 u'eventTime': u'2016-11-04T17:15:18.545Z',
 u'properties': {u'popularity': 89}}

{u'creationTime': u'2016-11-04T17:15:18.886Z',
 u'entityId': u'u28',
 u'entityType': u'user',
 u'event': u'$set',
 u'eventId': u'1UisRWSy-wCp-lStTgdKbAAAAVgwVmfklr6M8VFQ5AI',
 u'eventTime': u'2016-11-04T17:15:18.884Z',
 u'properties': {u'popularity': 94}}


If we want to re-set any user's properties at a later time, we can re-use our `set_user` method. Suppose there were a user who rated 100/100 in popularity:

In [None]:
user_properties = {}
user_properties['popularity'] = 100

try:
    event_result = event_client.set_user(user_id, properties=user_properties)
    json_body = json.loads(event_result.__dict__['body'])
except:
    print('Error when trying to set user on event server.')

user_id = json_body['eventId']
user = event_client.get_event(user_id) 
pprint.pprint(user)

Now we can proceed with adding events to PredictionIO. Each event has a custom name (continuing with our users example, suppose we want to represent a user signing up, with some given source. Events can also take a timestamp, which we can extract from our data (here we just set it equal to the current time).

The engine template that we are currently using can take in two different types of events: **rate** and **buy**, which represent a user rating an item and a user buying an item, respectively. 

Similar to user creation, event creation takes places asynchronously or synchronously. To keep it simple, we use the synchronous method in order to add events. Here, we create two events representing a user giving an item a rating of 5, and subsequently purchasing it.

We would then create an event using the `acreate_event` and `create_event` methods in our `EventClient` (asynchronous and synchronous calls as before; for brevity's sake we won't repeat the same tests for creating events). As with entities, events can have properties too, but they also require identification for the associated entity, which would consist of the entity id (from our previous code, some value between 0 and 1000000), as well as the entity type (which we implicitly set to `user` with our `set_user` calls). Here we create a single event representing a random user signing up. For every event we create, we also want to record the event id, so that we can get the event 

In [None]:
timestamp = datetime.utcnow().replace(tzinfo=pytz.UTC) # need to add timezone info
event_result = event_client.create_event(event='rate', entity_type='user', entity_id=user['entityId'],
                                         target_entity_type='item', target_entity_id='i0', 
                                         properties={ 'rating': float(5) }, event_time = timestamp).__dict__

# sanity check 
json_body = json.loads(event_result['body'])
event_id = json_body['eventId']
pprint.pprint(event_client.get_event(event_id))

Now that the process is clear, we want to speed up the process of importing data into our event server. As sample data for our engine template, we use our provided `sample_data.txt`, where each line is of the format `<user_id>::<item_id>::<rating>`. For demo purposes, we also include randomly introduced `buy` events mixed in with our sample data.

In [None]:
event_results = []

def import_events(file_name):
    with open(file_name) as f:
        for line in f:
            data = line.split('::')
            if (random.choice([True, False])): # randomly introduce buy events
                event_results.append(event_client.create_event(event='buy', entity_type='user', entity_id=data[0],
                                                               target_entity_type='item', 
                                                               target_entity_id=data[1]).__dict__)
            else:
                event_results.append(event_client.create_event(event='rate', entity_type='user', entity_id=data[0],
                                                               target_entity_type='item', target_entity_id=data[1],
                                                               properties={ 'rating': float(data[2]) }).__dict__)

import_events('sample_data.txt')
print(len(event_results))
pprint.pprint(event_results[0])

### Interfacing with our engine template

Now that we have successfully imported all of our data, we can now interface with our engine template. First we setup our engine template as a local server, modifying a line in the `engine.json` file that is in our engine template directory.

```
...
"datasource": {
  "params": {
    "appName": "test"
  }
},
...
```

Now, we deploy our server with a series of shell commands (make sure that you are still in the `test` directory)

Now we can create a client to our engine, using the default url for the engine (this parameter is more important when we deploy multiple engines simultaneously).

We now send a request to the server in order to retrieve recommendations for items for a single user, ranked by the score that the engine gives them:

In [None]:
query_properties = { 'user': str(random.choice(range(30))), 'num': random.randint(1,5)}
query_result = engine_client.send_query(query_properties)
pprint(query_result)

Additionally, we can access a web interface for our server at http://localhost:8000, which gives you server information along with engine information.

## Census Data

We now look at sample data provided to us in class. In the previous homework, we looked at publicly available census data when implementing a Naive Bayes classifier. We look at the same data, except with only three extracted features (for brevity's sake). There is another engine template, titled `classification`, available [here](https://www.dropbox.com/sh/gfalmgeky5ubtlo/AACA04uErXZxNsq8CcFgga_9a?dl=0). Copy this into a directory of your choice; the rest of the tutorial takes place within this directory. We first read in the data, as with the previous assignment.

In [17]:
import pandas as pd

df = pd.read_csv('census.csv')
df = df[(df.occupation!='?') & (df.native_country!='?') & (df.work_class!='?')]
df['label'] = df['income'].map({'<=50K': 0, '>50K': 1})
del df['income']
df = df.reset_index(drop=True)

print(df.head())

# replace this with your own access key!
event_client = EventClient(access_key='Tjbr4B3_O0hmZxLdHARhOgTA9gXBKWXiKxxcBshILruTUxF7qibOcsejgqo4v4Yl')

   age        work_class  final_weight  education  education_num  \
0   39         State-gov         77516  Bachelors             13   
1   50  Self-emp-not-inc         83311  Bachelors             13   
2   38           Private        215646    HS-grad              9   
3   53           Private        234721       11th              7   
4   28           Private        338409  Bachelors             13   

       marital_status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital_gain  capital_loss  hours_per_week native_country  label  
0          2174             0              40  United-States      0  
1     

We will use age, final_weight, and education_num as our features to our naive Bayes classifier. With each person representing a single data point, we create entities on our event server of type "person" - thus adding entirely different data points without touching our previous "users". Note that this also allows us to re-use data with a different engine (one that we wrote, for instance). Our method calls are also slightly different from the previous example - for custom entity types, we need to set the event type as `$set`, a special value that PredictionIO uses for the (implicit) creation of entities.

In [18]:
event_results = []
count = 1

for _, row in df.iterrows():
    person_properties = { 'age': row.age, 'finalWeight': row.final_weight, 
                         'educationNum': row.education_num, 'label': row.label }
    event_result = event_client.acreate_event(event='$set', entity_type='person', 
                                              entity_id=count, properties=person_properties)
    event_results.append(event_result)
    count += 1

In [19]:
# sanity check, as usual
event_result = random.choice(event_results)
async_response = event_result.get_response()
json_body = json.loads(async_response.__dict__['body'])
print(event_client.get_event(json_body['eventId']))

{u'eventId': u'aOXP92IPt8aN_f82z-CNtgAAAVgwXMc1gCqFk97Uykw', u'eventTime': u'2016-11-04T17:22:16.501Z', u'entityType': u'person', u'creationTime': u'2016-11-04T17:22:45.473Z', u'properties': {u'age': 39, u'finalWeight': 124090, u'educationNum': 5, u'label': 0}, u'entityId': u'4725', u'event': u'$set'}


That's it! We can now run `pio train` as before, and run `pio deploy` in the following manner:

This deploys the PredictionIO server on a different port, allowing for multiple servers to be deployed at the same time. We can now make requests to our engine using a different engine client.

In [30]:
engine_client = EngineClient(url='http://localhost:8001')

# sample query result
query_result = engine_client.send_query({ 'age': 24.0, 'finalWeight': 10.0, 'educationNum': 9.0 })
print(query_result)

{u'label': 1.0}


We now test the accuracy of the engine on the training set.

In [36]:
predictions = np.array([])
# testing accuracy on training set
for _, row in df.iterrows():
    prediction = engine_client.send_query({'age': row.age, 
                                           'finalWeight': row.final_weight, 
                                           'educationNum': row.education_num})
    predictions = np.append(predictions, prediction['label'])

print(predictions)
print(float(sum(predictions != df['label'])) / len(df['label']))

[ 1.  1.  0. ...,  1.  0.  0.]
0.412273721902


So it seems that our engine needs a bit of tweaking, or needs to use different features. However, hopefully this has demonstrated how easily and quickly servers can be deployed using PredictionIO.