## Introduction

In this tutorial, you'll learn about Graphlab libray for Python. There is an extensive set of tutorials available for Graphlab in this [link](https://github.com/turi-code/tutorials/tree/master/notebooks). Here, I've explanied some important functions of this library with some interesting applications.

[GraphLab Create](https://turi.com/products/create/) is an extensible machine learning framework that enables developers and data scientists to easily build and deploy intelligent applications and services at scale. It includes distributed data structures and rich libraries for data transformation and manipulation, scalable task-oriented machine learning toolkits for creating, evaluating, and improving machine learning models, data and model visualization for all aspects of development, and a client to define and deploy both distributed batch jobs to Turi Distributed™ as well as real-time machine learning services to Turi Predictive Services™. It is designed for end-to-end developer productivity, scale, and the variety and complexity of real-world data.

The main design considerations behind the design of GraphLab are:

- Sparse data with local dependencies
- Iterative algorithms
- Potentially asynchronous execution

### Tutorial content

This tutorial will guide through the installation and working with graphlab with the help of two simple applications of Graphlab library
- Sentiment Analysis Classifier
- Exploring graph of Americal films

The data for each application is collected from Graphlab data repository. The following topics will be covered in the tutorial.

- [Installing Graphlab](#Installing-Graphlab)
- [What is an Sframe](#What-is-an-Sframe)
- [Pagerank algorithm to find best page](#Pagerank-algorithm-to-find-best-page)
- [Bag of Words](#Bag-of-Words)
- [Predicting airline performance](#Predicting-airline-performance)
- [Batch recommendations](#Batch-recommendations)
- [Regression & Classification](#Regression-&-Classification)

## Installing Graphlab

As a first step, you need to register with your academic email address and get a product key. You can do the registration in this website https://turi.com/download/academic.html . There are two ways of installing Graphlab. I highly recommend to use the first method to install Graphlab.

#### 1. Install into Anaconda Python Environment.

This requires installation of Anaconda, which can be downloaded from the link https://www.continuum.io/downloads. Once anaconda is installed, make sure python version = 2.7.x and follow these steps.

    $  conda create -n gl-env python=2.7 anaconda

    $  activate gl-env
    
Now ensure pip version >= 7

    $  conda update pip
    
    $  pip install --upgrade --no-cache-dir https://get.graphlab.com/GraphLab-Create/2.1/your registered email address here/your product key here/GraphLab-Create-License.tar.gz
    
    $  conda install ipython-notebook
    
#### 2: Install in Python environment using virtualenv

    $  virtualenv gl-env
    
    $  .\gl-env\Scripts\activate

    $  pip install "ipython[notebook]"
    
    $  pip install --upgrade --no-cache-dir https://get.graphlab.com/GraphLab-Create/2.1/your registered email address here/your product key here/GraphLab-Create-License.tar.gz

## What is an Sframe

An SFrame is a tabular data structure. If you are familiar with R or the pandas python package, SFrames behave similarly to the dataframes available in those frameworks. SFrames act like a table by consisting of 0 or more columns. Each column has its own datatype and every column of a particular SFrame must have the same number of entries as the other columns that already exist. There are two things that make SFrames very different from other dataframes. Each column is an SArray, which is a series of elements stored on disk. This makes SFrames disk-based and therefore able to hold datasets that are too large to fit in your system's memory. An SFrame's data is located on the server that is running the GraphLab toolkits, which is not necessarily on your client machine. 

### Creating Sframes

We'll look at how to create Sframe.

In [2]:
import graphlab as gl
from IPython.display import display
from IPython.display import Image

## Pagerank algorithm to find best page
Here, we'll apply Pagerank algorithm to one of the datasets got from https://static.turi.com/datasets/bond/bond_vertices.csv and https://static.turi.com/datasets/bond/bond_edges.csv . We'll find the best person using the pagerank algorithm.

PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. More information about pagerank is available here http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

To get an understanding of how the data is distributed, We'll look at the Sframe object using show() and head() method.

Here, vertices is a Sframe object. gl.SFrame() creates a SFrame object. One of it's method is num_rows(), which gives number of rows in the Sframe object.

In [3]:
gl.canvas.set_target('ipynb') # use IPython Notebook output for GraphLab Canvas
vertices = gl.SFrame.read_csv('https://static.turi.com/datasets/bond/bond_vertices.csv',verbose=False)
edges = gl.SFrame.read_csv('https://static.turi.com/datasets/bond/bond_edges.csv',verbose=False)

print vertices.num_rows()
vertices.show()
vertices.head()

10


name,gender,license_to_kill,villian
James Bond,M,1,0
M,M,1,0
Moneypenny,F,1,0
Q,M,1,0
Wai Lin,F,1,0
Inga Bergstorm,F,0,0
Elliot Carver,M,0,1
Paris Carver,F,0,1
Gotz Otto,M,0,1
Henry Gupta,M,0,1


Vertices Sframe contains Name, Gender, license_to_kill and Villian columns. license_to_kill takes 1, if the person has license 
to kill or 0 if the person have license to kill. Villian Column takes 1, if the person is a Villian or 0 otherwise.
Now, We'll create SGraph Object using gl and use pagerank algorithm to find the best person from the list.

In [4]:
g = gl.SGraph() # Creates graph object
g = g.add_vertices(vertices=vertices, vid_field='name')
g = g.add_edges(edges=edges, src_field='src', dst_field='dst')

We have added the vertices and edges to the graph object, using add_vertices and add_edges. Now we'll use page rank to find the best person.

In [55]:
pr = gl.pagerank.create(g,verbose=False)
print pr.get('pagerank').topk(column_name='pagerank')
pr.show()

+----------------+----------------+-------------------+
|      __id      |    pagerank    |       delta       |
+----------------+----------------+-------------------+
|   James Bond   | 2.52743578524  |  0.0132914517076  |
|       M        | 1.87718696576  |  0.00666194771763 |
|   Moneypenny   | 1.18363921275  |  0.00143637385736 |
|       Q        | 1.18363921275  |  0.00143637385736 |
| Inga Bergstorm | 0.869872717136 |  0.00477951418076 |
|    Wai Lin     | 0.869872717136 |  0.00477951418076 |
| Elliot Carver  | 0.634064732205 | 0.000113553313724 |
|  Henry Gupta   | 0.284762885673 | 1.89255522873e-05 |
|  Paris Carver  | 0.284762885673 | 1.89255522873e-05 |
|   Gotz Otto    | 0.284762885673 | 1.89255522873e-05 |
+----------------+----------------+-------------------+
[10 rows x 3 columns]



We have implemented Pagerank algorithm to find the best person using the pagerank as the ranking column.

## Bag of Words

The dataset for this analysis is downloaded from https://www.kaggle.com/c/word2vec-nlp-tutorial/data . The data contains movie reviews selected for sentiment analysis. Download the dataset and extract in the same folder as the notebook. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
An n-gram of size 1 is referred to as a "unigram", size 2 is a "bigram" (or, less commonly, a "digram"), size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.

First, We'll experiment with 1-gram and 2-gram features.

In [8]:
traindata_path = "labeledTrainData.tsv"
testdata_path = "testData.tsv"

movies_reviews_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'sentiment' : str, 'review':str } , verbose=False)
movies_reviews_data.head()

id,sentiment,review
5814_8,1,With all this stuff going down at the moment with ...
2381_9,1,"""The Classic War of the Worlds"" by Timothy Hines ..."
7759_3,0,The film starts with a manager (Nicholas Bell) ...
3630_4,0,It must be assumed that those who praised this ...
9495_8,1,Superbly trashy and wondrously unpretentious ...
8196_8,1,I dont know why people think this is such a bad ...
7166_2,0,"This movie could have been very good, but c ..."
10633_1,0,I watched this video at a friend's house. I'm glad ...
319_1,0,"A friend of mine bought this film for £1, and ..."
8713_10,1,<br /><br />This movie is full of references. Like ...


We'll use 1-gram features to predict the sentiment. Here We'll use text_analytics library to generate the model and use that to predict the features. We'll split the dataset randomly using random_split() function and train a classifier on the train_set using the 1-gram features. Then, We can find the accuracy on the test_set

In [None]:
movies_reviews_data['1grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data ['review'],1)
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)
model_1 = gl.classifier.create(train_set, target='sentiment', features=['1grams features'],verbose=False)
result1 = model_1.evaluate(test_set)
print "Accuracy        : ", result1["accuracy"]
print "Confusion Matrix: \n", result1["confusion_matrix"]

Accuracy        :  0.87454249695
Confusion Matrix: 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      1       |        0        |  240  |
|      1       |        1        |  2153 |
|      0       |        1        |  377  |
|      0       |        0        |  2148 |
+--------------+-----------------+-------+
[4 rows x 3 columns]



Now, We'll use 2-gram features to predict the sentiment and compare it with the 1-gram model accuracy. We expect the 2-gram model to give better accuracy than the 1-gram model.

In [None]:
movies_reviews_data['2grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data['review'],2)
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)
model_2 = gl.classifier.create(train_set, target='sentiment', features=['1grams features','2grams features'], verbose=False)
result2 = model_2.evaluate(test_set)
print "Accuracy        : ", result2["accuracy"]
print "Confusion Matrix: \n", result2["confusion_matrix"]

As can, be seen above, 2-gram model gave a slightly better accuracy then the 1 gram model. We'll now use both 1 gram features and 2 gram features, as features to build the new model. We don't have any test labels to check the accuracy of the test dataset.

In [None]:
traindata_path = "labeledTrainData.tsv"
train_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'sentiment' : int, 'review':str } , verbose = False) ;
train_data['1grams features'] = gl.text_analytics.count_ngrams(train_data['review'],1) ;
train_data['2grams features'] = gl.text_analytics.count_ngrams(train_data['review'],2) ;

cls = gl.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'],verbose=False) ;

test_data = gl.SFrame.read_csv(testdata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'review':str } , verbose = False) ;
test_data['1grams features'] = gl.text_analytics.count_ngrams(test_data['review'],1) ;
test_data['2grams features'] = gl.text_analytics.count_ngrams(test_data['review'],2) ;

#predicting the sentiment of each review in the test dataset
test_data['sentiment'] = cls.classify(test_data)['class'].astype(int) ;

#saving the prediction to a CSV for submission
#test_data[['id','sentiment']].save("predictions.csv", format="csv")

## Predicting airline performance

We'll find the best performing airline by fitting the regression model to the dataset. 

The dataset for this part is downloaded from the URL http://stat-computing.org/dataexpo/2009/2008.csv.bz2 . The dataset has information about flight arrival/departure times for 10 years of flights in the US. Each year's data is recorded in a single csv file. We'll use 100K records of flight data from the year 2008. This dataset doesn't contain the airport names, for which we need to download another dataset from http://stat-computing.org/dataexpo/2009/airports.csv.

In [None]:
data_url = "http://stat-computing.org/dataexpo/2009/2008.csv.bz2"

data = gl.SFrame.read_csv('2008.csv', 
                                 column_type_hints={"ActualElapsedTime":float,"Distance":float}, 
                                 na_values=["NA"], nrows=1000000, verbose = False)


data = data.dropna(['ActualElapsedTime','CarrierDelay'])
data

We split the data into training and test subsets. The accuracy of the model is evaluated by the test subset. We build simple yet powerful linear regression method to try and predict the actual flight times.

In [None]:
(train, test) = data.random_split(0.8)
model = gl.linear_regression.create(train, 
                                          target="ActualElapsedTime", 
                                          validation_set=test,verbose=False)
print model.get('coefficients').topk('value')

We'll try to explore what are the top airports. For this we need to get the list of airports from the URL http://stat-computing.org/dataexpo/2009/airports.csv . Later we can join the airport SFrame with the results to get the list of topk airports

In [None]:
airports = gl.SFrame.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv',verbose=False)
airports.rename({'iata':'Dest'})
result = model.get('coefficients').topk('value')
result = result[result['name'] == 'Dest']
result = result.join(airports,on={'index':'Dest'}).topk('value')
print result

Our task is to predict the actual flight time, which is affected by the airport load, weather, plane type, carrier and many other paramters. We can cast this problem as that of predicting a real-valued variable (flight time) for a pair of entities (source and destination airports). This can be solved easily using certain models in the recommender toolkit. First, let us try regular matrix factoriation.

In [None]:
# Train a matrix factorization model with default parameters
model = gl.recommender.factorization_recommender.create(train, 
                                    user_id="FlightNum", 
                                    item_id="Dest", 
                                    target="ActualElapsedTime", 
                                    side_data_factorization=False, verbose=False)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', gl.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))

In [None]:
train.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime']) ;
test.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime']) ;

Now we'll use boosted tree to create model and preditct the time.

In [None]:
# Train a matrix factorization model with default parameters
model = gl.boosted_trees_regression.create(train, 
                                    target="ActualElapsedTime", max_iterations=50,verbose= False)
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', gl.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))

In [33]:
print model.get_feature_importance()

+----------------+-------+-------+
|      name      | index | count |
+----------------+-------+-------+
|    NASDelay    |  None |  505  |
|    TaxiOut     |  None |  345  |
| CRSElapsedTime |  None |  257  |
|    DepTime     |  None |  172  |
|     TaxiIn     |  None |  160  |
|   DayofMonth   |  None |  121  |
|    Distance    |  None |  118  |
|   FlightNum    |  None |   79  |
|   CRSDepTime   |  None |   73  |
| UniqueCarrier  |   OO  |   64  |
+----------------+-------+-------+
[5451 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [35]:
model = gl.linear_regression.create(train, target="ActualElapsedTime",verbose = False)
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', gl.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))

Training RMSE 11.0556233613
Validation RMSE 11.5699896022


## Batch recommendations

Recommender systems or recommendation systems are a subclass of information filtering system that seek to predict the "rating" or "preference" that a user would give to an item.

The raw data is read using graphlab.SFrame.read_csv, with the file path provided as a parameter to the Task. Once the data is loaded into an SFrame, we clean it by calling dropna() on the SFrame. 

I have implemented separate functions to clean the data, train the model and to make recommendations. This makes it easier to understand the implementation.

In [71]:
def clean_data(path):
    sf = gl.SFrame.read_csv(path, delimiter='\t')
    sf['rating'] = sf['rating'].astype(int)
    sf = sf.dropna()
    sf.rename({'user':'user_id', 'movie':'movie_id'})
    
    # To simplify this example, only keep 0.1% of the number of rows from the input data
    sf = sf.sample(0.001)
    return sf

# To train the model, we need the SFrame created in the previous Task.
def train_model(data):
    model = gl.recommender.create(data, user_id='user_id', item_id='movie_id', target='rating',verbose=False)
    return model

# To generate recommendations we need the trained model to use, and the users needing recommendations.
def gen_recs(model, data):
    recs = model.recommend(data['user_id'])
    return recs

Using the late-binding feature of the Data Pipelines framework, the parameters, inputs, and outputs that have not been specified with the Task can be specified at runtime. We will use this feature to specify the database parameters for the 'persist' task, and then raw data location for the 'clean' task.

In [48]:
def my_batch_job(path):
    data = clean_data(path) ;
    model = train_model(data) ;
    recs = gen_recs(model, data) ;
    return recs
        
job = gl.deploy.job.create(my_batch_job, 
        path = 'https://static.turi.com/datasets/movie_ratings/sample.small')

[INFO] graphlab.deploy.job: Validating job.
INFO:graphlab.deploy.job:Validating job.
[INFO] graphlab.deploy.job: Validation complete. Job: 'my_batch_job-Oct-30-2016-18-50-45' ready for execution.
INFO:graphlab.deploy.job:Validation complete. Job: 'my_batch_job-Oct-30-2016-18-50-45' ready for execution.
[INFO] graphlab.deploy.job: Job: 'my_batch_job-Oct-30-2016-18-50-45' scheduled.
INFO:graphlab.deploy.job:Job: 'my_batch_job-Oct-30-2016-18-50-45' scheduled.


In [49]:
job.get_status()
recs = job.get_results() # Blocking call which waits for the job to complete.

u'Running'

In [74]:
print job

Info
------
Job                : my_batch_job-Oct-30-2016-18-50-45
Function(s)        : ['my_batch_job']
Status             : Completed

Help
------
Visualize progress : self.show()
Query status       : self.get_status()
Get results        : self.get_results()

Environment
----------
LocalAsync: ["name": async]

Metrics
-------
Start time         : 2016-10-30 18:50:48
End time           : 2016-10-30 18:52:15
+--------------+-----------+---------------------+---------------+-----------+
|  task_name   |   status  |      start_time     |    run_time   | exception |
+--------------+-----------+---------------------+---------------+-----------+
| my_batch_job | Completed | 2016-10-30 18:50:49 | 85.6809999943 |    None   |
+--------------+-----------+---------------------+---------------+-----------+
+-------------------+---------------------+
| exception_message | exception_traceback |
+-------------------+---------------------+
|        None       |         None        |
+----------------

In [75]:
recs.show()

## Regression & Classification

Creating regression models is easy with GraphLab Create! The regression/classification toolkit contains several models including (but not restricted to) linear regression, logistic regression, and gradient boosted trees. All models are built to work with millions of features and billions of examples. The models differ in how they make predictions, but conform to the same API. Like all GraphLab Create toolkits, you can call create() to create a model, predict() to make predictions on the returned model, and evaluate() to measure performance of the predictions.

In this notebook, we will use a subset of the data from the Yelp Dataset Challenge for this tutorial. The task is to predict the 'star rating' for a restaurant for a given user. The dataset comprises three tables that cover 11,537 businesses, 8,282 check-ins, 43,873 users, and 229,907 reviews.

In [6]:
business = gl.SFrame('https://static.turi.com/datasets/regression/business.csv');
user = gl.SFrame('https://static.turi.com/datasets/regression/user.csv');
review = gl.SFrame('https://static.turi.com/datasets/regression/review.csv');

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,list,str,str,float,float,str,long,long,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------


Inferred types from first 100 line(s) of file as 
column_type_hints=[float,str,long,str,str,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------


Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,long,str,str,str,dict,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


First, we use an SFrame join operation to merge the business and review tables, using the business_id column to "match" the rows of the two tables. 

In [61]:
review_business_table = review.join(business, how='inner', on='business_id')
review_business_table = review_business_table.rename({'stars.1': 'business_avg_stars', 
                              'type.1': 'business_type',
                              'review_count': 'business_review_count'})

Now, join user table to the result, using the user_id column to match rows. Now we have review, business, and user information in a single table.

In [62]:
user_business_review_table = review_business_table.join(user, how='inner', on="user_id")
user_business_review_table = user_business_review_table.rename({'name.1': 'user_name', 
                                   'type.1': 'user_type', 
                                   'average_stars': 'user_avg_stars',
                                   'review_count': 'user_review_count'})

First, let us split our data into training and testing sets, using SFrame's random_split function.

In [63]:
train_set, test_set = user_business_review_table.random_split(0.8, seed=1)

Let's start out with a simple model. The target is the star rating for each review and the features are:
- Average rating of a given business
- Average rating made by a user
- Number of reviews made by a user
- Number of reviews that concern a business

In [65]:
model = gl.linear_regression.create(train_set, target='stars', 
                                    features = ['user_avg_stars','business_avg_stars', 
                                                'user_review_count', 'business_review_count'], verbose = False)

GraphLab Create easily allows you to make predictions using the created model with the predict function. The predict function returns an SArray with a prediction for each example in the test dataset.

In [66]:
predictions = model.predict(test_set)

We can also evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics: root-mean-square error (RMSE) is a global summary of the differences between predicted values and the values actually observed, while max-error measures the worst case performance of the model on a single observation. In this example, our model made predictions which were about 1 star away from the true rating (on average) but there were a few cases where we were off by almost 4 stars.

In [67]:
model.evaluate(test_set)

{'max_error': 4.019750949512857, 'rmse': 0.9710267935060807}

In [68]:
sf = gl.SFrame()
sf['Predicted-Rating'] = predictions
sf['Actual-Rating'] = test_set['stars']
predict_count = sf.groupby('Actual-Rating', [gl.aggregate.COUNT('Actual-Rating'), gl.aggregate.AVG('Predicted-Rating')])
predict_count.topk('Actual-Rating', k=5, reverse=True)

Actual-Rating,Count,Avg of Predicted-Rating
1,3280,2.64874516707
2,4003,3.27132293866
3,6455,3.54142644226
4,15150,3.81638846127
5,14383,4.23733573113
