# Building a simple news recommendation system

The goal of this example is to train a very simple news recommendation system. We will
- prepare the training data in parallel with Ray
- train a simple model that classifies article titles as "popular" or "less popular" using scikit learn and
- find good hyperparameter settings for the model with Tune, Ray's parallel hyperparameter optimization library.

### Downloading and preparing the training data

<html><img src="newsreader_1.png"/></html>

First we will download and uncompress 400,000 hackernews submissions. This is a small subset of the articles that have been submitted to https://news.ycombinator.com. The data includes the title of each submission and its score, which roughly corresponds to the number of upvotes. There are 4 batches of JSON files that contain the information, named `submission-1.json` through `submission-4.json`. The first couple lines of the first file will be printed below by the `head` command.

In [1]:
!wget -nc https://s3-us-west-2.amazonaws.com/ray-tutorials/hackernews.zip
!unzip -o hackernews.zip
!head -n 2 submission-1.json

--2020-01-02 12:16:41--  https://s3-us-west-2.amazonaws.com/ray-tutorials/hackernews.zip
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.128.120
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.128.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56402193 (54M) [application/zip]
Saving to: ‘hackernews.zip’


2020-01-02 12:16:44 (25.7 MB/s) - ‘hackernews.zip’ saved [56402193/56402193]

Archive:  hackernews.zip
  inflating: submission-1.json       
  inflating: submission-2.json       
  inflating: submission-3.json       
  inflating: submission-4.json       
{"body": {"descendants": 0, "url": "http://markpincus.blogspot.com/2005/03/peopleweb-i-believe-we-are-close-to.html", "text": "", "title": "The PeopleWeb | Mark Pincus Blog (March 2005)", "by": "sayemm", "score": 3, "time": 1286515576, "type": "story", "id": 1770734}, "source": "firebase", "id": 1770734, "retrieved_at_ts": 1436469924}
{"body": {"de

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import numpy as np
import pandas as pd
import ray
import time

%env RAY_DASHBOARD_DEBUG = True

env: RAY_DASHBOARD_DEBUG=True


In [6]:
ray.shutdown()

In [7]:
ray.init(num_cpus=4, include_webui=True, ignore_reinit_error=True)

2020-01-02 12:22:17,658	INFO resource_spec.py:216 -- Starting Ray with 3.32 GiB memory available for workers and up to 1.66 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-01-02 12:22:17,900	INFO services.py:1101 -- View the Ray dashboard at [1m[32mlocalhost:8267[39m[22m.


{'node_ip_address': '10.1.10.91',
 'redis_address': '10.1.10.91:61548',
 'object_store_address': '/tmp/ray/session_2020-01-02_12-22-17_649165_4496/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-01-02_12-22-17_649165_4496/sockets/raylet',
 'webui_url': 'localhost:8267',
 'session_dir': '/tmp/ray/session_2020-01-02_12-22-17_649165_4496'}

The function below parses a chunk of the data and produces a pandas DataFrame with the titles and scores of the submissions.

In [16]:
def parse_hn_submissions(path):
    with open(path, "r") as f:
        records = []
        for line in f.readlines():
            body = json.loads(line)["body"]
            records.append({"data": body["title"], "score": body["score"]})
        return pd.DataFrame(records)

We now process all the data chunks and concatenate them into a single dataframe:

In [17]:
start_time = time.time()

files = ["submission-" + str(i) + ".json" for i in range(1, 5)]
records = [parse_hn_submissions(file) for file in files]
df = pd.concat(records)

end_time = time.time()
duration = end_time - start_time
print("Took {} seconds to parse the hackernews submissions".format(duration))

df.head()

Took 2.8535571098327637 seconds to parse the hackernews submissions


Unnamed: 0,data,score
0,The PeopleWeb | Mark Pincus Blog (March 2005),3
1,Computer science and programming are two separ...,1
2,Don't Go It Alone: Create an Advisory Board,1
3,Wikileaks Secret Dreams,1
4,MakeMyTrip.com: Is eCommerce in India Finall...,1


**EXERCISE:** Modify the code above to parallelize the parsing of the four files with Ray!

**Note**: In Binder this will not lead to a speedup (in fact, it will be slower) due to constrained resources (each Binder instance is shared with many other people). On an uncontended EC2 instance, we get **4.25s** for the serial code and **1.34s** for the parallel version.

We use the following lines to determine a cutoff of what we consider a "good" article. The median score for articles is 1, so we want to label articles with score higher than that as class "1" and everything else as "0".

In [18]:
df["score"].median()

1.0

In [19]:
df["target"] = df["score"] > 1.0

We are now done preparing the data and can start training a model.

### Training a model

<html><img src="newsreader_2.png"/></html>

First we split the data into a train and test set.

In [20]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

The following defines a pipeline that first converts the title of the submission to a bag of words and then applies an SVM for the actual classification. Note that we are fitting a very simple SVM here due to the computational restrictions of Binder. With more resources, a state-of-the-art model like [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) would be a better choice, in this case the code would be structured similarly.

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("clf", SGDClassifier(loss="hinge", penalty="l2",
                          alpha=0.001,
                          max_iter=5, tol=1e-3,
                          warm_start=True))])
result = pipeline.fit(train.data, train.target)

predicted = result.predict(train.data)
print("Accuracy on the training set is {}".format(np.mean(predicted == train.target)))



Accuracy on the training set is 0.586221875


In [22]:
predicted = pipeline.predict(test.data)
print("Accuracy on the test set is {}".format(np.mean(predicted == test.target)))

Accuracy on the test set is 0.578675


We can also classify new titles as follows:

In [23]:
pipeline.predict(["Iconic consoles of the IBM System/360 mainframes, 55 years old today",
                  "Are Banned Drugs in Your Meat?"])

array([ True, False])

### Hyperparameter tuning
<html><img src="newsreader_3.png"/></html>

Now let's try to improve these results by doing some hyperparameter tuning. Hyperparameter tuning is the process of finding the best parameters for the learning algorithm. These parameters are typically few numbers like learning rate schedule (i.e. how large steps to take in each iteration), regularization parameters or size of the model. By tuning these knobs, we can typically make the model perform better. Tune supports a number of different algorithms to perform hyperparameter tuning. The simplest is a grid search where we just exhaustively try out different values for the parameters. More sophisticated algorithms include hyperband and population based training. If you want to learn more about these, check out the [tune documentation](https://ray.readthedocs.io/en/latest/tune.html). 

In [24]:
import os
import pickle
from ray import tune

First we need to put the training data into the object store (to make sure it will be re-used between training runs), and define the objective function. The objective function `train_func` takes two arguments: The `config` argument which contains the hyperparameters for that hyperparameter run. The `reporter` object can be used to report the performance of these hyperparameters back to tune so it can select the next trial based on the performance of the past ones.

**EXERCISE**: Inside the `train_func`, instantiate the training pipeline as above and replace the concrete value of $\alpha$ with the value `config["alpha"]` that is passed in by Tune.

The following function instantiates a model corresponding to the hyperparameters in `config`, runs 5 iterations of training and saves the model parameters to a checkpoint file.

In [25]:
train_id = ray.put(train)
test_id = ray.put(test)

def train_func(config, reporter):
    pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("clf", SGDClassifier(loss="hinge", penalty="l2",
                          alpha=config["alpha"],
                          max_iter=5, tol=1e-3,
                          warm_start=True))]) # TODO: Put in the training pipeline here
    train = ray.get(train_id)
    test = ray.get(test_id)
    for i in range(5):
        # Perform one epoch of SGD
        X = pipeline.named_steps["vect"].fit_transform(train.data)
        pipeline.named_steps["clf"].partial_fit(X, train.target, classes=[0, 1])
        with open("model.pkl", "wb") as f:
            pickle.dump(pipeline, f)
        reporter(mean_accuracy=np.mean(pipeline.predict(test.data) == test.target))  # report metrics

We can then get the best setting for the regularization parameter $\alpha$ as follows. **You should expect the training to take about 4-5 minutes**.

In [26]:
all_trials = tune.run(
    train_func,
    name="news_recommendation",
    # With the "stop" parameter, you could also specify a stopping criterion.
    config={"alpha": tune.grid_search([1e-3, 1e-4, 1e-5, 1e-6])}
)



Trial name,status,loc,alpha
train_func_b36a39c4,RUNNING,,
train_func_b36dfc58,PENDING,,
train_func_b36e24d0,PENDING,,
train_func_b36e4bfe,PENDING,,




Result for train_func_b36e24d0:
  date: 2020-01-02_12-23-18
  done: false
  experiment_id: 0c18f4cfd3a54af0902bb2ade835f338
  experiment_tag: 2_alpha=1e-05
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 1
  mean_accuracy: 0.5702625
  node_ip: 10.1.10.91
  pid: 4527
  time_since_restore: 4.267539978027344
  time_this_iter_s: 4.267539978027344
  time_total_s: 4.267539978027344
  timestamp: 1577996598
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: b36e24d0
  


Trial name,status,loc,alpha,iter,total time (s),acc
train_func_b36a39c4,RUNNING,,,,,
train_func_b36dfc58,RUNNING,,,,,
train_func_b36e24d0,RUNNING,10.1.10.91:4527,1e-05,1.0,4.26754,0.570263
train_func_b36e4bfe,RUNNING,,,,,


Result for train_func_b36e4bfe:
  date: 2020-01-02_12-23-18
  done: false
  experiment_id: 8852a9f9a054441ea0fb3d6963922800
  experiment_tag: 3_alpha=1e-06
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 1
  mean_accuracy: 0.5621125
  node_ip: 10.1.10.91
  pid: 4525
  time_since_restore: 4.303200960159302
  time_this_iter_s: 4.303200960159302
  time_total_s: 4.303200960159302
  timestamp: 1577996598
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: b36e4bfe
  
Result for train_func_b36dfc58:
  date: 2020-01-02_12-23-18
  done: false
  experiment_id: 12c88bd52e6047679f7a52af9aa71d02
  experiment_tag: 1_alpha=0.0001
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 1
  mean_accuracy: 0.59805
  node_ip: 10.1.10.91
  pid: 4528
  time_since_restore: 4.316836833953857
  time_this_iter_s: 4.316836833953857
  time_total_s: 4.316836833953857
  timestamp: 1577996598
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: b36df

Trial name,status,loc,alpha,iter,total time (s),acc
train_func_b36a39c4,RUNNING,10.1.10.91:4526,0.001,2,8.39669,0.58005
train_func_b36dfc58,RUNNING,10.1.10.91:4528,0.0001,2,8.38358,0.601413
train_func_b36e24d0,RUNNING,10.1.10.91:4527,1e-05,3,12.2363,0.591962
train_func_b36e4bfe,RUNNING,10.1.10.91:4525,1e-06,2,8.35894,0.568412


Result for train_func_b36e4bfe:
  date: 2020-01-02_12-23-26
  done: false
  experiment_id: 8852a9f9a054441ea0fb3d6963922800
  experiment_tag: 3_alpha=1e-06
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 3
  mean_accuracy: 0.5668
  node_ip: 10.1.10.91
  pid: 4525
  time_since_restore: 12.327115058898926
  time_this_iter_s: 3.9681711196899414
  time_total_s: 12.327115058898926
  timestamp: 1577996606
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: b36e4bfe
  
Result for train_func_b36dfc58:
  date: 2020-01-02_12-23-26
  done: false
  experiment_id: 12c88bd52e6047679f7a52af9aa71d02
  experiment_tag: 1_alpha=0.0001
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 3
  mean_accuracy: 0.60135
  node_ip: 10.1.10.91
  pid: 4528
  time_since_restore: 12.36200499534607
  time_this_iter_s: 3.9784250259399414
  time_total_s: 12.36200499534607
  timestamp: 1577996606
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: b36d

Trial name,status,loc,alpha,iter,total time (s),acc
train_func_b36a39c4,RUNNING,10.1.10.91:4526,0.001,4,16.3448,0.578825
train_func_b36dfc58,RUNNING,10.1.10.91:4528,0.0001,4,16.3186,0.609237
train_func_b36e24d0,RUNNING,10.1.10.91:4527,1e-05,5,20.0713,0.596325
train_func_b36e4bfe,RUNNING,10.1.10.91:4525,1e-06,4,16.267,0.5628


Result for train_func_b36e4bfe:
  date: 2020-01-02_12-23-34
  done: false
  experiment_id: 8852a9f9a054441ea0fb3d6963922800
  experiment_tag: 3_alpha=1e-06
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 5
  mean_accuracy: 0.568875
  node_ip: 10.1.10.91
  pid: 4525
  time_since_restore: 20.2430899143219
  time_this_iter_s: 3.976069927215576
  time_total_s: 20.2430899143219
  timestamp: 1577996614
  timesteps_since_restore: 0
  training_iteration: 5
  trial_id: b36e4bfe
  
Result for train_func_b36dfc58:
  date: 2020-01-02_12-23-34
  done: false
  experiment_id: 12c88bd52e6047679f7a52af9aa71d02
  experiment_tag: 1_alpha=0.0001
  hostname: Yunzhis-MBP.hsd1.ca.comcast.net
  iterations_since_restore: 5
  mean_accuracy: 0.6095125
  node_ip: 10.1.10.91
  pid: 4528
  time_since_restore: 20.28740692138672
  time_this_iter_s: 3.968809127807617
  time_total_s: 20.28740692138672
  timestamp: 1577996614
  timesteps_since_restore: 0
  training_iteration: 5
  trial_id: b36dfc

Trial name,status,loc,alpha,iter,total time (s),acc
train_func_b36a39c4,TERMINATED,,0.001,5,20.3502,0.578388
train_func_b36dfc58,TERMINATED,,0.0001,5,20.2874,0.609513
train_func_b36e24d0,TERMINATED,,1e-05,5,20.0713,0.596325
train_func_b36e4bfe,TERMINATED,,1e-06,5,20.2431,0.568875


2020-01-02 12:23:34,849	INFO tune.py:334 -- Returning an analysis object by default. You can call `analysis.trials` to retrieve a list of trials. This message will be removed in future versions of Tune.


The best model can now be loaded and evaluated like so:

In [None]:
best_result_path = os.path.join(all_trials.get_best_logdir("mean_accuracy"), "model.pkl")
with open(best_result_path, "rb") as f:
    pipeline = pickle.load(f)
print("Best result was {}".format(np.mean(pipeline.predict(test.data) == test.target)))
print("Best result path is {}".format(best_result_path))