# Building a simple news recommendation system

The goal of this example is to train a very simple news recommendation system. We will
- prepare the training data in parallel with Ray
- train a simple model that classifies article titles as "popular" or "less popular" using scikit learn and
- find good hyperparameter settings for the model with Tune, Ray's parallel hyperparameter optimization library.

### Downloading and preparing the training data

First we will download and decompress 2 million hackernews submissions. This is a small subset of the articles that have been submitted to https://news.ycombinator.com. The data includes the title of the submission and its score, which roughly corresponds to the number of upvotes. There is 4 batches of JSONL files that contain the information, named `hackernews-1.json` through `hackernews-4.json`. An example of the format of the file is displayed below.

In [None]:
!wget -nc https://s3-us-west-2.amazonaws.com/ray-tutorials/hackernews.zip
!unzip -o hackernews.zip
!head -n 2 hackernews-1.json

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import numpy as np
import pandas as pd
import ray
import time

In [None]:
ray.init(num_cpus=4, include_webui=False, ignore_reinit_error=True)

The following is a function to parse a chunk of the data and produce a pandas DataFrame with the title and the score of the submissions.

In [None]:
def parse_hn_submissions(path):
    with open(path, "r") as f:
        records = []
        for line in f.readlines():
            body = json.loads(line)["body"]
            records.append({"data": body["title"], "score": body["score"]})
        return pd.DataFrame(records)

We now process all the data chunks and concatenate them into a single dataframe:

In [None]:
start_time = time.time()

files = ["hackernews-" + str(i) + ".json" for i in range(1, 5)]
records = [parse_hn_submissions(file) for file in files]
df = pd.concat(records)

end_time = time.time()
duration = end_time - start_time
print("Took {} seconds to parse the hackernews submissions".format(duration))

df.head()

**EXERCISE:** Speed up the parsing by processing the four files in parallel with Ray!

We use the following lines to determine a cutoff of what we consider a "good" article. The median score for articles is 1, so we want to label articles with score higher than that as class "1" and everything else as "0".

In [None]:
df["score"].median()

In [None]:
df["target"] = df["score"] > 1.0

We are now done preparing the data and can start training a model.

### Training a model

First we split the data into a train and test set.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

The following defines a pipeline that first converts the title of the submission to a bag of words and then applies an SVM for the actual classification.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=0.001,
                          max_iter=5, tol=1e-3))])
result = pipeline.fit(train.data, train.target)

predicted = result.predict(train.data)
print("Accuracy on the training set is {}".format(np.mean(predicted == train.target)))

In [None]:
predicted = pipeline.predict(test.data)
print("Accuracy on the test set is {}".format(np.mean(predicted == test.target)))

We can also classify new titles as follows:

In [None]:
pipeline.predict(["Iconic consoles of the IBM System/360 mainframes, 55 years old today",
                  "Are Banned Drugs in Your Meat?"])

### Hyperparameter tuning

Now let's try to improve these results by doing some hyperparameter tuning.

In [None]:
from ray import tune
from ray.tune.util import pin_in_object_store, get_pinned_object

First we need to put the training data into the object store (to make sure it will be re-used between training runs), and define the objective function.

In [None]:
train_id = pin_in_object_store(train)
test_id = pin_in_object_store(test)

def train_func(config, reporter):
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('clf', SGDClassifier(loss='hinge', penalty='l2',
                              alpha=config["alpha"],
                              max_iter=5, tol=1e-3))])
    
    train = get_pinned_object(train_id)
    test = get_pinned_object(test_id)
    pipeline.fit(train.data, train.target)
    reporter(mean_accuracy=np.mean(pipeline.predict(test.data) == test.target)) # report metrics

We can then get the best setting for the regularization parameter $\alpha$ as follows:

In [None]:
all_trials = tune.run(
    train_func,
    name="news_recommendation",
    stop={"mean_accuracy": 99},
    config={"alpha": tune.grid_search([1e-3, 1e-4, 1e-5, 1e-6])}
)