# Building a news recommendation system

The goal of this exercise is to train and serve a news recommendation system.

### Downloading and preparing the training data

First we will download and extract a subset of hackernews articles. This is a small subset of the articles that have been submitted to https://news.ycombinator.com. 

In [None]:
!wget https://s3-us-west-2.amazonaws.com/ray-tutorials/hackernews.zip

In [None]:
!unzip -o hackernews.zip

In [None]:
import ray
ray.init()

The following is a function to parse a chunk of the data and produce a pandas DataFrame with the title and the score of the submissions.

In [None]:
import json
import pandas as pd
import numpy as np

@ray.remote
def parse_hn_submissions(path):
    with open(path, "r") as f:
        records = []
        for line in f.readlines():
            body = json.loads(line)["body"]
            records.append({"data": body["title"], "score": body["score"]})
        return pd.DataFrame(records)

In [None]:
files = ["hackernews-" + str(i) + ".json" for i in range(1, 5)]
# %time records = [parse_hn_submissions(file) for file in files]

In [None]:
%time results = ray.get([parse_hn_submissions.remote(file) for file in files])
df = pd.concat(results)

In [None]:
df["score"].mean()

In [None]:
df["target"] = df["score"] > 9.0

### Training a model

Now we split the data into a train and test set.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

The following defines a pipeline that first converts the title of the submission to a bag of words and then applies an SVM for classification.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None))])

In [None]:
result = pipeline.fit(train.data, train.target)

In [None]:
predicted = pipeline.predict(test.data)

In [None]:
np.mean(predicted == test.target)

### Hyperparameter tuning

In [None]:
from ray import tune

In [None]:
def train_func(config, reporter):
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('clf', SGDClassifier(loss='hinge', penalty='l2',
                              alpha=config["alpha"], random_state=42,
                              max_iter=5, tol=None))])

    pipeline.fit(train.data, train.target)
    reporter(mean_accuracy=np.mean(pipeline.predict(train.data) == train.target)) # report metrics

In [None]:
all_trials = tune.run(
    train_func,
    name="quick-start",
    stop={"mean_accuracy": 99},
    config={"alpha": tune.grid_search([1e-1, 1e-3, 1e-5])}
)