# End-to-End NLP: News Headline Classifier (Local Version)

This notebook trains a Keras-based model to classify news headlines between four domains: Business (b), Entertainment (e), Health & Medicine (m) and Science & Technology (t).

The model is trained and evaluated here on the notebook instance itself - and we'll show in the follow-on notebook how to take advantage of Amazon SageMaker to separate these infrastructure needs.


In [None]:
# First install some libraries which might not be available across all kernels (e.g. in Studio):
!pip install ipywidgets

### Set Up Execution Role and Session

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.


In [None]:
%%time
%load_ext autoreload
%autoreload 2

import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()


### Download News Aggregator Dataset

We will download our dataset from the **UCI Machine Learning Database** public repository. The dataset is the News Aggregator Dataset and we will use the newsCorpora.csv file. This dataset contains a table of news headlines and their corresponding classes.


In [None]:
%%time
import util.preprocessing

util.preprocessing.download_dataset()


### Let's visualize the dataset

We will load the newsCorpora.csv file to a Pandas dataframe for our data processing work.


In [None]:
import os
import re
import numpy as np
import pandas as pd


In [None]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
df = pd.read_csv("data/newsCorpora.csv", names=column_names, header=None, delimiter="\t")
df.head()


For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable


In [None]:
df["CATEGORY"].value_counts()


The dataset has four article categories: Business (b), Entertainment (e), Health & Medicine (m) and Science & Technology (t).


## Natural Language Pre-Processing

We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.

We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.


### Dummy Encode the Labels


In [None]:
encoded_y, labels = util.preprocessing.dummy_encode_labels(df, "CATEGORY")
print(labels)


In [None]:
df["CATEGORY"][1]

In [None]:
encoded_y[0]

### Tokenize and Set Fixed Sequence Lengths

We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.


In [None]:
padded_docs, tokenizer = util.preprocessing.tokenize_pad_docs(df, "TITLE")


In [None]:
df["TITLE"][1]

In [None]:
padded_docs[0]

### Import Word Embeddings

To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using pre-built GloVe word embeddings.

You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.


In [None]:
%%time
embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, "data/embeddings")


In [None]:
np.save(
    file="./data/embeddings/docs-embedding-matrix",
    arr=embedding_matrix,
    allow_pickle=False,
)
vocab_size=embedding_matrix.shape[0]
print(embedding_matrix.shape)


### Split Train and Test Sets

Finally we need to divide our data into model training and evaluation sets:


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    padded_docs,
    encoded_y,
    test_size=0.2,
    random_state=42
)


In [None]:
# Do you always remember to save your datasets for traceability when experimenting locally? ;-)
os.makedirs("./data/train", exist_ok=True)
np.save("./data/train/train_X.npy", X_train)
np.save("./data/train/train_Y.npy", y_train)
os.makedirs("./data/test", exist_ok=True)
np.save("./data/test/test_X.npy", X_test)
np.save("./data/test/test_Y.npy", y_test)


## Define the Model


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Conv1D, Dense, Dropout, Embedding, Flatten, MaxPooling1D
from tensorflow.keras.models import Sequential

seed = 42
np.random.seed(seed)
num_classes=len(labels)


In [None]:
model = Sequential()
model.add(Embedding(
    vocab_size,
    100,
    weights=[embedding_matrix],
    input_length=40,
    trainable=False,
    name="embed"
))
model.add(Conv1D(filters=128, kernel_size=3, activation="relu", name="conv_1"))
model.add(MaxPooling1D(pool_size=5, name="maxpool_1"))
model.add(Flatten(name="flat_1"))
model.add(Dropout(0.3, name="dropout_1"))
model.add(Dense(128, activation="relu", name="dense_1"))
model.add(Dense(num_classes, activation="softmax", name="out_1"))

# Compile the model
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["acc"])

model.summary()


## Fit (Train) and Evaluate the Model


In [None]:
%%time
# fit the model here in the notebook:
print("Training model")
model.fit(X_train, y_train, batch_size=16, epochs=5, verbose=1)
print("Evaluating model")
# TODO: Better differentiate train vs val loss in logs
scores = model.evaluate(X_test, y_test, verbose=2)
print(
    "Validation results: "
    + "; ".join(map(
        lambda i: f"{model.metrics_names[i]}={scores[i]:.5f}", range(len(model.metrics_names))
    ))
)
    

## (**JupyterLab / SageMaker Studio Only**) Installing IPyWidgets Extension

This notebook uses a fun little interactive widget to query the classifier, which works out of the box in plain Jupyter on a SageMaker Notebook Instance - but in JupyterLab or SageMaker Studio requires an extension not installed by default.

**If you're using JupyterLab on a SageMaker Notebook Instance**, you can install it via UI:

- Select "*Settings > Enable Extension Manager (experimental)*" from the toolbar, and confirm to enable it
- Click on the new jigsaw puzzle piece icon in the sidebar, to open the Extension Manager
- Search for `@jupyter-widgets/jupyterlab-manager` (Scroll down - search results show up *below* the list of currently installed widgets!)
- Click "**Install**" below the widget's description
- Wait for the blue progress bar that appears by the search box
- You should be prompted "*A build is needed to include the latest changes*" - select "**Rebuild**"
- The progress bar should resume, and you should shortly see a "Build Complete" dialogue.
- Select "**Reload**" to reload the webpage

**If you're using SageMaker Studio**, you can install it via CLI:

- Open a new launcher and select **System terminal** (and **not** *Image terminal*)
- Change to the repository root folder (e.g. with `cd sagemaker-workshop-101`) and check with `pwd` (print working directory)
- Run `./init-studio.sh` and refresh your browser page when the script is complete.


## Use the Model (Locally)

Let's evaluate our model with some example headlines...

If you struggle with the widget, you can always simply call the `classify()` function from Python. You can be creative with your headlines!


In [None]:
from IPython import display
import ipywidgets as widgets
from keras.preprocessing.sequence import pad_sequences

def classify(text):
    """Classify a headline and print the results"""
    encoded_example = tokenizer.texts_to_sequences([text])
    # Pad documents to a max length of 40 words
    max_length = 40
    padded_example = pad_sequences(encoded_example, maxlen=max_length, padding="post")
    result = model.predict(padded_example)
    print(result)
    ix = np.argmax(result)
    print(f"Predicted class: '{labels[ix]}' with confidence {result[0][ix]:.2%}")

interaction = widgets.interact_manual(
    classify,
    text=widgets.Text(
        value="The markets were bullish after news of the merger",
        placeholder="Type a news headline...",
        description="Headline:",
        layout=widgets.Layout(width="99%"),
    )
)
interaction.widget.children[1].description = "Classify!"

## Review

In this notebook we pre-processed publicly downloadable data and trained a neural news headline classifier model: As a data scientist might normally do when working on a local machine.

...But can we use the cloud more effectively to allocate high-performance resources; and easily deploy our trained models for use by other applications?

Head on over to the next notebook, [Headline Classifier SageMaker.ipynb](Headline%20Classifier%20SageMaker.ipynb), where we'll show how the same model can be trained and then deployed on specific target infrastructure with Amazon SageMaker.
