# End-to-end NLP: News Headline classifier

### Setup execution role and session

In [1]:
import numpy as np
import pandas as pd

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [2]:
%%time
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()
#bucket = <bucket> # custom bucket name.
s3_bucket = sess.default_bucket()
s3_prefix = 'news'

arn:aws:iam::344028372807:role/service-role/AmazonSageMaker-ExecutionRole-20190212T154595
CPU times: user 552 ms, sys: 47.1 ms, total: 599 ms
Wall time: 722 ms


### Download News Aggregator Dataset available at the public UCI dataset repository

We will download our dataset from the UCI Machine Learning Database public repository. The dataset is the News Aggregator Dataset and we will use the newsCorpora.csv file. This dataset contains a table of news headlines and their corresponding classes.

In [3]:
!wget --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip 

--2020-06-07 07:10:00--  https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 29224203 (28M) [application/x-httpd-php]
Saving to: ‘NewsAggregatorDataset.zip.1’


2020-06-07 07:10:02 (18.5 MB/s) - ‘NewsAggregatorDataset.zip.1’ saved [29224203/29224203]



In [4]:
!unzip NewsAggregatorDataset.zip

Archive:  NewsAggregatorDataset.zip
  inflating: 2pageSessions.csv       
   creating: __MACOSX/
  inflating: __MACOSX/._2pageSessions.csv  
  inflating: newsCorpora.csv         
  inflating: __MACOSX/._newsCorpora.csv  
  inflating: readme.txt              
  inflating: __MACOSX/._readme.txt   


In [5]:
!rm -rf __MACOSX/

In [6]:
ls

2pageSessions.csv                 [0m[01;31mmodel.tar.gz[0m                 text8
blazingtext_word2vec_text8.ipynb  [01;31mNewsAggregatorDataset.zip[0m    [01;34mtf-src[0m/
eval.json                         NewsAggregatorDataset.zip.1  vectors.bin
headline-classifier-local.ipynb   newsCorpora.csv              vectors.txt
headline-classifier-tf.ipynb      readme.txt


### Let's visualize the dataset

We will load the newsCorpora.csv file to a Pandas dataframe for our data processing work.

In [7]:
import pandas as pd
import mxnet
import re
import numpy as np
import os

In [8]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
news_dataset = pd.read_csv('newsCorpora.csv', names=column_names, header=None, delimiter='\t')
news_dataset.head()

Unnamed: 0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


#### For this exercice we'll only use the title (Headline) of the news story and the category as our target variable

In [9]:
df=news_dataset[['TITLE',"CATEGORY"]]

In [10]:
from collections import Counter
Counter(df['CATEGORY'])

Counter({'b': 115967, 't': 108344, 'e': 152469, 'm': 45639})

The dataset has four categories: Business (b), Science & Technology (t), Entertainment (e) and Health & Medicine (m).

In [11]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
leMapped=le.fit_transform(df["CATEGORY"].values)
list(le.classes_)

['b', 'e', 'm', 't']

## Natural Language pre processing

We will do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.
We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.

#### Dummy encode the labels

In [12]:
from sklearn import preprocessing
from keras.utils.np_utils import to_categorical
encoder = preprocessing.LabelEncoder()

docs = df["TITLE"].values

encoder.fit(df["CATEGORY"].values)
encoded_Y = encoder.transform(df["CATEGORY"].values)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = to_categorical(encoded_Y)

Using MXNet backend


In [13]:
list(encoder.classes_)

['b', 'e', 'm', 't']

In [14]:
encoded_Y

array([0, 0, 0, ..., 2, 2, 2])

#### Tokenize documents and set fixed sequence lengths for input feature dimension.

In [15]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print("Vocabulary size: " + str(vocab_size))
# pad documents to a max length of 4 words
max_length = 40
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print("Number of headlines: " + str(len(padded_docs)))

Vocabulary size: 75286
Number of headlines: 422419


In [16]:
docs[0]

'Fed official says weak data caused by weather, should not slow taper'

In [62]:
padded_docs[0]

array([ 215,  452,   25, 1062,   84, 1970,   19, 1081,  270,   37, 1412,
       7900,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0], dtype=int32)

### Import word embeddings

The vectors.txt file is the output of the blazingtext_word2vec_text8.ipynb notebook. This will have a list of vector representations for each word in our vocabulary.

In [18]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./vectors.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 71291 word vectors.


##### Create embedding matrix

In [19]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Import necessary keras libraries to build DL network

In [20]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import KFold

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)


### Train test split to feed to model for evaluation and training.

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(padded_docs, dummy_y, test_size=0.2, random_state=42)

### Build Deep Learning network

#### Build a keras Sequential feedforward model.
#### Embedding > Conv1D > MaxPooling > Flatten > Dropout > Dense > Dense (final Softmax layer)
#### First layer is an Embedding layer that will recieve as input our pre built word embeddings
#### We will use a 1D convolutional network to capture the sequential dimension of language (neigbouring words will be important when classifying context)
#### We will use dropout for regularization.
#### Finally we will use rmsporp optimization scheme.


In [22]:
saveBestModelWeights = ModelCheckpoint("news_model_weights.h5",
                                       monitor='val_acc',
                                       verbose=1, 
                                       save_best_only=True,
                                       save_weights_only=False,
                                       mode='auto',
                                       period=1)

    # define the model
model = Sequential()
model.add(Embedding(vocab_size, 100, 
                        weights=[embedding_matrix],
                        input_length=40, 
                        trainable=False, 
                        name="embed"))
model.add(Conv1D(filters=128, 
                     kernel_size=3, 
                     activation='relu',
                     name="conv_1"))
model.add(MaxPooling1D(pool_size=5,
                           name="maxpool_1"))
model.add(Flatten(name="flat_1"))
model.add(Dropout(0.3,
                     name="dropout_1"))
model.add(Dense(128, 
                    activation='relu',
                    name="dense_1"))
model.add(Dense(le.classes_.size,
                    activation='softmax',
                    name="out_1"))
    
    # compile the model
model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])
    

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embed (Embedding)            (None, 40, 100)           7528600   
_________________________________________________________________
conv_1 (Conv1D)              (None, 128, 98)           15488     
_________________________________________________________________
maxpool_1 (MaxPooling1D)     (None, 25, 98)            0         
_________________________________________________________________
flat_1 (Flatten)             (None, 2450)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 2450)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               313728    
_________________________________________________________________
out_1 (Dense)                (None, 4)                 516       
Total para

## Train model

In [23]:
        
    # fit the model
model.fit(X_train,
              y_train,
              batch_size=16,
              epochs=5, # no benefit from additional epochs
              verbose=1,
              callbacks=[saveBestModelWeights])
    
scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))




Epoch 1/5
    16/337935 [..............................] - ETA: 1:18:37 - loss: 0.5620 - acc: 0.7500

  force_init=force_init)


Epoch 2/5
    80/337935 [..............................] - ETA: 11:29 - loss: 0.2388 - acc: 0.9094



Epoch 3/5
Epoch 4/5
Epoch 5/5
acc: 91.93%


# While we wait for this model to end training please continue to the next notebook (headline-classifier-mxnet.ipynb)

In [71]:
example_doc=['It is proven that chloroquite has entered indonesia market']
# integer encode the document
encoded_example = t.texts_to_sequences(example_doc)

# pad documents to a max length of 4 words
max_length = 40
padded_example = pad_sequences(encoded_example, maxlen=max_length, padding='post')

In [72]:
result = map(lambda x: float("{:.4f}".format(x)), model.predict(padded_example)[0])
print(list(result))

[0.8256, 0.0035, 0.0051, 0.1658]



The dataset has four categories: Business (b), Entertainment (e), Health & Medicine (m), and Science & Technology (t).
