# Multi LSTM Model with Pre-Trained Word Embedding
In this notebook we first look at the dataset, prepare it then we use [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/2) from Tensorhub, which is a token based text embedding trained on English Google News 200B corpus.

## The Dataset
[Guardian News Articles](https://www.kaggle.com/datasets/adityakharosekar2/guardian-news-articles)  dataset on Kaggle was used to perform genre or section analysis. Since this dataset was large (~150,000 articles / >700MB) the full dataset was not used. Instead only a proportion of the dataset was used. 

Aditya Kharosekar, the author of the dataset stated that no special preprocessing was done. Kharosekar said this was to give users of this dataset ability to preprocess this data in anyway they wish. About .3% of rows was found to be corrupted when inspected in excel. labels were also found to be catagorical strings. 

There were 164 unique section names. Two interesting feature columns for news classification were webTitle and bodyContent. webTitle was chosen as a feature to work with as bodyContent contained too many troublesome characters. webTitle also had less characters meaning training time should be slightly reduced.

### Removed corrupted rows
Id on last row was read as 149,839. 149,723 rows remained, meaning 116 were removed. 

### Encoded categorical labels to an ordinal list
164 section names were found. A seperate key map list file was created called 'guardian_articles_labels.csv', with the original dataset recieving a new column containing an ordinal list of labels. Any N/A values were correct

### Got 10% of data from the large dataset
In excl the rand() function was used to create a new column of random numbers. Then the entire dataset was ordered by smallest random number. 10% of the dataset (149723 * 10%) or the top 14,972 rows were selected and exported into a seperate file called guardian_articles_ten_perc.csv. This file will be used for further processing.

### Merged webTitle with bodyContent
To simplify the dataset all other columns were removed except webTitle, bodyContent and label. Two smaller datasets were created to test results. The first one had webTitle and bodyContent combined in the same column seperated by a white space. The second just had webTitle.

In [None]:
!pip install --upgrade tensorflow_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#get data from google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Libraries
import pandas as pd
import numpy as np
from collections import Counter
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras import layers

#word cleaning
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

# save numpy array of tokens as csv file
from numpy import asarray
from numpy import savetxt

#used for transfer learning
import tensorflow_hub as hub

#plotting performance
import matplotlib.pyplot as plt



In [None]:

nltk.download('stopwords') #to remove common words
nltk.download('wordnet') #for WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
#149723 * 10% = 14,972
df = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_formatted_merged_ten_percent.csv') # guardian_articles_ten_perc_webTitle
labels_lookup_table = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_labels.csv')

In [None]:
df.head()

Unnamed: 0,webTitle_bodyContent,labels
0,Lost in showbiz: Marina Hyde's 2016 quiz,17
1,Tony Blair: return to dark 1930s politics no l...,11
2,'Prince far more royal than the Queen' says Mo...,21
3,Japan care home attack: picture emerges of mod...,2
4,Biden urged to scrap Trump â€˜Remain in Mexico...,1


In [None]:
labels_lookup_table.head()

Unnamed: 0,sectionName,labels
0,US news,1
1,World news,2
2,Football,3
3,Sport,4
4,Television & radio,5


In [None]:
df.labels.value_counts()

2      1498
10     1264
3      1090
4       975
8       814
       ... 
102       1
132       1
87        1
52        1
73        1
Name: labels, Length: 76, dtype: int64

In [None]:
#percentage of data imported from guardian dataset
print(str(round(len(df.index)/149723, 2)*100) + "%")

10.0%


In [None]:
#check missing data
df.isnull().sum() #none missing

webTitle_bodyContent    0
labels                  0
dtype: int64

In [None]:
df.webTitle_bodyContent[50]

'Politics are a matter of life and death. No wonder more Scots want to leave the UK | Adam Ramsay No one should be surprised that support for Scottish independence is surging. A year ago this week, Conservatives rallied behind a leader capable of rekindling an old flame. A leader whose very purpose was to assert against all the evidence that England or Britain â€“ theyâ€™re never sure which â€“ stands alone against the forces of history. A leader who would inflame the UKâ€™s oldest arguments. Europe looked on in horror. In Scotland, we gritted our teeth. During the 2019 general election, called five months after Johnson took the helm of the Tory party, he lost more than half of his partyâ€™s seats in Scotland. Rather than look reality in the eye and follow the example of its neighbour, England set its house on fire, giving Johnson a thumping win. Our homes are semidetached, so when you burn, so do we.\nIt was clear that the Corbyn project was dead, and with it any chance to seriously r

In [None]:
labels_lookup_table.loc[labels_lookup_table['labels'] == df.labels[50]]['sectionName'].values[0]

'Opinion'

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
truncate = 255

In [None]:
def clean_text(text): #ref:https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    lemmatizer.lemmatize(text) # reduce to root word
    text = text[:truncate] # reduce string length
    return text

In [None]:
df.iloc[:, 0] = df.iloc[:, 0].apply(clean_text)

In [None]:
df.webTitle_bodyContent[50]

'politics matter life death wonder scots want leave uk adam ramsay one surprised support scottish independence surging year ago week conservatives rallied behind leader capable rekindling old flame leader whose purpose assert evidence england britain theyr'

In [None]:
# count how many unique words are in the dataset, used to define vocab
max_words = len(set(df['webTitle_bodyContent'].values))
print(max_words)

14953


In [None]:
# How many words per row or Sequence length,excludes counting spaces
# use to pad the output
seq_len = int(df.iloc[:, 0].map(len).max())
print(seq_len) #max length of each example

255


## Text to Vector to Embeddings
Using a pretrained embedding layer added.

In [None]:
max_features = int(max_words)  # Maximum vocab size.
max_len = seq_len
#df['webTitle_bodyContent']

In [None]:
X = df['webTitle_bodyContent']

In [None]:
# we have now created train, validation and test datasets with max length of 
#128 for each example.
print(X)
print(type(X))
print(X.shape)

0                      lost showbiz marina hydes 2016 quiz
1        tony blair return dark 1930s politics longer f...
2        prince far royal queen says morrissey special ...
3        japan care home attack picture emerges modest ...
4        biden urged scrap trump remain mexico migrant ...
                               ...                        
14967    ups downs british gas homecare contract britis...
14968    two nights milo yiannopouloss campus tour offe...
14969    share views uber losing licence operate london...
14970    budapest festival orchestra fischer review tho...
14971    gvc faces shareholder rebellion 67m paid two b...
Name: webTitle_bodyContent, Length: 14972, dtype: object
<class 'pandas.core.series.Series'>
(14972,)


In [None]:
X[0]

'lost showbiz marina hydes 2016 quiz'

In [None]:
#y = df['labels'].values
y = pd.get_dummies(df['labels']).values
print(y.shape)

(14972, 76)


## Split sample into 80% training, 10% test & 10% validation datasets
Next, 10% of the data was split off for testting, 10% for validation and the remaining 80% was used as for training data.

In [None]:
from sklearn.model_selection import train_test_split

#first split data for training and test data (90:10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1111, random_state=42)

print("Training data shape:", X_train.shape)
print("Training labels shape:", y_train.shape)
print("Validation data shape:", X_val.shape)
print("Validation labels shape:", y_train.shape)
print("Test data shape:", X_test.shape)
print("Test labels shape:", y_test.shape)


Training data shape: (11977,)
Training labels shape: (11977, 76)
Validation data shape: (1497,)
Validation labels shape: (11977, 76)
Test data shape: (1498,)
Test labels shape: (1498, 76)


In [None]:
max_words = 14953

In [None]:
seq_len = 128 #255

## Multi-layer LSTM Model

In [None]:
ptwe_multi_LSTM_model = tf.keras.Sequential()

In [None]:
# Add the pretrained word embeddings layer
hub_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2", input_shape=[], 
                           dtype=tf.string, trainable=True)

In [None]:
#check working
X_train[:1]

5533    fresh pressure theresa may brexit battle moves...
Name: webTitle_bodyContent, dtype: object

In [None]:
hub_layer(X_train[:1])

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[ 0.20057896,  0.14303227,  0.06468769, -0.10665905, -0.0268971 ,
        -0.06172035,  0.0906657 ,  0.00424548, -0.10174389,  0.15166107,
         0.05054346,  0.04557198,  0.04096015, -0.07246089,  0.0500988 ,
         0.03589746, -0.26670226, -0.12244248,  0.14849818,  0.06931599,
        -0.03663989,  0.02653947,  0.06096202, -0.00818487,  0.1430749 ,
        -0.2994158 ,  0.21570113, -0.065112  , -0.10153643,  0.03918028,
         0.06017835, -0.32528248,  0.167346  ,  0.19237624,  0.1337378 ,
        -0.03682581, -0.09635773,  0.08444913, -0.08154714,  0.06724647,
        -0.3122109 , -0.00654316,  0.02008086,  0.02234356,  0.04416066,
         0.12768754, -0.18069619,  0.14701657,  0.21150202,  0.07597362,
         0.13863339, -0.04458964,  0.04293177, -0.00713998, -0.03755633,
         0.15221745,  0.02339762, -0.05930477, -0.11120779,  0.13369367,
        -0.08882883,  0.09813786,  0.2540373 , -0.06764533, -0.13779186,
 

In [None]:
ptwe_multi_LSTM_model.add(hub_layer)

In [None]:
ptwe_multi_LSTM_model.add(layers.Reshape((128, 1))) #we need to reshape our data for LSTM. It expects (batch_size, num_timesteps or sequence_length, num_features)

In [None]:
ptwe_multi_LSTM_model.add(layers.LSTM(82, return_sequences=True))

In [None]:
ptwe_multi_LSTM_model.add(layers.LSTM(82))

In [None]:
ptwe_multi_LSTM_model.add(layers.Dense(y.shape[1], activation='softmax'))

In [None]:
ptwe_multi_LSTM_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])

In [None]:
ptwe_multi_LSTM_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_3 (KerasLayer)  (None, 128)               124642688 
                                                                 
 reshape_1 (Reshape)         (None, 128, 1)            0         
                                                                 
 lstm_2 (LSTM)               (None, 128, 82)           27552     
                                                                 
 lstm_3 (LSTM)               (None, 82)                54120     
                                                                 
 dense (Dense)               (None, 76)                6308      
                                                                 
Total params: 124,730,668
Trainable params: 124,730,668
Non-trainable params: 0
_________________________________________________________________


In [None]:
ptwe_multi_LSTM_history = ptwe_multi_LSTM_model.fit(X_train, y_train, validation_data=(X_val,y_val), epochs=10)

Epoch 1/10

In [None]:
loss_and_metrics = ptwe_multi_LSTM_model.evaluate(X_test, y_test, verbose=2)
print("Test Loss", loss_and_metrics[0])
print("Test Accuracy", loss_and_metrics[1])

In [None]:
# Plot training & validation accuracy values
plt.plot(ptwe_multi_LSTM_history.history['accuracy'])
plt.plot(ptwe_multi_LSTM_history.history['val_accuracy'])
plt.title('PTWE Multi LSTM Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='lower right')
plt.show()

In [None]:
# Plot training & validation loss values
plt.plot(ptwe_multi_LSTM_history.history['loss'])
plt.plot(ptwe_multi_LSTM_history.history['val_loss'])
plt.title('PTWE Multi LSTM Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

In [None]:
ptwe_multi_LSTM_model.save('/content/drive/MyDrive/data/ptwe_multi_lstm_model')

## Part 1 News Article Section Prediction Task Checklist

Compare LSTM and Basic RNN models

Compare a single layer LSTM implementation to multi-layer LSTM implementations

Compare Embeddings learned on the fly to pre-trained word embedding available from the Tensorflow Hub or HuggingFace.

Compare embeddings based approaches to a more traditional text encoding approach.

Use of CNNs with multiple and heterogeneous kernel sizes as an alternative to an LSTM solution

Use of CNNs with multiple and heterogeneous kernel sizes as an additional
layer before a LSTM solution

Compare the performance of one of your best performing neural models against the non-neural method (e.g Random Forests).

build a version of a model that uses both the text of an article and the web title to predict the section heading

Save best model based on non pre-trained embeddings

Save best model based on pre-trained embeddings

A link to these best performing models should be included in your submission report and a
demonstration notebook (described later) should be capable of loading these from the web as
well as your snapshot of test data, and demonstrating your test results with these models. You
will also be using these saved models in Parts 2 below for training.

## Part 2 Transfer Learning Checklist

https://www.kaggle.com/datasets/yufengdev/bbc-fulltext-and-category

create a new model based on predicting the topic based on the article text

Build and evaluate models based on your best 2 models obtained from part 1 using a Transfer Learning method, but also build models from scratch

The models you build should allow some
amount of like to like comparison between the newly created models and the models imported
from Part A, and should where appropriate demonstrate approaches to Transfer Learning and
good practice in model design.

Save the best performing resulting Transfer Learning and ‘From Scratch’ models for this
dataset. Links to these models need to be supplied as part of your submission.

Your evaluation of the models should be based minimally on training and validation error and any other metrics
or methods you think appropriate. Again, the demo notebook should be capable of downloading
the models and your test data and automatically demonstrating the calculation of test value
results.

## Part 3 Writing your own news article

write a few sentences of a news article for the two most frequent genre / section types in your dataset.

build a generative model based on this dataset that outputs script excerpts that
are 10 turns

core model should be based on the use of LSTMs, but beyond this you are free to explore
whatever architecture and hyper-parameter variants that you find results in the best
performance in the language generation task

Report model performance in terms of perplexity and any other metrics or methods you finappropriate.

report of quality that is worthy for submission for publication at a national conference