# Data Preparation
In this notebook we first look at the dataset and prepare it for modelling we then output raw text, bag of words and word vectors

## The Dataset
[Guardian News Articles](https://www.kaggle.com/datasets/adityakharosekar2/guardian-news-articles)  dataset on Kaggle was used to perform genre or section analysis. Since this dataset was large (~150,000 articles / >700MB) the full dataset was not used. Instead only a proportion of the dataset was used. 

Aditya Kharosekar, the author of the dataset stated that no special preprocessing was done. Kharosekar said this was to give users of this dataset ability to preprocess this data in anyway they wish. About .3% of rows was found to be corrupted when inspected in excel. labels were also found to be catagorical strings. 

There were 164 unique section names. Two interesting feature columns for news classification were webTitle and bodyContent. webTitle was chosen as a feature to work with as bodyContent contained too many troublesome characters. webTitle also had less characters meaning training time should be slightly reduced.

### Removed corrupted rows
Id on last row was read as 149,839. 149,723 rows remained, meaning 116 were removed. 

### Encoded categorical labels to an ordinal list
164 section names were found. A seperate key map list file was created called 'guardian_articles_labels.csv', with the original dataset recieving a new column containing an ordinal list of labels. Any N/A values were correct

### Got 10% of data from the large dataset
In excl the rand() function was used to create a new column of random numbers. Then the entire dataset was ordered by smallest random number. 10% of the dataset (149723 * 10%) or the top 14,972 rows were selected and exported into a seperate file called guardian_articles_ten_perc.csv. This file will be used for further processing.

### Merged webTitle with bodyContent
To simplify the dataset all other columns were removed except webTitle, bodyContent and label. Two smaller datasets were created to test results. The first one had webTitle and bodyContent combined in the same column seperated by a white space. The second just had webTitle.

## Access data from google drive

In [None]:
#get data from google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import Libraries

In [None]:
# data handling
import pandas as pd
import numpy as np
from collections import Counter

#data prep and dl modelling
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras import layers

#word cleaning
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

# save numpy array of tokens as csv file
from numpy import asarray
from numpy import savetxt

In [None]:
nltk.download('stopwords') #to remove common words
nltk.download('wordnet') #for WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Import Dataset & Exploration

In [None]:
df = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_formatted_merged_ten_percent.csv')
labels_lookup_table = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_labels.csv')

In [None]:
gen_df = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_formatted_title_body_ten_percent.csv')

In [None]:
df.head()

Unnamed: 0,webTitle_bodyContent,labels
0,Lost in showbiz: Marina Hyde's 2016 quiz,17
1,Tony Blair: return to dark 1930s politics no l...,11
2,'Prince far more royal than the Queen' says Mo...,21
3,Japan care home attack: picture emerges of mod...,2
4,Biden urged to scrap Trump â€˜Remain in Mexico...,1


In [None]:
labels_lookup_table.head()

Unnamed: 0,sectionName,labels
0,US news,1
1,World news,2
2,Football,3
3,Sport,4
4,Television & radio,5


In [None]:
df.labels.value_counts()

2      1498
10     1264
3      1090
4       975
8       814
       ... 
102       1
132       1
87        1
52        1
73        1
Name: labels, Length: 76, dtype: int64

In [None]:
#percentage of data imported from guardian dataset
print(str(round(len(df.index)/149723, 2)*100) + "%")

10.0%


In [None]:
#check missing data
df.isnull().sum() #none missing

webTitle_bodyContent    0
labels                  0
dtype: int64

In [None]:
df.webTitle_bodyContent[50]

'Politics are a matter of life and death. No wonder more Scots want to leave the UK | Adam Ramsay No one should be surprised that support for Scottish independence is surging. A year ago this week, Conservatives rallied behind a leader capable of rekindling an old flame. A leader whose very purpose was to assert against all the evidence that England or Britain â€“ theyâ€™re never sure which â€“ stands alone against the forces of history. A leader who would inflame the UKâ€™s oldest arguments. Europe looked on in horror. In Scotland, we gritted our teeth. During the 2019 general election, called five months after Johnson took the helm of the Tory party, he lost more than half of his partyâ€™s seats in Scotland. Rather than look reality in the eye and follow the example of its neighbour, England set its house on fire, giving Johnson a thumping win. Our homes are semidetached, so when you burn, so do we.\nIt was clear that the Corbyn project was dead, and with it any chance to seriously r

In [None]:
labels_lookup_table.loc[labels_lookup_table['labels'] == df.labels[50]]['sectionName'].values[0]

'Opinion'

## Data Cleaning

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
truncate = 255

In [None]:
def clean_text(text): #ref:https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    lemmatizer.lemmatize(text) # reduce to root word
    text = text[:truncate] # reduce string length
    return text

In [None]:
df.iloc[:, 0] = df.iloc[:, 0].apply(clean_text)

In [None]:
df.webTitle_bodyContent[50]

'politics matter life death wonder scots want leave uk adam ramsay one surprised support scottish independence surging year ago week conservatives rallied behind leader capable rekindling old flame leader whose purpose assert evidence england britain theyr'

In [None]:
# count how many unique words are in the dataset, used to define vocab
max_words = len(set(df['webTitle_bodyContent'].values))
print(max_words)

14953


In [None]:
# How many words per row or Sequence length,excludes counting spaces
# use to pad the output
seq_len = int(df.iloc[:, 0].map(len).max())
print(seq_len) #max length of each example

255


## Text Raw

In [None]:
X_raw = df.webTitle_bodyContent.to_numpy(dtype=str)

In [None]:
X_raw.shape

(14972,)

## Text to Bag-Of-Words

In [None]:
# create the tokenizer
t = Tokenizer()
t.fit_on_texts(df.webTitle_bodyContent)

In [None]:
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# encode documents into vector  where the coefficient for each token is based on counting words
X_bow = t.texts_to_matrix(df.webTitle_bodyContent.values, mode='count')

In [None]:
print(X_bow)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 1.]]


In [None]:
X_bow.shape

(14972, 53780)

## Text to Vector
We use Kera's TextVectorization method to convert the text into an integer vector.

In [None]:
max_features = int(max_words)  # Maximum vocab size.
max_len = seq_len

vectorize_layer = TextVectorization(max_tokens=max_features,
                                    ngrams=2,
                                    output_mode='int',
                                    output_sequence_length=max_len)

In [None]:
vectorize_layer.adapt(df['webTitle_bodyContent'])

In [None]:
encoder_model = tf.keras.models.Sequential()
encoder_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
encoder_model.add(vectorize_layer)

In [None]:
X = []
n = len(df['webTitle_bodyContent'])
for i in range(0, n):
    X.append([df.iloc[i]['webTitle_bodyContent']])

In [None]:
print(len(X))
print(type(X))

14972
<class 'list'>


In [None]:
X = encoder_model.predict(X)



In [None]:
# we have now created train, validation and test datasets with max length of 
#255 for each example. those less than that are padded with zeros.
print(X)
print(type(X))
print(X.shape)

[[  258     1  4888 ...     0     0     0]
 [  981  4995   205 ...     0     0     0]
 [ 1100   229   390 ...     0     0     0]
 ...
 [  495  1562  2466 ...     0     0     0]
 [12854   435  3857 ...     0     0     0]
 [    1   564  6226 ...     0     0     0]]
<class 'numpy.ndarray'>
(14972, 255)


In [None]:
#check that we can convert back from vector to text
vocab = vectorize_layer.get_vocabulary()

In [None]:
len(vocab) #list

14953

In [None]:
#just pull five predictions
for i in range(0, 5):
  p = " ".join([vocab[int(word)] for word in X[i]])
  print(p)
print("Decoded:\n")

lost [UNK] marina [UNK] 2016 quiz [UNK] [UNK] [UNK] [UNK] [UNK]                                                                                                                                                                                                                                                    
tony blair return dark 1930s politics longer [UNK] return dark politics 1930s longer [UNK] today [UNK] nationalist populism widespread rejection [UNK] [UNK] according tony blair stark speech [UNK] house thinktank london former bri tony blair [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] tony blair [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]                                                                                                                                                                                              
prince far royal queen says morrissey special kind [UNK] pay tribute 

In [None]:
#y = df['labels'].values
y = pd.get_dummies(df['labels']).values
print(y.shape)

(14972, 76)


In [None]:
# print the array
print(X)
print(y)

[[  258     1  4888 ...     0     0     0]
 [  981  4995   205 ...     0     0     0]
 [ 1100   229   390 ...     0     0     0]
 ...
 [  495  1562  2466 ...     0     0     0]
 [12854   435  3857 ...     0     0     0]
 [    1   564  6226 ...     0     0     0]]
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Generative Text Dataset

In [None]:
gen_df = gen_df.drop('random', axis=1)

In [None]:
#148,614
#10%
#14,861
gen_df.head()

Unnamed: 0,bodyContent,webTitle
0,Suicide rates among university students in Eng...,Male students in England and Wales more likely...
1,"Jon Snow, Daenerys Targaryen and Cersei Lannis...",Game of Thrones season seven trailer: winter h...
2,The Bank of England must do more to ensure a g...,Bank of England must do more to secure green r...
3,Congress voted this week to allow internet ser...,Here's how to protect your internet browsing d...
4,"Since 2016, Britain and the United States have...",Trump and Johnson are getting their comeuppanc...


In [None]:
gen_df.bodyContent=gen_df.bodyContent.astype(str)
gen_df.webTitle=gen_df.webTitle.astype(str)

In [None]:
gen_df.iloc[:, 1]

0        Male students in England and Wales more likely...
1        Game of Thrones season seven trailer: winter h...
2        Bank of England must do more to secure green r...
3        Here's how to protect your internet browsing d...
4        Trump and Johnson are getting their comeuppanc...
                               ...                        
14855    Two decades after the 'Brooks Brothers riot', ...
14856                          Dame Beulah Bewley obituary
14857    Suspend Jared O'Mara over verbal abuse claim, ...
14858    Our safe haven: how we made ourselves at home ...
14859    Alex Gibney to make feature film debut with FB...
Name: webTitle, Length: 14860, dtype: object

In [None]:
gen_df.iloc[:, 0] = gen_df.iloc[:, 0].apply(clean_text)

In [None]:
gen_df.iloc[:, 1] = gen_df.iloc[:, 1].apply(clean_text)

In [None]:
X_gen = gen_df.bodyContent.to_numpy(dtype=str)

In [None]:
y_gen = gen_df.webTitle.to_numpy(dtype=str)

In [None]:
len(X_gen)

14860

In [None]:
len(y_gen)

14860

## Save Raw Text, Bag of Words and Word Vectors
We save the features and target to be used later in other notebooks

### Features

In [None]:
# save raw title-content as csv file
#savetxt('/content/drive/MyDrive/data/X_raw.csv', X_raw, delimiter=',', fmt='%s')

In [None]:
# save bog as csv file
#savetxt('/content/drive/MyDrive/data/X_bow.csv', X_bow, delimiter=',')

In [None]:
# save vectors as csv file
#savetxt('/content/drive/MyDrive/data/X.csv', X, delimiter=',')

In [None]:
# save raw content as csv file
savetxt('/content/drive/MyDrive/data/X_gen.csv', X_gen, delimiter=',', fmt='%s')

### Targets

In [None]:
# save section labels as csv file
#savetxt('/content/drive/MyDrive/data/y.csv', y, delimiter=',')

In [None]:
2

01. Basic RNN model ✅
02. Single Layer LSTM ✅
03. Multi-Layer LSTM ✅
04. On-the-fly Embeddings ✅
05. Pre-trained Embeddings ✅
06. Bag of Words ✅
07. CNNs with multiple and heterogeneous kernel sizes ✅
08. CNNs with multiple and heterogeneous kernel sizes with LSTM ✅
09. Random Forests Model ✅
10. build a model with WebTitle and BodyContent
11. Save & Load best on-the-fly embedding model
12. Save & Load best pretrained embedding model

Part 2
1. load model and train on BBC dataset

Part 3
1. Section Heading Generator  ✅

## Part 1 News Article Section Prediction Task Checklist

Compare LSTM and Basic RNN models

Compare a single layer LSTM implementation to multi-layer LSTM implementations

Compare Embeddings learned on the fly to pre-trained word embedding available from the Tensorflow Hub or HuggingFace.

Compare embeddings based approaches to a more traditional text encoding approach.

Use of CNNs with multiple and heterogeneous kernel sizes as an alternative to an LSTM solution

Use of CNNs with multiple and heterogeneous kernel sizes as an additional
layer before a LSTM solution

Compare the performance of one of your best performing neural models against the non-neural method (e.g Random Forests).

build a version of a model that uses both the text of an article and the web title

Save best model based on non pre-trained embeddings

Save best model based on pre-trained embeddings

A link to these best performing models should be included in your submission report and a
demonstration notebook (described later) should be capable of loading these from the web as
well as your snapshot of test data, and demonstrating your test results with these models. You
will also be using these saved models in Parts 2 below for training.

## Part 2 Transfer Learning Checklist

https://www.kaggle.com/datasets/yufengdev/bbc-fulltext-and-category

create a new model predicting the topic based on the article text from new dataset

Build and evaluate models based on your best 2 models obtained from part 1 using a Transfer Learning method, but also build models from scratch

The models you build should allow some
amount of like to like comparison between the newly created models and the models imported
from Part A, and should where appropriate demonstrate approaches to Transfer Learning and
good practice in model design.

Save the best performing resulting Transfer Learning and ‘From Scratch’ models for this
dataset. Links to these models need to be supplied as part of your submission.

Your evaluation of the models should be based minimally on training and validation error and any other metrics
or methods you think appropriate. Again, the demo notebook should be capable of downloading
the models and your test data and automatically demonstrating the calculation of test value
results.

## Part 3 Writing your own news article

write a few sentences of a news article for the two most frequent genre / section types in your dataset.

build a generative model based on this dataset that outputs script excerpts that
are 10 turns

core model should be based on the use of LSTMs, but beyond this you are free to explore
whatever architecture and hyper-parameter variants that you find results in the best
performance in the language generation task

Report model performance in terms of perplexity and any other metrics or methods you finappropriate.

report of quality that is worthy for submission for publication at a national conference

# References
[embedding tut](https://www.google.com/search?q=on+the+fly+embedding+keras&sxsrf=APwXEdfEv10bndD2ZJBgAfBUHNePWkxSKA%3A1683056274303&ei=kmZRZNuPEtjhgAbN_riADA&ved=0ahUKEwjboPeasdf-AhXYMMAKHU0_DsAQ4dUDCA8&uact=5&oq=on+the+fly+embedding+keras&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCCEQoAE6CwgAEIoFEIYDELADOggIIRAWEB4QHToECCEQFToHCCEQoAEQCkoECEEYAVCnA1j1CGDtCWgBcAB4AIABVogBgQOSAQE2mAEAoAEByAEBwAEB&sclient=gws-wiz-serp#fpstate=ive&vld=cid:78ff3562,vid:8h8Z_pKyifM)