# Recurrent Neural Network

## Introduction
The goal of the project was to perform a systematic investigation of a number of Deep Learning methods in the context of text processing tasks, and benchmark these methods against classical methods where appropriate. 

This source code notebook was provided along with the following: a report and a link to Google Colab contained within a zip file. 

Models were trained using Keras / TensorFlow. No alternative data set or coding framework was used.

## The Dataset
[Guardian News Articles](https://www.kaggle.com/datasets/adityakharosekar2/guardian-news-articles)  dataset on Kaggle was used to perform genre or section analysis. Since this dataset was large (~150,000 articles / >700MB) the full dataset was not used. Instead only a proportion of the dataset was used. 

Aditya Kharosekar, the author of the dataset stated that no special preprocessing was done. Kharosekar said this was to give users of this dataset ability to preprocess this data in anyway they wish. About .3% of rows was found to be corrupted when inspected in excel. labels were also found to be catagorical strings. Two interesting feature columns for news classification were webTitle and bodyContent. webTitle was chosen as a feature to work with as bodyContent contained too many troublesome characters. webTitle also had less characters meaning training time should be slightly reduced.

### Removed corrupted rows
Id on last row was read as 149,839. 149,723 rows remained, meaning 116 were removed. 

### Encoded categorical labels to an ordinal list
163 section names were found. A seperate key map list file was created called 'guardian_articles_labels.csv', with the original dataset recieving a new column containing an ordinal list of labels. Any N/A values were correct

### Got 10% of data from the large dataset
In excl the rand() function was used to create a new column of random numbers. Then the entire dataset was ordered by smallest random number. 10% of the dataset (149723 * 10%) or the top 14,972 rows were selected and exported into a seperate file called guardian_articles_ten_perc.csv. This file will be used for further processing.

### Only selected webTitle
All other columns were removed except webTitle

## Text to Vector


In [2]:
#get data from google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Libraries
import pandas as pd
from collections import Counter
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

In [5]:
#149723 * 10% = 14,972
df = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_ten_perc_webTitle.csv')

In [6]:
df.head()

Unnamed: 0,webTitle,labels
0,No Foreign Office takeover of international ai...,47
1,Eurostar calls on UK for urgent support as it ...,9
2,Boj: Gbagada Express review – a sensual Afrobe...,21
3,UK supermarkets ask suppliers for payments due...,9
4,Six Nations unions warned against putting tour...,4


In [27]:
# count how many words are in each string and find max
words = Counter()
df['webTitle'].str.lower().str.split().apply(words.update)
max_words = max(words.values())
print(max_words)


4945


In [28]:
seq_len = int(df.webTitle.map(len).max())  # Sequence length to pad the outputs
print(seq_len)

142


In [9]:
# Use ratios to split the dataset into training, testing, and validation sets
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

train_cutoff = int(train_ratio * len(df))
val_cutoff = int((train_ratio + val_ratio) * len(df))

train_df = df.iloc[:train_cutoff]
val_df = df.iloc[train_cutoff:val_cutoff]
test_df = df.iloc[val_cutoff:]
print("**Training Dataset**")
print(str(round((len(train_df.index) /len(df.index))*100, 1)) + "%")
print("\n**Testing Dataset**")
print(str(round((len(test_df.index) /len(df.index))*100, 1)) + "%")
print("\n**Validation Dataset**")
print(str(round((len(val_df.index) /len(df.index))*100, 1)) + "%")

**Training Dataset**
80.0%

**Testing Dataset**
10.0%

**Validation Dataset**
10.0%


In [10]:
text_dataset = tf.data.Dataset.from_tensor_slices(train_df['webTitle'].to_numpy())

In [11]:
max_features = int(max_words)  # Maximum vocab size.
max_len = seq_len
embedding_dims = 2
vectorize_layer = TextVectorization(max_tokens=max_features,
                                    output_mode='int',
                                    output_sequence_length=max_len)

In [12]:
vectorize_layer.adapt(text_dataset.batch(64))

In [13]:
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)

In [14]:
train_df['webTitle'][2]

'Boj: Gbagada Express review – a sensual Afrobeats celebration'

In [15]:
input_data = [["No Foreign Office takeover of international aid budget, says Priti Patel"], 
              ["Boj: Gbagada Express review – a sensual Afrobeats celebration"]]

In [18]:
predictions = model.predict(input_data)
print(predictions)

tf.Tensor(
[[  56  857  489  931    4 1089  419  540   24 2146 1799    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [   1    1 2821   12   10    7    1    1 3345    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    

In [26]:
vocab = vectorize_layer.get_vocabulary()
for sentence in predictions:
  x = " ".join([vocab[int(word)] for word in sentence])
  print(x)

no foreign office takeover of international aid budget says priti patel                                                                                                                                   
[UNK] [UNK] express review – a [UNK] [UNK] celebration                                                                                                                                     


In [None]:
# Libraries
import pandas as pd

In [None]:
#149723 * 10% = 14,972
df = pd.read_csv('guardian_articles_ten_perc_webTitle.csv')

### Data Exploration

In [None]:
df.head()

Unnamed: 0,webTitle,labels
0,No Foreign Office takeover of international ai...,47
1,Eurostar calls on UK for urgent support as it ...,9
2,Boj: Gbagada Express review – a sensual Afrobe...,21
3,UK supermarkets ask suppliers for payments due...,9
4,Six Nations unions warned against putting tour...,4


In [None]:
#percentage of data imported from guardian dataset
print(str(round(len(df.index)/149723, 2)*100) + "%")

10.0%


In [None]:
#check missing data
df.isnull().sum() #none missing

webTitle    0
labels      0
dtype: int64

### Split sample into 80% training, 10% test & 10% validation datasets
Next, 10% of the data was split off for testting, 10% for validation and the remaining 80% was used as for training data.

In [None]:
# Use ratios to split the dataset into training, testing, and validation sets
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

train_cutoff = int(train_ratio * len(df))
val_cutoff = int((train_ratio + val_ratio) * len(df))

train_df = df.iloc[:train_cutoff]
val_df = df.iloc[train_cutoff:val_cutoff]
test_df = df.iloc[val_cutoff:]
print("**Training Dataset**")
print(str(round((len(train_df.index) /len(df.index))*100, 1)) + "%")
print("\n**Testing Dataset**")
print(str(round((len(test_df.index) /len(df.index))*100, 1)) + "%")
print("\n**Validation Dataset**")
print(str(round((len(val_df.index) /len(df.index))*100, 1)) + "%")

**Training Dataset**
80.0%

**Testing Dataset**
10.0%

**Validation Dataset**
10.0%


In [None]:
#tf.keras.preprocessing.text.Tokenizer