## Introduction
The goal of the project was to perform a systematic investigation of a number of Deep Learning methods in the context of text processing tasks, and benchmark these methods against classical methods where appropriate. The following was provided: a report and source code complete with a link to Google Colab is contained within a .zip file provided and trained models using Keras / TensorFlow. No alternative data set or coding framework was used.

In [18]:
# Libraries
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer

## The Dataset
[Guardian News Articles dataset](https://www.kaggle.com/datasets/adityakharosekar2/guardian-news-articles) on Kaggle was used to perform genre or more precisely section analysis. Since this dataset was large (~150,000 articles / >700MB) the full dataset was not used. Instead a proportion of the dataset (random 10%-20%) was used.
### Get 10% of data from large dataset

In [30]:
random.seed(4321) #for reproducibility
#10% of 149,828 rows = 14,983 rows
df = pd.read_csv("guardian_articles.csv", skiprows=lambda x: x > 0 and random.random() >=.1)

In [43]:
#percentage of data imported from guardian dataset
print(str(round((len(df.index) /149828)*100, 2))  + "%")

10.0%


In [32]:
print(df.columns) #df.head()#look at raw data

Index(['article_id', 'sectionName', 'webTitle', 'webUrl', 'bodyContent',
       'webPublicationDate', 'id'],
      dtype='object')


In [33]:
#reorder data and oly take the news title and category label
df = df[['webTitle', 'sectionName']]

In [34]:
df.head()

Unnamed: 0,webTitle,sectionName
0,British pilot in Tanzania 'manoeuvred ​to save...,World news
1,Jürgen Klopp hails Liverpool youngsters but re...,Football
2,Tommy Elphick turns thoughts to Harry Arter af...,Football
3,Chelsea draw Manchester City and Arsenal meet ...,Football
4,John Terry to leave Chelsea after refusal of f...,Football


In [35]:
#check missing data
df.isnull().sum() #none missing

webTitle       0
sectionName    0
dtype: int64

### Split sample into 80% training, 10% test & 10% validation datasets
Next, 10% of the data was split off for testting, 10% for validation and the remaining 80% was used as for training data.

In [44]:
# Use ratios to split the dataset into training, testing, and validation sets
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

train_cutoff = int(train_ratio * len(df))
val_cutoff = int((train_ratio + val_ratio) * len(df))

train_df = df.iloc[:train_cutoff]
val_df = df.iloc[train_cutoff:val_cutoff]
test_df = df.iloc[val_cutoff:]
print("**Training Dataset**")
print(str(round((len(train_df.index) /len(df.index))*100, 2)) + "%")
print("\n**Testing Dataset**")
print(str(round((len(test_df.index) /len(df.index))*100, 2)) + "%")
print("\n**Validation Dataset**")
print(str(round((len(val_df.index) /len(df.index))*100, 2)) + "%")

**Training Dataset**
80.0%

**Testing Dataset**
10.0%

**Validation Dataset**
10.0%


### Prepare & embed data

In [None]:
max_features = 5000
maxlen = 400
embedding_dims = 50

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

## Section Heading Prediction
In this section analysis was perform based on 1) training from scratch and 2) using pre-trained models to predict section heading. Overfitting was minimised, hyperparamters were investigated and functions or metrics were selected where appropriate. After training was completed results were graphed for Training and validation data and the test result.

### RNN Variants: LSTM vs a Basic RNN model & single layer LSTM vs multi-layer LSTM

### Embeddings: on the fly vs pre-trained word embedding (Tensorflow Hub/HuggingFace) & embeddings vs traditional text encoding