# Data Preparation

## Introduction
The goal of the project is to perform a systematic investigation of a number of Deep Learning methods in the context of text processing tasks, and benchmark these methods against classical methods where appropriate. 

This Google Colab source code notebook is to accompany a report contained within a zip file. 

Models will be trained using Keras / TensorFlow. No alternative data set or coding framework will be used.

## The Dataset
[Guardian News Articles](https://www.kaggle.com/datasets/adityakharosekar2/guardian-news-articles)  dataset on Kaggle was used to perform genre or section analysis. Since this dataset was large (~150,000 articles / >700MB) the full dataset was not used. Instead only a proportion of the dataset was used. 

Aditya Kharosekar, the author of the dataset stated that no special preprocessing was done. Kharosekar said this was to give users of this dataset ability to preprocess this data in anyway they wish. About .3% of rows was found to be corrupted when inspected in excel. labels were also found to be catagorical strings. 

There were 164 unique section names. Two interesting feature columns for news classification were webTitle and bodyContent. webTitle was chosen as a feature to work with as bodyContent contained too many troublesome characters. webTitle also had less characters meaning training time should be slightly reduced.

### Removed corrupted rows
Id on last row was read as 149,839. 149,723 rows remained, meaning 116 were removed. 

### Encoded categorical labels to an ordinal list
164 section names were found. A seperate key map list file was created called 'guardian_articles_labels.csv', with the original dataset recieving a new column containing an ordinal list of labels. Any N/A values were correct

### Got 10% of data from the large dataset
In excl the rand() function was used to create a new column of random numbers. Then the entire dataset was ordered by smallest random number. 10% of the dataset (149723 * 10%) or the top 14,972 rows were selected and exported into a seperate file called guardian_articles_ten_perc.csv. This file will be used for further processing.

### Merged webTitle with bodyContent
To simplify the dataset all other columns were removed except webTitle, bodyContent and label. Two smaller datasets were created to test results. The first one had webTitle and bodyContent combined in the same column seperated by a white space. The second just had webTitle.

In [1]:
#get data from google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Libraries
import pandas as pd
import numpy as np
from collections import Counter
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras import layers

#word cleaning
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

# save numpy array of tokens as csv file
from numpy import asarray
from numpy import savetxt

In [3]:

nltk.download('stopwords') #to remove common words
nltk.download('wordnet') #for WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
#149723 * 10% = 14,972
df = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_formatted_merged_ten_percent.csv') # guardian_articles_ten_perc_webTitle
labels_lookup_table = pd.read_csv('/content/drive/MyDrive/data/guardian_articles_labels.csv')

In [5]:
df.head()

Unnamed: 0,webTitle_bodyContent,labels
0,Lost in showbiz: Marina Hyde's 2016 quiz,17
1,Tony Blair: return to dark 1930s politics no l...,11
2,'Prince far more royal than the Queen' says Mo...,21
3,Japan care home attack: picture emerges of mod...,2
4,Biden urged to scrap Trump â€˜Remain in Mexico...,1


In [6]:
labels_lookup_table.head()

Unnamed: 0,sectionName,labels
0,US news,1
1,World news,2
2,Football,3
3,Sport,4
4,Television & radio,5


In [7]:
df.labels.value_counts()

2      1498
10     1264
3      1090
4       975
8       814
       ... 
102       1
132       1
87        1
52        1
73        1
Name: labels, Length: 76, dtype: int64

In [8]:
#percentage of data imported from guardian dataset
print(str(round(len(df.index)/149723, 2)*100) + "%")

10.0%


In [9]:
#check missing data
df.isnull().sum() #none missing

webTitle_bodyContent    0
labels                  0
dtype: int64

In [10]:
df.webTitle_bodyContent[50]

'Politics are a matter of life and death. No wonder more Scots want to leave the UK | Adam Ramsay No one should be surprised that support for Scottish independence is surging. A year ago this week, Conservatives rallied behind a leader capable of rekindling an old flame. A leader whose very purpose was to assert against all the evidence that England or Britain â€“ theyâ€™re never sure which â€“ stands alone against the forces of history. A leader who would inflame the UKâ€™s oldest arguments. Europe looked on in horror. In Scotland, we gritted our teeth. During the 2019 general election, called five months after Johnson took the helm of the Tory party, he lost more than half of his partyâ€™s seats in Scotland. Rather than look reality in the eye and follow the example of its neighbour, England set its house on fire, giving Johnson a thumping win. Our homes are semidetached, so when you burn, so do we.\nIt was clear that the Corbyn project was dead, and with it any chance to seriously r

In [11]:
labels_lookup_table.loc[labels_lookup_table['labels'] == df.labels[50]]['sectionName'].values[0]

'Opinion'

In [12]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
truncate = 255

In [13]:
def clean_text(text): #ref:https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    lemmatizer.lemmatize(text) # reduce to root word
    text = text[:truncate] # reduce string length
    return text

In [14]:
df.iloc[:, 0] = df.iloc[:, 0].apply(clean_text)

In [15]:
df.webTitle_bodyContent[50]

'politics matter life death wonder scots want leave uk adam ramsay one surprised support scottish independence surging year ago week conservatives rallied behind leader capable rekindling old flame leader whose purpose assert evidence england britain theyr'

In [16]:
# count how many unique words are in the dataset, used to define vocab
max_words = len(set(df['webTitle_bodyContent'].values))
print(max_words)

14953


In [17]:
# How many words per row or Sequence length,excludes counting spaces
# use to pad the output
seq_len = int(df.iloc[:, 0].map(len).max())
print(seq_len) #max length of each example

255


## Text to Vector


In [18]:
max_features = int(max_words)  # Maximum vocab size.
max_len = seq_len

vectorize_layer = TextVectorization(max_tokens=max_features,
                                    output_mode='int',
                                    output_sequence_length=max_len)

In [19]:
vectorize_layer.adapt(df['webTitle_bodyContent'])

In [20]:
encoder_model = tf.keras.models.Sequential()
encoder_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
encoder_model.add(vectorize_layer)

In [21]:
X = []
n = len(df['webTitle_bodyContent'])
for i in range(0, n):
    X.append([df.iloc[i]['webTitle_bodyContent']])

In [22]:
print(len(X))
print(type(X))

14972
<class 'list'>


In [23]:
X = encoder_model.predict(X)



In [24]:
# we have now created train, validation and test datasets with max length of 
#21875 for each example. those less than that are padded with zeros.
print(X)
print(type(X))
print(X.shape)

[[  255 12175  4529 ...     0     0     0]
 [  955  4624   204 ...     0     0     0]
 [ 1071   226   384 ...     0     0     0]
 ...
 [  482  1519  2374 ...     0     0     0]
 [10380   426  3628 ...     0     0     0]
 [    1   551  5653 ...     0     0     0]]
<class 'numpy.ndarray'>
(14972, 255)


In [25]:
#check that we can convert back from vector to text
vocab = vectorize_layer.get_vocabulary()
for i in range(0, 5):#just pull five predictions
  p = " ".join([vocab[int(word)] for word in X[i]])
  print(p)
print("Decoded:\n")

lost showbiz marina [UNK] 2016 quiz                                                                                                                                                                                                                                                         
tony blair return dark 1930s politics longer [UNK] return dark politics 1930s longer [UNK] today rampant nationalist populism widespread rejection [UNK] [UNK] according tony blair stark speech [UNK] house thinktank london former bri                                                                                                                                                                                                                              
prince far royal queen says morrissey special kind [UNK] pay tribute music icon yet still manage turn dig british royal family piece [UNK] propaganda morrissey [UNK] statement released [UNK] true former smiths frontman paid tribute prince die                  

In [26]:
#y = df['labels'].values
y = pd.get_dummies(df['labels']).values
print(y.shape)

(14972, 76)


In [27]:
# save vectors as csv file
savetxt('/content/drive/MyDrive/data/X.csv', X, delimiter=',')
savetxt('/content/drive/MyDrive/data/y.csv', y, delimiter=',')