# Lab 8: Natural Language Processing: Classification of News Articles
In this lab we will use the content of news articles to classify them into one of the 4 following categories: `World, Sports, Business, Sci/Tech`.

The dataset is available at: https://www.kaggle.com/rmisra/news-category-dataset . The original data source is http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
You do not need to download the data from those websites, it has been made available on GCU learn in the compressed file *news_dataset.zip*, you simply need to download and extract it. You'll get files *train.csv* and *test.csv*. Place them in the same folder as this notebook (or change the path in `pd.read_csv()` accordingly).
Open the files using Excel (or any text editor) to see the format of the data and some examples.

To prepare a Conda python environment for this lab, either start from `lab_environment_dl.yml` and add the **wordcloud** package, or just import `lab8_NLP_news.yml`.

This lab follows the steps covered in the lecture to prepare textual data for classification using a 'standard' ML algorithms such as k-Nearst Neighbors (k-NN) and more advanced algorithms such as Neural Networks (NNs) and Recurrent Neural Networks (RNNs).
The lab is divided into 3 parts:
1. Data preparation
2. Classification using k-NN
3. Classification using NNs and RNNs

At the end of the lab you will be able to:
1. Prepare textual data for classification using ML algorithms using various techniques such as tokenization, stop words removal, TF-IDF.
2. Use more traditional ML algorithms such as k-NN for classification of textual data and compare the results with those obtained using NNs and RNNs.
3. Understand how to use NNs and RNNs for classification of textual data.

Whilst the exact steps will not match the exact same approach used in the lecture, the overall approach is the same. Preparing, Loading and Segregating data, before applying the ML algorithms and obtaining the results.

Throughout this lab, you will be see the comment or instruction: `TASK`. This is to help you keep track of what you need to do and where you need to do it.

**NOTE**: Some parts of the code are outlined with the keyword `ADVANCED CODE`. You do not need to understand what this part of the code does, simply read the comment next to it.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import seaborn as sns
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import (
    TextVectorization,
    Embedding,
    LSTM,
    Bidirectional,
    Dense,
    Embedding,
)
import tensorflow.keras
from tensorflow.keras.models import Sequential
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier

# 1. Data preparation
As usual, the first teps is always the loading of the data. We then also want to look at what the target variable looks like. Since we have 4 classes, there should be 4 unique values in there. 

In [None]:
# Importing the dataset. As done in lab 4, you usually have a look at the data. Here we simply show the first 5 rows
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
print(train_data.head(5))

# Printing out of the unique values of the target variable, e.g. the labels
print("Unique Target Variables", train_data["Class Index"].unique())

# TASK: Print the number of rows for the training and test data which correlates with the number of samples for both, the train and test data.

When looking at the label (Class Index) we see that World is class 1, Sports class 2 etc. Usually, we want label numbers to start from 0, to not cause issues further down the line, therefore, we'll simply substract 1 from the initial index values.

We then also create a custom data-dictionary, which lets us translate from the class indices to the actual name. 

In [3]:
# substract 1 from the target variable for both, train and test data
train_data["Class Index"] = train_data["Class Index"] - 1
test_data["Class Index"] = test_data["Class Index"] - 1

# ML algorithms can mostly only work with numeric labels, but for us humans it is easier to work with text labels.
# We therefore create the following variable called label_names which contains the actual news class as a string.
label_names = ["World", "Sports", "Business", "Sci/Tech"] # 0 is "World", 1 is "Sports", ...

## Visualizing
Word clouds can be used to visualize the most frequent words in the dataset. They also can be used to get a better understanding of this data. In a word cloud, the ,size of the word is proportional to the frequency of the word, meaning the more frequent the word, the bigger the word.
The follwing code cell contains a function that creates the word clouds for us.

In [5]:
# ADVANCED CODE: function to create word clouds using the 1000 most frequent words
def create_word_cloud(string):
    # wordcloud for the world-news in training data
    wordcloud = WordCloud(
        width=1400,
        height=700,
        background_color="white",
        max_words=1000,
        min_font_size=10,
    ).generate(string)
    plt.figure(figsize=(18, 10), facecolor=None)
    plt.imshow(wordcloud)

To create the word clouds we can not simply use pandas and each row of the dataset, because the dataset is not in the right format. We have to concatenate (join) all the descriptions of all news samples within the same category into one large string for each . 
This is done in the code below. 

In [6]:
# This code separates the classes by their index, and only select the 'Description' column for each class.
# TASK: Print out one of the strings to see what it looks like.
world_news = " ".join(train_data[train_data["Class Index"] == 0]["Description"])
sports_news = " ".join(train_data[train_data["Class Index"] == 1]["Description"])
business_news = " ".join(train_data[train_data["Class Index"] == 2]["Description"])
science_news = " ".join(train_data[train_data["Class Index"] == 3]["Description"])

Have a look at the word cloud for the various news classes below the following code cells. Do you notice any differences in key words for each class?

**TASK**: Create a word cloud for the test data. Do you notice any differences in key words for each class?

In [None]:
create_word_cloud(world_news)
create_word_cloud(sports_news)
create_word_cloud(business_news)
create_word_cloud(science_news)

Before we start picking a metric like accuracy, we need to understand the data distribution. Let's plot the distribution of the classes.

In [None]:
# plot number of samples in each class
train_data["Class Index"].value_counts().plot(
    kind="bar", title="Number of samples in each class in training data"
)

In [None]:
test_data["Class Index"].value_counts().plot(
    kind="bar", title="Number of samples in each class in test data"
)

The classes seem evenly distributed so accuracy would be a good metric. 

By looking at some sample entries, we can usually already tell what kind of steps we should start with. 
The following code will show some samples just so that we can have a good overview of the data. You could also use excel or another text editor to do this, but since we are already in a notebook, we will just use python.

In [None]:
def show_samples(class_index, dataframe):
    """Function to show two samples of each class based on the index (label) and what column to show
    Args:
        class_index (int): class index you want to see the samples from
        dataframe (pd.dataframe): dataframe you want to see the samples from
    """
    print("-----------")
    print("Class Index: ", class_index)
    print("Description 1: ",
           dataframe[dataframe["Class Index"] == class_index]["Description"].iloc[2],)
    print("Description 2: ",
           dataframe[dataframe["Class Index"] == class_index]["Description"].iloc[4],)

show_samples(0, train_data)
show_samples(1, train_data)
show_samples(2, train_data)
show_samples(3, train_data)

## Preprocessing 

Simply by looking at the data you can already tell the following: 
- There are a lot of backslashes in the text, which are not really necessary and a lot of punctuation marks.

The steps that you need to take to preprocess the data differ from one dataset to another. It usually helps to look at the data and see what needs to be done:
- If there are special characters, you need to remove them
- If there are numbers, you need to remove them (in most cases)

It also helps to look at the data and see if there are any patterns. For example, in this dataset we have a lot of news articles starting with: "Reuters - " or "AFP - "
We could remove these patterns to make the data cleaner and more useful as those news agencies can indicate the source of the news, although we will not do that in this part of the lab. 
Other techniques that you could try to preprocess the data are **Stemming** and **Lemmatization**, which we have not used here, but will be done in an extension task. 

We therefore create a function that does some kind of preprocessing for us in the code cell below. Remember, we want to use funcitons as we can then reuse the code for the test set as well.

In [11]:
# Function to preprocess the data
def preprocess_text(string):
    """preprocess the text in a string
    Args:
        string (str): string you want to preprocess
    """
    # lower case the text
    string = string.lower()
    # remove the apostrophe and replace it with a space
    string = string.replace("'", " ")
    # remove the backslash and replace it with a space
    string = string.replace("\\", " ")
    # remove the special characters by only allowing letters
    # this uses a regular expression (regex) which is a special syntax to define a pattern to search for
    string = re.sub(r"[^a-zA-Z]", " ", string)

    return string

We can then use this function on the dataframes.
In this case, we are applying the preprocess_text function to each row of the description column. In the backend, the string value that is saved in the description column is passed to the *preprocess_text* function. Then, since we have the = sign, the return value of the preprocess_text function is saved in the description column

In [12]:
# The dataframe.apply() function applies a function to each row of a dataframe.
train_data["Description"] = train_data["Description"].apply(preprocess_text)
test_data["Description"] = test_data["Description"].apply(preprocess_text)

For our use case we then convert the dataframes to lists. The reason for this is that quite a few of the next steps require lists as input.
Whilst we could use pandas, this would require some more code and would be less efficient. 

In [13]:
train_data_list = train_data["Description"].tolist()
test_data_list = test_data["Description"].tolist()

# We have converted the data samples but we can not forget to convert the labels as well
train_labels_list = train_data["Class Index"].tolist()
test_labels_list = test_data["Class Index"].tolist()

We can also use a **validation dataset**. This is a sub-dataset used to validate the model during training, *without accessing the test dataset* and is done to prevent overfitting. Recall that overfitting means the model performs well on the training data, but not on the test data. The validation data therefore provides held-out dataset to prevent this.

We can use the `sklearn.model_selection.train_test_split` function to split the training data into training and validation data, similar to what we have done previously with the test data. We use 20% of the training data as validation data and also use stratify to make sure that the distribution of the classes in the training and validation data is the same as in the training data (this is not necessary, but it is good practice). Essentially, this means that the validation data will contain the same number of samples from each class as the training data

In [17]:
(train_data_list, validation_data_list,
 train_labels_list, validation_labels_list) = train_test_split(train_data_list, train_labels_list, test_size=0.2, stratify=train_labels_list)

## Removing Stop Words
As said in the lecture, `nltk` has a lot of *corpora* (datasets) that can be used for different tasks. In this case, we are using the stopwords corpus, which contains a list of stopwords.

Other available corpora are:
- *wordnet* (Contains a list of words and their meanings)
- *movie_reviews* (Used in the lecture. Contains a list of movie reviews and their sentiment, which can be used as a dataset for sentiment analysis)
- *names* (Contains a list of names)
- further corpora can be found here: https://www.nltk.org/nltk_data/ and can be very useful for different tasks

You can download the corpora by using nltk.download() and then selecting the one you want to download. For example, to download the *stopwords* corpus, you would use `nltk.download('stopwords')` as shown below.

In [None]:
# download the stopwords
nltk.download("stopwords")

def remove_stopwords(data_list):
    """removes stopwords from a list of strings by simply checking on a word for word basis if the word is a stopword and if it is, we are removing it

    Args:
        data_list (list): list of strings you want to remove stopwords from
    """
    stopword_list = stopwords.words("english")
    for i in range(len(data_list)):
        data_list[i] = " ".join(
            [word for word in data_list[i].split() if word not in (stopword_list)]
        )
    return data_list

Any kind of preprocessing should be applied to the train, validation and test data. 
However, in practice you usually you only do those steps in a notebook on the training data, because in production you'll have a separate pipeline for fresh test data.
For this example, though, we do those steps in a single go as it is easier and keeps the Jupyter notebook cleaner.

In [19]:
train_data_list = remove_stopwords(train_data_list)
validation_data_list = remove_stopwords(validation_data_list)
test_data_list = remove_stopwords(test_data_list)

## Vectorization and Features

**Vectorization** is used to convert the text into a vector representation. Essentially, it is a way to convert text into numbers.
The **TF-IDF** vectorizer from *scikit-learn* is one implementation that converts texts into a vector representation. This is achieved by using:
- *term frequency (TF)*, the number of times a word appears in a document
- *inverse document frequency (IDF)*, the inverse of the number of documents that contain a word

The TF-IDF vectorizer combines both TF and IDF. This is done by multiplying the term frequency with the inverse document frequency for each word in a document.
Simply calculating those values for the training data does *not* yield vectors directly; we need to call the `.fit` method of the vectorizer to learn the vocabulary and IDF from the training data. After fitting the vectorizer, we can transform the data lists into vectors. This is done by using the `transform()` function. The function returns a sparse matrix which contains the TF-IDF values for each word in the vocabulary that has been found in the specific data sample. 

It is a very common way to convert text into a vector representation, with other approaches being:
- *Word2Vec*. Very powerful NN-based approach. It passes the word through a NN that outputs the vector representation of the word
- *Bag of words*. Very simple approach, it just counts the number of times a word appears in a document and uses that as the vector representation of the word
- Other vectorizers can be found here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text


In [29]:
# scikit-learn's implementation of the vectorizer also has the option to remove stop words, create n-grams and create lowercase words.
# We have performed some of those steps in the preprocessing step, so we will not use those options here.
vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=10000)

# fit on train data
vectorizer.fit(train_data_list)
train_data_vectorized = vectorizer.transform(train_data_list)

# since we do not want to overfit on the validation data, we only transform this
validation_data_vectorized = vectorizer.transform(validation_data_list)

# as we usually dont have access to the test data, we will use the train data to fit the vectorizer and then transform the test data with the same vectorizer
test_data_vectorized = vectorizer.transform(test_data_list)


As there are quite a lot of of features in the samples due to creating the *n-grams*, we would like to reduce their numbers.
For this, we will calculate the feature importance with `f_classif` from *scikit-learn* which is a feature selection technique [More information here](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). 

This function is used to determine the *importance* of each feature. Essentially it calculates the variance between the different classes and the variance within the classes. The higher the variance between the classes, the more important the feature is.

There are other methods to determine the most important features such as *chi-squared*, *random forest* and *mutual information*. We will not use those methods here but you can read more about them at the link above. 

The `selectKBest` class will select the *k* best features based on the `f_classif` function. This means that only the *k* most important features will be used in the model, which will reduce the dimensionality of the data.

We then fit this selector *and* the method that extracts the most important features to the train data. We do not fit this to the test data, because in a real world scenario, we would not have access to the test data when we are training the model. Therefore, we train on the train data and then use the trained selector to select the most important features from the test data.
However, sometimes there is a difference in the most important features between the train and test data. This is usually called *data leakage*, but is not a problem in this case (however, keep that in mind).

In [None]:
# select the top 1000 features
selector = SelectKBest(f_classif, k=1000)
# fit on train data
selector.fit(train_data_vectorized, train_labels_list)

# vectorizing the data transforms the data into a sparse matrix.
# we can use the toarray() method to transform the sparse matrix into a dense matrix. this will make it easier to visualize the data, but this happens at a later stage.
train_data_vectorized = selector.transform(train_data_vectorized)
validation_data_vectorized = selector.transform(validation_data_vectorized)
test_data_vectorized = selector.transform(test_data_vectorized)

# lets just check the shape of the data to see if we have the correct amount of features
print(train_data_vectorized.shape)
print(validation_data_vectorized.shape)
print(test_data_vectorized.shape)


It also helps to look at the most important features. The `get_support()` method returns a *boolean* array that indicates if each feature is one of the most important. The `get_feature_names_out()` method returns the names of the features, that we can combine with the boolean array to select the names of the most important features (and print them).
 
This helps us to understand what the model is looking at when it makes a prediction, put into terms that we can understand.

If you look through this we can see a lot of unigrams and some bigrams. 

In [None]:
# ADVANCED CODE: Visualize the top 1000 most important features

# get the most important features
most_important_features = selector.get_support()
# get the names of the features
feature_names = vectorizer.get_feature_names_out()
# select the names of the most important features
most_important_feature_names = [
    feature_names[i]
    for i in range(len(most_important_features))
    if most_important_features[i] == True
]
# print the names of the most important features
print(most_important_feature_names)


# "Traditonal" Classification Models with *scikit-learn*

In this example we are not using cross validation because it is very computationally expensive. Rather, we are are using the validation data to evaluate the model, simply to check a) if the model is overfitting and b) to get an idea of how well the model will perform on the test data. If you have more time and more computational power, you can use cross validation to get a better estimate of how well the model will perform on the test data, but this is not necessary in this case.

For this exercise, you could use any classifier you want, however we will use the *K-nearest neighbours* (K=NN) classifier as it trains relatively quickly (on my laptop it took about 70s to run this cell).
You can read more about the K-NN scikit-learn implementation, `KNeighborsClassifier`, here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
# create a K nearest neighbor classifier
kn_classifier = KNeighborsClassifier()

# fit the classifier on the training data
kn_classifier.fit(train_data_vectorized, train_labels_list)

# predict the labels for the validation data
validation_predictions = kn_classifier.predict(validation_data_vectorized)

# calculate the accuracy of the model on the validation data
accuracy = accuracy_score(validation_labels_list, validation_predictions)
print("Validation Set Accuracy: ", accuracy)

# predict the labels on the test data
kn_predictions = kn_classifier.predict(test_data_vectorized)

# calculate the accuracy score
accuracy = accuracy_score(test_labels_list, kn_predictions)
print("Test Set Accuracy: ", accuracy)


# Neural Networks with Keras and Tensorflow
# 1) Standard Fully Connected Classifier
Instead of the kNN classifier, we can also use neural networks. In the last labs we have learned how to use neural networks, so we use them here as well. The NN that we will use is being applied on the same data as the kNN classifier, so we can compare the results of models. The network here is a simple 3 layer, fully connected, architecture (2 hidden and 1 output layer). We use the ReLu activation function for the hidden layers. Softmax activation is used for output layer as we are trying to predict a single class out of multiple. 

There are some extra steps that we need to take to get the data in the right format for the neural network. First, we will  need to convert the label indices to one-hot encoded vectors (Example [2] --> [0,0,1,0]). Scikit-learn does this in the background but for using keras, we need to use  the to_categorical() function from the keras.utils module. We then also need to turn the vectorized data into a format, that is compatible with tensorflow/keras. To do this, we use the toarray() method on the sparse matrix. This generates a numpy array. 


In [None]:
# It is always very useful to print the shape of the data to make sure that everything is working as expected
num_features = train_data_vectorized.shape[1]
print(num_features)


# the labels are of value 0, 1, 2, 3. We need to convert them to one-hot encoded vectors 
train_labels_list_np = np.array(train_labels_list)
validation_labels_list_np = np.array(validation_labels_list)
test_labels_list_np = np.array(test_labels_list)
print(test_labels_list_np.shape)

train_labels = tensorflow.keras.utils.to_categorical(
    train_labels_list_np, num_classes=4
)
validation_labels = tensorflow.keras.utils.to_categorical(
    validation_labels_list_np, num_classes=4
)
test_labels = tensorflow.keras.utils.to_categorical(test_labels_list_np, num_classes=4)

print(train_data_vectorized.shape, train_labels.shape)

# create a neural network as per the specs above
model = Sequential()
model.add(Dense(32, activation="relu", input_shape=(num_features,)))
model.add(Dense(32, activation="relu"))
model.add(Dense(4, activation="softmax"))

# print the model summary
model.summary()


# compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# we are using the validation data to evaluate the model, after each epoch.
# this is useful to check if the model is overfitting
model.fit(
    train_data_vectorized.toarray(),
    train_labels,
    epochs=10,
    batch_size=128,
    verbose=1,
    validation_data=(validation_data_vectorized.toarray(), validation_labels),
)

# predict the labels on the test data
nn_predictions = model.predict(test_data_vectorized.toarray())
# since the predictions are in one-hot encoded format, we need to convert them to a single label ([0,0,1,0] -> 2)
nn_predictions = np.argmax(nn_predictions, axis=1)


# calculate the accuracy score
accuracy = accuracy_score(test_labels_list, nn_predictions)
print("Test Set Accuracy: ", accuracy)

# create a confusion matrix
cm = confusion_matrix(test_labels_list, nn_predictions)
# plot the confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=label_names, yticklabels=label_names)


You will see that the validation loss will decrease as training progresses. However, as mentioned above, the validation loss will increase after a certain point. This can indicate that the model is starting to overfit and that it will not perform as well on the test data. When this happens training should usually be stopped. You can do this using the 'early stopping' callback.  Whilst we do not use early stopping in this example, you can read more about it here: https://keras.io/api/callbacks/early_stopping/

# Creating Example Predictions
Let us now test whether our two models work. Change the text in the sample news to see if it works. 

In [None]:
# sample_news = "The US president Donald Trump has been impeached by the House of Representatives. The vote was 230-197. The president is accused of abuse of power and obstruction of Congress. The vote was largely along party lines. The president is expected to be acquitted by the Senate."
# sample_news = "The Boston Red Sox have won the World Series. The Red Sox beat the Los Angeles Dodgers 4-1 in game 5 of the series. The Red Sox won the series 4-1. The Red Sox are the first team to win the World Series after losing the first three games of the series since the 1985 Kansas City Royals. The Red Sox are the first team to win the World Series after losing the first three games of the series since the 1985 Kansas City Royals."
# sample_news = "AI summit: Rishi Sunak says leaders must address dangers of artificial intelligence. World leaders have a responsibility to address the dangers of artificial intelligence, Rishi Sunak has told the UK's first AI safety summit. The prime minister said AI offered transformative change but that it also brought the potential for social harms like bias and disinformation. Some 28 countries are at the summit, alongside tech bosses and academics."
sample_news = "Bank warns of zero growth until 2025 as rates held. The UK economy is likely to see zero growth until 2025, while interest rates remain high for longer, the Bank of England has warned. It came as the Bank left rates on hold for the second time in a row at 5.25%, their highest level in 15 years."
# TODO: create some sample news string yourself (or copy them from a news site like BBC) then see if the model can predict the correct label for it.
# Can you find an example of an incorrect prediction?


# preprocess the sample news
sample_news = preprocess_text(sample_news)

# vectorize the sample news
sample_news_vectorized = vectorizer.transform([sample_news])
sample_news_vectorized = selector.transform(sample_news_vectorized)

# predict the label
prediction = kn_classifier.predict(sample_news_vectorized)
print("k-NN classifier prediction: ", label_names[prediction[0]], "News")

# predict using the neural network
prediction = model.predict(sample_news_vectorized.toarray())
prediction = np.argmax(prediction, axis=1)
print("NN classifier prediction: ", label_names[prediction[0]], "News")


# 2) Recurrent Neural Network Classifier

Recurrent neural networks (RNNs) are networks that are used for sequential data. They can be very useful for NLP tasks, as they can take into account the order of the words in a sentence. Due to the way the layers are built, have the ability to remember information from previous inputs. This aids them in understanding the context of the sentence.

We are using a different training confirguration for the RNN. Instead of looking at the TF-IDF scores, we use **embeddings**. 
First we use the TextVectorization layer to convert the text to a sequence of integers. You can read about how this works  [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization).


In [None]:

# Vocabulary size and number of words in a sequence.
vocab_size = 2500
sequence_length = 30

# instead of using TF-IDF, we will use a text vectorizer to vectorize the text
# this will create a vocabulary of the top 2500 words and then convert the text to a sequence of numbers
vectorizer = TextVectorization(
    max_tokens=vocab_size, output_sequence_length=sequence_length
)

# fit the vectorizer on the training data
vectorizer.adapt(train_data_list)

# vectorize the training data
train_data_vectorized = vectorizer(train_data_list)

# this shape shows that we have our 9600 training examples, each of length 30 (sequence_length)
print(train_data_vectorized.shape)

# vectorize the validation data
validation_data_vectorized = vectorizer(validation_data_list)

# vectorize the test data
test_data_vectorized = vectorizer(test_data_list)


Then, we use an *Embedding layer* to convert the integer sequences to embeddings. You can read about how this works [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding).

We then build a neural network. The bidirectional layer was not covered in the lecture, but think of it as a 'wrapping' layer, that takes the input and passes it forwards and backwards through the LSTM layer.
You may ask now, why do we need to pass the input forwards and backwards? The reason for this is that it allows the network to learn the context of the sentence in both directions. By doing so, information from both the past and the future (this is, previous words and following words in the text) is taken into account.


In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 32))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(4, activation="softmax"))

model.summary()

# compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# fit the model
model.fit(
    train_data_vectorized,
    train_labels,
    epochs=5,
    batch_size=128,
    verbose=1,
    validation_data=(validation_data_vectorized, validation_labels),
)

# predict the labels on the test data
rnn_predictions = model.predict(test_data_vectorized)
# since the predictions are in one-hot encoded format, we need to convert them to a single label
rnn_predictions = np.argmax(rnn_predictions, axis=1)


# calculate the accuracy score
accuracy = accuracy_score(test_labels_list, rnn_predictions)
print("Test Set Accuracy: ", accuracy)


# create a confusion matrix
cm = confusion_matrix(test_labels_list, rnn_predictions)
# plot the confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=label_names, yticklabels=label_names)


## Model Comparison

Now let us compare the performance of all the models. 

In [None]:
# knn
print(classification_report(test_labels_list, kn_predictions, target_names=label_names))

# neural network
print(classification_report(test_labels_list, nn_predictions, target_names=label_names))

# rnn
print(
    classification_report(test_labels_list, rnn_predictions, target_names=label_names)
)

When comparing all results of the models, we can see that the RNN performs the best. This is because it takes into account the sequential nature of the data and the context of the sentence, which both of the other models do not do.

Whilst the knn still performs well, this is a good example that shows that neural networks can be very useful for NLP tasks. You can also see that the standard neural network performs worse than the RNN.
However, you will notice, that all the results are not that far apart from each other. This is because the dataset is relatively small and the models are not very complex.

If you were to use more complex models, you would see that the results would be much better and the differences between the models would be much more significant.

<hr>

## Summary

In this lab we have followed the steps to create a text classification model, classifying news articles into their respective categories.
Three methods have been used to create the model: a k-nearest neighbours classifier, a simple neural network,  and a recurrent neural network.
The example workflow we have followed is as follows:
1. Preprocess the text
2. Vectorize the text 
3. Create a model
4. Train the model
5. Evaluate the model
6. Make predictions
which aligns closely to the ML pipeline that we have discussed in the lectures.

However, depending on your target task, there are many other steps or methods that you can use to preprocess the data, improve and select the model. For example, you can use pre-train word embeddings, such as GloVe or Word2Vec, to improve the performance of the model. You can also use other methods to vectorize the text, such as word embeddings, or even use a different model architecture.
Stemming and lemmatization are two methods that were omitted in this lab, but could be used to improve the performance of the model.

Over time, and with more experience, you will be able to identify the best methods to use for your target task.
In the meantime, you could however look at guides such as the one from the TensorFlow website: https://developers.google.com/machine-learning/guides/text-classification/step-2-5 which show the model/selection and preprocessing steps for a text classification task.

<img src="https://developers.google.com/static/machine-learning/guides/text-classification/images/TextClassificationFlowchart.png" width="600"/>

Image Source: https://developers.google.com/machine-learning/guides/text-classification/step-2-5


## Extra TASKS:
1) Whilst you look at the recommended guides, let the neural networks run their training for a few more epochs. How does the accuracy change?
2) Remove the bidiirectional LSTM layer from the neural network. How does the accuracy change?
3) If you observe the most important features, you will see that there are:
- some links with "http" in them. 
- names of news agencies that we have seen in the top few rows of the dataset such as reuters.

We want to remove these features from the dataset. Therefore: 
- Remove the occurences of "http" "href", "https" and "www" from the datasets which indicate weblinks.
- Remove the occurences of the names of the news agencies from the datasets. YOu can do this by using the .replace() method on the dataframes and removing the first part of the string before the first hyphen.
- Then, re-run the code and see how the accuracy changes.
4) Try using a different model architecture. For example, increase the number of layers, or change the number of neurons in each layer. How does the accuracy change?
5) If you use the same process but only the titles. How does the accuracy change?
6) Read through the guide from the TensorFlow website to build a RNN for text classification: https://www.tensorflow.org/text/tutorials/text_classification_rnn . How does this differ from the model we have built in this lab?
7) (Very advanced task). Try using a pre-trained word embedding, such as GloVe or Word2Vec. A guide to this can be found here: https://keras.io/examples/nlp/pretrained_word_embeddings/ . How does the accuracy change when this pre-trained word embedding is used in the models we have previously created?