# Tutorial (Text Data Processing)

(Last updated: Mar 3, 2023)

This tutorial will familiarize you with the data science pipeline of processing text data. We will go through the various steps involved in the Natural Language Processing (NLP) pipeline for topic modelling and topic classification, including tokenization, lemmatization, and obtaining word embeddings. We will also build a neural network using PyTorch for multi-class topic classification using the dataset.
The AG's News Topic Classification Dataset contains news articles from four different categories, making it a nice source of text data for NLP tasks. We will guide you through the process of understanding the dataset, implementing various NLP techniques, and building a model for classification. Below is the pipeline of this tutorial.

[pipeline pic]

You can use the following links to jump to the tasks and assignments:

[table of contents]

## Scenario

The [AG's News Topic Classification Dataset](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv) is a collection of over 1 million news articles from more than 2000 news sources. The dataset was created by selecting the 4 largest classes from the original corpus, resulting in 120,000 training samples and 7,600 testing samples. The dataset is provided by the academic community for research purposes in data mining, information retrieval, and other non-commercial activities. We will use it to demonstrate various NLP techniques on real data, and in the end make 2 models with this data. The files train.csv and test.csv contain all the training and testing samples as comma-separated values with 3 columns: class index, title, and description. Download train.csv and test.csv for the following tasks. 

## Import Packages

We put all the packages that are needed for this tutorial below:

In [None]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import spacy
import torch

from gensim.models import Word2Vec

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score

from xml.sax import saxutils as su

## Task Answers

The code block below contains answers for the assignments in this tutorial. **Do not check the answers in the next cell before practicing the tasks.**

In [None]:
def check_answer_df(df_result, df_answer, n=1):
    """
    This function checks if two output dataframes are the same.
    
    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        if df_answer.isinstance(list):
            assert any([answer.equals(df_result) for answer in df_answer])
        else:
            assert df_answer.equals(df_result)
        print(f"Test case {n} passed.")
    except:
        print(f"Test case {n} failed.")
        print("")
        print("Your output is:")
        print(df_result)
        print("")
        print("Expected output is", end="")
        if df_answer.isinstance(list):
            print(" one of", end="")
        print(":")
        print(df_answer)

## Task 3: Preprocess Text Data

In this task, we will preprocess the text data from the AG News Dataset. First, we need to load the files.

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

display(train_df, test_df)

As you can see, all the classes are distributed evenly in the train and test data.

In [None]:
display(train_df['Class Index'].value_counts(), test_df['Class Index'].value_counts())

To make the data more understandable, we will make the classes more understandable by transforming the `Class Index` column to a `class` column, containing the category of the news article. To process both the title and news text together, we will combine the `Title` and `Description` columns into one `text` column. We will just deal with the train data until the point where we need the test data again.

In [None]:
def reformat_data(df):
    """
    Reformat the Class Index column to a Class column and combine
    the Title and Description columns into a Text column.
    Select only the new columns afterwards.
    
    Parameters
    ----------
    df : pandas.DataFrame
        The original dataframe.
         
    Returns
    -------
    pandas.DataFrame
        The reformatted dataframe.
    """
    # Make the class column using a dictionary.
    classes = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
    df['class'] = df['Class Index'].apply(classes.get)
    
    # Use string concatonation for the Text column and unesacpe html characters.
    df['text'] = (df['Title'] + ' ' + df['Description']).apply(su.unescape)
    
    # Select only the Class and Text columns.
    df = df[['class', 'text']]
    return df

train_df = reformat_data(train_df)
display(train_df)

### Tokenization 

Tokenization is the process of breaking down a text into individual tokens, which are usually words but can also be phrases or sentences. It helps language models to understand and analyze text data by breaking it down into smaller, more manageable pieces. While it may seem like a trivial task, tokenization can be applied in multiple ways and thus be a complex and challenging task influencing NLP applications.

For example, in languages like English, it is generally straightforward to identify words by using spaces as delimiters. However, there are exceptions, such as contractions like "can't" and hyphenated words like "self-driving". And in Dutch, where multiple nouns can be combined into one bigger noun without any delimiter this can be hard. How would you tokenize "hippopotomonstrosesquippedaliofobie"? In other languages, such as Chinese and Japanese, there are no spaces between words, so identifying word boundaries is much more difficult. 

To illustrate the use of tokenization, let's consider the following example, which tokenizes a sample text using the `word_tokenize` function from the NLTK package. That function uses a pre-trained tokenization model for English.

In [None]:
# Sample text.
text = "The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day."

# Tokenize the text.
tokens = word_tokenize(text)

# Print the text and the tokens.
print("Original text:", text)
print("Tokenized text:", tokens)

### Stemming / lemmatization

Stemming and lemmatization are two common techniques used in NLP to preprocess and normalize text data. Both techniques involve transforming words into their root form, but they differ in their approach and the level of normalization they provide.

Stemming is a technique that involves reducing words to their base or stem form by removing any affixes or suffixes. For example, the stem of the word "lazily" would be "lazi". Stemming is a simple and fast technique that can be useful. However, it can also produce inaccurate or incorrect results since it does not consider the context or part of speech of the word.

Lemmatization, on the other hand, is a more sophisticated technique that involves identifying the base or dictionary form of a word, also known as the lemma. Unlike stemming, lemmatization can consider the context and part of speech of the word, which can make it more accurate and reliable. In this example, the context and part of speech is not used. With lemmatization, the lemma of the word "lazily" would be "lazy". Lemmatization can be slower and more complex than stemming but provides a higher level of normalization.

In [None]:
# Initialize the stemmer and lemmatizer.
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization seperately on the tokens.
stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Print the results.
print("Stemmed text:", stemmed_tokens)
print("Lemmatized text:", lemmatized_tokens)

### Stopword removal

Stopword removal is a common technique used in NLP to preprocess and clean text data by removing words that are considered to be of little or no value in terms of conveying meaning or information. These words are called "stopwords" and they include common words such as "the", "a", "an", "and", "or", "but", and so on.

The purpose of stopword removal in NLP is to improve the accuracy and efficiency of text analysis and processing by reducing the noise and complexity of the data. Stopwords are often used to form grammatical structures in a sentence, but they do not carry much meaning or relevance to the main topic or theme of the text. So by removing these words, we can reduce the dimensionality of the text data, improve the performance of machine learning models, and speed up the processing of text data. NLTK has a predefined list of stopwords for English.

In [None]:
# English stopwords in NLTK.
stopwords_list = stopwords.words('english')
print(stopwords_list)

### Assignment for Task 3

**Your task (which is your assignment) is to write functions to do the following:**
- Since we want to use our text to make a model later on, we need to preprocess it. Add a `tokens` column to the `train_df` dataframe with the text tokenized, then lemmatize those tokens.
    - Hint: Use the `pandas.Series.apply` function with the imported `nltk.tokenize.word_tokenize` function. This might take a moment. Recall that you can use the `pd.Series.apply?` syntax in a code cell for more information.
    - Hint: use the `nltk.stem.WordNetLemmatizer.lemmatize` function to lemmatize a token.
- To see what the most used words per class are, create a new, seperate dataframe with the 5 most used words per class. Sort the resulting dataframe ascending on the `class` and descending on the `count`.
    - Hint: use the `pandas.Series.apply` and `str.isalpha()` functions to filter out non-alphabetical tokens.
    - Hint: use the `pandas.DataFrame.explode` to create one row per class and token.
    - Hint: use `pandas.DataFrame.groupby` with `.size()` afterwards or `pandas.DataFrame.pivot_table` with `size` as the `aggfunc` to obtain the occurences per class.
    - Hint: use the `pandas.Series.reset_index` function to obtain a dataframe with `[class, tokens, count]` as the columns.
    - Hint: use the `pandas.DataFrame.sort_values` function for sorting a dataframe.
    - Hint: use the `pandas.DataFrame.groupby` and `pandas.DataFrame.head` functions to get the first 5 rows per class.
- Remove the stopwords from the `tokens` column in the `train_df` dataframe. Do the most used tokens say something about the class now?
    - Hint: once again, you can use the `pandas.Series.apply` function.

In [None]:
def tokenize_and_lemmatize(df):
    """
    Tokenize and lemmatize the text in the dataset.
    
    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the text column.
         
    Returns
    -------
    pandas.DataFrame
        The dataframe with the added tokens column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)
    
    # Apply the tokenizer to create the tokens column.
    df['tokens'] = df['text'].apply(word_tokenize)
    
    # Apply the lemmatizer on every word in the tokens list.
    df['tokens'] = df['tokens'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])
    return df


def most_used_words(df, token_col='tokens'):
    """
    Generate a dataframe with the 5 most used words per class, and their count.
    
    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the class and tokens columns.
         
    Returns
    -------
    pandas.DataFrame
        The dataframe with 5 rows per class, and an added 'count' column.
        The dataframe is sorted in ascending order on the class and in descending order on the count.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)
    
    # Filter out non-words
    df[token_col] = df[token_col].apply(lambda tokens: [token for token in tokens if token.isalpha()])
    
    # Explode the tokens so that every token gets its own row.
    df = df.explode(token_col)
    
    # Option 1: groupby on class and token, get the size of how many rows per item, 
    # add that as a column.
    counts = df.groupby(['class', token_col]).size().reset_index(name='count')
    
    # Option 2: make a pivot table based on the class and token based on how many
    # rows per combination there are , add counts as a column.
    # counts = counts.pivot_table(index=['class', 'tokens'], aggfunc='size').reset_index(name='count')
    
    # Sort the values on the class and count, get only the first 5 rows per class.
    counts = counts.sort_values(['class', 'count'], ascending=[True, False]).groupby('class').head()

    return counts

def remove_stopwords(df):
    """
    Remove stopwords from the tokens.
    
    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the tokens column.
         
    Returns
    -------
    pandas.DataFrame
        The dataframe with stopwords removed from the tokens column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)
    
    # Using a set for quicker lookups.
    stopwords_set = set(stopwords_list)
    
    # Filter stopwords from tokens.
    df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens if token.lower() not in stopwords_set])
    
    return df

tok = tokenize_and_lemmatize(train_df)
train_df = remove_stopwords(tok)
display(most_used_words(tok), most_used_words(train_df))

## Task 4: Another option: spaCy

spaCy is another library used to perform various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and much more. It provides pre-trained models for different languages and domains, which can be used as-is but also can be fine-tuned on a specific task or domain.

In an object-oriented way, spaCy can be thought of as a collection of classes and objects that work together to perform NLP tasks. Some of the important functions and classes in spaCy include:

- `nlp`: The core function that provides the main functionality of spaCy. It is used to process text and create a `Doc` object.
- [`Doc`](https://spacy.io/api/doc): A container for accessing linguistic annotations like tokens, part-of-speech tags, named entities, and dependency parse information. It is created by the `nlp` function and represents a processed document.
- [`Token`](https://spacy.io/api/token): An object representing a single token in a `Doc` object. It contains information like the token text, part-of-speech tag, lemma, embedding, and much more.

When a text is processed by spaCy, it is first passed to the nlp function, which uses the loaded model to tokenize the text and applies various linguistic annotations like part-of-speech tagging, named entity recognition, and dependency parsing in the background. The resulting annotations are stored in a Doc object, which can be accessed and manipulated using various methods and attributes. For example, the Doc object can be iterated over to access each Token object in the document.

In [None]:
# Load the small English model in spaCy.
nlp = spacy.load("en_core_web_sm")

# Process the text using spaCy.
doc = nlp(text)

# This becomes a spaCy Doc object, which prints nicely as the original string.
print(type(doc) , doc)

# We can iterate over the tokens in the Doc, since it has already been tokenized underneath.
print(type(doc[0]))
for token in doc:
    print(token)

Since a lot of processing has already been done, we can also directly access multiple attributes of the `Token` objects. For example, we can directly access the lemma of the token with `Token.lemma_` and check if a token is a stop word with `Token.is_stop`.

In [None]:
print(doc[0].lemma_, type(doc[0].lemma_), doc[0].is_stop, type(doc[0].is_stop))

**Your task (which is your assignment) is to write functions to do the following:**
- Add a `doc` column to the `train_df` dataframe containing the `Doc` representation of that row's `text`.
- Add a `spacy_tokens` column containing the to the `train_df` dataframe containing a list of lemmatized tokens (strings). 

## Task 5: Unsupervised Learning - Topic Modelling

- Use LDA to transform preprocessed text into features
- Run simple Kmeans to make clusters
- Let student pick amount of clusters (elbow method)
- Evaluate using adjusted_mutual_info_score and adjusted_rand_score
- Similar to an assignment from the Applied ML course these students had last year, but LDA was cut for their year

In [None]:
# Load preprocessed text data
data = pd.read_csv('preprocessed_text.csv')

# Define the number of topics to extract with LDA
num_topics = 10

# Convert preprocessed text to features using CountVectorizer
vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['preprocessed_text'])

# Fit LDA to the feature matrix
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42)
lda.fit(X)

# Extract the topic proportions for each document
doc_topic_proportions = lda.transform(X)

# Determine the optimal number of clusters with KMeans using the elbow method
k_range = range(2, 11)
sse = []
for k in k_range:
    kmeans = KMeans(n_clusters=k, max_iter=100)
    kmeans.fit(doc_topic_proportions)
    sse.append(kmeans.inertia_)
    
# Plot the elbow curve
import matplotlib.pyplot as plt
plt.plot(k_range, sse)
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

# Let the user select the number of clusters
num_clusters = int(input("Enter the number of clusters: "))

# Cluster the documents using KMeans
kmeans = KMeans(n_clusters=num_clusters, max_iter=100)
kmeans.fit(doc_topic_proportions)
cluster_labels = kmeans.labels_

# Evaluate the clustering using adjusted mutual information score and adjusted rand score
ami_score = adjusted_mutual_info_score(data['category'], cluster_labels)
ari_score = adjusted_rand_score(data['category'], cluster_labels)

print(f"Adjusted mutual information score: {ami_score:.2f}")
print(f"Adjusted rand score: {ari_score:.2f}")


## Task 6: Word embeddings

- Show code to make embeddings based on pre-processed text using both NLTK and spaCy
- Assignment: Let student apply it to dataframe

Sources:
- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/
- https://www.kaggle.com/code/vukglisovic/classification-combining-lda-and-word2vec/notebook

In [None]:

# Load preprocessed text data
data = pd.read_csv('preprocessed_text.csv')

# Define the preprocessing functions
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Apply the preprocessing function to the text data
data['tokens'] = data['preprocessed_text'].apply(preprocess_text)

# Train a Word2Vec model on the preprocessed text data
model = Word2Vec(data['tokens'], size=100, window=5, min_count=1, workers=4)

# Get the word embedding for a specific word
embedding = model.wv['word']


In [None]:
import spacy

# Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_sm')

# Load preprocessed text data
# data = pd.read_csv('preprocessed_text.csv')

# # Define the preprocessing function
# def preprocess_text(text):
#     doc = nlp(text)
#     tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and token.is_alpha]
#     return tokens

# # Apply the preprocessing function to the text data
# data['tokens'] = data['preprocessed_text'].apply(preprocess_text)

# Get the word embedding for a specific word
embedding = nlp('.').vector
embedding

## Task 7: Supervised Learning - Topic Classification

- Using the word embeddings features, train a small neural net 
- Don't give the full torch code, only one layer to let them do something with torch
- Hyperparameter tuning (either some hinted ones or see if Ray Tune is worth it for this task)
- Evaluate using confusion matrix against true features

Sources:
- https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
- https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Load preprocessed text data with word embeddings as features
df = pd.read_csv('preprocessed_data.csv')

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['topic'], test_size=0.2, random_state=42)

# Define a simple neural network with one hidden layer
class Net(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Define hyperparameters
input_dim = len(X_train[0])
hidden_dim = 100
output_dim = len(y_train.unique())
lr = 0.001
epochs = 10

# Initialize model, optimizer and loss function
model = Net(input_dim, hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

# Train the model
for epoch in range(epochs):
    running_loss = 0.0
    for i in range(len(X_train)):
        # Convert input and target to tensors
        input_tensor = torch.tensor(X_train.iloc[i])
        target_tensor = torch.tensor(y_train.iloc[i])
        
        # Clear the gradients
        optimizer.zero_grad()

        # Forward pass
        output = model(input_tensor)

        # Calculate loss
        loss = criterion(output, target_tensor)

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()

        # Print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # Print every 100 batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0

# Evaluate the model
y_pred = []
with torch.no_grad():
    for i in range(len(X_test)):
        # Convert input to tensor
        input_tensor = torch.tensor(X_test.iloc[i])

        # Forward pass
        output = model(input_tensor)

        # Get predicted class
        _, predicted = torch.max(output.data, 0)
        y_pred.append(predicted.item())

# Calculate confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)
