# Tutorial (Text Data Processing)

(Last updated: Mar 3, 2023)

This tutorial will familiarize you with the data science pipeline of processing text data. We will go through the various steps involved in the NLP pipeline for topic modelling and topic classification, including tokenization, lemmatization, and obtaining word embeddings. We will also build a neural network using PyTorch for multi-class topic classification using the dataset.
The AG's News Topic Classification Dataset contains news articles from four different categories, making it a nice source of text data for NLP tasks. We will guide you through the process of understanding the dataset, implementing various NLP techniques, and building a model for classification. Below is the pipeline of this tutorial.

[pipeline pic]

You can use the following link to jump to the tasks and assignments:

[table of contents]

## Scenario

The [AG's News Topic Classification Dataset](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv) is a collection of over 1 million news articles from more than 2000 news sources. The dataset was created by selecting the 4 largest classes from the original corpus, resulting in 120,000 training samples and 7,600 testing samples. The dataset is provided by the academic community for research purposes in data mining, information retrieval, and other non-commercial activities. We will use it to demonstrate various NLP techniques on real data, and in the end make 2 models with this data. The files train.csv and test.csv contain all the training and testing samples as comma-separated values with 3 columns: class index, title, and description. Download train.csv and test.csv for the following tasks. 

## Import Packages

We put all the packages that are needed for this tutorial below:

In [12]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import spacy
import torch

from gensim.models import Word2Vec

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score

## Task Answers

The code block below contains answers for the assignments in this tutorial. **Do not check the answers in the next cell before practicing the tasks.**

In [45]:
def check_answer_df(df_result, df_answer, n=1):
    """
    This function checks if two output dataframes are the same.
    
    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        if df_answer.isinstance(list):
            assert any([answer.equals(df_result) for answer in df_answer])
        else:
            assert df_answer.equals(df_result)
        print(f"Test case {n} passed.")
    except:
        print(f"Test case {n} failed.")
        print("")
        print("Your output is:")
        print(df_result)
        print("")
        print("Expected output is", end="")
        if df_answer.isinstance(list):
            print(" one of", end="")
        print(":")
        print(df_answer)

## Task 3: Preprocess Text Data

In this task, we will preprocess the text data from the AG News Dataset. First, we need to load the files.

In [43]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

display(train_df, test_df)

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
...,...,...,...
119995,1,Pakistan's Musharraf Says Won't Quit as Army C...,KARACHI (Reuters) - Pakistani President Perve...
119996,2,Renteria signing a top-shelf deal,Red Sox general manager Theo Epstein acknowled...
119997,2,Saban not going to Dolphins yet,The Miami Dolphins will put their courtship of...
119998,2,Today's NFL games,PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...


Unnamed: 0,Class Index,Title,Description
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o..."
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...
...,...,...,...
7595,1,Around the world,Ukrainian presidential candidate Viktor Yushch...
7596,2,Void is filled with Clement,With the supply of attractive pitching options...
7597,2,Martinez leaves bitter,Like Roger Clemens did almost exactly eight ye...
7598,3,5 of arthritis patients in Singapore take Bext...,SINGAPORE : Doctors in the United States have ...


### Tokenization 

Tokenization is the process of breaking down a text into individual tokens, which are usually words but can also be phrases or sentences. While it may seem like a trivial task, tokenization can be applied in multiple ways and thus be a complex and challenging task influencing natural language processing (NLP) applications. This is because different languages and even different contexts within the same language can have vastly different tokenization rules.

For example, in languages like English and Dutch, it is generally straightforward to identify words by using spaces as delimiters. However, there are exceptions, such as contractions like "can't" and hyphenated words like "self-driving". In other languages, such as Chinese and Japanese, there are no spaces between words, so identifying word boundaries is much more difficult.

Moreover, tokenization is often a crucial step in the NLP pipeline because the accuracy of the subsequent analysis depends on the quality of the tokens. Poor tokenization can lead to inaccurate results and can make it difficult to extract meaningful information from the text.

To illustrate the importance of tokenization, let's consider an example in Python using the NLTK library. The following code tokenizes a sample text using the `word_tokenize` function from the NLTK package, which uses a pre-trained tokenization model for English:

In [36]:
# Sample text
text = "The quick brown fox jumped over the lazy dog. The dog couldn't wait to sleep all day."

# Tokenize the text
tokens = word_tokenize(text)

# Print the results
print("Original text:", text)
print("Tokenized text:", tokens)

Original text: The quick brown fox jumped over the lazy dog. The dog couldn't wait to sleep all day.
Tokenized text: ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.', 'The', 'dog', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']


### Lemmatization or stemming

Showing both, but stick to one for the assignment.

In [None]:
# Initialize the stemmer and lemmatizer
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization on the tokens
stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Print the results
print("Stemmed text:", stemmed_tokens)
print("Lemmatized text:", lemmatized_tokens)

### Stopword removal

- Assignment: Show word frequencies; top-n overall, per topic from the dataset both before and after pre-processing

Sources:

- https://www.kaggle.com/code/vukglisovic/classification-combining-lda-and-word2vec/notebook

In [35]:
# Remove English stopwords
english_stopwords = set(stopwords.words('english'))
filtered_tokens = [token for token in lemmatized_tokens if token.lower() not in english_stopwords]
print("Filtered text:", filtered_tokens)

Filtered text: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', '.', 'dog', 'sleep', 'day', '.']


## Task 4: Another option: spaCy

- Make a seperate column with the text transformed to a SpaCy span per column
- Show that tokens, lemma's, stems and if something is a stopword is immediately available on the span.
- Assignment: make a version of the text that's tokenized, lemmatized/stemmed and has stopwords removed

In [10]:
import spacy

# Load the small English model in spaCy
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The quick brown fox jumped over the lazy dog. The dog slept all day."

# Tokenize the text using spaCy
doc = nlp(text)

# Perform stemming and lemmatization on the tokens
lemmatized_tokens = [token.lemma_.lower() for token in doc if not token.is_stop]

# Print the results
print("Original text:", text)
print("Stemmed text:", lemmatized_tokens)

Original text: The quick brown fox jumped over the lazy dog. The dog slept all day.
Stemmed text: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', '.', 'dog', 'sleep', 'day', '.']


## Task 5: Unsupervised Learning - Topic Modelling

- Use LDA to transform preprocessed text into features
- Run simple Kmeans to make clusters
- Let student pick amount of clusters (elbow method)
- Evaluate using adjusted_mutual_info_score and adjusted_rand_score
- Similar to an assignment from the Applied ML course these students had last year, but LDA was cut for their year

In [None]:


# Load preprocessed text data
data = pd.read_csv('preprocessed_text.csv')

# Define the number of topics to extract with LDA
num_topics = 10

# Convert preprocessed text to features using CountVectorizer
vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['preprocessed_text'])

# Fit LDA to the feature matrix
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42)
lda.fit(X)

# Extract the topic proportions for each document
doc_topic_proportions = lda.transform(X)

# Determine the optimal number of clusters with KMeans using the elbow method
k_range = range(2, 11)
sse = []
for k in k_range:
    kmeans = KMeans(n_clusters=k, max_iter=100)
    kmeans.fit(doc_topic_proportions)
    sse.append(kmeans.inertia_)
    
# Plot the elbow curve
import matplotlib.pyplot as plt
plt.plot(k_range, sse)
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

# Let the user select the number of clusters
num_clusters = int(input("Enter the number of clusters: "))

# Cluster the documents using KMeans
kmeans = KMeans(n_clusters=num_clusters, max_iter=100)
kmeans.fit(doc_topic_proportions)
cluster_labels = kmeans.labels_

# Evaluate the clustering using adjusted mutual information score and adjusted rand score
ami_score = adjusted_mutual_info_score(data['category'], cluster_labels)
ari_score = adjusted_rand_score(data['category'], cluster_labels)

print(f"Adjusted mutual information score: {ami_score:.2f}")
print(f"Adjusted rand score: {ari_score:.2f}")


## Task 6: Word embeddings

- Show code to make embeddings based on pre-processed text using both NLTK and spaCy
- Assignment: Let student apply it to dataframe

Sources:
- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/
- https://www.kaggle.com/code/vukglisovic/classification-combining-lda-and-word2vec/notebook

In [None]:

# Load preprocessed text data
data = pd.read_csv('preprocessed_text.csv')

# Define the preprocessing functions
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Apply the preprocessing function to the text data
data['tokens'] = data['preprocessed_text'].apply(preprocess_text)

# Train a Word2Vec model on the preprocessed text data
model = Word2Vec(data['tokens'], size=100, window=5, min_count=1, workers=4)

# Get the word embedding for a specific word
embedding = model.wv['word']


In [33]:
import spacy

# Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_sm')

# Load preprocessed text data
# data = pd.read_csv('preprocessed_text.csv')

# # Define the preprocessing function
# def preprocess_text(text):
#     doc = nlp(text)
#     tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and token.is_alpha]
#     return tokens

# # Apply the preprocessing function to the text data
# data['tokens'] = data['preprocessed_text'].apply(preprocess_text)

# Get the word embedding for a specific word
embedding = nlp('.').vector
embedding

array([-5.8810937e-01,  1.2690233e+00,  1.6575636e+00, -9.0438890e-01,
       -7.0677483e-01, -1.1966441e+00,  4.2419106e-01, -4.2574078e-01,
       -2.8125042e-01,  1.2154953e+00,  1.0961263e-01, -9.9083734e-01,
       -8.9280665e-01, -9.2996472e-01, -1.2193933e+00, -3.3290449e-01,
       -1.2119348e+00,  7.6204681e-01,  4.9417186e+00, -3.7561244e-01,
        2.1576166e-02, -5.2177596e-01, -2.1905096e+00, -7.6049173e-01,
       -1.4267705e-01,  2.4515245e+00, -2.9129535e-04,  3.4355882e-01,
        1.1452764e+00, -1.3602724e+00, -1.2848355e+00,  3.1477764e-02,
        7.5193155e-01,  7.0128936e-01,  2.0565279e+00,  9.5156097e-01,
        4.5888591e-01,  1.1683748e+00,  3.1925082e-01, -9.0628773e-01,
       -6.1355400e-01, -8.2875299e-01, -3.2473198e-01, -9.0215296e-01,
       -4.9787417e-01, -8.8159394e-01, -8.8454676e-01, -1.1683216e+00,
       -7.5443119e-02,  1.3703958e+00, -6.6398099e-02,  2.9801071e-01,
        6.4264160e-01,  3.4167087e-01,  2.9193616e-01, -3.6580968e-01,
      

## Task 7: Supervised Learning - Topic Classification

- Using the word embeddings features, train a small neural net 
- Don't give the full torch code, only one layer to let them do something with torch
- Hyperparameter tuning (either some hinted ones or see if Ray Tune is worth it for this task)
- Evaluate using confusion matrix against true features

Sources:
- https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
- https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Load preprocessed text data with word embeddings as features
df = pd.read_csv('preprocessed_data.csv')

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['topic'], test_size=0.2, random_state=42)

# Define a simple neural network with one hidden layer
class Net(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Define hyperparameters
input_dim = len(X_train[0])
hidden_dim = 100
output_dim = len(y_train.unique())
lr = 0.001
epochs = 10

# Initialize model, optimizer and loss function
model = Net(input_dim, hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

# Train the model
for epoch in range(epochs):
    running_loss = 0.0
    for i in range(len(X_train)):
        # Convert input and target to tensors
        input_tensor = torch.tensor(X_train.iloc[i])
        target_tensor = torch.tensor(y_train.iloc[i])
        
        # Clear the gradients
        optimizer.zero_grad()

        # Forward pass
        output = model(input_tensor)

        # Calculate loss
        loss = criterion(output, target_tensor)

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()

        # Print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # Print every 100 batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0

# Evaluate the model
y_pred = []
with torch.no_grad():
    for i in range(len(X_test)):
        # Convert input to tensor
        input_tensor = torch.tensor(X_test.iloc[i])

        # Forward pass
        output = model(input_tensor)

        # Get predicted class
        _, predicted = torch.max(output.data, 0)
        y_pred.append(predicted.item())

# Calculate confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)
