# 5.3 Deep Learning Methods, Part 3

Prerequisite: 

*  A basic understanding of the training process for neural networks, including forward pass, loss computation, and backpropagation. 
*  You should also be able to train and evaluate basic models in PyTorch using Datasets and DataLoaders.

Activity:

*  We will explore deep learning using PyTorch for text classification.
*  We will cover encoding and deep learning models for text using PyTorch.


## 5.3.1 Text processing pipeline in PyTorch

![](images/text_processing_pipeline.png)

The text analysis approach in PyTorch involves preprocessing, encoding, and Dataset and DataLoader. 

Given a **raw data**. Then we do the following components in order:

1. The first pipeline component is **preprocessing**. We clean and prepare the text data for encoding. We'll use PyTorch and NLTK to transform raw text into processed text.

* The first step in text preprocessing is **tokenization**. We'll use the PyTorch `get_tokenizer` function imported from `torchtext..data.utils`. By applying tokenization, our output becomes a list of `tokens`.

* The second step is **stop word removal**. We'll eliminate stopwords using NLTK. We will download the stopwords collection of words from nltk using `nltk.download` and import the `stopwords` package. Note that we will be using the **lower** method. We will create a set of stopwords with no duplicates using `stopwords.words`. With list comprehension, we iterate through the `tokens` we previously created and filter out any stopwords. Finally, we print the `filtered tokens`.

* The third step is to use **stemming** which reduces words or tokens to their base or root form for simplified analysis. We shall use the NLTK library's `PorterStemmer `package to perform stemming on a set of words or tokens. 

2. The second pipeline component is **encoding**. The preprocessed text data is given to  ` CountVectorizer` from `sklearn.feature_extraction.text`.

3. The pipeline is completed when we use the PyTorch's **Dataset** and **DataLoader**. The `Dataset` serves as a container for our processed and encoded text data. `DataLoader` then allows us to iterate over this dataset in **batches**, **shuffle** the data, and apply **multiprocessing** for efficient loading.

In [1]:
## 5.3.2  Our text data

import pandas as pd

news_df = pd.read_csv("datasets/fake_or_real_news.csv", index_col = "Unnamed: 0")

news_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6335 entries, 8476 to 4330
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   6335 non-null   object
 1   text    6335 non-null   object
 2   label   6335 non-null   object
dtypes: object(3)
memory usage: 198.0+ KB


In [2]:
news_df.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [3]:
# !pip install --upgrade pip
# !pip install dill
# !pip install huggingface-hub
# !pip install multiprocess
# !pip install torchtext

In [4]:
import torch
from torchtext.data.utils import get_tokenizer

import nltk
nltk.download('stopwords')

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

## Create a list of stopwords
stop_words = set(stopwords.words("english"))

## Initialize the tokenizer and stemmer
tokenizer = get_tokenizer("basic_english")
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Complete the function to preprocess texts
def preprocess_news(data):
    processed_news = []
    for news in data:
        news = news.lower()
        tokens = tokenizer(news)
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [stemmer.stem(token) for token in tokens]
        processed_news.append(' '.join(tokens))
    return processed_news

# processed_news = preprocess_news(news_df.text)

In [6]:
# len(processed_news[0])

In [7]:
# processed_news[0][:274]

In [8]:
# print(len(processed_news))
# print(news_df.text.shape)
# print(news_df.label.shape)

In [9]:
import numpy as np
labels = news_df.label
encoded_labels = np.array([0 if label == "FAKE" else 1 for label in labels])
encoded_labels.shape

(6335,)

In [10]:
# from torch.utils.data.dataset import Dataset
# from torch.utils.data.dataloader import DataLoader


import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
batch_size = 2  # Adjust as needed
test_size = 0.2  # Adjust as needed
random_state = 42  # Adjust as needed
num_classes = 2

# Complete the encoding function
def encode_news(news):
    vectorizer = CountVectorizer()
    features = vectorizer.fit_transform(news)
    return features.toarray(), vectorizer

# Step 2: Create a dataset and dataloader
# Complete the text processing pipeline
def text_processing_pipeline(news):
    processed_news = preprocess_news(news)
    encoded_news, vectorizer = encode_news(processed_news)
    X_train, X_test, y_train, y_test = train_test_split(encoded_news, encoded_labels, 
                                                        test_size = test_size, 
                                                        random_state = random_state)
    input_dim = X_train.shape[1]
    X_train = X_train.astype(np.float32)
    X_test =  X_test.astype(np.float32)
    y_train = y_train.astype(np.float32)
    y_test =  y_test.astype(np.float32)
    train_dataset = TensorDataset(torch.tensor(X_train), 
                                  torch.tensor(y_train))
    test_dataset =  TensorDataset(torch.tensor(X_test), 
                                  torch.tensor(y_test))
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)  

    dataset = [train_dataset, test_dataset]
    dataloader = [train_dataloader, test_dataloader]
    trainset = [X_train, y_train]
    testset = [X_test, y_test]
    return  dataset, dataloader, input_dim, trainset, testset, processed_news
    

In [11]:
import time
# get the start time
st = time.time()

#train_dataset, test_dataset, train_dataloader, test_dataloader, input_dim, X_train, X_test, y_train, y_test = text_processing_pipeline(news_df.text)

dataset, dataloader, input_dim, trainset, testset, processed_news = text_processing_pipeline(news_df.text)

train_dataset, test_dataset = dataset
train_dataloader, test_dataloader = dataloader
X_train, y_train = trainset
X_test, y_test = testset

# get the end time
et = time.time()

# get the execution time
print('Execution time:', et - st, 'seconds')


# execution time: about 54.36970829963684 seconds


Execution time: 56.43265104293823 seconds


In [12]:
import time
# get the start time
st = time.time()
input_dim = input_dim
output_dim = num_classes


import torch
import torch.nn as nn

# Step 3: Define the neural network model
class TextClassificationModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(TextClassificationModel, self).__init__()
        self.fc = nn.Linear(input_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc(x)
        return x

# Step 4: Define the loss function and optimizer
import torch.optim as optim

output_dim = num_classes

model = TextClassificationModel(input_dim, output_dim )

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


# Step 5: Train the model
def train_model(model, iterator, optimizer, criterion):
    model.train()
    for data in iterator:
        optimizer.zero_grad()
        features, targets = data   
        predictions = model(features)

        loss = criterion(predictions, targets.type(torch.LongTensor))
        loss.backward()
        optimizer.step()

# Training loop

num_epochs = 10  # Adjust as needed
for epoch in range(num_epochs):
    train_model(model, train_dataloader, optimizer, criterion)

# Step 6: Evaluate the model
# from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix

def evaluate_model(model, iterator):
    model.eval()
    all_predictions = []
    all_targets = []
    with torch.no_grad():
        for data in iterator:
            features, targets = data
            predictions = model(features)
            all_predictions.extend(predictions.argmax(1).tolist())
            all_targets.extend(targets.tolist())
    return all_predictions, all_targets

# Evaluate the model
predictions, true_targets = evaluate_model(model, test_dataloader)
accuracy = accuracy_score(true_targets, predictions)
print(f'Accuracy: {accuracy:.2f}')

# get the end time
et = time.time()

# get the execution time
print('Execution time:', et - st, 'seconds')

# Accuracy: 0.93
# Execution time: 56.27982807159424 seconds

Accuracy: 0.93
Execution time: 55.265727519989014 seconds


In [13]:
conf_matrix =  confusion_matrix(y_true = true_targets, 
                                y_pred = predictions)
conf_matrix

array([[589,  39],
       [ 54, 585]], dtype=int64)

**Confusion matrix**: `i`-th row and `j`-th column entry indicates the number of samples with `true label` being `i`-th class and `predicted label` being `j`-th class.

In [14]:
tn, fp, fn, tp = confusion_matrix(true_targets, predictions).ravel()
print(f'True Negative = {tn}')
print(f'False Positive = {fp}')
print(f'False Negative = {fn}')
print(f'True Positive = {tp}')

True Negative = 589
False Positive = 39
False Negative = 54
True Positive = 585


In [15]:
# sensitivity = tp / (tp + fn) or Recall  
# is the number of correctly identified points in the class of true positives
print(f'sensitivity = {round(tp/(tp + fn), 3)}')

sensitivity = 0.915


In [16]:
# specificity = tn / (tn + fp) 
print(f'specificity = {round(tn/(tn + fp),3)}')

specificity = 0.938


In [17]:
# Precision = tp / (tp +fp)
print(f'Precision = {round(tp / (tp + fp), 3)}')

Precision = 0.938


# Increasing the number of hidden layers and using relu activation function

In [18]:
import time
import torch
import torch.nn as nn

# get the start time
st = time.time()

hidden_dim = 4
#output_dim = 2
num_classes = 2

# Step 3: Define the neural network model
class TextClassificationModel2(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(TextClassificationModel2, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)  # Apply first ReLU activation
        x = self.fc2(x)
        x = self.relu2(x)  # Apply second ReLU activation
        x = self.fc3(x)
        return x

# Step 4: Define the loss function and optimizer
import torch.optim as optim

output_dim = num_classes

model2 = TextClassificationModel2(input_dim, hidden_dim, output_dim)
criterion2 = nn.CrossEntropyLoss()
optimizer2 = optim.Adam(model2.parameters(), lr=0.001)

# Step 5: Train the model

# Training loop
num_epochs = 10  # Adjust as needed
for epoch in range(num_epochs):
    train_model(model2, train_dataloader, optimizer2, criterion2)

# Step 6: Evaluate the model

# Evaluate the model
predictions2, true_targets = evaluate_model(model2, test_dataloader)
accuracy2 = accuracy_score(true_targets, predictions2)
print(f'Accuracy: {accuracy2:.2f}')

# get the end time
et = time.time()

# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

# Accuracy: 0.93
# Execution time: 74.64704155921936 seconds

Accuracy: 0.93
Execution time: 81.53462433815002 seconds


## Using cross validation

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
import time

# get the start time
st = time.time()

# Define the number of folds
num_folds = 5  # You can adjust this as needed

# Initialize a list to store accuracy scores
accuracy_scores = []

# Define the input and output dimensions
input_dim = input_dim
output_dim = num_classes


# Define the model_cv
model_cv  = TextClassificationModel2(input_dim, hidden_dim, output_dim)
criterion_cv = nn.CrossEntropyLoss()
optimizer_cv = optim.Adam(model_cv.parameters(), lr=0.001)

# Create KFold cross-validator
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform k-fold cross-validation
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train)):
    print(f'Fold {fold + 1}')
    
    # Split the data into training and validation sets for this fold
    X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
    y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]

    # Convert NumPy arrays to PyTorch tensors
    train_data = TensorDataset(torch.tensor(X_train_fold), torch.tensor(y_train_fold))
    val_data = TensorDataset(torch.tensor(X_val_fold), torch.tensor(y_val_fold))

    train_dataloader_cv = DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True)
    val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=False, pin_memory=True)

    # Create and train a model for this fold
    for epoch in range(num_epochs):
        train_model(model_cv, train_dataloader_cv, optimizer_cv, criterion_cv)
    
    # Evaluate the model on the validation set
    predictions_cv, true_labels_cv = evaluate_model(model_cv, val_dataloader)
    accuracy_cv = accuracy_score(true_labels_cv, predictions_cv)
    accuracy_scores.append(accuracy_cv)
    print(f'Validation Accuracy for Fold {fold + 1}: {accuracy_cv:.2f}')

# Calculate and print the mean and standard deviation of accuracy scores
mean_accuracy = sum(accuracy_scores) / len(accuracy_scores)
std_accuracy = (sum((x - mean_accuracy) ** 2 for x in accuracy_scores) / len(accuracy_scores)) ** 0.5
print(f'Mean Accuracy: {mean_accuracy:.2f}')
print(f'Standard Deviation: {std_accuracy:.2f}')

# Get the end time
et = time.time()

# Get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

# Fold 1
# Validation Accuracy for Fold 1: 0.91
# Fold 2
# Validation Accuracy for Fold 2: 0.97
# Fold 3
# Validation Accuracy for Fold 3: 0.99
# Fold 4
# Validation Accuracy for Fold 4: 0.99
# Fold 5
# Validation Accuracy for Fold 5: 1.00
# Mean Accuracy: 0.97
# Standard Deviation: 0.03
# Execution time: 1232.3690402507782 seconds


Fold 1
Validation Accuracy for Fold 1: 0.91
Fold 2
Validation Accuracy for Fold 2: 0.99
Fold 3
Validation Accuracy for Fold 3: 1.00
Fold 4
Validation Accuracy for Fold 4: 1.00
Fold 5
Validation Accuracy for Fold 5: 1.00
Mean Accuracy: 0.98
Standard Deviation: 0.03
Execution time: 389.9281520843506 seconds


In [20]:
# Evaluate model_cv
predictions3, true_targets = evaluate_model(model_cv, test_dataloader)
accuracy3 = accuracy_score(true_targets, predictions3)
print(f'Accuracy: {accuracy3:.2f}')

Accuracy: 0.91


## Using embedding with the vocabulary of the text data

In [20]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
from itertools import chain
import time

# get the start time
# st = time.time()

# Step 2: Tokenization and Vocabulary 
# processed_news = preprocess_news(news_df.text)
nopunc_news = [re.sub(r'[^\w\s]', '', item) for item in processed_news]
tokenized_news = [word_tokenize(item) for item in nopunc_news]

words  = list(chain(*map(lambda line: [word for word in line], tokenized_news)))

unique_words = set()
vocab = []

for word in words:
    if word not in unique_words:
        vocab.append(word)
        unique_words.add(word)

word_to_idx = {word: i for i, word in enumerate(vocab)}
# idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# Convert word_to_idx to a tensor
inputs = torch.LongTensor([word_to_idx[w] for w in words])

# Step 3: Create an Embedding Layer
embedding_dim = 500  # Adjust the dimension as needed
num_embeddings = len(vocab)

# Initialize embedding layer 
embedding = nn.Embedding(num_embeddings, embedding_dim)

# Pass the tensor to the embedding layer
output = embedding(inputs)


# Step 4: Modify the Model Architecture
class TextClassificationModelWithEmbedding(nn.Module):
    def __init__(self, embedding, hidden_dim, output_dim):
        super(TextClassificationModelWithEmbedding, self).__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        embedded = self.embedding(x)
        x = embedded.mean(dim=0)  # Average the embeddings
        x = self.fc1(x)
        x = self.relu1(x)  # Apply ReLU activation
        x = self.fc2(x)
        return x

# Step 5: Define the loss function and optimizer
import torch.optim as optim

output_dim = num_classes

model_embed = TextClassificationModelWithEmbedding(embedding, hidden_dim, output_dim)
criterion_embed = nn.CrossEntropyLoss()
optimizer_embed = optim.Adam(model_embed.parameters(), lr=0.001)


# Step 6: Train the model

# Training loop
num_epochs = 10  # Adjust as needed
for epoch in range(num_epochs):
    train_model(model_embed, train_dataloader, optimizer_embed, criterion_embed)


RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

In [None]:
print(inputs.shape)
print(output.shape)

In [40]:
# Step 6: Train the model

# Training loop
num_epochs = 10  # Adjust as needed
for epoch in range(num_epochs):
    train_model(model_embed, train_dataloader, optimizer_embed, criterion_embed)


RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

In [23]:

# Step 7: Evaluate the model

# Evaluate the model
predictions_embed, true_targets = evaluate_model(model_embed, test_dataloader)
accuracy_embed = accuracy_score(true_targets, predictions_embed)
print(f'Accuracy: {accuracy_embed:.2f}')

# get the end time
# et = time.time()

# Get the execution time
# elapsed_time = et - st
# print('Execution time:', elapsed_time, 'seconds')

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)