Name: Shirisha Hechirla

# RNN Classifier with LSTM Trained on Own Dataset (IMDB)

Example notebook showing how to use an own CSV text dataset for training a simple RNN for sentiment classification (here: a binary classification problem with two labels, positive and negative) using LSTM (Long Short Term Memory) cells.

In [1]:
!pip install torch==2.0.1 torchvision==0.15.2 torchtext==0.15.2



In [2]:

import torch
import torch.nn.functional as F
import torchtext
import time
import random
import pandas as pd

torch.backends.cudnn.deterministic = True

## General Settings

In [96]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)

VOCABULARY_SIZE = 20000
LEARNING_RATE = 0.005
BATCH_SIZE = 128
NUM_EPOCHS = 15
DEVICE = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')

EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_CLASSES = 2

## Download Dataset

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [4]:
!wget https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz

--2024-12-13 23:24:26--  https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz [following]
--2024-12-13 23:24:26--  https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26521894 (25M) [application/octet-stream]
Saving to: ‘movie_data.csv.gz’


2024-12-13 23:24:27 (81.4 MB/s) - ‘movie_data.csv.gz’ saved [26521894/26521894]



In [5]:
!gunzip -f movie_data.csv.gz

Check that the dataset looks okay:

In [6]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [7]:
del df

## Prepare Dataset with Torchtext

In [8]:
# !conda install spacy

Download English vocabulary via:
    
- `python -m spacy download en_core_web_sm`

Define the Label and Text field formatters:

In [9]:
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Tokenizer
tokenizer = get_tokenizer("spacy", language="en_core_web_sm")

# Example dataset (replace with your dataset's text data)
train_data = ["I love this movie", "This is a bad movie"]

# Vocabulary
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

print(f"Vocabulary size: {len(vocab)}")

Vocabulary size: 9


Process the dataset:

In [10]:
import torch
from torch.utils.data import Dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Tokenizer
tokenizer = get_tokenizer("basic_english")

# Define custom dataset
class MovieDataset(Dataset):
    def __init__(self, csv_file, text_column, label_column, vocab=None):
        import pandas as pd
        self.df = pd.read_csv(csv_file)
        self.text_column = text_column
        self.label_column = label_column
        self.tokenizer = tokenizer

        # Build vocabulary if not provided
        if vocab is None:
            self.vocab = build_vocab_from_iterator(self._yield_tokens(), specials=["<unk>"])
            self.vocab.set_default_index(self.vocab["<unk>"])
        else:
            self.vocab = vocab

    def _yield_tokens(self):
        for text in self.df[self.text_column]:
            yield self.tokenizer(text)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text = self.df.iloc[idx][self.text_column]
        label = self.df.iloc[idx][self.label_column]
        # Convert text to token indices and label to tensor
        token_indices = [self.vocab[token] for token in self.tokenizer(text)]
        return torch.tensor(token_indices, dtype=torch.long), torch.tensor(label, dtype=torch.long)

# Create dataset
csv_file = "movie_data.csv"  # Replace with your dataset path
text_column = "review"  # Replace with your text column name
label_column = "sentiment"  # Replace with your label column name

dataset = MovieDataset(csv_file, text_column, label_column)

print(f"Dataset size: {len(dataset)}")
print(f"First item: {dataset[0]}")

Dataset size: 50000
First item: (tensor([    11,   5681,      3,      1,   2123,   3956,  28450,     24,   4741,
          1517,     23,   1097,      7,      1,  21396,   1557,      6,   6637,
           724,      3,  14901,      3,   8576,      2,     27,      1,  15546,
           323,      3,   3564,      6,   2099,      3,     61,     16,   1885,
            11,      1,   8712,      6,     48,    332,      4,     48,    596,
          4615,  15649,      2,  35014,    161,    310,      3,      1,    733,
           930,  26779,     24,   1418,  20166,     23,      3,     41,      9,
             5,   1094,   1121,   1355,     14,     52,   2982,     11,   5944,
            19,  52791,     11,    796,      2,   1626,      2,   5259,   3149,
             4,   1609,      7,  22888,      3,   1064,      7,   3830,      1,
           422,     18,     31,   1934,   1748,   2263,     24,   3603,   3233,
            23,     18,      1,   1245,      6,    497,      5,    282,      2,
       

## Split Dataset into Train/Validation/Test

Split the dataset into training, validation, and test partitions:

In [11]:
from torch.utils.data import random_split

# Define split sizes
train_size = int(0.8 * len(dataset))  # 80% for training
test_size = len(dataset) - train_size  # Remaining 20% for testing

# Perform the split
train_data, test_data = random_split(dataset, [train_size, test_size])

print(f'Num Train: {len(train_data)}')
print(f'Num Test: {len(test_data)}')

Num Train: 40000
Num Test: 10000


In [12]:
# Define sizes for train, validation, and test sets
train_size = int(0.7 * len(dataset))  # 70% for training
valid_size = int(0.1 * len(dataset))  # 10% for validation
test_size = len(dataset) - train_size - valid_size  # Remaining 20% for testing

# Perform the split
train_data, valid_data, test_data = random_split(dataset, [train_size, valid_size, test_size])

print(f'Num Train: {len(train_data)}')
print(f'Num Validation: {len(valid_data)}')
print(f'Num Test: {len(test_data)}')

Num Train: 35000
Num Validation: 5000
Num Test: 10000


In [13]:
from torch.utils.data import random_split

# Split train_data into training and validation sets
train_size = int(0.85 * len(train_data))  # 85% of current train_data
valid_size = len(train_data) - train_size  # Remaining 15% for validation

train_data, valid_data = random_split(train_data, [train_size, valid_size])

print(f'Num Train: {len(train_data)}')
print(f'Num Validation: {len(valid_data)}')

Num Train: 29750
Num Validation: 5250


In [14]:
# Access the first data item in the Subset
example_idx = train_data.indices[0]  # Get the original index
example = train_data.dataset[example_idx]  # Fetch the original data

# Print the content
print(f"Text: {example[0]}")  # Token indices or raw text depending on your dataset
print(f"Label: {example[1]}")  # Label

Text: tensor([    5,    60,   720,    22,     3,  3537,    25,   126,    93,     7,
          111,   179,  3138,  2143,  4677,     2,    10,   206,    28,   221,
          558,     7,  9348,  4845,    46,    63,  1746,   133,  4677,    17,
            5,    22,  1691,     3,    21,    10,     8,    15,    80,    59,
          225,    79,   131,   775,   428,   896,   251,    46,   281,     8,
           15,   275,    27,     5,   124, 11911,   693,     2,    17,     5,
          693,     3,    10,     8,    15,    64, 10242, 16343,    46,  9601,
         4521,     3,    21,    10,    57,  7988,     1,   359, 87806,    19,
          116,     4,     1,   918,  1103,  9722,     2,     1,    22,  1226,
           11,   157,    67,   306,     5,   946,   124,     3,   615,   323,
            6,     1, 21781,     3,   121,     5,   596,    81,     4,   530,
          164,    81,     9,  1963, 11911,     2,     9,     1,  2686,   259,
           51,   165,     1,  8902,   732,    53,   254,  

## Build Vocabulary

Build the vocabulary based on the top "VOCABULARY_SIZE" words:

In [22]:
import torch
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

# Tokenizer for text
tokenizer = get_tokenizer("basic_english")

# Function to yield tokens from the dataset (ensure the text is in raw string format)
def yield_tokens(data):
    for text, _ in data:  # Access text part of each example
        # If the text is a tensor, convert it to a string
        if isinstance(text, torch.Tensor):
            # Check if tensor contains bytes (in case of pre-processed tokens or encoded data)
            text = text.cpu().numpy().tostring().decode('utf-8', errors='ignore')  # Decode bytes to string
        elif isinstance(text, str):  # If it's already a string, no need to decode
            pass
        else:
            raise ValueError(f"Unexpected type for text: {type(text)}")

        # Tokenize the text
        yield tokenizer(text)

# Build text vocabulary
VOCABULARY_SIZE = 20000  # Define max vocabulary size
vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>", "<pad>"], max_tokens=VOCABULARY_SIZE)
vocab.set_default_index(vocab["<unk>"])  # Handle out-of-vocabulary tokens

print(f"Vocabulary size: {len(vocab)}")

  text = text.cpu().numpy().tostring().decode('utf-8', errors='ignore')  # Decode bytes to string


Vocabulary size: 20000


In [23]:
# Extract unique labels
labels = set(label.item() for _, label in train_data)

# Create label-to-index mapping
label_vocab = {label: idx for idx, label in enumerate(sorted(labels))}
print(f"Number of classes: {len(label_vocab)}")
print(f"Label mapping: {label_vocab}")

Number of classes: 2
Label mapping: {0: 0, 1: 1}


In [24]:
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    texts, labels = zip(*batch)
    texts = [torch.tensor([vocab[token] for token in tokenizer(text)], dtype=torch.long) for text in texts]
    texts = pad_sequence(texts, batch_first=True, padding_value=vocab["<pad>"])
    labels = torch.tensor([label_vocab[label.item()] for label in labels], dtype=torch.long)
    return texts, labels

In [26]:
import torch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Tokenizer for text
tokenizer = get_tokenizer("basic_english")

# Function to yield tokens from train_data
def yield_tokens(data):
    for text, _ in data:  # Access text part of each example
        if isinstance(text, torch.Tensor):  # If the text is a tensor, convert it
            text = text.cpu().numpy().tostring().decode('utf-8', errors='ignore')  # Decode to string
        elif isinstance(text, str):  # If it's already a string, no need to decode
            pass
        else:
            raise ValueError(f"Unexpected type for text: {type(text)}")

        yield tokenizer(text)

# Build text vocabulary
VOCABULARY_SIZE = 20000  # Define max vocabulary size
vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>", "<pad>"], max_tokens=VOCABULARY_SIZE)
vocab.set_default_index(vocab["<unk>"])  # Handle out-of-vocabulary tokens

print(f"Vocabulary size: {len(vocab)}")

# Build label vocabulary (assuming labels are integers)
labels = set(label.item() for _, label in train_data)  # Extract unique labels
label_vocab = {label: idx for idx, label in enumerate(sorted(labels))}

print(f"Number of classes: {len(label_vocab)}")
print(f"Label mapping: {label_vocab}")

  text = text.cpu().numpy().tostring().decode('utf-8', errors='ignore')  # Decode to string


Vocabulary size: 20000
Number of classes: 2
Label mapping: {0: 0, 1: 1}


- 20,002 not 20,000 because of the `<unk>` and `<pad>` tokens
- PyTorch RNNs can deal with arbitrary lengths due to dynamic graphs, but padding is necessary for padding sequences to the same length in a given minibatch so we can store those in an array

**Look at most common words:**

In [29]:
from collections import Counter

# Manually count the frequencies of tokens in the dataset
token_frequencies = Counter()

# Function to yield tokens from the dataset (this is already defined)
def yield_tokens(data):
    for text, _ in data:  # Access text part of each example
        if isinstance(text, torch.Tensor):  # If the text is a tensor, convert it
            text = text.cpu().numpy().tostring().decode('utf-8', errors='ignore')  # Decode to string
        elif isinstance(text, str):  # If it's already a string, no need to decode
            pass
        else:
            raise ValueError(f"Unexpected type for text: {type(text)}")

        # Tokenize and update frequency count
        tokens = tokenizer(text)
        token_frequencies.update(tokens)

# Apply to the training data
yield_tokens(train_data)

# Get the most common 20 tokens
most_common_tokens = token_frequencies.most_common(20)

# Print the result
print(f"Most common 20 tokens: {most_common_tokens}")

  text = text.cpu().numpy().tostring().decode('utf-8', errors='ignore')  # Decode to string


Most common 20 tokens: [('\x00\x00\x00\x00\x00\x00\x00', 140946), ('!', 54479), ('\x00\x00\x00\x00\x00\x00', 54244), ("'", 46807), ('(', 45492), (')', 44631), (',', 42267), ('.', 39362), ('?', 29782), ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 16654), ('\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00', 10429), ('\x01\x00\x00\x00\x00\x00\x00', 6903), ('\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00', 6858), ('\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00', 6050), ('\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00', 5646), ('\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00', 5635), ('\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00', 4811), ('\x00\x00\x00\x00\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x00', 4436), ('\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x00\x00\x00\x00\x00', 4327), ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 3784)]


**Tokens corresponding to the first 10 indices (0, 1, ..., 9):**

In [31]:
# Print the first 10 tokens from the vocabulary (integer-to-string mapping)
itos = vocab.get_itos()  # Get integer-to-string mapping
print(itos[:10])  # Print first 10 tokens

['<unk>', '<pad>', '\x00\x00\x00\x00\x00\x00\x00', '!', '\x00\x00\x00\x00\x00\x00', "'", '(', ')', ',', '.']


**Converting a string to an integer:**

In [61]:
from torch.nn.utils.rnn import pad_sequence
from torchtext.data.utils import get_tokenizer
import torch

tokenizer = get_tokenizer("basic_english")

def collate_batch(batch):
    text_list, label_list = zip(*batch)

    # Ensure that text is a string (handle if it's tensor or numpy array)
    text_list = [
        text if isinstance(text, str) else
        text.numpy().tobytes().decode('utf-8', errors='ignore') if isinstance(text, torch.Tensor) else
        text
        for text in text_list
    ]

    # Tokenize and convert to indices using the vocabulary
    text_list = [
        torch.tensor([vocab[token] for token in tokenizer(text)], dtype=torch.long)
        for text in text_list
    ]

    # Pad the sequences to ensure the batch is of equal length
    text_list = pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])

    # Convert labels to tensor
    label_list = torch.tensor(label_list, dtype=torch.long)

    return text_list, label_list

**Class labels:**

In [62]:
# Print the string-to-integer mapping for labels
print(f"Label to index mapping (stoi): {label_vocab}")

Label to index mapping (stoi): {0: 0, 1: 1}


**Class label count:**

In [63]:
from collections import Counter

# Create a counter to track label frequencies
label_frequencies = Counter()

# Iterate over the dataset to count label occurrences
for _, label in train_data:
    label_frequencies[label.item()] += 1  # Increment the count for each label

# Print the frequencies of labels
print(f"Label frequencies: {label_frequencies}")

# Print the most common 20 labels and their frequencies
most_common_labels = label_frequencies.most_common(20)
print(f"Most common 20 labels: {most_common_labels}")

Label frequencies: Counter({0: 14948, 1: 14802})
Most common 20 labels: [(0, 14948), (1, 14802)]


## Define Data Loaders

In [64]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from torchtext.data.utils import get_tokenizer
import torch

# Tokenizer for text
tokenizer = get_tokenizer("basic_english")

# Create a custom collate function to pad sequences
def collate_batch(batch):
    text_list, label_list = zip(*batch)

    # Ensure that text is a string (handle if it's tensor or numpy array)
    text_list = [
        text if isinstance(text, str) else
        text.numpy().tostring().decode('utf-8', errors='ignore') if isinstance(text, torch.Tensor) else
        text  # If text is already a string, pass as is
        for text in text_list
    ]

    # Tokenize and convert to indices using the vocabulary
    text_list = [
        torch.tensor([vocab[token] for token in tokenizer(text)], dtype=torch.long)
        for text in text_list
    ]

    # Pad the sequences to ensure the batch is of equal length
    text_list = pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])

    # Convert labels to tensor
    label_list = torch.tensor(label_list, dtype=torch.long)

    return text_list, label_list

# Create DataLoader objects for train, valid, and test datasets
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle=False)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle=False)

# Check if the DataLoader works properly
for batch_idx, (text, labels) in enumerate(train_loader):
    print(f'Batch {batch_idx + 1} - Text size: {text.size()}, Labels size: {labels.size()}')
    if batch_idx > 1:
        break

Batch 1 - Text size: torch.Size([128, 254]), Labels size: torch.Size([128])
Batch 2 - Text size: torch.Size([128, 243]), Labels size: torch.Size([128])
Batch 3 - Text size: torch.Size([128, 264]), Labels size: torch.Size([128])


  text.numpy().tostring().decode('utf-8', errors='ignore') if isinstance(text, torch.Tensor) else


Testing the iterators (note that the number of rows depends on the longest document in the respective batch):

In [65]:
print('Train')
for batch in train_loader:
    text, labels = batch  # Unpack the tuple
    print(f'Text matrix size: {text.size()}')  # Size of the text tensor
    print(f'Target vector size: {labels.size()}')  # Size of the label tensor
    break

print('\nValid:')
for batch in valid_loader:
    text, labels = batch  # Unpack the tuple
    print(f'Text matrix size: {text.size()}')  # Size of the text tensor
    print(f'Target vector size: {labels.size()}')  # Size of the label tensor
    break

print('\nTest:')
for batch in test_loader:
    text, labels = batch  # Unpack the tuple
    print(f'Text matrix size: {text.size()}')  # Size of the text tensor
    print(f'Target vector size: {labels.size()}')  # Size of the label tensor
    break

Train
Text matrix size: torch.Size([128, 242])
Target vector size: torch.Size([128])

Valid:
Text matrix size: torch.Size([128, 291])
Target vector size: torch.Size([128])

Test:
Text matrix size: torch.Size([128, 310])
Target vector size: torch.Size([128])


  text.numpy().tostring().decode('utf-8', errors='ignore') if isinstance(text, torch.Tensor) else


## Model

In [73]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(RNN, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)  # Embedding layer
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)  # LSTM layer
        self.fc = nn.Linear(hidden_dim, output_dim)  # Fully connected layer

    def forward(self, text):
        embedded = self.embedding(text)  # Shape: [batch_size, seq_len, embedding_dim]
        output, (hidden, cell) = self.rnn(embedded)  # LSTM returns output and hidden states

        # Use only the final hidden state (the last time step)
        hidden = hidden[-1]  # Shape: [batch_size, hidden_dim]

        logits = self.fc(hidden)  # Shape: [batch_size, output_dim]
        return logits

In [74]:
import torch

# Check if CUDA is available and choose the device accordingly
if torch.cuda.is_available():
    # Use the first available GPU (device 0)
    DEVICE = torch.device('cuda:0')  # You can also use 'cuda' to default to the first available GPU
else:
    DEVICE = torch.device('cpu')

print(f"Using device: {DEVICE}")

# Now you can safely move the model to the chosen device
torch.manual_seed(RANDOM_SEED)

model = RNN(input_dim=len(vocab),
            embedding_dim=EMBEDDING_DIM,
            hidden_dim=HIDDEN_DIM,
            output_dim=NUM_CLASSES)  # Use 1 for binary classification

model = model.to(DEVICE)  # Move the model to the selected device (CPU/GPU)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

print("Model initialized and ready for training.")

Using device: cuda:0
Model initialized and ready for training.


## Training

In [75]:
def compute_accuracy(model, data_loader, device):

    with torch.no_grad():

        correct_pred, num_examples = 0, 0

        for i, (features, targets) in enumerate(data_loader):

            features = features.to(device)
            targets = targets.float().to(device)

            logits = model(features)
            _, predicted_labels = torch.max(logits, 1)

            num_examples += targets.size(0)
            correct_pred += (predicted_labels == targets).sum()
    return correct_pred.float()/num_examples * 100

In [76]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(RNN, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)  # batch_first=True
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)  # Shape: [batch_size, seq_len, embedding_dim]
        output, (hidden, cell) = self.rnn(embedded)  # Shape of output: [batch_size, seq_len, hidden_dim]

        # Use only the final hidden state (the last time step)
        hidden = hidden[-1]  # Shape: [batch_size, hidden_dim] (use the last layer's hidden state)

        logits = self.fc(hidden)  # Shape: [batch_size, output_dim]
        return logits

In [78]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(RNN, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)  # Embedding layer
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)  # LSTM layer
        self.fc = nn.Linear(hidden_dim, output_dim)  # Fully connected layer

    def forward(self, text):
        embedded = self.embedding(text)  # Shape: [batch_size, seq_len, embedding_dim]
        output, (hidden, cell) = self.rnn(embedded)  # LSTM returns output and hidden states

        # Use only the final hidden state (the last time step)
        hidden = hidden[-1]  # Shape: [batch_size, hidden_dim]

        logits = self.fc(hidden)  # Shape: [batch_size, output_dim]
        return logits

In [95]:
 NUM_EPOCHS = 15  # Set the number of epochs for training

# The rest of the code remains the same
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()
    for batch_idx, batch_data in enumerate(train_loader):
        text, labels = batch_data
        text = text.to(DEVICE)
        labels = labels.to(DEVICE)

        # Forward pass
        logits = model(text)
        loss = F.cross_entropy(logits, labels)
        optimizer.zero_grad()  # Zero gradients
        loss.backward()  # Backpropagation
        optimizer.step()  # Update model parameters

        # Logging every 50 batches
        if batch_idx % 50 == 0:
            print(f'Epoch: {epoch + 1:03d}/{NUM_EPOCHS:03d} | '
                  f'Batch {batch_idx:03d}/{len(train_loader):03d} | '
                  f'Loss: {loss:.4f}')

    # Print training and validation accuracy
    with torch.no_grad():
        train_acc = compute_accuracy(model, train_loader, DEVICE)
        valid_acc = compute_accuracy(model, valid_loader, DEVICE)
        print(f'Training accuracy: {train_acc:.2f}%\n'
              f'Validation accuracy: {valid_acc:.2f}%')

    print(f'Time elapsed: {(time.time() - start_time) / 60:.2f} min')

# Final evaluation on the test set
test_acc = compute_accuracy(model, test_loader, DEVICE)
print(f'Total Training Time: {(time.time() - start_time) / 60:.2f} min')
print(f'Test accuracy: {test_acc:.2f}%')

  text.numpy().tostring().decode('utf-8', errors='ignore') if isinstance(text, torch.Tensor) else


Epoch: 001/015 | Batch 000/233 | Loss: 0.6945
Epoch: 001/015 | Batch 050/233 | Loss: 0.6871
Epoch: 001/015 | Batch 100/233 | Loss: 0.6905
Epoch: 001/015 | Batch 150/233 | Loss: 0.6895
Epoch: 001/015 | Batch 200/233 | Loss: 0.6944
Training accuracy: 50.64%
Validation accuracy: 50.10%
Time elapsed: 1.07 min
Epoch: 002/015 | Batch 000/233 | Loss: 0.6977
Epoch: 002/015 | Batch 050/233 | Loss: 0.6924
Epoch: 002/015 | Batch 100/233 | Loss: 0.6965
Epoch: 002/015 | Batch 150/233 | Loss: 0.6862
Epoch: 002/015 | Batch 200/233 | Loss: 0.6871
Training accuracy: 50.64%
Validation accuracy: 50.23%
Time elapsed: 2.08 min
Epoch: 003/015 | Batch 000/233 | Loss: 0.6888
Epoch: 003/015 | Batch 050/233 | Loss: 0.6906
Epoch: 003/015 | Batch 100/233 | Loss: 0.6943
Epoch: 003/015 | Batch 150/233 | Loss: 0.6895
Epoch: 003/015 | Batch 200/233 | Loss: 0.6869
Training accuracy: 49.98%
Validation accuracy: 49.79%
Time elapsed: 3.08 min
Epoch: 004/015 | Batch 000/233 | Loss: 0.6883
Epoch: 004/015 | Batch 050/233 | 

In [89]:
import spacy
import torch

# Assuming vocab is defined properly (replace this with your vocab creation)
nlp = spacy.blank("en")

def predict_sentiment(model, sentence, vocab, device):
    model.eval()

    # Tokenize the sentence using spaCy
    tokenized = [tok.text.lower() for tok in nlp.tokenizer(sentence)]  # Lowercased tokens for consistency
    print(f"Tokenized sentence: {tokenized}")  # Debug print to check tokenization

    # Convert tokens to indices using the vocabulary
    indexed = [vocab[token] if token in vocab else vocab['<unk>'] for token in tokenized]
    print(f"Indexed tokens: {indexed}")  # Debug print to check the indexing

    if len(indexed) == 0:
        raise ValueError("The tokenized sentence is empty or contains only unknown tokens.")

    # Create a tensor from the indices and move it to the device (CPU or GPU)
    tensor = torch.LongTensor(indexed).to(device)  # Shape: [seq_len]
    tensor = tensor.unsqueeze(0)  # Shape: [1, seq_len] for batch size 1

    # Pass the tensor through the model
    logits = model(tensor)

    # Get the probabilities using softmax
    probabilities = torch.nn.functional.softmax(logits, dim=1)

    return probabilities[0][1].item()  # Probability of the positive class (index 1)

# Check CUDA availability and devices
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs available: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"Device {i}: {torch.cuda.get_device_name(i)}")

# Dynamically set the device (GPU if available, otherwise CPU)
try:
    DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {DEVICE}")
except Exception as e:
    print(f"Error in setting device: {e}")
    DEVICE = torch.device("cpu")
    print(f"Using CPU instead.")

# Example Usage
sentence = "This is such an awesome movie, I really love it!"
print(f"Probability positive: {predict_sentiment(model, sentence, vocab, DEVICE):.4f}")

# Calculate probability of negative sentiment
negative_prob = 1 - predict_sentiment(model, "I really hate this movie. It is really bad and sucks!", vocab, DEVICE)

# Print the probability of negative sentiment
print(f'Probability negative: {negative_prob:.4f}')

CUDA available: True
Number of GPUs available: 1
Device 0: Tesla T4
Using device: cuda:0
Tokenized sentence: ['this', 'is', 'such', 'an', 'awesome', 'movie', ',', 'i', 'really', 'love', 'it', '!']
Indexed tokens: [0, 0, 0, 0, 0, 0, 8, 8048, 0, 0, 0, 3]
Probability positive: 0.7749
Tokenized sentence: ['i', 'really', 'hate', 'this', 'movie', '.', 'it', 'is', 'really', 'bad', 'and', 'sucks', '!']
Indexed tokens: [8048, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 3]
Probability negative: 0.3285


In [90]:
# Calculate probability of negative sentiment
negative_prob = 1 - predict_sentiment(model, "I really hate this movie. It is really bad and sucks!", vocab, DEVICE)

# Print the probability of negative sentiment
print(f'Probability negative: {negative_prob:.4f}')

Tokenized sentence: ['i', 'really', 'hate', 'this', 'movie', '.', 'it', 'is', 'really', 'bad', 'and', 'sucks', '!']
Indexed tokens: [8048, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 3]
Probability negative: 0.3285


In [92]:
!pip install watermark

Collecting watermark
  Downloading watermark-2.5.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting jedi>=0.16 (from ipython>=6.0->watermark)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading watermark-2.5.0-py2.py3-none-any.whl (7.7 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi, watermark
Successfully installed jedi-0.19.2 watermark-2.5.0


In [93]:
%load_ext watermark

In [94]:
%watermark -iv

pandas   : 2.2.2
torch    : 2.0.1
torchtext: 0.15.2
spacy    : 3.7.5

