### News Headlines Classification


This notebook demonstrates a complete workflow for training a Long Short-Term Memory (LSTM) neural network to **classify news headlines into four categories: entertainment, business, science/tech, and health.**

**Key Steps:**

1.  **Imports**: Essential libraries for data handling (pandas, numpy), deep learning (PyTorch), and natural language processing (NLTK, scikit-learn) are imported.

2.  **Data Loading**: News headlines and their corresponding categories are loaded from a CSV file. The dataset is shuffled, and categorical labels are encoded into numerical representations (0-3).

3.  **Text Preprocessing**: A custom function tokenizes headlines, converts them to lowercase, removes non-alphabetic characters, and filters out common English stopwords. The processed tokens are stored.

4.  **Vocabulary Building**: A vocabulary is constructed from all unique tokens in the dataset. A `word2idx` mapping is created, assigning a unique integer ID to each of the most frequent 20,000 words.

5.  **Tokens to Sequences**: Tokenized headlines are converted into numerical sequences using the `word2idx` mapping. Sequences are padded or truncated to a fixed length of 30, preparing them for input to the LSTM model.

6.  **Train-Test Split**: The numerical sequences and their labels are split into training and testing sets, with 75% for training and 25% for testing, ensuring stratified sampling.

7.  **PyTorch Dataset & DataLoader**: A custom `NewsDataset` class is defined to handle the training and testing data, and `DataLoader` objects are created to efficiently batch and load data during training and evaluation.

8.  **LSTM Model Definition**: A bidirectional LSTM-based classifier (`LSTMClassifier`) is defined. It includes an embedding layer, a bidirectional LSTM layer, a dropout layer, and a final linear layer for classification. The model, loss function (CrossEntropyLoss with class weights), and optimizer (Adam) are initialized.

9.  **Training**: The model is trained for 5 epochs. For each epoch, it iterates through the training data, performs forward and backward passes, and updates model weights using the Adam optimizer, printing the loss at the end of each epoch.

10. **Evaluation**: After training, the model's performance is evaluated on the test set. Accuracy and a detailed classification report (precision, recall, f1-score) are printed to assess the model's performance across different news categories.

11. **Prediction Example**: A demonstration of how to preprocess a new unseen headline and use the trained LSTM model to predict its category.

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 30px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Imports <br>
  <span style="
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 1.6rem;
  width: fit-content;
">BY: Genia</span>
</p>


In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils.class_weight import compute_class_weight
from collections import Counter

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 30px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Data Loading <br>
</p>

In [2]:
data_path = 'C:\\sharing1\\ML4\\edu\\nlp\\uci-news-aggregator.csv'
data = pd.read_csv(data_path, usecols=['TITLE', 'CATEGORY'])

# Shuffle dataset
concated = data.sample(frac=1, random_state=42).reset_index(drop=True)

# Encode labels
label_map = {'e':0, 'b':1, 't':2, 'm':3}
concated['LABEL'] = concated['CATEGORY'].map(label_map)
concated.drop(['CATEGORY'], axis=1, inplace=True)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\sharing1\\ML4\\edu\\nlp\\uci-news-aggregator.csv'

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Text Preprocessing <br>
</p>

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    return tokens

concated['TOKENS'] = concated['TITLE'].apply(preprocess_text)

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Build Vocab <br>
</p>

In [None]:
all_tokens = [token for tokens in concated['TOKENS'] for token in tokens]

MAX_VOCAB = 20000
counter = Counter(all_tokens)
most_common = counter.most_common(MAX_VOCAB)

word2idx = {word: i+1 for i, (word, _) in enumerate(most_common)}
vocab_size = len(word2idx) + 1


<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Tokens To Sequences <br>
</p>

In [None]:
max_len = 30

def tokens_to_sequence(tokens):
    seq = [word2idx.get(word, 0) for word in tokens]
    if len(seq) < max_len:
        seq += [0] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return seq

concated['SEQ'] = concated['TOKENS'].apply(tokens_to_sequence)

X = np.array(concated['SEQ'].tolist())
y = concated['LABEL'].values

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Train-Test Split <br>
</p>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Pytorch Dataset <br>
</p>

In [None]:
class NewsDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.LongTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

batch_size = 512
train_loader = DataLoader(NewsDataset(X_train, y_train), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(NewsDataset(X_test, y_test), batch_size=batch_size)

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  LSTM Model <br>
</p>

In [None]:
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            batch_first=True,
            bidirectional=True
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, x):
        x = self.embedding(x)
        _, (hn, _) = self.lstm(x)
        hn = torch.cat((hn[-2], hn[-1]), dim=1)
        out = self.dropout(hn)
        return self.fc(out)

embedding_dim = 128
hidden_dim = 128
output_dim = 4

device = torch.device('cpu')
model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, output_dim).to(device)

# Class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Training <br>
</p>

In [None]:
epochs = 5

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)

        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.4f}")

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Evaluation <br>
</p>

In [None]:
model.eval()
all_preds, all_labels = [], []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        outputs = model(X_batch)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()

        all_preds.extend(preds)
        all_labels.extend(y_batch.numpy())

print("Test Accuracy:", accuracy_score(all_labels, all_preds))
print(classification_report(
    all_labels,
    all_preds,
    target_names=['entertainment','business','science/tech','health']
))

<p style="
  background: #ffffff;
  border: 5px solid #9f0b0bff;
  border-radius: 18px;
  padding: 40px 180px;
  text-align: center;
  text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.4);
  font-family: 'serif';
  color: #9f0b0bff;
  font-size: 3.8rem;
  width: fit-content;
  margin: 20px auto;
">
  Prediction Example <br>
</p>

In [None]:
txt = ["Regular fast food eating linked to fertility issues in women"]
tokens = preprocess_text(txt[0])
seq = tokens_to_sequence(tokens)
seq_tensor = torch.LongTensor(seq).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    pred = model(seq_tensor)
    pred_label = pred.argmax(dim=1).item()

labels = ['entertainment','business','science/tech','health']
print("Prediction:", labels[pred_label])