#Problem Statement and Motivation
The problem I am going to tackle is **predicting the category of a news article based on the title and description** of said text. According to the media stack API, some articles are classified as "general", which means they were not assigned to a certain category. The categories Mediastack currently supports are sports, entertainment, health, science, technology, and business. The goal is to build a model to predict the most likely category for some of these unassigned articles, assuming the reason they were unassianged was because of a lack of metadata provided about the articles and not because the articles do not fit into any of these categories.

**I will be using BERT models for this task,** as it's a small language model that can be run locally without a GPU, and has been shown in research to be relatively robust. BERT can be thought of as a scaled-down version of a model like chat gpt, as it usually has simmilar mechanisms, like the encoder and attention mechanism.

#Data Collection
For this assignment, I am **only going to retrieve 100 samples of news article metadata**. The API limit is 100, but for future work, the number of calls can be increased by adjusting the for-loop range.

Once the request goes through, data is appended to a numpy array and saved for future use. Note that **I am considering all categories aside from the "general" category as the assumption is that is the unassigned category**.

I will also be only retrieving English data.

In [None]:
import http.client, urllib.parse

conn = http.client.HTTPConnection('api.mediastack.com')
data_arr = []


for i in range(1):
    params = urllib.parse.urlencode({
        'access_key': 'e6fcc1cb587d615f0ed72c7fc28e6d82',
        'sort': 'published_desc',
        'categories': 'sports,entertainment,health,science,technology,business, -general',
        'limit': 100,
        "language": 'en'
    })

    conn.request('GET', '/v1/news?{}'.format(params))

    res = conn.getresponse()
    data = res.read()

    data.decode('utf-8')

    import json

    parsed = json.loads(data)

    for item in parsed['data']:
        data_arr.append(item)

import numpy as np
np.save('data.npy', np.array(data_arr),  allow_pickle=True)

#Experiment 1:

In this first experiment, **I am exploring fine-tuning BERT for sequence classification**, specifically given the title and description what label/category will be assigned to that input?

First, **any punctuation is removed from the fields in focus: title, description, and category. Then each field is changed to be lowercase**. This is done to ensure the data is normalized and that there can be a fair comparison between different inputs.

A data loader is utilized to iteratively train the Bert sequence classification model.

The **Adam optimizer was selected** based on the high performance it has been shown to yield. Future work could involve experimentation with different optimizers, however.

The default **loss function being used here is cross-entropy loss**, which pushes the model to assign higher probabilities to correct labels. This type of loss function is particularly useful when dealing with multiple classes as we are doing here.

After the model is trained, it is saved for future use in the inference stage.

This model took 58 minutes to train locally.

In [None]:
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader
import string
from transformers import BertTokenizer

category_to_idx = {'sports': 0, 'science': 1, 'business': 2, 'entertainment': 3, 'health' : 4, 'technology': 5}

class NewsDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def preprocess(self, text):
        if text is None:
            return ""
        return text.lower().translate(str.maketrans('', '', string.punctuation))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        # Concatenate title and description
        text = self.preprocess(item['title']) + " " + self.preprocess(item['description'])

        # Tokenize
        encoded_dict = self.tokenizer(
            text,
            add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
            max_length=self.max_length,  # Pad & truncate all sentences.
            padding='max_length',
            truncation=True,
            return_tensors='pt',  # Return PyTorch tensors
        )
        label = torch.tensor(category_to_idx[item['category']], dtype=torch.long)
        # Return tokenized information along with the category
        return encoded_dict['input_ids'].squeeze(0), encoded_dict['attention_mask'].squeeze(0), label


# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load data
data = np.load('data.npy', allow_pickle=True)

# Create Dataset
news_dataset = NewsDataset(data, tokenizer)

# Create DataLoader
news_dataloader = DataLoader(news_dataset, batch_size=2, shuffle=True)


from transformers import BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader
import torch

# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=6  #number of unique categories
)



# Define optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop
for epoch in range(8):
    model.train()
    total_loss = 0

    for batch in news_dataloader:
        # Unpack the training batch
        b_input_ids, b_input_mask, b_labels = batch

        # Clear previously calculated gradients
        model.zero_grad()

        # Forward pass
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()

    # Calculate average loss over the epoch
    avg_train_loss = total_loss / len(news_dataloader)
    print(f"Average train loss: {avg_train_loss}")

# Save the model
model.save_pretrained('./exp1_fine_tuned_bert')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Average train loss: 1.438009613752365
Average train loss: 0.8464612972736358
Average train loss: 0.4123449271917343
Average train loss: 0.21202554874122143
Average train loss: 0.13355910535901785
Average train loss: 0.090833296533674
Average train loss: 0.060082337707281115
Average train loss: 0.04345612993463874


#Experiment 2
In experiment 2, we are **using a mask token prediction approach, which potentially makes more sense than sequence classification**. Rather than simply assigning labels to inputs, this variant of BERT will learn to predict a specific token found in a sentence. T**his task is formalized by adding a masked token to a sentence, denoted by [MASK], which will represent the category a certain news article falls into**. This type of approach is more similar to a text generation task than a classification task.

The code for experiment 2 follows a similar structure to the experiment, with one important distinction being how the input into the model is processed. The training data is parsed in a way that tells the model what the masked token is and what the "context" information is. **Additionally, instead of the target output being a label, the target output now represents what value the masked token should be filled in with.**

The optimizer and loss function are the same ones used in experiment 1, however, the loss function is applied in a slightly different way. For each masked position, the BERT model outputs logits, which are raw, unnormalized scores for each token in the vocabulary. These logits are turned into probability values, and cross-entropy is used to penalize incorrect tokens with higher probabilities.

This model took ~1 hour to train

In [None]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import string

class NewsDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def preprocess(self, text):
        if text is None:
            return ""
        return text.lower().translate(str.maketrans('', '', string.punctuation))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        text = self.preprocess(item['title']) + " " + self.preprocess(item['description'])
        category = self.preprocess(item['category'])

        # Construct the prompt with [MASK] for category
        category_list = "The category can be sports, business, science, technology, entertainment, or health."
        masked_text = f"{category_list} The title is {text}, the description is {text}, the category is [MASK]."

        # Tokenize and create labels
        input_ids = self.tokenizer.encode(masked_text, add_special_tokens=True)
        labels = [-100] * len(input_ids)  # Initialize labels with -100, which indicate pytorch to ignore these token
        # in ths loss function

        # Replace the [MASK] token's label with the true category's ID
        mask_token_index = input_ids.index(self.tokenizer.mask_token_id)
        labels[mask_token_index] = self.tokenizer.convert_tokens_to_ids(category)

        # Padding and truncation
        padding_length = self.max_length - len(input_ids)
        if padding_length > 0:  # Pad if necessary
            input_ids.extend([self.tokenizer.pad_token_id] * padding_length)
            labels.extend([-100] * padding_length)
        elif padding_length < 0:  # Truncate if necessary
            input_ids = input_ids[:self.max_length]
            labels = labels[:self.max_length]

        return torch.tensor(input_ids, dtype=torch.long), torch.tensor(labels, dtype=torch.long)

from transformers import BertForMaskedLM, AdamW, BertTokenizer

model = BertForMaskedLM.from_pretrained('bert-base-uncased')
if torch.cuda.is_available():
    model.cuda()

optimizer = AdamW(model.parameters(), lr=2e-5)

num_epochs = 4


def fine_tune_bert_mlm(model, dataloader, optimizer, num_epochs=4):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0

        for batch in dataloader:
            input_ids, labels = batch

            # Attention mask - identify non-zero inputs
            attention_mask = (input_ids != tokenizer.pad_token_id).type(torch.long)

            if torch.cuda.is_available():
                input_ids = input_ids.cuda()
                labels = labels.cuda()
                attention_mask = attention_mask.cuda()

            optimizer.zero_grad()

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            loss.backward()
            optimizer.step()

        avg_train_loss = total_loss / len(dataloader)
        print(f"Average train loss: {avg_train_loss}")


# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load your data
data = np.load('data.npy', allow_pickle=True)

# Create Dataset
news_dataset = NewsDataset(data, tokenizer)

# DataLoader setup
news_dataloader = DataLoader(news_dataset, batch_size=2, shuffle=True)


fine_tune_bert_mlm(model, news_dataloader, optimizer, num_epochs=4)


model.save_pretrained('./exp2_fine_tuned_bert')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identica

Average train loss: 1.128540556561202
Average train loss: 0.2584678082028404
Average train loss: 0.022833976553520187
Average train loss: 0.0061055827329983


#Inference with Models

In [None]:
#Lets get some test data first from the "general" category set
import http.client, urllib.parse

conn = http.client.HTTPConnection('api.mediastack.com')
data_arr = []


for i in range(1):
    params = urllib.parse.urlencode({
        'access_key': 'e6fcc1cb587d615f0ed72c7fc28e6d82',
        'sort': 'published_desc',
        'categories': 'general',
        'limit': 5,
        'language': 'en, -es, -ar, -de, -fr, -he, -it, -nl, -no, -pt, -ru, -zh, -se'

    })
    conn.request('GET', '/v1/news?{}'.format(params))

    res = conn.getresponse()
    data = res.read()

    data.decode('utf-8')

    import json

    parsed = json.loads(data)

    for item in parsed['data']:
        data_arr.append(item)

import numpy as np
np.save('test_data.npy', np.array(data_arr),  allow_pickle=True)

In [None]:
#Inference for experiment 1
data = np.load('test_data.npy', allow_pickle=True)
model_path = '/content/exp1_fine_tuned_bert' #replace with your file path
model = BertForSequenceClassification.from_pretrained(model_path)

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


import string

def preprocess(text):
    if text is None:
        return ""
    return text.lower().translate(str.maketrans('', '', string.punctuation))

for item in data:
    text = preprocess(item['title']) + " " + preprocess(item['description'])

    # Tokenize
    inputs = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        outputs = model(**inputs)

    # Process the output
    logits = outputs.logits
    predicted_probabilities = torch.nn.functional.softmax(logits, dim=1)
    predicted_label_idx = torch.argmax(predicted_probabilities, dim=1).item()

    # Convert predicted label index to actual label
    idx_to_category = {
        0: 'sports',
        1: 'science',
        2: 'business',
        3: 'entertainment',
        4: 'health',
        5: 'technology'
    }
    predicted_category = idx_to_category[predicted_label_idx]
    print(text, "Predicted Label:", predicted_category)

benny safdie confirma que los hermanos safdie se han separado benny afirmó que la separación con josh es amigable y representa una evolución natural en sus respectivas trayectorias profesionales Predicted Label: sports
us navy blue angels arrive in imperial county the us navy blue angels arrived at the naval air facility in el centro for their winter training to prepare for the 2024 air showthe post us navy blue angels arrive in imperial county appeared first on kyma Predicted Label: sports
luego del 9 de enero reiniciarán las movilizaciones sociales pacíficas en la zona sur de la región consejo de autoridades originarias de jacha mallkus anunció que luego del 9 de enero se reiniciarán movilizaciones sociales pacificas por el poco avance de las investigaciones el representante del consejo de autoridades originarias de jacha mallkus rubén añamuro anunció que luego de participar de las actividades programadas por los familiares de las víctimas del pasado the post luego del 9 de enero rei

In [None]:
#inference for experiment 2
data = np.load('test_data.npy', allow_pickle=True)
model_path = '/content/exp2_fine_tuned_bert' #replace with your file path
model = BertForMaskedLM.from_pretrained(model_path)

def prepare_input(text, tokenizer):
    inputs = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        return_tensors='pt'
    )
    return inputs

def predict_masked_token(text, model, tokenizer):
    model.eval()

    inputs = prepare_input(text, tokenizer)
    mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]

    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        predictions = outputs.logits

    predicted_index = torch.argmax(predictions[0, mask_token_index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

    return predicted_token
for item in data:
  category_list = "The category can be sports, business, science, technology, entertainment, or health."
  masked_sentence = f"{category_list} The title is {item['title']}, the description is {item['description']}, the category is [MASK]."
  predicted_word = predict_masked_token(masked_sentence, model, tokenizer)
  print(masked_sentence, f"Predicted word for [MASK]: {predicted_word}")

The category can be sports, business, science, technology, entertainment, or health. The title is Few Weeks Left To Comment On Tuolumne County Wildfire Protection Plan, the description is The goal is to outline local fire priorities and reduce wildfire risk through a proactive plan for communities in the county., the category is [MASK]. Predicted word for [MASK]: business
The category can be sports, business, science, technology, entertainment, or health. The title is Condenan a 20 años de cárcel a uno de 3 asaltantes, the description is Los hechos ocurrieron el 11 de noviembre del 2022, cuando Pimentel Rosario, junto a Michael Miguel Pimentel Grullart y Wilmer Eduardo Peguero Camilo se presentaron en el vehículo marca Kia, modelo K5, año 2014, plateado, placa A809438, en momentos que Guerra Álvarez se encontraba abriendo la puerta de su residencia ubicada en los Suizos, de Bayaguana, donde encañonaron a la víctima con un arma de fuego., the category is [MASK]. Predicted word for [MASK

#Conclusion and Future work

To start, I was working with limited computational power and time, and as a result was not able to build a very large dataset. **Ideally, the data retrieved should be more than 1000 samples, but I used only 100.**

Another major issue is that despite specifically requesting only English news articles, it appears that some Spanish articles may have been mislabeled as English. This is an issue as **I am not using a multi-lingual model, and Bert was trained in English text.**

I suspect because of the low number of samples the models I created were **overfitting on not able to generalize the data well.**

In regards to evaluation metrics, writing code that checks wether or not the predicted category is contained within the actual category name, would potentially be a good start there. That would take into the edge case of a model outputting "The predicted category is health".

To summarize, future work would involve using a multilingual model and also attaining the computational means necessary to fine-tune language models on hundreds, or thousands of samples. The nearest next steps of this project would also be to implement a robust evaluation pipeline.