<a href="https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification Using Small and Middle-sized LLMs

Created by Sarah Oberbichler [![ORCID](https://info.orcid.org/wp-content/uploads/2019/11/orcid_16x16.png)](https://orcid.org/0000-0002-1031-2759)

Text classification assigns predefined categories/labels to text documents. Example: Categorizing emails as spam/not-spam, or news articles into topics like sports, politics, technology.

* **Classifying Earthquake Reports as an Example**

As an text classification example, we use a dataset containing earthquake reporting and  classify the articles as either "aid reports" or "others." This classification would - for example - enable further research into how humanitarian assistance is covered in disaster media coverage, providing insights into the prominence and patterns of aid-related reporting during crises.

* **Trying Small and Middle-Sized LLMs**

While fine-tuning language models for specific classification tasks can achieve strong results, not all research projects have the resources to fine-tune models. Using LLMs with in-context-learning can provide an accessible alternative by leveraging pre-trained knowledge through prompts, using few-shot learning with examples, and adapting to new tasks without parameter updates.  
For this example, we compare a small (7b) and a middle-sized (70b) model for the same classification taks.

* **Evaluating Model Output**

In order to evaluate which model performs better, we need to compare the model results to ground truth results. Ground truth answers are verified, correct labels used as a reference point to evaluate model performance. They represent the "true" or "correct" classification determined by human experts or established criteria.
Our dataset therefore also contains a column containing human ground truth.

##Importing the Dataset


In [None]:
!git clone https://github.com/ieg-dhr/NLP-Course4Humanities_2024.git

In [None]:
import pandas as pd

articles_df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/earthquake_articles_examples.xlsx')

articles_df.head()

## Text Classification using the Qwen2.5 7b Model

We'll access the model through Hugging Face's repository, which means it gets downloaded to our Colab runtime environment. It is recommended to switch to Colab's T4 GPU runtime for faster processing speed. Initial download and model processing will require approximately 10 minutes.

In [None]:
!pip install optimum
!pip install auto-gptq

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch


tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    device_map="balanced",
    torch_dtype=torch.float16
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

def format_with_examples(text):
    prompt = """Klassifiziere den Text als entweder "Hilfsreport" oder "Andere". Setze deine Antwort in XML-Tags.
Kriterien für "Hilfsreport":

Spendenbeträge und deren Verwendung
Aktivitäten von Hilfsorganisationen
Anzahl der Helfer und Art der Hilfsleistungen
Verteilung von Hilfsgütern
Hilfskomitees und ihre Arbeit
Hilfsmaßnahmen jeglicher Art
Güter oder Geld für Katastrophenopfer
Unterstützung für Opfer
Hilfe und Wiederaufbau

Überprüfe deine Antwort durch Kontrolle der relevanten Themen und Wörter sowie der Beispiele. Reflektiere deine Antwort und gib nur die beste Antwort.

Example 1: Die Mannschaften der Kriegsschiffe haben an der Küste Schutzhütten fertig gestellt.
<answer>Andere</answer>

Example 2: Das deutsche Hilfskomitee sammelt Spenden. Kleidungs- und Wäschestücke werden geschickt.
<answer>Hilfsreport</answer>

Example 3: W Madrid, 11. Jan. (Telegr.) Der Finanzminister hat in der Kammer den Antrag eingebracht, für die Opfer des Erdbebens in Süditalien 200000 Pesetas zu bewilligen.
<answer>Hilfsreport</answer>

Classifiziere diesen Text:
{text}
"""
    return prompt.format(text=text)

def classify_text(text):
    torch.cuda.empty_cache()
    prompt = format_with_examples(text)
    response = pipe(
        prompt,
        max_new_tokens=20,  # Increased for XML tags
        do_sample=False,
        temperature=0.1,
        return_full_text=False
    )[0]['generated_text'].strip()

    # Extract text between XML tags
    try:
        import re
        match = re.search(r'<answer>(.*?)</answer>', response)
        if match:
            return match.group(1).strip()
        else:
            return "Andere"  # Default fallback
    except Exception as e:
        print(f"Error extracting response: {e}")
        return "Andere"


batch_size = 4
classifications = []
for i in range(0, len(articles_df), batch_size):
    batch = articles_df['extracted_article_clean'].iloc[i:i+batch_size]
    for text in batch:
        try:
            classification = classify_text(text)
            classifications.append(classification)
        except Exception as e:
            print(f"Error processing text: {e}")
            classifications.append("Error")
    torch.cuda.empty_cache()

articles_df['classification_qwen7b'] = classifications
articles_df

In [None]:
# Export to Excel
articles_df.to_excel('classified_articles_qwen7b.xlsx', index=False)

## Text Classification Using the Nvidia/Llama 3.1 Nemotron 70b model

The nemotron model is accessed via NVIDIA's API using our pre-configured authentication token stored in Colab's environment variables. The processing time will be considerably faster than with the HuggingFace model.

In [None]:
import pandas as pd
from openai import OpenAI
from google.colab import userdata

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=userdata.get('NVIDIA_TOKEN')
)

classifications = []
for index, row in articles_df.iterrows():
    try:
        completion = client.chat.completions.create(
            model="nvidia/llama-3.1-nemotron-70b-instruct",
            messages=[
                {
                    'role': 'system',
                    'content': """You are an expert in classification tasks"""
                },
                {
                    'role': 'user',
                    'content': f"""Klassifiziere den Text als entweder "Hilfsreport" oder "Andere". Setze deine Antwort in XML-Tags.
Kriterien für "Hilfsreport":

Spendenbeträge und deren Verwendung
Aktivitäten von Hilfsorganisationen
Anzahl der Helfer und Art der Hilfsleistungen
Verteilung von Hilfsgütern
Hilfskomitees und ihre Arbeit
Hilfsmaßnahmen jeglicher Art
Güter oder Geld für Katastrophenopfer
Unterstützung für Opfer
Hilfe und Wiederaufbau

Überprüfe deine Antwort durch Kontrolle der relevanten Themen und Wörter sowie der Beispiele. Reflektiere deine Antwort und gib nur die beste Antwort.

Example 1: Die Mannschaften der Kriegsschiffe haben an der Küste Schutzhütten fertig gestellt.
<answer>Andere</answer>

Example 2: Das deutsche Hilfskomitee sammelt Spenden. Kleidungs- und Wäschestücke werden geschickt.
<answer>Hilfsreport</answer>

Example 3: W Madrid, 11. Jan. (Telegr.) Der Finanzminister hat in der Kammer den Antrag eingebracht, für die Opfer des Erdbebens in Süditalien 200000 Pesetas zu bewilligen.
<answer>Hilfsreport</answer>

Classifiziere diesen Text:

{row['extracted_article_clean']}"""
                }
            ],
            temperature=0.0,
            max_tokens=20
        )

        content = completion.choices[0].message.content

        try:
            import re
            match = re.search(r'<answer>(.*?)</answer>', content)
            classifications.append(match.group(1).strip() if match else "Andere")
        except Exception as e:
            print(f"Error extracting response for row {index}: {e}")
            classifications.append("Andere")

    except Exception as e:
        print(f"Error processing text for row {index}: {e}")
        classifications.append("Error")

articles_df['classification_nemotron70b'] = classifications
articles_df.to_excel('classified_articles.xlsx', index=False)
articles_df

In [None]:
# Export to Excel
articles_df.to_excel('classified_articles_nemotron70b.xlsx', index=False)

## Evaluation and Comparison of the Model Performance

The code below compares two classification models' performance on a dataset. It calculates overall accuracy and class-specific accuracies for "Hilfsreport" and "Andere" categories, then creates a bar chart using matplotlib showing three metrics per model: total accuracy (blue), Andere accuracy (green), and Hilfsreport accuracy (red). The visualization includes percentage labels on bars and requires a DataFrame with ground truth and model prediction columns.

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt

def compare_models_percentage(df, ground_truth_col, model1_col, model2_col=None):
   # Calculate accuracies
   acc1 = accuracy_score(df[ground_truth_col], df[model1_col]) * 100

   def class_accuracies(y_true, y_pred):
       hilfs_acc = sum((y_true == 'Hilfsreport') & (y_pred == 'Hilfsreport')) / sum(y_true == 'Hilfsreport') * 100
       andere_acc = sum((y_true == 'Andere') & (y_pred == 'Andere')) / sum(y_true == 'Andere') * 100
       return andere_acc, hilfs_acc

   andere_acc1, hilfs_acc1 = class_accuracies(df[ground_truth_col], df[model1_col])

   if model2_col:
       acc2 = accuracy_score(df[ground_truth_col], df[model2_col]) * 100
       andere_acc2, hilfs_acc2 = class_accuracies(df[ground_truth_col], df[model2_col])
       models = ['Qwen7b', 'Nemotron70b']
       total_acc = [acc1, acc2]
       andere_acc = [andere_acc1, andere_acc2]
       hilfs_acc = [hilfs_acc1, hilfs_acc2]
   else:
       models = ['Nemotron70b']
       total_acc = [acc1]
       andere_acc = [andere_acc1]
       hilfs_acc = [hilfs_acc1]

   x = np.arange(len(models))
   width = 0.25

   fig, ax = plt.subplots(figsize=(10, 6))
   ax.bar(x - width, total_acc, width, label='Total Accuracy', color='blue')
   ax.bar(x, andere_acc, width, label='Andere Accuracy', color='green')
   ax.bar(x + width, hilfs_acc, width, label='Hilfsreport Accuracy', color='red')

   ax.set_ylabel('Accuracy (%)')
   ax.set_title('Model Comparison')
   ax.set_xticks(x)
   ax.set_xticklabels(models)
   ax.legend()

   def add_labels(bars):
       for bar in bars:
           height = bar.get_height()
           ax.text(bar.get_x() + bar.get_width()/2., height,
                  f'{height:.1f}%', ha='center', va='bottom')

   for container in ax.containers:
       add_labels(container)

   plt.show()

# Example usage:
# For one model:
#compare_models_percentage(articles_df, 'ground_truth', 'classification_nemotron70b')
# For two models:
compare_models_percentage(articles_df, 'ground_truth', 'classification_qwen7b', 'classification_nemotron70b')