**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [42]:
# imports for the project

import pandas as pd;
import decouple as config;
import os;
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm


Watson

In [43]:
#from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

from decouple import Config, RepositoryEnv
config = Config(RepositoryEnv('.env'))

WX_API_KEY = config('WX_API_KEY')
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com/",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="3b0687e2-70ee-471b-ba30-6d4514787f00"
)
model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)

from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [44]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [45]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [46]:

PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

System prompt

In [52]:

# First System Prompt = Basic
SYSTEM_PROMPT_1 = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

# Second System Prompt = Few-Shot Prompting
SYSTEM_PROMPT_2 = """You are an expert classifier tasked with labeling news articles into one of the categories: World, Sports, Business, or Sci/Tech. Use the examples below as a guide:

EXAMPLES:
Example 1:
News article: "Global leaders convened at an international summit to discuss climate change policies."
Label: World

Example 2:
News article: "The local team clinched the championship title after a nail-biting final match."
Label: Sports

Example 3:
News article: "The stock market saw dramatic swings today after the release of the quarterly earnings report."
Label: Business

Example 4:
News article: "Scientists revealed a breakthrough in quantum computing that could transform the tech industry."
Label: Sci/Tech


CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

# Third System Prompt =  Chain-of-Thought with Guided Reasoning
SYSTEM_PROMPT_3 = """You are an expert news classifier. For each article, follow this diagnostic reasoning process:

1. **Determine the Focus:** Identify the primary focus of the article. Consider if the content primarily addresses political/diplomatic issues, sports events, business/market developments, or technological/scientific advancements.
2. **List Supporting Evidence:** Extract and list at least two pieces of evidence (keywords or phrases) from the article that support your interpretation of its focus.
3. **Select the Appropriate Label:** Match the evidence to one of the following labels:
   - "World" for articles on global or international topics.
   - "Sports" for articles on athletic events or sports.
   - "Business" for articles on financial news and economic issues.
   - "Sci/Tech" for articles on technology or scientific discoveries.
4. **State Your Final Decision:** Provide category.

EXAMPLES:
Example 1:
News article: "In a historic summit, world leaders convened to address climate change."
- Focus Determination: The article covers an international event.
- Supporting Evidence: "historic summit," "world leaders," "climate change."
- Label Selection: World.
- Final Decision: World, because the evidence strongly points to global issues.
Category: World

Example 2:
News article: "A breakthrough in AI technology promises to reshape the tech industry."
- Focus Determination: The article centers on technological innovation.
- Supporting Evidence: "breakthrough in AI," "reshape the tech industry."
- Label Selection: Sci/Tech.
- Final Decision: Sci/Tech, as the key evidence revolves around technological advancement.
Category: Sci/Tech

Now, please classify the following news article by applying the diagnostic reasoning steps above.

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

Generate predictions

In [53]:
# Ensure CATEGORIES is defined
CATEGORIES = 'Business\nSci/Tech\nSports\nWorld'

# Store predictions for each system prompt
predictions_1 = []
predictions_2 = []
predictions_3 = []

for text in tqdm(test_df["text"]):

    # Format and generate response for SYSTEM_PROMPT_1
    prompt_1 = SYSTEM_PROMPT_1.format(categories=CATEGORIES, text=text)
    response_1 = model.generate(prompt_1)
    prediction_1 = response_1["results"][0]["generated_text"].strip()
    predictions_1.append(prediction_1)

    # Format and generate response for SYSTEM_PROMPT_2
    prompt_2 = SYSTEM_PROMPT_2.format(categories=CATEGORIES, text=text)
    response_2 = model.generate(prompt_2)
    prediction_2 = response_2["results"][0]["generated_text"].strip()
    predictions_2.append(prediction_2)

    # Format and generate response for SYSTEM_PROMPT_3
    prompt_3 = SYSTEM_PROMPT_3.format(categories=CATEGORIES, text=text)
    response_3 = model.generate(prompt_3)
    prediction_3 = response_3["results"][0]["generated_text"].strip()
    predictions_3.append(prediction_3)

100%|██████████| 760/760 [10:53<00:00,  1.16it/s]


### Performance

In [54]:
# Evaluate performance for SYSTEM_PROMPT_1
print("Classification Report for SYSTEM_PROMPT_1:")
print(classification_report(test_df.label, predictions_1))

# Evaluate performance for SYSTEM_PROMPT_2
print("\nClassification Report for SYSTEM_PROMPT_2:")
print(classification_report(test_df.label, predictions_2))

# Evaluate performance for SYSTEM_PROMPT_3
print("\nClassification Report for SYSTEM_PROMPT_3:")
print(classification_report(test_df.label, predictions_3))

Classification Report for SYSTEM_PROMPT_1:
              precision    recall  f1-score   support

    Business       0.60      0.88      0.71       190
    Sci/Tech       0.84      0.58      0.69       190
      Sports       0.98      0.92      0.95       190
       World       0.84      0.76      0.80       190

    accuracy                           0.78       760
   macro avg       0.81      0.78      0.79       760
weighted avg       0.81      0.78      0.79       760


Classification Report for SYSTEM_PROMPT_2:
              precision    recall  f1-score   support

    Business       0.47      0.97      0.63       190
     Letters       0.00      0.00      0.00         0
    Sci/Tech       0.81      0.36      0.50       190
      Sports       0.94      0.89      0.92       190
       World       0.85      0.46      0.60       190

    accuracy                           0.67       760
   macro avg       0.61      0.54      0.53       760
weighted avg       0.77      0.67      0.66 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Overall, SYSTEM_PROMPT_1 performed best, while SYSTEM_PROMPT_3 had the weakest results due to poor recall in multiple categories. It seems that the additional instructions has hindered accuracy scores.

### Confusion matrix

In [56]:
# Generate confusion matrix for SYSTEM_PROMPT_1
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
import pandas as pd

# Compute the confusion matrix for SYSTEM_PROMPT_1
conf_matrix_1 = confusion_matrix(test_df["label"], predictions_1)

# Define label map
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

# Convert the confusion matrix to a DataFrame with descriptive headers
conf_matrix_df_1 = pd.DataFrame(
    conf_matrix_1,
    index=[f"{label}" for label in label_map.values()],
    columns=[f"{label}" for label in label_map.values()]
)

# Calculate overall test accuracy for SYSTEM_PROMPT_1
overall_accuracy_1 = accuracy_score(test_df["label"], predictions_1)

# Calculate per-class precision, recall, and accuracy
per_class_precision_1 = precision_score(test_df["label"], predictions_1, average=None, zero_division=0)
per_class_recall_1 = recall_score(test_df["label"], predictions_1, average=None, zero_division=0)
per_class_accuracy_1 = conf_matrix_1.diagonal() / conf_matrix_1.sum(axis=1)

# Display the confusion matrix
print("Confusion Matrix for SYSTEM_PROMPT_1:")
print(conf_matrix_df_1)

# Display overall accuracy
print(f"\nOverall Test Accuracy for SYSTEM_PROMPT_1: {overall_accuracy_1:.2f}")

# Display per-class metrics
print("\nPer-Class Metrics for SYSTEM_PROMPT_1:")
for label, precision, recall, accuracy in zip(label_map.values(), per_class_precision_1, per_class_recall_1, per_class_accuracy_1):
    print(f"{label}: Precision: {precision:.2f}, Recall: {recall:.2f}, Accuracy: {accuracy:.2f}")

Confusion Matrix for SYSTEM_PROMPT_1:
          World  Sports  Business  Sci/Tech
World       168       9         0        13
Sports       70     110         0        10
Business     11       1       174         4
Sci/Tech     31      11         4       144

Overall Test Accuracy for SYSTEM_PROMPT_1: 0.78

Per-Class Metrics for SYSTEM_PROMPT_1:
World: Precision: 0.60, Recall: 0.88, Accuracy: 0.88
Sports: Precision: 0.84, Recall: 0.58, Accuracy: 0.58
Business: Precision: 0.98, Recall: 0.92, Accuracy: 0.92
Sci/Tech: Precision: 0.84, Recall: 0.76, Accuracy: 0.76


Basic observations for LLM:

1. World:
Although "World" articles have a high recall (0.88) and high accuracy (0.88), the precision is relatively low (0.60) - considerable number of articles from other categories are being misclassified as "World." In particular, the confusion from other classes (e.g., 70 "Sports" and 31 "Sci/Tech" instances predicted as "World") suggests that the model may be overgeneralizing when content has a global aspect.

2. Sports:
"Sports" articles show a high precision (0.84), meaning that when the model predicts an article as "Sports," it is usually correct. However, the recall is low (0.58), which reveals that a significant portion of actual "Sports" articles are not being recognized as such—70 instances of "Sports" articles are incorrectly classified as "World." This misclassification indicates that the model might be conflating some sports content with broader global news.

3. Business:
The "Business" category exhibits the strongest performance, with an impressive precision (0.98) and recall (0.92), resulting in the highest accuracy (0.92) among the categories. This suggests that the vocabulary and features present in business news are highly distinctive, allowing the model to accurately differentiate them from other categories with minimal confusion.

4. Sci/Tech:
"Sci/Tech" articles have a balanced performance with a precision of 0.84 and a recall of 0.76, leading to an accuracy of 0.76. However, the model still confuses some Sci/Tech content with "World" (31 instances) and "Sports" (11 instances) articles. This indicates that while the model performs reasonably well in identifying technological and scientific content, there remains some overlap with other categories, particularly when the subject matter spans global or sports-related contexts.

# Final remarks

* The BoW model had noticeable struggles with distinguishing between similar categories, particularly confusing "World" with "Sports" and "Sci/Tech." but it has the highest accuracy amongst all the models.
* Pre-trained LLMs can match BERT in performance if prompted effectively, reducing the need for computationally expensive fine-tuning.
* LLM prompt nr 1, has outperformed the others because it likely provided clearer instructions and better context.