**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

## Imports


In [138]:
# imports for the project

import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from decouple import Config
from decouple import RepositoryEnv
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

## Load APi key

In [139]:

# Create a Config instance with the path to the .env file
config = Config(RepositoryEnv('.env'))

	# Load the WX_API_KEY from the .env file
wx_api_key = config('KEY')



## Connecting to the WatsonX.ai Credentials API

In [140]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = wx_api_key
)

client = APIClient(
    credentials=credentials, 
    project_id="52ba1404-06c0-47d5-87c2-df078023a34e"
)

## Test connection

In [141]:

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-8b-code-instruct",
)

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [142]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [143]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

test_df.head()

Unnamed: 0,text,label
0,"Ford: Monthly Sales Drop, Company Looks To New...",Business
1,United #39;s pension dilemma United Airlines s...,Business
2,Comcast part of group wanting to buy MGM A con...,Business
3,Treasuries Tussle with Profit-Takers NEW YORK...,Business
4,"Lloyds TSB to Move More Than 1,000 UK Jobs to ...",Business


## Parameters

In [144]:
PARAMS = TextGenParameters(
    temperature=0,              
    max_new_tokens=10,          
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-8b-code-instruct",  # Model ID of the model to use, this is the T5 model and the accuracy was the highest using this model. 
    params=PARAMS
)

## Prompt

In [145]:
SYSTEM_PROMPT = """You task is to classify news stories into one of four categories. Try to avoid assigning a category that is too specific. For example, if the text is about a specific sports team, choose the Sports category instead of the team's name.
Try cutting out stop words and other unnecessary information to make the text easier to classify.
Try to take word roots into considerations. For example, if the text mentions "running", it is likely about sports. 

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

## Predictions

In [146]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories


predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [03:25<00:00,  3.70it/s]


In [147]:
test_df["prediction"] = predictions
test_df

Unnamed: 0,text,label,prediction
0,"Ford: Monthly Sales Drop, Company Looks To New...",Business,Business
1,United #39;s pension dilemma United Airlines s...,Business,Business
2,Comcast part of group wanting to buy MGM A con...,Business,Business
3,Treasuries Tussle with Profit-Takers NEW YORK...,Business,Business
4,"Lloyds TSB to Move More Than 1,000 UK Jobs to ...",Business,Business
...,...,...,...
755,Palestinian Attack Kills Woman in Gaza Settlem...,World,Sports
756,Mounties left in dark by U.S. on deportation o...,World,Business
757,Yemeni Poet Says He Is al-Qaida Member GUANTAN...,World,Business
758,Saudis Take a Small Dose of Democracy For the ...,World,Business


## Evaluation

In [148]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.31      0.99      0.48       190
    Sci/Tech       1.00      0.01      0.02       190
      Sports       0.80      0.65      0.72       190
       World       0.00      0.00      0.00       190

    accuracy                           0.41       760
   macro avg       0.53      0.41      0.30       760
weighted avg       0.53      0.41      0.30       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Models tried: 
- mistralai/mixtral-8x7b-instruct-v01
- ibm/granite-8b-code-instruct
- ibm/granite-13b-instruct-v2 - accuracy: 0.70
- google/flan-t5-xl - accuracy: 0.92
- google/flan-t5-xxl - accuracy: 0.92
