**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [59]:
# imports for the project
# https://cloud.ibm.com/login
# Fr33R011R011!
# k2 eeFJTYLr8emEqfbb9lAoN3bedNb7t36JCzgk8PlO688B

import pandas as pd
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from sklearn.metrics import classification_report 
from tqdm import tqdm

In [60]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [61]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [62]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [63]:

WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="52854b63-f97f-49d0-9357-b6a80b3e2d48"
)

In [64]:
PARAMS = TextGenParameters(
    temperature=0, 
    max_new_tokens=10, 
)

model = ModelInference(
    api_client=client,
    model_id="mistralai/mistral-large",  # We could also try a larger model!
    params=PARAMS
)

In [65]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

In [66]:
SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [None]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip().replace(" ", "").replace("-", "")

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [04:35<00:00,  2.76it/s]


In [70]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.72      0.91      0.80       190
      Health       0.00      0.00      0.00         0
    Sci/Tech       0.91      0.64      0.75       190
      Sports       0.97      0.96      0.97       190
       World       0.87      0.91      0.89       190

    accuracy                           0.86       760
   macro avg       0.70      0.68      0.68       760
weighted avg       0.87      0.86      0.85       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [72]:
# Let's try another good classification model too
# llama-3-3-70b-instruct
PARAMS = TextGenParameters(
    temperature=0, 
    max_new_tokens=10, 
    min_new_tokens=1,
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct",
    params=PARAMS
)

In [73]:
CATEGORIES = ", ".join(test_df["label"].unique())
SYSTEM_PROMPT = """{text}. Please assign the correct category to the text above. Answer with the correct category and nothing else. CATEGORIES available: {categories}. Category of the text above:"""

In [74]:
SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [75]:

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip().replace(" ", "").replace("-", "")

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [02:58<00:00,  4.27it/s]


In [76]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.59      0.95      0.73       190
    Politics       0.00      0.00      0.00         0
    Sci/Tech       0.92      0.51      0.66       190
      Sports       0.97      0.91      0.94       190
       World       0.87      0.77      0.82       190

    accuracy                           0.79       760
   macro avg       0.67      0.63      0.63       760
weighted avg       0.84      0.79      0.79       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Conclusions
The best model for classification in watson (mistral-large) performs the best on this task.  It has similar score to bow for the category "World", an amazing (and better) score for Sports, similar precision for Sci/tech but worse off recall, and way worse precision on Business but better recall.