**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [1]:
# imports for the project

import pandas as pd
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference


In [3]:
# Load the environment variables using python-decouple
# The .env file should be in the root of the project
# The .env file should NOT be committed to the repository

WX_API_KEY = config('WX_API_KEY')

In [4]:
import os

print("Current Working Directory:", os.getcwd())
print("Files in Directory:", os.listdir())  # Check if .env is listed here


Current Working Directory: /Users/sarahaagaard/Documents/sarah/HA IT/4. år/8. semester/Artificial Intelligence and Machine Learning/AIML25/mas/ma2
Files in Directory: ['assignments', 'LICENSE', 'environment.yml', 'README.md', '.gitignore', '.env', 'guides', 'test.hf', '.env.example', '.git', '.vscode', 'data']


In [5]:


# Retrieve the API key
api_key = os.getenv("WX_API_KEY")

if api_key:
    print("✅ API Key Loaded Successfully!")
else:
    print("❌ Error: API Key not found. Check your .env file.")






✅ API Key Loaded Successfully!


### 1. Connecting to the WatsonX.ai Credentials API

In [6]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com", # set the URL to dallas
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="e2fff713-c98f-4236-8836-47cbe38cd27d" # insert my unique project ID
)

### 2. Testing the connection

In [7]:

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)

In [8]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

In [9]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

### 3. Load the data

In [10]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [11]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [18]:
test_df.label.value_counts()

label
Business    190
Sci/Tech    190
Sports      190
World       190
Name: count, dtype: int64

### 5. Set model parameers

In [12]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

### 4. Create a system prompt - change this!

In [38]:
SYSTEM_PROMPT = """
You are an advanced news categorization model. Your task is to classify a given news article into one of four categories: 

CATEGORIES:
- World: Covers international news, politics, diplomacy, conflicts, and global events.
- Sports: Includes news about athletic competitions, teams, players, and sporting events.
- Business: Encompasses financial markets, companies, economic trends, and corporate news.
- Sci/Tech: Relates to scientific discoveries, technological advancements, and industry innovations.

To classify the article correctly, follow this step-by-step reasoning:

1. Identify the main subject of the article. Is it discussing a country's affairs, an economic trend, a sports event, or a scientific/technological breakthrough?
2. Look for keywords and context clues. Does it mention political leaders, stock markets, athletes, or emerging technologies?
3. Determine the dominant theme. If the article covers multiple aspects, prioritize the primary focus.
4. Assign the most appropriate category based on the above reasoning.

TEXT:
{text}

Choose *ONLY* World, Sports, Business OR Sci/Tech! Based on the analysis, the correct category is:

Category:
"""


In [None]:
# chain of though:
# Think step by step about this. Show your reasoning. The, on a new line choose *ONLY* World, Sports, Business OR SciTech!

Changes to the model:
- The task is framed as something the model "is" (a news categorization model), which might help it better focus on the task. Might make the instructions clearer.
- Added detailed descriptions of each category – This provides the model with clearer distinctions between the four categories
- Introduced a step-by-step reasoning process
- Ensured that the model outputs only one of the 4 categories

### 5. Generate predictions

In [39]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    #     answer = response.split("\n")[-1] # this should be added if i did chain-of-though

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [04:58<00:00,  2.55it/s]


### 6. Evaluate performance

In [40]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.47      0.95      0.63       190
    Sci/Tech       0.92      0.36      0.52       190
      Sports       0.96      0.69      0.80       190
       World       0.84      0.73      0.78       190

    accuracy                           0.68       760
   macro avg       0.80      0.68      0.68       760
weighted avg       0.80      0.68      0.68       760



## 7. Model evaluation and comparison

BoW Model:
- Training Performance: 100% accuracy
- Test Performance: 83% accuracy

BERT Model:
- Training Performance: 99% accuracy
- Test Performance: 79% accuracy

LLM (Large Language Model) Performance:
- Accuracy: 68%

The LLM model demonstrated the lowest overall accuracy (68%) compared to BoW and BERT. However, it outperformed BoW on the recall for the category Business.
 The LLM performed well in precision for Sports but struggled with some other categories, especially Space Sci/Tech, which received a 0.52 in f1-score.

The LLM prompt plays a significant role in performance. A well-designed prompt with clear categories and instructions leads to more accurate predictions. In my case, the prompt might be too
Using examples in the prompt, rather than just listing categories, could help the model understand the types of articles that belong to each category.

In conclusion,  the LLM performse worse than BoW and BERT model.