**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [16]:
# imports for the project

import pandas as pd
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

In [20]:
# Load the environment variables using python-decouple
# The .env file should be in the root of the project
# The .env file should NOT be committed to the repository
from decouple import Config, RepositoryEnv

config = Config(RepositoryEnv("../.env"))  # Adjust path if needed
WX_API_KEY = config('WX_API_KEY')





### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [3]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [4]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [17]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

In [22]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="163839a7-17ed-4d45-8690-531423735ab8"
)

In [26]:
response = model.generate(prompt)
response

{'model_id': 'ibm/granite-3-8b-instruct',
 'model_version': '1.1.0',
 'created_at': '2025-03-23T09:51:33.644Z',
 'results': [{'generated_text': '\n\n1. Preheat your oven to the temperature specified in your recipe.\n2. Gather and measure your ingredients.\n3. In a large bowl, combine your dry ingredients (flour, sugar, baking powder, salt).\n4. In a separate bowl, beat your wet ingredients (eggs, milk, oil or melted butter).\n5. Gradually add the wet ingredients to the dry ingredients, mixing until just combined.\n6. Pour the batter into a greased cake pan.\n7. Bake in the preheated oven for the time specified in your recipe.\n8. Check for doneness with a toothpick. If it comes out clean, the cake is done.\n9. Allow the cake to cool in the pan for 10-15 minutes, then transfer to a wire rack to cool completely.\n10. Once cooled, you can frost and decorate your cake as desired.',
   'generated_token_count': 215,
   'input_token_count': 8,
   'stop_reason': 'eos_token'}]}

In [27]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

In [35]:
examples = test_df.groupby("label").sample(n=1, random_state=42).reset_index(drop=True)

for idx, row in examples.iterrows():
    print(f"Label: {row['label']}")
    print(f"Text: {row['text']}")
    print("-" * 80)

Label: Business
Text: Poor nations seek WTO textile aid With 40 years of textile quotas about to be abolished in a move to help developing nations, a group of the world #39;s poorest countries are asking for a different approach: special trade deals to protect them from a free-for-all.
--------------------------------------------------------------------------------
Label: Sci/Tech
Text: RIM takes new BlackBerry design overseas Revamped keyboard is key feature of the 7100v, which is headed for European and Asian shores.\
--------------------------------------------------------------------------------
Label: Sports
Text: Novak Captures First Indoor Title BASEL, Switzerland Oct 31, 2004 - Jiri Novak of the Czech Republic won the Swiss Indoors for his first indoor title, defeating David Nalbandian in five sets Sunday in a final in which the Argentine smashed two rackets.
--------------------------------------------------------------------------------
Label: World
Text: German investor conf

In [28]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct", 
    params=PARAMS
)

In [43]:
SYSTEM_PROMPT = """You task is to classify news stories into one of four categories

CATEGORIES:
{categories}
Ensure that your answer is specifically the same as one the categories!


TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.


Examples to help you determine the category:
Example 1:
Label: Business
Text: Poor nations seek WTO textile aid With 40 years of textile quotas about to be abolished in a move to help developing nations, a group of the world #39;s poorest countries are asking for a different approach: special trade deals to protect them from a free-for-all.

Example 2:
Label: Sci/Tech
Text: RIM takes new BlackBerry design overseas Revamped keyboard is key feature of the 7100v, which is headed for European and Asian shores.\

Example 3:
Label: Sports
Text: Novak Captures First Indoor Title BASEL, Switzerland Oct 31, 2004 - Jiri Novak of the Czech Republic won the Swiss Indoors for his first indoor title, defeating David Nalbandian in five sets Sunday in a final in which the Argentine smashed two rackets.

Example 4:
Label: World
Text: German investor confidence slumped in September BERLIN - German investor confidence dropped sharply in September, a key economic indicator released Tuesday showed amid concerns about the impact of high oil prices on consumer demand and the outlook for the global economy.

Category:
"""

In [44]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [05:58<00:00,  2.12it/s]


In [46]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.60      0.93      0.73       190
    Sci/Tech       0.98      0.30      0.46       190
      Sports       0.98      0.89      0.93       190
       World       0.73      0.92      0.81       190

    accuracy                           0.76       760
   macro avg       0.82      0.76      0.74       760
weighted avg       0.82      0.76      0.74       760



The LLM does not manage to achieve as high a score as BERT. This makes sense as LLMs are more general applicable, where BERT is made for encoding only. I gave the systempromt a few examples which seemed to improve perfomance slightly.