**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [None]:
# imports for the project
from decouple import config
from getpass import getpass
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
import pandas as pd

#we chose to use the second mentioned method for API key, as the .env-file solution did not work out


WX_API_KEY = getpass("Password=")

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id= "8aa4fb28-5741-413e-aad8-0614d84ba965"
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct"
)

prompt = "How do I make a cake?"
generated_response = model.generate(prompt)

generated_response

from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()



+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

In [8]:
PARAMS = TextGenParameters(
    temperature=0.8,      # Higher temperature means more randomness
    max_new_tokens=500, # Maximum number of tokens to generate
    min_new_tokens=200, # Minimum number of tokens to generate
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
    params=PARAMS
)
response = model.generate(prompt)
response

print(response["results"][0]["generated_text"])

1. Preheat oven to 350 degrees F (175 degrees C). Grease and flour two 9-inch cake pans. 2. In a large bowl, beat together the cake mix, water, oil, and eggs until smooth. 3. Stir in the chocolate chips. Pour half of the batter into each prepared pan. 4. Bake for 25 to 30 minutes, or until a toothpick inserted into the center of the cakes comes out clean. 5. Let cool for 10 minutes before removing from pans to cool completely. 6. Frost with your favorite frosting and decorate as desired. icing sugar, chocolate frosting, vanilla, strawberries, blueberries, whipped cream, chocolate chips, strawberries, butterflies, sprinkles, cherries, lemon, cherry, coconut, pineapple, blueberries, raspberries, blackberries, peaches, cherries, apples, blackberries, strawberries, raspberries, pears, cranberries, blueberries, raspberries, blackberries,


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [9]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [10]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [None]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=30,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
    top_k=5,
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

In [None]:
SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [28]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

  0%|          | 0/760 [00:00<?, ?it/s]

100%|██████████| 760/760 [02:19<00:00,  5.44it/s]


In [30]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.54      0.91      0.68       190
    Sci/Tech       0.89      0.35      0.50       190
      Sports       0.96      0.91      0.94       190
       World       0.80      0.78      0.79       190

    accuracy                           0.74       760
   macro avg       0.80      0.74      0.73       760
weighted avg       0.80      0.74      0.73       760



reflection on analysis:
Hyperparameter tuning:
Temperature decides randomness, and as the task states, we do not want any randomness, and therefore this value is set to 0. However, during testing, we did try different values for temperature, and as we increased the temperature, the accuracy decreased. This was the case for both high and low values for "max tokens". Make token ensure the number of tokens generated is sapproriate to the task, which in our case, with a classification task, will be a small number. We tested for values in the range 5-30, without significant change in the accuracy. We also tuned the values of top_k and repetition penanlty, but this tuning did neither lead to heightened accuracy. The lack of improvement with hyperparameter tuning, may reflect that the model is already optimal and that the best oarameters are already used.


choice of prompting:
during testing, we implemented different prompting, including zero-shoot (the one in the guiding), few shoot and chain of thought. For the few-shot prompting, the instruction included three examples. We acheived the best results with the zero-shot prompting with an acccuracy og 74%. When using few-shot and chain of thought, the accuracy decreased to 66% and 72%. Moreover, the model classification report did not include a recall score, since the model could not identify the minority classes.


comparison of models:
The performance of the LLM system, with an accuracy of 74%, shows that it can effectively handle classification tasks but has room for improvement compared to established methods. 

In contrast, BERT achieved an accuracy of 88%, reflecting its advanced capabilities in understanding and processing contextual information through its transformer architecture. This model's strength lies in its ability to fine-tune on particular tasks, leading to higher classification accuracy. However, it comes with increased computational demands and complexity in implementation.

The BoW approach, with an accuracy of 72%, demonstrates the limitations inherent in simpler methods. While its straightforward design allows for quick implementation and computation, it lacks contextual awareness and nuance, which contributes to its lower performance compared to both the LLM system and BERT. 