**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


<p align="center">
<img src="media/llm_header.png" alt="LLM" width="800"/> <br>
Image from Speech and Language Processing. Daniel Jurafsky & James H. Martin.<br> Copyright © 2024. All
rights reserved. Draft of January 12, 2025.
</p>


# Generative LLMs: decoder-only Large Language Models

***

- **Large Language Models (LLMs)** are deep learning models trained on vast amounts of text data to understand, generate, and manipulate human language. They belong to the family of Transformer-based architectures, first introduced in the seminal paper *Attention Is All You Need* ([Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)), which revolutionized Natural Language Processing (NLP) through self-attention mechanisms.  

- The development of LLMs gained momentum with OpenAI’s *Generative Pre-trained Transformer (GPT-1)* ([Radford et al., 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)), which introduced a pretraining-finetuning paradigm for NLP tasks. This approach was later scaled with *GPT-2* ([Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)) and *GPT-3* ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the latter demonstrating emergent capabilities in zero-shot and few-shot learning.  

- **BERT** ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805)) played a parallel but pivotal role in advancing Natural Language Understanding (NLU) with bidirectional context modeling. In particular, BERT influenced many retrieval-based applications. Meanwhile, GPT models prioritized *autoregressive decoding*, which revitalized the field of Natural Language Generation (NLG). The bifurcation between *encoder-only* (BERT-like) and *decoder-only* (GPT-like) architectures shapes the evolution of modern NLP systems to this day.  

- **Scaling Laws** ([Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)) demonstrated that model performance improves logarithmically with increased parameter count, data, and compute. This insight led to the training of massive LLMs, such as *GPT-4* ([OpenAI, 2023](https://openai.com/research/gpt-4)), *PaLM* ([Chowdhery et al., 2022](https://arxiv.org/abs/2204.02311)), *Claude* ([Anthropic, 2023](https://www.anthropic.com)), and *Gemini* ([Google DeepMind, 2023](https://blog.google/technology/ai/google-gemini-ai/)), which exhibit improved reasoning, multilingual capabilities, and tool use.  

- LLMs have been widely adopted in industry, powering chatbots (e.g., *ChatGPT*, *Claude*), code generation tools (e.g., *GitHub Copilot* powered by OpenAI’s *Codex* ([Chen et al., 2021](https://arxiv.org/abs/2107.03374))), and enterprise applications. For example, [Morgan Stanley](https://www.morganstanley.com/articles/morgan-stanley-ai-assistant?utm_source=chatgpt.com) uses LLMs to retrieve financial insights, while [Salesforce](https://www.salesforce.com/news/stories/how-salesforce-ai-assistant-works/?utm_source=chatgpt.com) has integrated LLMs into its AI assistant for customer relationship management (CRM).  

- Despite their strengths, LLMs have significant limitations: They are compute-intensive, struggle with factual consistency ([Maynez et al., 2020](https://arxiv.org/abs/2010.03043)), and often hallucinate incorrect information. Efforts to mitigate these issues include RLHF (Reinforcement Learning from Human Feedback) ([Christiano et al., 2017](https://arxiv.org/abs/1706.03741)) and retrieval-augmented generation (RAG) ([Lewis et al., 2020](https://arxiv.org/abs/2005.11401)), which combines LLMs with external knowledge sources.  

***

<br><br>

**In this notebook, we will explore how to use generative LLMs via WatsonX.AI.**

<br><br>


***

# 1. Introduction to decoder-only LLMs

<div style="background-color:rgba(4, 12, 78, 0.58); color: #ffffff; font-weight: 700; padding-left: 10px; padding-top: 20px; padding-bottom: 20px"><strong>Generative AI models</strong></div>

<div style="background-color:rgb(13, 14, 18); padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<div style="padding-left: 10px; padding-right: 10px; padding-top: 10px; padding-bottom: 30px, align: justify">
<p align="center">
<img src="media/autoregressive.png" alt="autoregressive" width="800"/>
</p>
</div>

<div style="background-color:rgb(13, 14, 18); padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px; color: white;">
<p>
As mentioned above, the rise of Large Language Models (LLMs) is a direct result of advancements in Transformer-based neural network architectures, originally introduced in <i>Attention Is All You Need</i> (<a href="https://arxiv.org/abs/1706.03762">Vaswani et al., 2017</a>). <br><br>While early Transformer models were primarily used for sequence-to-sequence tasks (such as machine translation), modern LLMs - so-called decoder-only LLMs - are capable of human-level natural language generation (NLG) and, arguably, near-human reasoning, dialogue, and tool usage.

Unlike earlier neural architectures like convolutional neural networks (CNNs), recurrent neural networks (RNNs), or long short-term memory (LSTM) networks, Transformers (and by extension, decoder-only LLMs) rely on self-attention mechanisms to dynamically weigh relationships between words in a sequence. Effectively, this means that instead of reading text one word at a time in order (like RNNs) or trying to spot patterns in small chunks (like CNNs), Transformers look at the entire sentence at once and decide which words are most important to each other.

**For example**: Imagine you're reading a mystery novel. Instead of going page by page and only remembering the last few sentences, a Transformer can scan the whole book at once and instantly see connections—like how a clue in Chapter 2 relates to the big reveal in Chapter 20. Much like what humans do when reading a book, Transformers can understand context and meaning across long and complex sentences.

This ability to "pay attention" to all words at once makes LLMs much better at understanding context and meaning, even across long and complex sentences. This, however, has it limits.

</div>


<div style="background-color:rgba(4, 12, 78, 0.58); color: #ffffff; font-weight: 700; padding-left: 10px; padding-top: 20px; padding-bottom: 20px">
    <strong>Key aspects governing LLM behavior</strong>
</div>

<div style="background-color:rgb(13, 14, 18); padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px; color: white;">

<ul>
    <li><strong>Context Window</strong><br>
        Defines how many tokens (words, subwords, or characters) the model can consider at once. Affects how well it retains information in long conversations or documents.
        <br><br>
    </li>
    <li><strong>Temperature</strong><br>
        Controls randomness in text generation. Lower values (e.g., 0.1) make responses more deterministic, while higher values (e.g., 1.0) make them more creative.
        <br><br>
    </li>
    <li><strong>Top-k Sampling</strong><br>
        Limits the model to choosing from the <i>k</i> most likely next tokens instead of considering all possibilities, reducing extreme randomness.
        <br><br>
    </li>
    <li><strong>Top-p (Nucleus) Sampling</strong><br>
        Selects tokens from the smallest possible set whose cumulative probability exceeds <i>p</i>. More flexible than top-k, ensuring balanced randomness.
        <br><br>
    </li>
    <li><strong>Repetition Penalty</strong><br>
        Adjusts the likelihood of repeating words or phrases. Helps prevent redundant or looping responses.
        <br><br>
    </li>
    <li><strong>Stop Sequences</strong><br>
        Predefined words or phrases that signal the model to stop generating text, useful for structuring responses.
        <br><br>
    </li>
    <li><strong>System and User Prompts</strong><br>
        The initial instructions given to the model (system prompt) and user inputs. Strongly influences the model's behavior and output style.
        <br><br>
    </li>
    <li><strong>Memory and Retrieval</strong><br>
        Some models have long-term memory or external retrieval systems (e.g., RAG - Retrieval-Augmented Generation) to pull in relevant context beyond the context window.
        <br><br>
    </li>
</ul>

</div>


---


# 2. Connecting to WatsonX.ai

**If you haven't already, please follow the WatsonX.ai Guide on Canvas to set up your account and get your API key. You will need this key to connect to the WatsonX.ai API.**

### 2.0. Import libraries

In [14]:
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

### 2.1. Loading your API key
You can use your IBM Cloud API key in one of two ways

- Using `getpass` to enter your API key securely in the notebook (has to be done every time you restart the notebook)
- Storing your API key in a .env file and loading it using the `python-decouple` package (recommended)

**DO NOT HARD-CODE YOUR API KEY IN THE NOTEBOOK. EVER!!**

**IF YOU STORE YOUR API KEY IN THE .env FILE - MAKE SURE IT .env IS ALSO IN .gitignore SO IT IS NOT COMMITED TO YOUR REMOTE GH REPO!**

In [15]:
# Load the environment variables using python-decouple

# The .env file should be in the root of the project
# The .env file should NOT be committed to the repository

WX_API_KEY = config('WX_API_KEY')

UndefinedValueError: WX_API_KEY not found. Declare it as envvar or define a default value.

### 2.2. Connecting to the WatsonX.ai Credentials API

To authenticate and call LLMs via WatsonX.ai we need

* The URL of the WatsonX.ai API
* Your unique API key that you created in the WatsonX.ai guide
* The unique project ID of the project you created in the WatsonX.ai guide.
* The model ID of the LLM you want to use

**NOTE**: Depending on which region you picked for your project, you need to use the corresponding URL:

* Dallas: https://us-south.ml.cloud.ibm.com
* Frankfurt: https://eu-de.ml.cloud.ibm.com
* London: https://eu-gb.ml.cloud.ibm.com
* Sydney: https://au-syd.ml.cloud.ibm.com
* Tokyo: https://jp-tok.ml.cloud.ibm.com
* Toronto: https://ca-tor.ml.cloud.ibm.com

In [None]:
credentials = Credentials(
    url = "https://eu-de.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="ea176fc8-4852-4798-bf23-063620807ec9"
)

### 2.3. Testing the connection

The `ModelInference` class is a wrapper around the WatsonX.ai API. It allows you to interact with the LLMs via the API. If we were to look inside the `ModelInference` class, we would see that it uses the `requests` library to make HTTP requests to the server on IBM Cloud that hosts the WatsonX.ai API and the models.

In [None]:

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)

In [None]:
prompt = "How do I make a cake?"
generated_response = model.generate(prompt)

generated_response

{'model_id': 'ibm/granite-13b-instruct-v2',
 'created_at': '2025-03-01T23:38:27.533Z',
 'results': [{'generated_text': 'Mix the ingredients together in a bowl. Pour the batter into a cake pan. Bake for 30 minutes',
   'generated_token_count': 20,
   'input_token_count': 7,
   'stop_reason': 'max_tokens'}],
    'id': 'unspecified_max_new_tokens',
    'additional_properties': {'limit': 0,
     'new_value': 20,
     'parameter': 'parameters.max_new_tokens',
     'value': 0}}]}}

Notice that we get a warning that "*The value of `parameters.max_new_tokens` for this model was set to value 20*". This means that the model will generate a maximum of 20 tokens (words, subwords, or characters) in response to each prompt. This is a safety measure to prevent the model from generating too much text at once. You can adjust this value as needed, but be aware that generating large amounts of text can be computationally expensive and may take longer to complete.

Notice also that the `stop_reason` was 'max_tokens', which means that the model stopped generating text because it reached the maximum number of tokens allowed. This is expected behavior and is not an error.

You can check which parameters can be set like so:

In [None]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

### 2.4. Setting parameters

In [None]:
PARAMS = TextGenParameters(
    temperature=0.8,      # Higher temperature means more randomness
    max_new_tokens=500, # Maximum number of tokens to generate
    min_new_tokens=200, # Minimum number of tokens to generate
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
    params=PARAMS
)

In [None]:
response = model.generate(prompt)
response

{'model_id': 'ibm/granite-13b-instruct-v2',
 'created_at': '2025-03-01T23:39:30.770Z',
 'results': [{'generated_text': 'You can make a cake by following a recipe. You can also make a cake by following a mix. You can also make a cake by following a box. You can also make a cake by following a box and adding your own touches. You can also make a cake by following a recipe and adding your own touches. \nYou can also make a cake by following a mix and adding your own touches. You can also make a cake by following a box and adding your own touches. You can also make a cake by following a recipe and adding your own touches. You can also make a cake by following a mix and adding your own touches. You can also make a cake by following a box and adding your own touches. \nYou can also make a cake by following a recipe and adding your own touches. You can also make a cake by following a mix and adding your own touches. You can also make a cake by following a box and adding your own touches. \nYo

In [None]:
print(response["results"][0]["generated_text"])

You can make a cake by following a recipe. You can also make a cake by following a mix. You can also make a cake by following a box. You can also make a cake by following a box and adding your own touches. You can also make a cake by following a recipe and adding your own touches. 
You can also make a cake by following a mix and adding your own touches. You can also make a cake by following a box and adding your own touches. You can also make a cake by following a recipe and adding your own touches. You can also make a cake by following a mix and adding your own touches. You can also make a cake by following a box and adding your own touches. 
You can also make a cake by following a recipe and adding your own touches. You can also make a cake by following a mix and adding your own touches. You can also make a cake by following a box and adding your own touches. 
You can also make a cake by following a recipe and adding your own touches. You can also make a cake by following a mix and a

# 3. Using LLMs for classification

In this section, we will use a decoder-only LLM to classify text. We will provide the model with a prompt and ask it to generate a response that classifies the text into one of several categories. This is a common use case for LLMs in NLP, where they can be used to classify text based on the context and content of the input.

Note that what we provide here is a very "vanilla" example of prompt engineering and optimization. In practice, you would likely need to experiment with different prompts and parameters to get the best results for your specific use case.

In [None]:
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

### 3.0. Load data

We will be working with the AG News dataset, which is a collection of news articles from the AG's corpus of news articles on the web. The dataset has four categories: World, Sports, Business, and Science/Technology. We will use a subset of the dataset for this example.

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

In [None]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [None]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape # , train_df.shape

((760, 2),)

### 3.1. Split data

Notice that we don't need any training data for LLMs! This is one of the key advantages of using pre-trained models. The model has already been trained on a large corpus of text data and has learned to generate text based on that training. We can use the model directly for text classification without needing to train it on any specific data.

**However**, it is good practice to set a side a small portion of your data to work with while you are developing your model. We still want our LLM system to generalize!

### 3.2. Set model parameers

In [None]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

### 3.3. Create a system prompt

In [None]:
SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

### 3.4. Generate predictions

In [None]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [01:47<00:00,  7.06it/s]


### 3.5. Evaluate performance

In [None]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.54      0.91      0.68       190
    Sci/Tech       0.89      0.35      0.50       190
      Sports       0.96      0.91      0.94       190
       World       0.80      0.78      0.79       190

    accuracy                           0.74       760
   macro avg       0.80      0.74      0.73       760
weighted avg       0.80      0.74      0.73       760

