# IN4080: obligatory assignment 4
 
The final mandatory assignment for IN4080 consists of two parts. The first is about the development of dialogue systems, and the second about machine translation.
You are required to get at least 12/20 points to pass. 

- We assume that you have read and are familiar with IFI’s requirements and guidelines for mandatory assignments, see [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html) and [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html).
- This is an individual assignment. You should not deliver joint submissions. 
- You may redeliver in Devilry before the deadline (__Sunday, November 3 at 23:59__).
- Only the last delivery will be read! If you deliver more than one file, put them into a zip-archive. You don't have to include in your delivery the data files already provided for this assignment. 
- Name your submission _your\_username\_in4080\_mandatory\_4_

Part 1 should be done on your local computer, as it relies on a speech interface that will not work on remote machines. For Part 2, using _Fox_ is preferable, at least for the fine-tuning task.

You should deliver a completed version of this Jupyter notebook, containing both your code and explanations about the steps you followed. We want to stress that simply submitting code is __not__ by itself sufficient to complete the assignment - we expect the notebook to also contain explanations of what you have implemented, along with motivations for the choices you made along the way. Preferably use whole sentences, and mathematical formulas if necessary. Explaining in your own words (using concepts we have covered through in the lectures) what you have implemented and reflecting on your solution is an important part of the learning process - take it seriously!

Regarding the use of LLMs (ChatGPT or similar): you are allowed to use them as 'sparring partner', for instance to clarify something you have not understood. However, you are __not__ allowed to use them to generate solutions (either in part or in full) to the assignment tasks. 


## Part 1 : Dialogue systems

Our objective in this part is to build a spoken conversational interface for a (simulated) elevator. 

### Basic setup

First, let's make sure that we have all the necessary Python modules:

In [2]:
# %pip install ipywidgets pyaudio openai-whisper pyttsx3 setfit spacy jellyfish
# !python -m spacy download en_core_web_sm

The code for the simulated elevator is provided below. The elevator is displayed using simple widgets (where the current floor is shown in green). 

In [3]:
import ipywidgets as widgets
from IPython.display import display

import time, random, string, threading
from typing import List, Tuple, Dict, Set

class BasicElevator:
    """Elevator simulated using a GUI"""
    
    def __init__(self, start_floor:int =1, nb_floors=10):
        """Initialised a new elevator, placed on the first floor"""
        
        # Current floor of the elevator
        self.cur_floor: int = start_floor

        # (Possibly empty) list of next floor stops to reach 
        self.next_stops : List[int] = []
        
        # Building the basic GUI showing the elevator
        display(self._build_gui(nb_floors))

        # Starts the thread executing the movements
        thread = threading.Thread(target=self.elevator_move_thread)
        thread.start()     
            
    def move_to_floor(self, floor_number : int):
        """Move to a given floor (by adding it to a stack of floors to reach)"""
        
        if floor_number < 1 or floor_number > len(self.floors):
            raise RuntimeError("Floor number must be between 1 and %i"%len(self.floors))

        self.next_stops.append(floor_number)
        
        
    def stop(self):
        """Stops all movements of the elevator"""
        self.next_stops.clear()

    def _build_gui(self, nb_floors):
        """Creates the GUI for the elevator, with a status label and a visual representation
        of the floors, where the current floor is indicated in green."""

        # Displaying the current status of the elevator (still or going up or down)
        status_label = widgets.HTML("<b>Status</b>: ")
        self.status = widgets.Label("Still")
        status_box = widgets.HBox([status_label, self.status])

        # Displaying the floors on a vertical axis
        self.floors = []
        floor_layout = widgets.Layout(width='50px', height='30px', border='2px solid black',justify_content="center")
        for i in range(1, nb_floors+1):
            floor = widgets.Label(value=str(i), layout=floor_layout)
            floor.style = {"background":("white" if i!=self.cur_floor else "lightgreen")}
            self.floors.append(floor)

        # Create a vertical box container to hold the boxes
        vbox = widgets.VBox([status_box] + self.floors[::-1])
        return vbox
    

    def elevator_move_thread(self, speed=1.0, latency=0.1):
        """Trigger a movement of the elevator if the list of next stops is not 
        empty. The movement continues until all goals are reached."""

        while True:
            while self.next_stops:
                if self.cur_floor == self.next_stops[0]:
                    del self.next_stops[0]
                    continue
                if self.cur_floor < self.next_stops[0]:
                    next_floor = self.cur_floor+1
                    self.status.value = "UP"
                elif self.cur_floor > self.next_stops[0]:
                    next_floor = self.cur_floor-1
                    self.status.value = "DOWN"
                time.sleep(speed)   
                self.floors[self.cur_floor-1].style.background = "white"
                self.floors[next_floor-1].style.background = "lightgreen"
                self.cur_floor = next_floor
            self.status.value = "Still"
            
            # Wait loop (until we have a goal in self.next_stops)
            time.sleep(latency)

The elevator can be easily controlled through the functions `move_to_floor` and `stop`:

In [4]:
elevator = BasicElevator()
elevator.move_to_floor(5)

VBox(children=(HBox(children=(HTML(value='<b>Status</b>: '), Label(value='Still'))), Label(value='10', layout=…

We will now make our elevator controllable through a speech interface instead of using function calls.

## Speech interface

First, make sure that you have installed `pyaudio` (for audio processing), `whisper` (for speech recognition), and `pyttsx3` (for speech synthesis).

The `TalkingElevator` class below extends the basic simulated elevator with speech input and output. 

Upon clicking on the recording button, speech is recorded from the user's microphone, and continues until the stop button is clicked. The speech recognition engine `Whisper` from OpenAI is then employed to transcribe the spoken input (either on GPU, if you have a GPU on your machine, or on CPU). The transcription result is then sent to the `process_input` function, which is responsible for determining the system response. 

We are going to focus on implementing this `process_input` method. Note this system reaction to new user inputs may comprise both verbal responses (to be uttered by the system through the `_say_to_user` method) and physical actions (through the `move_to_floor` and `stop` methods).

In [None]:
import threading, time
import numpy as np
from typing import List, Tuple, Dict, Set
import whisper, pyaudio, pyttsx3
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

class TalkingElevator(BasicElevator):
    """Extension of the simulated elevator with a speech interface"""
    
    def __init__(self):
        print("Loading TTS and ASR models", end="...", flush=True)
        self.tts_engine = pyttsx3.init()  
        self.asr_engine = whisper.load_model("small.en")
        print("Done")
        
        # Initializing the GUI
        BasicElevator.__init__(self)

        # Starts the dialogue
        self.dialogue_history = []
        self._say_to_user("Hi, what can I do for you today?")
    

    def process_input(self, user_input: str, conf_score:float=1.0):
        """Processes the (transcribed) user input, and respond appropriately 
        (through a verbal response and possibly also an action, such as moving floors)"""

        self._add_to_dialogue_history(user_input, speaker="user", conf_score=conf_score)

        # Dummy response. Should be replaced by the actual dialogue behaviour
        self._say_to_user("Sorry, I don't understand you, pal")

    
    def _say_to_user(self, system_response: str):
        """Say something back to the user, and add the dialogue turn to the history. The 
        synthesis is done using the pyttsx3 library."""

        self._add_to_dialogue_history(system_response, speaker="elevator")
        
        # Stopping current TTS if one is active
        try:
            self.tts_engine.endLoop()
        except:
            pass
        self.tts_engine.say(system_response)
        self.tts_engine.runAndWait()


    def _add_to_dialogue_history(self, turn:str , speaker:str, conf_score:float=1.0):
        """Adds a new (user or system) turn to the dialogue history list, and displays it
         on the chat window displaying the turns"""

        self.dialogue_history.append({"speaker":speaker, "text":turn, 
                                      "conf_score":conf_score, "timesamp":time.time()})
        
        self.history_area.value += "&nbsp;<strong>%s</strong>:  %s"%(speaker.title(), turn)
        if conf_score < 1.0:
            self.history_area.value += " (%.2f)"%(conf_score)
        self.history_area.value += "<br>"
   
   
    def _build_gui(self, nb_floors):
        """GUI for the Talking elevator, comprising (beyond the simulated elevator from 
        BasicElevator) a chat window showing the dialogue turns, and buttons to record
        the user input. 
        The user should first click on the record button, then on stop when they have finished.
        Once the stop button is clicked, the audio is transcribed by Whisper, and finally 
        forwarded to the process_input function."""

        core_gui = BasicElevator._build_gui(self, nb_floors)

        self.frames = []
        self.recording = False

        def record(chunk_size=1024):
            """Record audio chunks to a buffer."""
            self.recording = True
            p = pyaudio.PyAudio()
            stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, 
                            frames_per_buffer=chunk_size)
            while self.recording:
                self.frames.append(stream.read(chunk_size))
            stream.close()

        def on_record_button_clicked(b):
            "Starts the recording"
            record_button.disabled=True
            stop_button.disabled=False
            self.frames = []  # Clear previous recordings
            thread = threading.Thread(target=record)
            thread.start()

        def on_stop_button_clicked(b):
            "stops the recording, runs Whisper, and forward the result to process_input"
            self.recording = False
            record_button.disabled=False
            stop_button.disabled=True
            audio_data = np.frombuffer(b"".join(self.frames), np.int16).astype(np.float32) * (1 / 32768.0)
            output = self.asr_engine.transcribe(audio_data)

            # We define the confidence score based on the log-probabilities
            conf_score = np.exp(np.mean([segment["avg_logprob"] for segment in output["segments"]]))
            # (and we push those up a bit, as the Whisper scores seem too low)
            conf_score = min(1, conf_score*1.2)

            # Finally, we process the input
            self.process_input(output["text"], conf_score)

        # The record and stop buttons
        record_button = widgets.Button(icon="microphone")
        stop_button = widgets.Button(icon="stop", disabled=True)
        record_button.on_click(on_record_button_clicked)
        stop_button.on_click(on_stop_button_clicked)
        
        # The chat area
        self.history_area = widgets.HTML(layout=widgets.Layout(width="600px", height="300px", 
                                                               border='1px solid black', overflow='scroll'))

        # The right side of the GUI
        right_side = widgets.VBox([widgets.Label(""), self.history_area, widgets.HBox([record_button, stop_button])])
        
        extended_gui = widgets.HBox([core_gui, right_side])
        return extended_gui


Let's give it a try:

In [6]:
elevator = TalkingElevator()

Loading TTS and ASR models...Done


HBox(children=(VBox(children=(HBox(children=(HTML(value='<b>Status</b>: '), Label(value='Still'))), Label(valu…

We hope that the speech recognition and speech synthesis will work correctly -- if it isn't the case, do let us know ! (audio processing in Python can be quite tricky and will work differently from OS to OS). [^1]

[^1]: If you are running on Linux and the TTS is not working, install the following packages on your machine: `sudo apt update && sudo apt install espeak ffmpeg libespeak1`

**Note**: The current implementation reloads the TTS and ASR models every time, which means that you may run into a "CUDA: out of memory" error if you reinitialise the `TalkingElevator` many times. If this happens, simply restart the Python kernel, which will clear the memory on both the CPU and the GPU. 


### Intent recognition

We wish our talking elevator to support the following functionalities:
 
- If the user express a wish to go to floor $X$ (where $X$ is an integer value between 1 and 10), the elevator should go to that floor. The interface should allow for several ways to express a given intent, such as "_Please go to the $X$-th floor_" or "_Floor $X$, please_".
- The user requests can also be relative, for instance "_Go one floor up_".
- The elevator should provide _grounding_ feedback to the user. For instance, it should respond "_Ok, going to the $X$-th floor_" after a user request to move to $X$.  
- The elevator should handle misunderstandings and uncertainties, e.g. by requesting the user to repeat, or asking the user to confirm if the intent is uncertain (say, when its confidence score is lower than 0.5). 
- The elevator should also allow the user to ask where the office of a given employee is located. For instance, the user could ask "_where is Erik Velldal's office?_", and the elevator would provide a response such as "_The office of Erik Velldal is on the 4th floor. Do you wish to go there?_".  We provide you with the office numbers of a small set of IFI employees in the `OFFICES` dictionary (see below).
- The elevator should also be able to inform the user about the current floor (such as replying to "_Which floor are we on?_" or "_Are we on the 5th floor?_"). 
- Finally, if the user asks the elevator to stop (or if the user says "_no_" after a grounding feedback "_Ok, going to floor $X$._"), the elevator should stop, and ask for clarification regarding the actual user intent. 

To implement this conversational behaviour, we will rely on a classical NLU-based approach in which we will recognise the user _intent_, and then determine a response based on the recognised intent(s). 

__Task 1.1__ (1 point): You first need to define a list of user intents that cover the kinds of user inputs you expect to observe in this talking elevator, such as `RequestMoveToFloor` or `Confirm`. This is a design question, and there is no obvious right or wrong answer. Define below the intents you want to cover, along with an explanation and a few examples of user inputs for each.


<!-- Provide here the list of intent classes you have defined, together with an explanation and a few examples -->
**Answer:**

1. `RequestMoveToFloor`: This intent is used when the user wants to move to a specific floor. Examples: "Please go to the 5th floor", "Floor 3, please", "I want to go to the 7th floor".
2. `RequestMoveRelative`: This intent is used when the user wants to move to a floor relative to the current floor. Examples: "Go one floor up", "Move two floors down", "Take me to the floor above".
3. `RequestOfficeLocation`: This intent is used when the user wants to know the location of an employee's office. Examples: "Where is Erik's office?", "Can you tell me where the office of Erik is?", "I need to know where Erik's office is".
4. `RequestCurrentFloor`: This intent is used when the user wants to know the current floor. Examples: "Which floor are we on?", "Are we on the 5th floor?", "Can you tell me the current floor?".
5. `Confirm`: This intent is used when the user confirms the elevator's response. Examples: "Yes", "That's correct", "I confirm".
6. `Stop`: This intent is used when the user wants the elevator to stop. Examples: "Stop", "I want to stop", "Please stop".
7. `Repeat`: This intent is used when the user wants the elevator to repeat the last response. Examples: "Can you repeat that?", "I didn't hear you", "What did you say?".

__Task 1.2__ (1 points): We wish to build a classifier any user input to a probability distribution over those intents, and start by creating a small, synthetic training set. Make a list of about 100 user utterances, each labelled with an intent defined above. You can "make up" those utterances yourself, or ask someone else to come with alternative formulations if you lack inspiration.

In [41]:
labelled_utterances = [
    # RequestMoveToFloor Intent
    ("Go to floor 1", "RequestMoveToFloor"),
    ("Take me to floor 2", "RequestMoveToFloor"),
    ("Please go up to the 5th floor", "RequestMoveToFloor"),
    ("Take me to the top floor", "RequestMoveToFloor"),
    ("I’d like to go to floor 3", "RequestMoveToFloor"),
    ("Move to floor 4", "RequestMoveToFloor"),
    ("Take us to the first floor", "RequestMoveToFloor"),
    ("Bring me to the highest floor", "RequestMoveToFloor"),
    ("Could we go to floor 6?", "RequestMoveToFloor"),
    ("Can you take me to floor 8?", "RequestMoveToFloor"),
    ("Please move to floor 7", "RequestMoveToFloor"),
    ("I’d like to go down to the first floor", "RequestMoveToFloor"),
    ("Take us up to floor 10", "RequestMoveToFloor"),
    ("Let's go to floor 9", "RequestMoveToFloor"),
    ("Bring me to floor 2", "RequestMoveToFloor"),
    ("Elevator, please move to floor 3", "RequestMoveToFloor"),
    ("Take me down to the ground floor", "RequestMoveToFloor"),
    ("Take me up a floor", "RequestMoveToFloor"),
    ("Let's head to floor 5", "RequestMoveToFloor"),
    ("I want to go to floor 1", "RequestMoveToFloor"),
    ("Take me to the basement", "RequestMoveToFloor"),
    ("Could you take me down?", "RequestMoveToFloor"),
    ("Let’s head down", "RequestMoveToFloor"),
    ("Will you take me to the rooftop?", "RequestMoveToFloor"),
    ("I'd like to visit floor 7", "RequestMoveToFloor"),
    ("Could you stop at floor 3?", "RequestMoveToFloor"),
    ("Take me to the middle floor", "RequestMoveToFloor"),
    ("Bring me down to the lobby", "RequestMoveToFloor"),
    ("Take me down to floor 2", "RequestMoveToFloor"),
    ("Let's ride to floor 6", "RequestMoveToFloor"),
    ("Please go up to floor 4", "RequestMoveToFloor"),
    ("Take us down to the entrance", "RequestMoveToFloor"),
    ("Elevator, take me to floor 8", "RequestMoveToFloor"),
    ("Go up a floor, please", "RequestMoveToFloor"),
    ("Take us to floor 3", "RequestMoveToFloor"),
    ("Let's stop at floor 5", "RequestMoveToFloor"),
    ("Move to floor 10", "RequestMoveToFloor"),
    ("Let’s reach the top floor", "RequestMoveToFloor"),
    ("Let's proceed to floor 4", "RequestMoveToFloor"),
    ("Bring us to the lowest level", "RequestMoveToFloor"),
    ("I'd like to head up to floor 9", "RequestMoveToFloor"),
    ("Let’s go to floor 1", "RequestMoveToFloor"),
    ("Head to the second level", "RequestMoveToFloor"),

    # RequestMoveRelative Intent
    ("Go one floor up", "RequestMoveRelative"),
    ("Move two floors down", "RequestMoveRelative"),
    ("Take me to the floor above", "RequestMoveRelative"),
    ("Go up one floor", "RequestMoveRelative"),
    ("Move down a floor", "RequestMoveRelative"),
    ("Take me one floor down", "RequestMoveRelative"),
    ("Go up two floors", "RequestMoveRelative"),
    ("Move up one level", "RequestMoveRelative"),
    ("Take me down two floors", "RequestMoveRelative"),
    ("Move up a floor", "RequestMoveRelative"),

    # RequestOfficeLocation Intent
    ("Where is Erik's office?", "RequestOfficeLocation"),
    ("Can you tell me where the office of Erik is?", "RequestOfficeLocation"),
    ("I need to know where Erik's office is", "RequestOfficeLocation"),
    ("Where is Erik Velldal's office?", "RequestOfficeLocation"),
    ("Can you tell me where Erik Velldal's office is?", "RequestOfficeLocation"),
    ("I need to know where Erik Velldal's office is", "RequestOfficeLocation"),
    ("Where is the office of Erik Velldal?", "RequestOfficeLocation"),
    ("Can you tell me where the office of Erik Velldal is?", "RequestOfficeLocation"),
    ("I need to know where the office of Erik Velldal is", "RequestOfficeLocation"),

    # RequestCurrentFloor Intent
    ("Which floor are we on?", "RequestCurrentFloor"),
    ("Are we on the 5th floor?", "RequestCurrentFloor"),
    ("Can you tell me the current floor?", "RequestCurrentFloor"),
    ("What floor are we on?", "RequestCurrentFloor"),
    ("Are we on floor 5?", "RequestCurrentFloor"),
    ("Which floor is this?", "RequestCurrentFloor"),
    ("What floor is this?", "RequestCurrentFloor"),
    ("Can you tell me which floor we are on?", "RequestCurrentFloor"),
    ("Are we on the fifth floor?", "RequestCurrentFloor"),

    # Confirm Intent
    ("Yes", "Confirm"),
    ("That's correct", "Confirm"),
    ("I confirm", "Confirm"),
    ("Sure", "Confirm"),
    ("Absolutely", "Confirm"),
    ("Correct", "Confirm"),
    ("Indeed", "Confirm"),
    ("Affirmative", "Confirm"),
    ("Right", "Confirm"),

    # Stop Intent
    ("Stop", "Stop"),
    ("I want to stop", "Stop"),
    ("Please stop", "Stop"),
    ("Can you stop here?", "Stop"),
    ("Stop the elevator", "Stop"),
    ("Stop moving", "Stop"),
    ("Stop at the next floor", "Stop"),
    ("Stop the elevator at once", "Stop"),
    ("End the ride", "Stop"),
    ("Stop right here", "Stop"),
    ("Stop the lift", "Stop"),
    ("Can you wait here?", "Stop"),
    ("Stop immediately", "Stop"),
    ("Let’s halt here", "Stop"),
    ("Stop the elevator for me", "Stop"),
    ("Elevator, stop now", "Stop"),
    ("Can you pause here?", "Stop"),

    # Repeat Intent
    ("Can you repeat that?", "Repeat"),
    ("I didn't hear you", "Repeat"),
    ("What did you say?", "Repeat"),
    ("Could you say that again?", "Repeat"),
    ("Please repeat", "Repeat"),
    ("Say that again", "Repeat"),
    ("Repeat that", "Repeat"),
    ("I missed that", "Repeat"),
    ("Can you say that again?", "Repeat"),

    # OutOfCoverage Intent
    ("I don’t need a ride", "OutOfCoverage"),
    ("Close the door", "OutOfCoverage"),
    ("Let's take the stairs", "OutOfCoverage"),
    ("How many floors are there?", "OutOfCoverage"),
    ("Are you on a break?", "OutOfCoverage"),
    ("I don't need the elevator", "OutOfCoverage"),
    ("Is the elevator moving fast?", "OutOfCoverage"),
    ("Are you stopping at every floor?", "OutOfCoverage"),
    ("Is anyone else here?", "OutOfCoverage"),
    ("I really like the IN4080 course", "OutOfCoverage")
]

We will now train an intent classifier based on the labelled utterances you have defined. To do so, we will rely on the [SetFit](https://huggingface.co/docs/setfit/index) library, which allows one to easily train a text classification model from few examples by fine-tuning a sentence-transformer model (like the ones we used in oblig 2 and 3). Make sure that the `setfit` library is installed (`pip install setfit`).

Read the [Setfit quickstart guide](https://huggingface.co/docs/setfit/quickstart) to find out how to use the library.

__Task 1.3__ (2 points): Implement the `__init__`, `train` and `get_intent_distrib` methods of the `IntentClassifier` class below. The classifier should rely on a `Setfit` model trained on the labelled utterances you have already defined. 

In [43]:
import setfit
from setfit import SetFitModel, Trainer, TrainingArguments
import datasets
from typing import List, Tuple, Dict

class IntentClassifier:

    def __init__(self, model_name="sentence-transformers/paraphrase-mpnet-base-v2"):
        """Initializes the SetFit model for intent recognition."""
        self.model = SetFitModel.from_pretrained(model_name)
        self.id2label = {}  # To store mapping from label IDs to label names
        self.label2id = {}  # To store mapping from label names to label IDs
        self.trainer = None  # Placeholder for the trainer

    def train(self, labelled_utterances: List[Tuple[str, str]]):
        """Trains the SetFit model on the labelled utterances."""

        # Extract unique labels for the intents and create label mappings
        # Extract unique labels for the intents and create label mappings
        intents = sorted(set(label for _, label in labelled_utterances))  # Sorted for consistent ID mapping
        self.label2id = {intent: i for i, intent in enumerate(intents)}
        self.id2label = {i: intent for intent, i in self.label2id.items()}

        # Creates the dataset from the list of labelled utterances
        train_data = datasets.Dataset.from_list([
            {"text": utt, "label": self.label2id[label]} for utt, label in labelled_utterances
        ])
        
        args = TrainingArguments(
            batch_size=32,
            num_epochs=10,
        )
        
        self.trainer = Trainer(
            model=self.model,
            args=args,
            train_dataset=train_data,
        )

        self.trainer.train()

    def get_intent_distrib(self, utterance: str) -> Dict[str, float]:
        """Applies the trained model on a new utterance and returns a dictionary mapping
        each intent to its probability."""

        # Get probabilities for each intent
        probabilities = self.model.predict_proba([utterance])[0]

        # Map each probability to its corresponding intent label
        intent_distribution = {
            self.id2label[i]: prob.item() for i, prob in enumerate(probabilities)
        }
        
        return intent_distribution

In [44]:
classifier = IntentClassifier()
classifier.train(labelled_utterances)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/116 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 10794
  Batch size = 32
  Num epochs = 10


  0%|          | 0/3380 [00:00<?, ?it/s]

{'embedding_loss': 0.1644, 'grad_norm': 0.9339766502380371, 'learning_rate': 5.91715976331361e-08, 'epoch': 0.0}
{'embedding_loss': 0.154, 'grad_norm': 0.7327700853347778, 'learning_rate': 2.958579881656805e-06, 'epoch': 0.15}
{'embedding_loss': 0.1145, 'grad_norm': 0.6400843858718872, 'learning_rate': 5.91715976331361e-06, 'epoch': 0.3}
{'embedding_loss': 0.0636, 'grad_norm': 0.34832996129989624, 'learning_rate': 8.875739644970414e-06, 'epoch': 0.44}
{'embedding_loss': 0.0259, 'grad_norm': 0.40534529089927673, 'learning_rate': 1.183431952662722e-05, 'epoch': 0.59}
{'embedding_loss': 0.0106, 'grad_norm': 0.14236658811569214, 'learning_rate': 1.4792899408284025e-05, 'epoch': 0.74}
{'embedding_loss': 0.003, 'grad_norm': 0.09258803725242615, 'learning_rate': 1.7751479289940828e-05, 'epoch': 0.89}
{'embedding_loss': 0.0022, 'grad_norm': 0.0409281961619854, 'learning_rate': 1.9921104536489153e-05, 'epoch': 1.04}
{'embedding_loss': 0.0014, 'grad_norm': 0.385511577129364, 'learning_rate': 1.9

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'embedding_loss': 0.0008, 'grad_norm': 0.020420070737600327, 'learning_rate': 1.8606180144641684e-05, 'epoch': 1.63}
{'embedding_loss': 0.0007, 'grad_norm': 0.016306841745972633, 'learning_rate': 1.827744904667982e-05, 'epoch': 1.78}
{'embedding_loss': 0.0007, 'grad_norm': 0.01920117624104023, 'learning_rate': 1.794871794871795e-05, 'epoch': 1.92}
{'embedding_loss': 0.0005, 'grad_norm': 0.010478561744093895, 'learning_rate': 1.7619986850756083e-05, 'epoch': 2.07}
{'embedding_loss': 0.0007, 'grad_norm': 0.008270319551229477, 'learning_rate': 1.7291255752794215e-05, 'epoch': 2.22}
{'embedding_loss': 0.0005, 'grad_norm': 0.015562576241791248, 'learning_rate': 1.696252465483235e-05, 'epoch': 2.37}
{'embedding_loss': 0.0005, 'grad_norm': 0.009582655504345894, 'learning_rate': 1.6633793556870482e-05, 'epoch': 2.51}
{'embedding_loss': 0.0005, 'grad_norm': 0.018521921709179878, 'learning_rate': 1.6305062458908614e-05, 'epoch': 2.66}
{'embedding_loss': 0.0005, 'grad_norm': 0.011736877262592316

In [45]:
classifier.get_intent_distrib("go to floor 3")

{'Confirm': 0.001251610362192498,
 'OutOfCoverage': 0.0012193290092969659,
 'Repeat': 0.0012298758593317044,
 'RequestCurrentFloor': 0.0012648997132574955,
 'RequestMoveRelative': 0.0012447928632440548,
 'RequestMoveToFloor': 0.9912965930143235,
 'RequestOfficeLocation': 0.0012253039149202332,
 'Stop': 0.0012675952634335391}

Since we don't have any test data, we cannot really conduct an evaluation of the classification performance, but this step would be of course strongly adviced when developing a real system. 

### Slot filling

In addition to the intents themselves, we also wish to detect some slots, such as floor numbers or person names. For this step, we will not use a data-driven model, but rather rely on an old-fashioned, rule-based approach:
- For floor numbers, we will rely on string matching (with regular expressions or basic string search) that detect patterns such as "X floor" (where X is [first,second, third, fourth, fifth, sixth, seventh, eighth, ninth, tenth]) or "floor X" (where X is between 1 and 10).
- For person names, we have a predefined list of person names to detect (employees at IFI), and we should simply search for their occurrence in the user input. The simplest implementation is to just for look for exact occurrences. However, since speech recognition will often struggle to recognize foreign person names, an even better approach would be to search for names that are phonetically close (you can use the `jellyfish` library for this).

The results of the slot filling should be a dictionary mapping slot names to a canonical form of the slot value. For instance, if the utterance contains the expression "ninth floor", the resulting slot dictionary should be `{"floor_number":9}`. Similarly, the `employee_name` slot should be a name present in `OFFICES` dictionary. 

__Task 1.4__ (2 points): Implement the method `fill_slots` that will detect the occurrence of those slots in the user input.<br>
(+ 1 bonus point if you implement a fuzzy matching strategy to find person names that are phonetically close)

In [46]:
# Floor numbers for a subset of the IFI employees
OFFICES = {'Adín Ramírez Rivera': 4, 'Andreas Austeng': 4, 'Anne H Schistad Solberg': 4, 
           'Arild Torolv Søetorp Waaler': 9, 'Audun Jøsang': 9, 'Birthe Soppe': 4, 'Carsten Griwodz': 4,
           'Dag Sjøberg': 9, 'Dag Trygve Eckhoff Wisland': 5, 'Einar Broch Johnsen': 8, 
           'Eric Bartley Jul': 10, 'Erik Velldal': 4, 'Henrik Skaug Sætra': 7, 'Ingrid Chieh Yu': 8,
           'Jørn Anders Braa': 6, 'Kristin Bråthen': 4, 'Kyrre Glette': 4, 'Lars Groth': 6, 
           'Lilja Øvrelid': 4, 'Maja Van Der Velden': 7, 'Martin Giese': 9, 'Michael Welzl': 5, 
           'Miria Grisot': 6, 'Nils Gruschka': 9, 'Olaf Owe': 9, 'Ole Christian Lingjærde': 4, 
           'Ole Hanseth': 6, 'Paulo Ferreira': 10, 'Philipp Dominik Häfliger': 5, 'Philipp Häfliger': 5, 
           'Roman Vitenberg': 4, 'Silvia Lizeth Tapia Tarifa': 8, 'Stephan Oepen': 4, 
           'Sundeep Sahay': 6, 'Thomas Peter Plagemann': 4, 'Tone Bratteteig': 7, 'Torbjørn Rognes': 8, 
           'Truls Erikson': 6, 'Viktoria Stray': 10, 'Yngvar Berg': 5, 'Yves Scherrer': 4, 
           'Özgü Mira Alay-Erduran': 4}

In [47]:
import re
import jellyfish
from typing import Dict, Union

import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def fuzzy_matching_name(user_input: str) -> str:
    """
    Fuzzy matches the user input to the closest employee name in the OFFICES dictionary.
    Returns the best match if the similarity score is above 0.75, otherwise returns None.
    """
    best_match, highest_score = None, 0

    # Remove stopwords from user input
    input_parts = [word for word in user_input.split() if word.lower() not in stop_words]
    possible_name_string = " ".join(input_parts)

    # Extract potential names from the cleaned input
    possible_names = re.findall(r'[A-Z][a-z]*\s*[A-Z]*[a-z]*', possible_name_string)

    # Calculate similarity for each possible name component
    for possible_name in possible_names:
        for name in OFFICES.keys():
            match_score = jellyfish.jaro_winkler_similarity(possible_name, name)
            # Update the best match if score is higher
            if match_score > highest_score:
                best_match, highest_score = name, match_score

    return best_match if highest_score > 0.7 else None

def fill_slots(user_input: str) -> Dict[str, Union[int, str]]:
    """Extracts the set of slots detected in the user inputs. More precisely, the method
    should detect both floor numbers and person names, and return a dictionary mapping slot 
    names (in this case either `floor_number` or `employee_name`) to its corresponding
    value, in canonical form (integer for the floor number, string for the employee name)"""

    slots = {}

    # Mapping words to floor numbers
    floor_word_to_number = {
        "first": 1, "second": 2, "third": 3, "fourth": 4, "fifth": 5,
        "sixth": 6, "seventh": 7, "eighth": 8, "ninth": 9, "tenth": 10
    }

    # Regular expressions to detect "floor X" or "X floor" patterns
    floor_pattern = re.compile(r"\b(floor\s+(\d+)|(\d+)\s+floor|({}))\b".format(
        "|".join(floor_word_to_number.keys())), re.IGNORECASE)

    # Search for floor patterns in the input
    floor_match = floor_pattern.search(user_input)
    if floor_match:
        if floor_match.group(2):  # Matches "floor X"
            slots["floor_number"] = int(floor_match.group(2))
        elif floor_match.group(3):  # Matches "X floor"
            slots["floor_number"] = int(floor_match.group(3))
        elif floor_match.group(4):  # Matches words like "first", "second", etc.
            slots["floor_number"] = floor_word_to_number[floor_match.group(4).lower()]

    # Search for exact matches of employee names
    exact_matches = [name for name in OFFICES.keys() if name.lower() in user_input.lower()]
    if exact_matches:
        slots["employee_name"] = exact_matches[0]
    else:
        fuzzy_match = fuzzy_matching_name(user_input)
        if fuzzy_match:
            slots["employee_name"] = fuzzy_match

    return slots

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/khoimai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [48]:
# Test cases for the fill_slots function

print(fill_slots("Could you take me to the ninth floor?"))  
print(fill_slots("I would like to visit Andreas Austeng"))  
print(fill_slots("Take me to floor 3"))  
print(fill_slots("Where is Olaf Owe located?")) 
print(fill_slots("Could you tell me where Andreas Austeng's office is?"))  
print(fill_slots("Take me to floor 9"))
print(fill_slots("Take me to floor ninth"))  
print(fill_slots("Where is Olaf Owe?"))
print(fill_slots("I'd like to see Andreas Austen"))  

{'floor_number': 9}
{'employee_name': 'Andreas Austeng'}
{'floor_number': 3}
{'employee_name': 'Olaf Owe'}
{'employee_name': 'Andreas Austeng'}
{'floor_number': 9}
{'floor_number': 9}
{'employee_name': 'Olaf Owe'}
{'employee_name': 'Andreas Austeng'}


In [49]:
# Test case for fuzzy matching

# 'Andreas Austeng'
print(fill_slots("I'd like to visit Andreas Austen"))
print(fill_slots("I want to speak with Andres Austeng"))

# 'Olaf Owe'
print(fill_slots("Where is Olaf Oh's office located?"))
print(fill_slots("Please tell me about Olaf Owa"))
print(fill_slots("Could you find Olaf Aw's office?"))

# 'Philipp Häfliger'
print(fill_slots("Where can I find Phillip Hafliger?"))
print(fill_slots("I need to talk to Philip Heffliger"))
print(fill_slots("Could you direct me to Phillip Hefligar?"))

# 'Eric Bartley Jul'
print(fill_slots("Is Erik Bartly in his office?"))
print(fill_slots("Can you show me where Eric Jul's office is?"))
print(fill_slots("Please guide me to Eric Bartley Jewel"))

{'employee_name': 'Andreas Austeng'}
{'employee_name': 'Andreas Austeng'}
{'employee_name': 'Olaf Owe'}
{'employee_name': 'Olaf Owe'}
{'employee_name': 'Olaf Owe'}
{'employee_name': 'Philipp Häfliger'}
{'employee_name': 'Philipp Häfliger'}
{'employee_name': 'Philipp Häfliger'}
{'employee_name': 'Eric Bartley Jul'}
{'employee_name': 'Eric Bartley Jul'}
{'employee_name': 'Eric Bartley Jul'}


### Response selection

The next step is to implement the response selection mechanism. The response will depend on various factors:
- the inferred user intents from the user utterance
- the detected slot values in the user utterance (if any)
- the current floor
- the list of next floor stops that are yet to be reached
- the dialogue history (as a list of dialogue turns).

The response may consist of verbal responses (enacted by calls to `_say_to_user`) but also physical actions, represented by calls to either `move_to_floor` or `stop`. 

__Task 1.5__ (3 points): Implement the method `_respond`, which is responsible for selecting and executing those responses. The responses should satisfy the aforementioned conversational criteria (provide grounding feedback, use confirmations and clarification requests etc.). This method will consist in practice of many _if...then...else_ blocks. 

In [50]:
def _respond(self, intent_distrib: Dict[str, float], slots: Dict[str, Union[int, str]]):
    """Given a probability distribution over possible intents, and a (possibly empty) list
    of detected slots in the user input, decide how to react. The method should lead
    to calls to both physical actions (move_to_floor, stop) and dialogue responses 
    (via _say_to_user)."""

    # Ensure intent_distrib is a dictionary
    if not isinstance(intent_distrib, dict):
        raise ValueError("intent_distrib should be a dictionary")

    # Determine the most likely intent
    intent = max(intent_distrib, key=intent_distrib.get)
    confidence = intent_distrib[intent]

    # Handle different intents
    if intent == "RequestMoveToFloor":
        if "floor_number" in slots:
            floor_number = slots["floor_number"]
            if 1 <= floor_number <= 10:
                self._say_to_user(f"Ok, going to the {floor_number}th floor.")
                self.move_to_floor(floor_number)
            else:
                self._say_to_user("Sorry, the floor number must be between 1 and 10.")
        else:
            self._say_to_user("Which floor would you like to go to?")

    elif intent == "RequestMoveRelative":
        if "floor_number" in slots:
            relative_floor = slots["floor_number"]
            target_floor = self.cur_floor + relative_floor
            if 1 <= target_floor <= 10:
                self._say_to_user(f"Ok, moving {relative_floor} floors.")
                self.move_to_floor(target_floor)
            else:
                self._say_to_user("Sorry, that would take us out of the building's range.")
        else:
            self._say_to_user("How many floors would you like to move?")

    elif intent == "RequestOfficeLocation":
        if "employee_name" in slots:
            employee_name = slots["employee_name"]
            if employee_name in OFFICES:
                office_floor = OFFICES[employee_name]
                self._say_to_user(f"The office of {employee_name} is on the {office_floor}th floor. Do you wish to go there?")
            else:
                self._say_to_user(f"Sorry, I don't know where {employee_name}'s office is.")
        else:
            self._say_to_user("Whose office are you looking for?")

    elif intent == "RequestCurrentFloor":
        self._say_to_user(f"We are currently on the {self.cur_floor}th floor.")

    elif intent == "Confirm":
        if self.next_stops:
            next_floor = self.next_stops[0]
            self._say_to_user(f"Ok, continuing to the {next_floor}th floor.")
        else:
            self._say_to_user("There is no pending floor request to confirm.")

    elif intent == "Stop":
        self.stop()
        self._say_to_user("The elevator has been stopped. Where would you like to go?")

    elif intent == "Repeat":
        if self.dialogue_history:
            last_system_turn = next(turn for turn in reversed(self.dialogue_history) if turn["speaker"] == "elevator")
            self._say_to_user(f"I said: {last_system_turn['text']}")
        else:
            self._say_to_user("I haven't said anything yet.")

    else:
        self._say_to_user("Sorry, I don't understand you, pal")

setattr(TalkingElevator, "_respond", _respond)

### Putting it all together

The last step is to implement the `process_input` method in the `TalkingElevator` class. The method should rely on the intent recognition, slot filling and response selection mechanism (which you have implemented in the previous steps) to react to a given user input.

**Task 1.6** (1 point): Implement the `process_input` method:

In [52]:
def process_input(self, user_input: str, conf_score:float=1.0):
    """Processes the (transcribed) user input, and respond appropriately 
    (through a verbal response and possibly also an action, such as moving floors).
    The method should rely on the intent classifier, slot-filling function, and
    response selection function."""

    self._add_to_dialogue_history(user_input, speaker="user", conf_score=conf_score)
    
    # Get the intent distribution from the classifier
    intent_distrib = classifier.get_intent_distrib(user_input)
    
    # Fill the slots in the user input
    slots = fill_slots(user_input)
    
    # Respond based on the intent and slots
    self._respond(intent_distrib, slots)

setattr(TalkingElevator, "process_input", process_input)

We are now ready to test our talking elevator: 


In [53]:
elevator = TalkingElevator()

Loading TTS and ASR models...

Done


HBox(children=(VBox(children=(HBox(children=(HTML(value='<b>Status</b>: '), Label(value='Still'))), Label(valu…



Your talking elevator will mostly likely not function properly right from the start. Identify what works and what doesn't and correct the code you have developed in Tasks 1.1 - 1.6 until your system meets the specifications we have outlined. 

## Part 2 : Machine translation

In this part, we evaluate a pre-trained machine translation model on data from the Lord of the Rings movies and fine-tune it to improve the translation quality.

### Data

We provide you with two files, `lotr.detok.de` and `lotr.detok.en`, containing German and English movie subtitles. These two files constitute a so-called _parallel corpus_, i.e. each sentence/line in German corresponds to a sentence/line in English. The two files have the same number of lines and the German sentence on line $i$ corresponds to the English sentence on line $i$. The subtitles are extracted from the [OpenSubtitles-2018](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) corpus.

Here are the first ten lines of the two files:

<style scoped>
table {
  font-size: 12px;
}
</style>
| Nb  | German (`lotr.detok.de`)         | English (`lotr.detok.en`)      |
|---|----------------------------------|--------------------------------|
| 1 | Die Welt ist im Wandel. | The world is changed.   |
| 2 | Ich spüre es im Wasser. | I feel it in the water. |
| 3 | Ich spüre es in der Erde. | I feel it in the earth. |
| 4 | Ich rieche es in der Luft. | I smell it in the air. |
| 5 | Vieles, was einst war, ist verloren, da niemand mehr lebt, der sich erinnert. | Much that once was is lost. For none now live who remember it. |
| 6 | Es begann mit dem Schmieden der Großen Ringe. | It began with the forging of the Great Rings. |
| 7 | 3 wurden den Elben gegeben, den unsterblichen, weisesten und reinsten aller Wesen. | Three were given to the Elves: Immortal, wisest and fairest of all beings. |
| 8 | 7 den Zwergenherrschern, großen Bergleuten und Handwerkern in ihren Hallen aus Stein. | Seven to the Dwarf-lords: Great miners and craftsmen of the mountain halls. |
| 9 | Und 9... 9 Ringe wurden den Menschen geschenkt, die vor allem anderen nach Macht streben. | And nine nine rings were gifted to the race of Men who, above all else, desire power. |
| 10 | Denn diese Ringe bargen die Kraft und den Willen, jedes Volk zu leiten. | For within these rings was bound the strength and will to govern each race. |


### Getting started

We will a pretrained machine translation model for German-to-English translation. The model is available on the HuggingFace model hub and can be used with the `transformers` library.

Let us first make sure that all required modules are installed:

In [None]:
%pip install torch transformers accelerate evaluate sacrebleu sacremoses sentencepiece unbabel-comet

The bilingual model is called [`opus-mt-de-en`](https://huggingface.co/Helsinki-NLP/opus-mt-de-en) and has been trained by the Helsinki-NLP group. Like (almost) all HuggingFace models, it consists of a _tokenizer_ and the _sequence-to-sequence model_ properly speaking. We need to load both separately:

In [2]:
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("helsinki-nlp/opus-mt-de-en")
translator = transformers.AutoModelForSeq2SeqLM.from_pretrained("helsinki-nlp/opus-mt-de-en")

# Change "cuda" to "cpu" if you're running on a machine without GPU
device = "cuda"
translator = translator.to(device)

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

The `transformers` library will automatically download the models from the HuggingFace hub the first time you run this cell, so it may take a bit longer.

Let's take the first two German sentences, tokenize them, and translate them to English:

In [3]:
tokens = tokenizer(["Die Welt ist im Wandel.", "Ich spüre es im Wasser."], return_tensors="pt", padding=True)
print(tokens)

{'input_ids': tensor([[   55,   401,    29,    49,  9012,     3,     0, 58100, 58100],
        [  105,  2768,  1691,    18,    65,    49,   672,     3,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [4]:
outputs = translator.generate(**tokens.to(device), max_new_tokens=50)
print(outputs)

tensor([[58100,    36,   360,    19,  7315,     3,     0, 58100, 58100, 58100],
        [58100,    38,    85,  1595,    56,     5,     4,   616,     3,     0]],
       device='cuda:0')


__Task 2.1__ (1 point):
- What do the numbers in the `input_ids` represent?
- What is the effect of `padding=True`? How would the data look like if padding was disabled?
- What does `max_new_tokens` do? Why do you think it is important to set this parameter?

Answer:
- The numbers in the `input_ids` represent the token IDs of the input tokens. Each token is represented by a unique ID.
- The effect of `padding=True` is that the input sequences are padded to the maximum length of the batch. If padding was disabled, the input sequences would not be padded, and the input sequences would have different lengths.
- `max_new_tokens` is the maximum number of tokens that can be generated. It is important to set this parameter to avoid generating too many tokens, which could lead to memory issues.

We can get actual words by running the output through the `batch_decode` function of the tokenizer:

In [5]:
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(translations)

['The world is changing.', 'I can feel it in the water.']


__Note:__ We assume that you will run the translations from German to English. If you would like to work on the opposite translation direction (and feel comfortable evaluating the German output), you are welcome to do so. The corresponding bilingual model is called `opus-mt-en-de`.

### Data splitting

Before we move on, we need to split our data. We will evaluate different models and for that we'll need test data. We will also fine-tune a model, and for that we'll need training data. The entire Lord of the Rings dataset has 9640 lines.

__Task 2.2__ (1 point): Split the dataset in such a way that the **last** 1000 lines are used for testing and the remaining lines (8640) for training. Save the data under the following filenames: `lotr.train.de, lotr.train.en, lotr.test.de, lotr.test.en`. You can use Python code or other tools to perform the splitting.

In [12]:
import os

def split_dataset(input_file_de: str, input_file_en: str, test_size: int = 1000):
    """
    Splits the given dataset into training and testing files based on the specified test size.

    Parameters:
    - input_file_de (str): Path to the German input file.
    - input_file_en (str): Path to the English input file.
    - test_size (int): Number of lines to use for the test set. Default is 1000.

    Saves:
    - Training and testing files for both German and English.
    """
    # Check if input files exist
    if not os.path.exists(input_file_de) or not os.path.exists(input_file_en):
        print("Error: One or both input files do not exist.")
        return

    # Read the German file
    with open(input_file_de, "r") as f:
        lines_de = f.readlines()

    # Split into train and test
    train_de = lines_de[:-test_size]
    test_de = lines_de[-test_size:]

    # Read the English file
    with open(input_file_en, "r") as f:
        lines_en = f.readlines()

    # Split into train and test
    train_en = lines_en[:-test_size]
    test_en = lines_en[-test_size:]

    # Define output filenames
    output_train_de = input_file_de.replace(".detok", ".train")
    output_test_de = input_file_de.replace(".detok", ".test")
    output_train_en = input_file_en.replace(".detok", ".train")
    output_test_en = input_file_en.replace(".detok", ".test")

    # Write the training and test sets to files
    with open(output_train_de, "w") as f:
        f.writelines(train_de)

    with open(output_test_de, "w") as f:
        f.writelines(test_de)

    with open(output_train_en, "w") as f:
        f.writelines(train_en)

    with open(output_test_en, "w") as f:
        f.writelines(test_en)

    print(f"Files saved as: {output_train_de}, {output_test_de}, {output_train_en}, {output_test_en}")

In [13]:
split_dataset("lotr.detok.de", "lotr.detok.en", test_size=1000)

Files saved as: lotr.train.de, lotr.test.de, lotr.train.en, lotr.test.en


In [14]:
# Load the training and testing datasets to check the dimensions
with open("lotr.train.de", "r") as f:
    train_de = f.readlines()

with open("lotr.train.en", "r") as f:
    train_en = f.readlines()

with open("lotr.test.de", "r") as f:
    test_de = f.readlines()

with open("lotr.test.en", "r") as f:
    test_en = f.readlines()

print("--- Dataset Dimensions (Germany) ---")
print(f"Training set size: {len(train_de)}")
print(f"Testing set size: {len(test_de)}")
print("--- Dataset Dimensions (English) ---")
print(f"Training set size: {len(train_en)}")
print(f"Testing set size: {len(test_en)}")

--- Dataset Dimensions (Germany) ---
Training set size: 8640
Testing set size: 1000
--- Dataset Dimensions (English) ---
Training set size: 8640
Testing set size: 1000


__Task 2.3__ (1 point): What are potential risks and drawbacks of splitting the dataset in this way? 


Answer:
- Temporal Bias: If the subtitles follow the chronological sequence of the movie, the last 1000 lines would represent the movie's later parts. This can introduce a bias where the model may not generalize well to the entire movie context or to other movies.
- Loss of Diversity: Subtitles in different parts of the movie may vary in vocabulary, tone, and context (e.g., introductions vs. climax vs. resolution). Using only the last portion for testing may reduce the diversity in both the training and test sets, making evaluation less robust.
- Overfitting to Specific Contexts: Since the lines are sequential, training on one part of the movie and testing on another may lead to overfitting on context-specific patterns, reducing the model’s generalizability to random scenes or new movies.
- Evaluation Bias: The model’s performance on a test set derived from the end of the movie may not reflect its overall performance on the whole movie or on other subtitle datasets, as it hasn’t been exposed to different types of dialogues evenly.

Now we are ready to translate the test set with our model.

__Task 2.4__ (2 points): Create a function that loads the entire `lotr.test.de` file, translates each line with the `opus-mt-de-en` model and writes its output to a new file, one sentence per line.

The easiest way to do this is to just load the entire test file into a list, tokenize and translate it, but the test set may be too large to fit on GPU memory, or it might be inefficient and slow if you use a CPU. A better alternative is to split the data into batches of 50-100 sentences and send each batch separately to the translator.

In [15]:
def translate(input_file, translation_file, tokenizer, translator, batch_size=100):
    """Translate an input file line by line using the loaded tokenizer and translator,
    and write the translations to output_file."""
    
    with open(input_file, "r") as f:
        lines = f.readlines()

    with open(translation_file, "w") as f:
        for i in range(0, len(lines), batch_size):
            batch = lines[i:i+batch_size]
            tokens = tokenizer(batch, return_tensors="pt", padding=True)
            outputs = translator.generate(**tokens.to(device), max_length=128)
            translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            f.writelines([f"{line}\n" for line in translations])

translate("lotr.test.de", "lotr.output_opus.en", tokenizer, translator)

Before moving on, open the output file and check that the translations look ok. In particular, the file should contain the expected number of lines and output should be in the expected language (English or German, depending on the chosen direction).

__Task 2.5__ (1 point): Open both the output file and the reference translations (`lotr.test.en` if translating from German to English) and compare the first 20 lines. How would you rate the translations of the OPUS system on a scale from 1 (incomprehensible and/or completely different meaning) to 5 (grammatically correct and meaning fully preserved)? Justify your answer.

In [1]:
with open("lotr.output_opus.en", "r") as f:
    translations_opus = f.readlines()

with open("lotr.test.en", "r") as f:
    references = f.readlines()

for i in range(20):
    print(f"OPUS: {translations_opus[i].strip()}")
    print(f"Ref: {references[i].strip()}")
    print()

OPUS: There's nothing here.
Ref: Nothing here.

OPUS: Keep looking!
Ref: Keep searching!

OPUS: This stone could be anywhere.
Ref: That Jewel could be anywhere.

OPUS: The Arkenstein is located in these halls.
Ref: The Arkenstone is in these halls.

OPUS: - Find him!
Ref: - Find it!

OPUS: You heard me.
Ref: You heard him.

OPUS: - Keep looking.
Ref: - Keep looking.

OPUS: All of you!
Ref: All of you!

OPUS: No one rests until he's found.
Ref: No one rests until... it is found.

OPUS: It's almost tempting me to leave it to you.
Ref: I am almost tempted... to let you take it.

OPUS: And be it just to see how oak shield suffers.
Ref: If only... to see Oakenshield... suffer.

OPUS: Seeing it destroy him.
Ref: Watch it... destroy him.

OPUS: Seeing it contaminate his heart and drive him crazy.
Ref: Watch it corrupt... his heart... and drive him mad.

OPUS: I'll help you.
Ref: I've got you.

OPUS: Just take what's necessary.
Ref: Take only what you need.

OPUS: We have a long march ahead of

From my side, the translations are quite good and I would rate them as 4/5. There are some differences compared to the reference translations as it uses synonyms or different sentence structures.

For example:
- oak shield (OPUS) vs. Oakenshield (Reference)
- crazy (OPUS) vs. mad (Reference)
- contaminate (OPUS) vs. corrupt (Reference)
- ...

In conclusion, the translations are mostly correct and the meaning is preserved, but there are some differences in the choice of words.

### Evaluation

We can now evaluate the quality of our translations. In a first step, we perform _reference-based surface-level evaluation_  using the popular BLEU score. We can do that with the `sacrebleu` module. Below is a slightly reformatted example taken from the [SacreBLEU documentation](https://github.com/mjpost/sacrebleu/tree/master?tab=readme-ov-file#using-sacrebleu-from-python):

In [17]:
from sacrebleu.metrics import BLEU

reference = ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.']
hypothesis = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

bleu_scorer = BLEU()
# BLEU can deal with multiple references per sentence, but here we only have one, so we just enclose it in another set of brackets:
score = bleu_scorer.corpus_score(hypothesis, [reference])
print(score)

BLEU = 45.07 70.6/42.9/36.4/37.5 (BP = 1.000 ratio = 1.000 hyp_len = 17 ref_len = 17)


__Task 2.6__ (1 point): Load both the system output and the reference of your test set and compute the corpus-level BLEU score. Also compute the corpus-level chrF score. Which of the scores is higher?

In [18]:
from sacrebleu.metrics import BLEU, CHRF

def evaluate_bleu(hypothesis_file, reference_file):
	"""Evaluate the BLEU score using the hypothesis and reference files."""
	
	with open(hypothesis_file, "r") as f:
		hypothesis = f.readlines()

	with open(reference_file, "r") as f:
		references = f.readlines()

	bleu_scorer = BLEU()
	score = bleu_scorer.corpus_score(hypothesis, [references])
	return score

evaluate_bleu("lotr.output_opus.en", "lotr.test.en")

BLEU = 29.03 63.3/38.3/25.2/16.2 (BP = 0.920 ratio = 0.923 hyp_len = 6203 ref_len = 6721)

Besides string-based metrics, neural metrics have become increasingly popular lately, since they have been shown to correlate better with human judgements. The most popular neural metric is called COMET and it can be used with the HuggingFace `evaluate` package. The example below is from the [documentation](https://huggingface.co/spaces/evaluate-metric/comet/blob/main/README.md):

In [19]:
import evaluate

comet_metric = evaluate.load('comet')
src = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
hyp = ["The fire could be stopped", "Schools and kindergartens were open"]
ref = ["They were able to control the fire.", "Schools and kindergartens opened"]
comet_score = comet_metric.compute(predictions=hyp, references=ref, sources=src)
print(comet_score)

Downloading builder script:   0%|          | 0.00/6.97k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint .cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Encoder model frozen.
/fp/homes01/u01/ec-khoimt/.local/lib/python3.11/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
/fp/homes01/u01/ec-khoimt/.local/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /cluster/software/EL9/easybuild/software/jupyter-ser ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_

{'mean_score': 0.9051420092582703, 'scores': [0.8385582566261292, 0.9717257618904114]}


__Task 2.7__ (1 point): Adapt this code to evaluate the output of the OPUS model. Note that COMET also requires the source text.

In [20]:
def evaluate_comet(hypothesis_file, reference_file, source_file):
    """Evaluate the COMET score using the hypothesis, reference, and source files."""
    
    with open(hypothesis_file, "r") as f:
        hypothesis = f.readlines()

    with open(reference_file, "r") as f:
        references = f.readlines()

    with open(source_file, "r") as f:
        source = f.readlines()

    comet_metric = evaluate.load('comet')
    comet_score = comet_metric.compute(predictions=hypothesis, references=references, sources=source)
    return comet_score

In [21]:
evaluate_comet("lotr.output_opus.en", "lotr.test.en", "lotr.test.de")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint .cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`
Encoder model frozen.
/fp/homes01/u01/ec-khoimt/.local/lib/python3.11/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
/fp/homes01/u01/ec-khoimt/.local/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /cluster/software/EL9/easybuild/software/jupyter-ser ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU availa

{'mean_score': 0.8517506479918957,
 'scores': [0.9153851866722107,
  0.9468668103218079,
  0.8256570100784302,
  0.8032516837120056,
  0.9530919194221497,
  0.9214392900466919,
  0.9798831343650818,
  0.9793310761451721,
  0.9100832939147949,
  0.7782115340232849,
  0.5980216264724731,
  0.8024845719337463,
  0.7926276326179504,
  0.7160571217536926,
  0.8175803422927856,
  0.9529080390930176,
  0.9506795406341553,
  0.9420503973960876,
  0.91288822889328,
  0.8457056283950806,
  0.9229069352149963,
  0.7305722832679749,
  0.76967453956604,
  0.898786723613739,
  0.8273663520812988,
  0.8060702681541443,
  0.9700202345848083,
  0.5075957179069519,
  0.8797752261161804,
  0.8298813104629517,
  0.877406656742096,
  0.9205743670463562,
  0.6748988032341003,
  0.8469387292861938,
  0.2771272659301758,
  0.34335044026374817,
  0.9769700169563293,
  0.7233666181564331,
  0.5694791674613953,
  0.6657575964927673,
  0.9521582126617432,
  0.7802305221557617,
  0.7778492569923401,
  0.9069486856

### Fine-tuning

Let us see now if we can further improve the translation quality. We still haven't used the training set after all...

Fine-tuning a translation model with the `transformers` library is a bit convoluted. You need the following ingredients:
- A `Seq2SeqTrainer` object, which defines the initial model and its tokenizer, the training data, and the configuration parameters (as a `Seq2SeqTrainingArguments` object). The training process starts with the `train()` method.
- A `Seq2SeqTrainingArguments` object, which contains the configuration parameters, such as the number of training epochs, the path for saving the fine-tuned model, the learning rate etc.
- A `DataCollatorForSeq2Seq` object that takes care of splitting the training data into batches of appropriate size.
- A `DatasetDict` object containing the tokenized training data. Typically, the untokenized data is loaded into a `DatasetDict` object, and the tokenization function is applied to everything inside this `DatasetDict` using the `map()` function.

__Task 2.8__ (1 point): The code in the box below shows a working example using the pretrained OPUS model, but is limited to two sentence pairs. Complete the code to load the entire training data.

In [22]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import DataCollatorForSeq2Seq
from datasets import Dataset, DatasetDict

max_length = 100

# Load the training data
with open("lotr.train.de", "r") as f:
    src_texts = f.readlines()

with open("lotr.train.en", "r") as f:
    tgt_texts = f.readlines()

# Create the dataset
ds = Dataset.from_dict({
    "src_text": src_texts,
    "tgt_text": tgt_texts
})
data = DatasetDict({"train": ds})

def preprocess_function(examples):
    model_inputs = tokenizer(examples["src_text"], text_target=examples["tgt_text"], max_length=max_length, truncation=True)
    return model_inputs

# Tokenize the dataset
tokenized_datasets = data.map(preprocess_function, batched=True)
print(tokenized_datasets)

# Create the data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=translator)

# Define the training arguments
args = Seq2SeqTrainingArguments(
    output_dir="opus-mt-de-en-lotr",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True
)

# Create the trainer
trainer = Seq2SeqTrainer(
    translator,
    args,
    train_dataset=tokenized_datasets["train"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

# Start training
trainer.train()

Map:   0%|          | 0/8640 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['src_text', 'tgt_text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8640
    })
})




Step,Training Loss
500,1.1815




TrainOutput(global_step=810, training_loss=1.1138491030092592, metrics={'train_runtime': 43.0474, 'train_samples_per_second': 602.127, 'train_steps_per_second': 18.816, 'total_flos': 160830792400896.0, 'train_loss': 1.1138491030092592, 'epoch': 3.0})

The model was fine-tuned for three epochs, and you should have three checkpoints in the `opus-mt-de-en-lotr` directory.

__Task 2.9__ (1 point): Choose one of the checkpoints and use it to translate the test set. Evaluate the test set with BLEU, chrF and COMET. Note that locally saved model files (and tokenizers) can be loaded in the same way as models from the HuggingFace hub, e.g. with the following command: `transformers.AutoModelForSeq2SeqLM.from_pretrained("opus-mt-de-en-lotr/checkpoint-810")`

Did fine-tuning help? Did fine-tuning help? Have a look at the first rows of the files. Do you agree with the metrics?

In [23]:
# Choose one of the checkpoints and use it to translate the test set. Evaluate the test set with BLEU, chrF and COMET. Note that locally saved model files (and tokenizers) can be loaded in the same way as models from the HuggingFace hub, e.g. with the following command: `transformers.AutoModelForSeq2SeqLM.from_pretrained("opus-mt-de-en-lotr/checkpoint-810")`

# Load the trained model
model = transformers.AutoModelForSeq2SeqLM.from_pretrained("opus-mt-de-en-lotr/checkpoint-810")
model = model.to(device)

# Translate the test set
translate("lotr.test.de", "lotr.output_custom.en", tokenizer, model)

# Evaluate the translations
bleu_score = evaluate_bleu("lotr.output_custom.en", "lotr.test.en")
comet_score = evaluate_comet("lotr.output_custom.en", "lotr.test.en", "lotr.test.de")

print(f"BLEU score: {bleu_score}")
print(f"COMET score: {comet_score}")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint .cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`
Encoder model frozen.
/fp/homes01/u01/ec-khoimt/.local/lib/python3.11/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
/fp/homes01/u01/ec-khoimt/.local/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /cluster/software/EL9/easybuild/software/jupyter-ser ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU availa

BLEU score: BLEU = 33.26 68.3/44.6/30.2/19.7 (BP = 0.906 ratio = 0.911 hyp_len = 6120 ref_len = 6721)
COMET score: {'mean_score': 0.8648609526157379, 'scores': [0.9153851866722107, 0.9468668103218079, 0.8256569504737854, 0.8963990211486816, 0.9530919194221497, 0.9214392900466919, 0.9798831343650818, 0.9793310761451721, 0.9540467262268066, 0.869658350944519, 0.7360653877258301, 0.7124089598655701, 0.8379898071289062, 0.7160571217536926, 0.7955370545387268, 0.9529080390930176, 0.9506795406341553, 0.9839560389518738, 0.91288822889328, 0.8799622654914856, 0.9229069352149963, 0.7448776960372925, 0.7838560938835144, 0.898786723613739, 0.891481339931488, 0.8060702681541443, 0.9700202345848083, 0.5075957179069519, 0.8797752261161804, 0.7960687279701233, 0.9037336707115173, 0.9205743670463562, 0.6748988032341003, 0.912050187587738, 0.3633274734020233, 0.3448634743690491, 0.9769700169563293, 0.7603086233139038, 0.605912983417511, 0.8996880650520325, 0.9300352931022644, 0.8096268773078918, 0.9077

- Before fine-tuning: BLEU = 29.03 63.3/38.3/25.2/16.2 (BP = 0.920 ratio = 0.923 hyp_len = 6203 ref_len = 6721) | COMET score: 0.9051
- After fine-tuning: BLEU = 33.26 68.3/44.6/30.2/19.7 (BP = 0.906 ratio = 0.911 hyp_len = 6120 ref_len = 6721) | COMET score: 0.864

In conclusiom, the BLEU score increased from 29.03 to 33.26 after fine-tuning, indicating an improvement in translation quality and the COMET score decreased slightly from 0.9051 to 0.864, which may suggest that the fine-tuned model is less aligned with human judgements. The discrepancy between the two scores may indicate that the fine-tuned model is better at generating fluent translations but may not be as accurate in capturing the meaning of the source text.