# IN4080: obligatory assignment 3
 
Mandatory assignment 3 is about the practical use of Large Language Models (LLMs). More specifically, you will be tasked to implement a RAG (Retrieval-Augmented Generation) system able to answer factual questions based on a document database, more specifically Wiki pages extracted from an [online Star Wars encyclopedia](https://starwars.fandom.com). 

You are required to get at least 12/20 points to pass. 

- We assume that you have read and are familiar with IFI’s requirements and guidelines for mandatory assignments, see [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html) and [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html).
- This is an individual assignment. You should not deliver joint submissions. 
- You may redeliver in Devilry before the deadline (__Sunday, October 13 at 23:59__).
- Only the last delivery will be read! If you deliver more than one file, put them into a zip-archive. You don't have to include in your delivery the data files already provided for this assignment. 
- Name your submission _your\_username\_in4080\_mandatory\_3_

The preferred way to complete this assignment is using the high-performance computing cluster _Fox_. See [here](https://www.uio.no/studier/emner/matnat/ifi/IN4080/h24/computing-setup.html) for instructions on how to register and log in to Fox.

You should deliver a completed version of this Jupyter notebook, containing both your code and explanations about the steps you followed. We want to stress that simply submitting code is __not__ by itself sufficient to complete the assignment - we expect the notebook to also contain explanations of what you have implemented, along with motivations for the choices you made along the way. Preferably use whole sentences, and mathematical formulas if necessary. Explaining in your own words (using concepts we have covered through in the lectures) what you have implemented and reflecting on your solution is an important part of the learning process - take it seriously!

Regarding the use of LLMs (ChatGPT or similar): you are allowed to use them as 'sparring partner', for instance to clarify something you have not understood. However, you are __not__ allowed to use them to generate solutions (either in part or in full) to the assignment tasks. 


## Basic setup

We will start by building a chatbot that directly answers user questions using an instruction-tuned LLM, without relying on any database. We will use the instruction-tuned version of the [Gemma 1.1 language model](https://huggingface.co/google/gemma-1.1-2b-it) from Google, which is available on HuggingFace. 

_Note: feel free to switch to another model (such as the newly released Llama 3 models) if you wish to experiment with them. Note, however, that the most recent LLMs will likely require a newer version of the `transformers` library than what is currently installed on Fox._



**Task 1** (4 points): Drawing inspiration from the code examples on the [Gemma webpage](https://huggingface.co/google/gemma-1.1-2b-it), implement the `__init__` and `get_response` methods. If you run the code on Fox with a GPU (or on a personal machine with a GPU), make sure that your code actually runs on the GPU.

In [16]:
import os
from dotenv import load_dotenv, find_dotenv

# loading the HUGGINGFACE_TOKEN Keys from .env file
load_dotenv(find_dotenv(), override=True)

hugging_face_token = os.getenv("HUGGINGFACE_TOKEN")

In [17]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class BasicResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it"):
        """Loads the tokenizer and pretrained causal LM for the given model. 
        If a GPU is available, the model should be loaded on the GPU """
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, token=hugging_face_token)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            token=hugging_face_token
        )

    def get_response(self, prompt:str, max_length:int=50) -> str:
        """Given a prompt, generate a response (of a maximum max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself
        """
        
        input_ids = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model.generate(**input_ids,max_length=max_length)

        return self.tokenizer.decode(outputs[0])

agent = BasicResponseGenerator()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

_Note: An easy way to verify that the GPU is actually used is to run the command `nvidia-smi` while your code is running. There also exists alternative GPU monitoring tools, like [`gpustat`](https://pypi.org/project/gpustat/0.3.2/)._

You can then test your response generator with the following set of questions: 

In [18]:
questions = ["Who is Luke Skywalker?",
             "Where is the Niima Outpost in Star Wars?",
             "Have you heard of Nute Gunray? Who is he?",
             "What kind of planet is Kashyyyk, and who discovered it?",
             "Who are Condlurans, and can you give 2-3 names of known Condlurans?",
             "What can you tell me about the First Battle of Geonosis?",
             "What is the name of the settlement where Anakin Skywalker and his mother lived?",
             "Which planet did Darth Sidious represent as senator?"]

for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

Question: Who is Luke Skywalker?
Answer: <bos>Who is Luke Skywalker?

Luke Skywalker is a fictional character in the Star Wars franchise, a member of the Skywalker family, and a key figure in the Star Wars saga. He is the son of Anakin Skywalker and Padmé Amidala, and
-------
Question: Where is the Niima Outpost in Star Wars?
Answer: <bos>Where is the Niima Outpost in Star Wars?

The Niima Outpost is not mentioned in the Star Wars universe, so it does not exist.<eos>
-------
Question: Have you heard of Nute Gunray? Who is he?
Answer: <bos>Have you heard of Nute Gunray? Who is he?

I am unable to find any information about Nute Gunray on the internet.<eos>
-------
Question: What kind of planet is Kashyyyk, and who discovered it?
Answer: <bos>What kind of planet is Kashyyyk, and who discovered it?

**Kashyyyk** is a fictional planet from the Star Wars universe. It is a forested planet located in the Outer Rim.

**Kashyyyk was discovered by
-------
Question: Who are Condlurans, and can yo

If your implementation is correct, the model should give you a few correct answers, but also many responses for which the model is either unable to give a precise answer, or hallucinates a (wrong) answer. This is expected, as the model is relatively small (3 billion parameters) and is a generic model that is not particularly optimised to generate trivia about the Star Wars Franchise. We will now try to improve the model performance by coupling the LLM to a document database.

## Retrieval step

Retrieval-augmented generation operates on a simple idea: instead of directly generating a response based on the "parametric knowledge" of the LLM, we first search for relevant documents in a database (or on the web). We then include the most relevant documents to the prompt, and ask the LLM to answer the user question _based on this retrieved knowledge_. 

In this assignment, you will use a set of Wiki texts extracted from an [online Star Wars encyclopedia](https://starwars.fandom.com) as document database. The wiki texts are available as a JSON file, either [here](https://home.nr.no/~plison/data/starwars.json) or on Fox at `/fp/projects01/ec403/IN4080/starwars.json`. The JSON is simply a dictionary mapping Wiki page titles to their content (in plain text).

### Sparse retrieval 

We can start by using the newly released [BM25s](https://bm25s.github.io/) library, which implements a number of well-known search algorithms, which are all variants of the original [BM25 algorithm](https://en.wikipedia.org/wiki/Okapi_BM25) . Although BM25 is an old-fashioned search technique based on bag-of-words, it remains suprisingly effective, and is still widely used in modern NLP systems.

**Task 2** (4 points): Fill in the implementation for the `BM25Retriever` class using [BM25s](https://bm25s.github.io/) (see the library documentation for details). You should filter out stop words by adding `stopwords='en_plus'` to the arguments of the tokenizer. 

In [24]:
!curl https://home.nr.no/~plison/data/starwars.json > starwars.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64.4M  100 64.4M    0     0  3172k      0  0:00:20  0:00:20 --:--:-- 3062k03 3084k20  0:00:20 --:--:-- 3006k


In [31]:
import json

with open("starwars.json", "r") as file:
    data = json.load(file)

In [41]:
data

{'Brianna': 'Brianna, also known as "The Last Handmaiden," was a half-Human, half-Echani hybrid fighter of Jedi heritage. The illegitimate daughter to the famous Echani General Yusanis and Jedi Master Arren Kae, both of whom fought in the Mandalorian Wars, she later served along with her five half-sisters as the Handmaiden Sisters under Jedi Master Atris in the period following the Jedi Civil War. During her service to Atris at the Jedi Academy on Telos IV, Brianna met Meetra Surik, also known as the Jedi Exile, and was subsequently tasked with accompanying her on her mission to seek out the Lost Jedi, Jedi Masters who survived the First Jedi Purge. As Surik\'s companion, Brianna helped battle against numerous adversaries, including the Sith Triumvirate.\n\nDuring the journey, Brianna trained Surik in Echani techniques and eventually received Jedi training from Surik in return, becoming one of the first new members of the Jedi Order after the Purge. In accepting such training, however,

In [None]:
import bm25s
import json
from typing import List

class BM25Retriever:

    def __init__(self, filename="starwars.json"):
        """Using the json file provided as input, create a BM25s retriever 
        containing all (indexed) documents."""
        with open(filename, "r") as file:
            self.data = json.load(file)
        
        self.corpus = [doc. for doc in self.data]
        corpus_tokens = bm25s.tokenize(self.corpus)
        self.retriever = bm25s.BM25(corpus=self.corpus)
        self.retriever.index(corpus_tokens)
        
    def search(self, query:str, k:int=5) -> List[str]:
        """Use the BM25 retriever to find the k documents that are closest
        to the provided query"""

        query_tokens = bm25s.tokenize(query)
        docs, scores = self.retriever.retrieve(query_tokens, k=2)

        return docs, scores


We can then test our retriever by checking whether the documents with highest BM25 scores are indeed the ones that are most relevant to the query:

In [None]:

retriever = BM25Retriever()
for question in questions:
    print("Question:", question)
    print("Retrieved documents:")
    for relevant_doc in retriever.search(question):
        print("- " + relevant_doc.replace("\n", " "))
    print("===========")

If your implementation is correct, the retrieved documents should for the most part relevant to the query. 

### Dense retrieval 

Many of those documents are, however, way too long to be included in the prompt for our Gemma model (especially if we wish to include 4-5 retrieved texts for each query!). Can we ensure that the length of each retrieved text stays within a reasonable length, such as one or two sentences? 

One strategy is to not return the full documents, but instead determine the most relevant _sentences_ within those documents. But how do we determine which sentence is most relevant? A sparse retriever using BM25 would not work well here, as it does not really account for the semantics of the query. Instead, what we can do is to:
- split the documents (retrieved through BM25) into sentences
- extract sentence embeddings for the query and for each sentence
- compute the cosine similarities between the query vector and each sentence vector
- and return the _k_ most similar sentences

In other words, our approach starts with a _sparse retrieval step_ at the level of full documents (which we already have implemented, using BM25S), and continues with a _dense retrieval step_ to determine the most relevant sentences among the sentences that are found in the retrieved documents.

**Task 3** (4 points): Re-implement the `search` method to segment into sentences each document retrieved with BM25, extract sentence embeddings for the query and sentences using the encoder model (see [here](https://sbert.net/examples/applications/semantic-search/README.html) for explanations and code examples), and then select the _k_ sentences with highest cosine similarities.  

_Tips_: You can use `nltk.sent_tokenize` to segment your document in sentences.

In [2]:
import bm25s
import re, json
import sentence_transformers
import nltk
#nltk.download('punkt_tab')
from typing import List

class HybridRetriever(BM25Retriever):

    def __init__(self, filename="/fp/projects01/ec403/IN4080/starwars.json", 
                 encoder_model="msmarco-MiniLM-L-6-v3"):
        
        """Using the json file provided as input, create a BM25s retriever 
        containing all (indexed) documents, and loads a sentence transformer model
        used to compute the embeddings for the query and sentences"""

        BM25Retriever.__init__(self, filename)
        self.encoder = sentence_transformers.SentenceTransformer(encoder_model)

    def search(self, query:str, k:int=5) -> List[str]:
        """Use the BM25 retriever to find the documents that are closest
        to the provided query, and then the sentence transformer model to
        determine the most relevant sentences"""

        docs = BM25Retriever.search(self, query, k)

        raise NotImplemented("You should implement this method")



And we can test our hybrid (sparse followed by dense) retriever on the same questions as before:

In [None]:

retriever = HybridRetriever()
for question in questions:
    print("Question:", question)
    print("Retrieved documents:")
    for relevant_doc in retriever.search(question):
        print("- " + relevant_doc.replace("\n", " "))
    print("===========")

## Putting it all together

Now that we have a functioning retriever model, we can connect it to the generative language model employed to produce the responses.

**Task 4** (4 points): Implement the `RetrievalAugmentedResponseGenerator`. Given an initial input prompt, the method should first retrieve relevant sentences using the `HybridRetriever` we have just developed. Then, it should expand the initial prompt using the provided template (you are of course free to edit or adapt it as you see fit). This expanded prompt should then be tokenized and fed as input to the LLM in the same way as before.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

PROMPT_TEMPLATE = "'You are given the following information about Star Wars:\n-{retrieved_sentences}\nNow answer the following question in 1 or 2 sentences, based on the provided information: '{query}'"

class RetrievalAugmentedResponseGenerator:

    def __init__(self, model_name="google/gemma-1.1-2b-it", 
                 doc_filename="/fp/projects01/ec403/IN4080/starwars.json", 
                 encoder_model="all-MiniLM-L6-v2"):
        """Loads the tokenizer, pretrained causal LM for the given model, along with the 
        hybrid sparse-dense retriever model populated with the documents in doc_filename."""

        raise NotImplemented("You must implement this method")

    def get_response(self, query:str, max_length:int=50, k=3) -> str:
        """Given a prompt, retrieve k relevant sentences, generate a response (of a maximum 
        max_length tokens) and return it.
        Only the response should be returned, not the text of the prompt itself
        """

        raise NotImplemented("You must implement this method")


agent = RetrievalAugmentedResponseGenerator()

The last step is to test our system end-to-end:

In [None]:

for question in questions:
    print("Question:", question)
    print("Answer:", agent.get_response(question))
    print("-------")

**Task 5** (4 points): If you have implemented your model correctly, the system should answer correctly to at least a few questions. But it is still far from perfect, and some of the answers are flat-out wrong. Suggest 2-3 ways one could improve the current system and get even better answers. You don't need to implement anything, simply flesh out a few ideas you believe are worth trying out.

_(of course, it is even better if you actually try to implement those ideas and evaluate their influence on the quality of the system responses!)_