![Tulane](https://github.com/tulane-cmps6730/main/blob/main/img/banner.png?raw=true)

<center>

<font size="+3">Information Retrieval</font>

[Aron Culotta](https://cs.tulane.edu/~aculotta/)  
[Tulane University](https://cs.tulane.edu/)

<a href="http://colab.research.google.com/github/tulane-cmps6730/main/blob/main/notebooks/12_Information_Retrieval.ipynb">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"  width=10%/></a>
<a href="https://github.com/tulane-cmps6730/main/tree/main">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/GitHub_Invertocat_Logo.svg/240px-GitHub_Invertocat_Logo.svg.png" width=6%/></a>

In this module, we'll learn about how guide language models to answer questions based on a given set of documents.

</center>

In [1]:
# we'll need these in a minute...
!pip install -U bitsandbytes unsloth

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting unsloth
  Downloading unsloth-2024.12.2-py3-none-any.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.6/58.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting unsloth_zoo>=2024.11.8 (from unsloth)
  Downloading unsloth_zoo-2024.12.1-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.2-py3-none-any.whl.metadata (9.4 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting trl!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,>=0.7.9 (from unsloth)
  Downloading 

<hr size=10 color=#285C4D>

## Motivation

### "Hallucinations"

![openai](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/OpenAI_Logo.svg/320px-OpenAI_Logo.svg.png)

> Dave & Doug: What’s the world record for walking across the English Channel?

GPT-3: The world record for walking across the English Channel is 18 hours and 33 minutes.

<br>

> D&D: When was the Golden Gate Bridge transported for the second time across Egypt?

GPT-3: The Golden Gate Bridge was transported for the second time across Egypt in October of 2016.

<br>

> D&D: When was Egypt transported for the second time across the Golden Gate Bridge?

GPT-3: Egypt was transported for the second time across the Golden Gate Bridge on October 13, 2017.

<br>

> D&D: What do fried eggs (sunny side up) eat for breakfast?

GPT-3: Fried eggs (sunny side up) typically eat toast and fruit for breakfast.

<br>

> D&D: Why does President Obama not have a prime number of friends?

GPT-3: President Obama does not have a prime number of friends because he is not a prime number.

<br>

[source](https://www.economist.com/by-invitation/2022/09/02/artificial-neural-networks-today-are-not-conscious-according-to-douglas-hofstadter)


<br>


**Hallucinations**: generated content that appears factual but is ungrounded

> Controversial anthropomorphization: (also called confabulation, delusion, or bullsh*tting)


### Customized Knowledge Bases



Often, we want to answer user questions using a specific set of documents, rather than the whole web.

> When does the Fall semester start?

<br><br>

<img width="60%" src="https://github.com/tulane-cmps6730/main/blob/main/lec/chat/figs/registrar.png?raw=true">

<br>
- Want answers to have references back to source

> Based on Tulane's website [[link]](https://registrar.tulane.edu/), the Fall semester begins on August 19th.

### Scalability + Freshness

> Who won the Tulane football game this weekend?

<br><br>

Given the time and money (millions) it takes to train an LLM on the Internet, it is infeasible to retrain every day to include the most current information.

<hr size=10 color=#285C4D>

## Retrieval-Augmented Generation



**Idea:** Perform a document search to find information relevant to query, then pass to a prompt to serve as context for the generated answer.


<img width="60%" src="https://github.com/tulane-cmps6730/main/blob/main/lec/chat/figs/rag1.png?raw=true">

[source](https://web.stanford.edu/class/cs329t/)

**Key steps:**

1. Break documents into chunks
2. Vectorize chunks
3. Vectorize query
4. Retrieve chunks similar to the query
5. Create LLM Prompt





## RAG Example

In [2]:
import numpy as np

documents = np.array([
    "Bob is a teacher. He teaches math. He likes to take long walks on the beach, read poetry, and drink wine.",

    "Mary is a doctor. She performs heart surgery. She likes to play pickleball and she also juggles.",

    "Bob met Mary at a Taylor Swift concert.",

    "Bob wanted to be a doctor, but failed organic chemistry.",

    "Mary hates wine.",

    "Jane is a lawyer. She does not juggle.",

    "Zebras can fly."
])

### 1. Break document into chunks



Why?

<br><br>

- Don't want to include an entire document as part of the prompt (time + relevance).

Many options here:
- Break into sentences (e.g., see [NLTK](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html))
- Fixed window of N words or characters.
- Overlapping vs non-overlapping windows (**why?**)



In [3]:
import re

def chunk(documents):
  result = []
  for doc in documents:
    for chunk in re.split(r'\.\s*', doc):
      if len(chunk) > 10:
        result.append(chunk + '.')
  return np.array(result)

chunks = chunk(documents)
chunks

array(['Bob is a teacher.', 'He teaches math.',
       'He likes to take long walks on the beach, read poetry, and drink wine.',
       'Mary is a doctor.', 'She performs heart surgery.',
       'She likes to play pickleball and she also juggles.',
       'Bob met Mary at a Taylor Swift concert.',
       'Bob wanted to be a doctor, but failed organic chemistry.',
       'Mary hates wine.', 'Jane is a lawyer.', 'She does not juggle.',
       'Zebras can fly.'], dtype='<U70')

### 2. Vectorize chunks



Need to embed each chunk into a fixed-length vector. Options?

<br><br>
A. Binary vector of length $V$: [0,1,1,0]

B. TF-IDF vector:
  + $tf$: term frequency: relative frequency of term $i$ in this chunk. e.g., for "a b b c"   $~~~tf(b)=\frac{2}{4}$
  + $df$: document frequency: how many chunks does work $i$ appear in?
  + $idf$: inverse document frequency = $\frac{N}{DF}$, where $N$ is number of chunks.
    + This downweights the importance of common terms.
    + Typically use log to reduce outlier values
  + Final formula: $tf(i) * \log{\frac{N}{df(i)}}$

C. Sentence embeddings
  + We've spent half the class learning good representations of word sequences!
  + Many transformer models pre-trained for this task:
  See https://huggingface.co/sentence-transformers

In [4]:
from sentence_transformers import SentenceTransformer
sentence_vectorizer = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
vectors = sentence_vectorizer.encode(chunks)
print(vectors.shape)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(12, 384)


**Each of the 10 chunks is represented by a 384 embedding vector.**

### 3. Vectorize query

Use same model to vectorize query.

In [5]:
query = "What does Bob do for a living?"
query_vector = sentence_vectorizer.encode(query)
query_vector.shape

(384,)

### 4. Retrieve chunks similar to the query

Recall **cosine similarity**:

$$ cos(x,y) = \frac{x \cdot y}{||x|| * ||y||}$$

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(query_vector.reshape(1,-1), vectors)[0]
sims

array([0.68319404, 0.38969168, 0.3044129 , 0.1635148 , 0.17533824,
       0.21293816, 0.333667  , 0.42745313, 0.1026054 , 0.13295731,
       0.0995698 , 0.1262123 ], dtype=float32)

In practice, we will have to select a subset of chunks to contribute to the answer. Options?

<br>

A. Top $k$

B. Top above similarity threshold $\epsilon$

C. ??

In [7]:
# let's take top 3 for now.
topi = sims.argsort()[::-1][:3]
topi

array([0, 7, 1])

In [8]:
context = '\n'.join(['%d: %s' % (i, chunks[i]) for i in topi])
print(context)

0: Bob is a teacher.
7: Bob wanted to be a doctor, but failed organic chemistry.
1: He teaches math.


### 5. Create LLM Prompt

Finally, we pass the retrieved results and the original query to an LLM to request an answer.

First, let's load the same LLM we used in the Transformers lecture.

In [9]:
from unsloth import FastLanguageModel

def get_llm():
  from unsloth import PatchDPOTrainer
  PatchDPOTrainer()
  max_seq_length = 1024 # Choose any! We auto support RoPE Scaling internally!
  dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
  load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
  llm, tokenizer = FastLanguageModel.from_pretrained(
      model_name = "unsloth/zephyr-sft-bnb-4bit", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
      max_seq_length = max_seq_length,
      dtype = dtype,
      load_in_4bit = load_in_4bit,
  )
  return llm, tokenizer

llm, tokenizer = get_llm()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.2: Fast Mistral patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/511 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

In [10]:
def chat(prompt, max_new_tokens=100):
  # using llm global variable
  test_prompts = [
      {'role': 'user', 'content': prompt}
  ]
  FastLanguageModel.for_inference(llm)
  test_inputs = tokenizer.apply_chat_template(test_prompts, return_tensors="pt").to('cuda')
  generated_ids = llm.generate(test_inputs, max_new_tokens=max_new_tokens,
                                 do_sample=False, temperature=.02, # take most probably word every time.
                                 pad_token_id=tokenizer.eos_token_id)
  print(tokenizer.batch_decode(generated_ids)[0])

Of course, without the retrieved chunks, the language model doesn't know anything about Bob...

In [11]:
chat("What does Bob do for a living?")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|user|>
What does Bob do for a living?</s> 
<|assistant|>
The passage provided does not offer any information about Bob's occupation or profession.</s>


In [12]:
def make_prompt(query, context):
  return f"""Retrieved Documents:

{context}

Query: {query}

Provide a detailed answer to the query based on the retrieved documents.
"""
prompt = make_prompt(query, context)
print(prompt)

Retrieved Documents: 
  
0: Bob is a teacher.
7: Bob wanted to be a doctor, but failed organic chemistry.
1: He teaches math.

Query: What does Bob do for a living?

Provide a detailed answer to the query based on the retrieved documents.



In [13]:
chat(prompt)

<|user|>
Retrieved Documents: 
  
0: Bob is a teacher.
7: Bob wanted to be a doctor, but failed organic chemistry.
1: He teaches math.

Query: What does Bob do for a living?

Provide a detailed answer to the query based on the retrieved documents.
</s> 
<|assistant|>
Based on the retrieved documents, it is clear that Bob's current profession is that of a teacher. Document 0 states explicitly that Bob is a teacher, and document 1 also confirms this by stating that he teaches math. While document 7 provides some background information about Bob's original career aspirations, it does not change the fact that his current occupation is that of a teacher. Therefore, it can be concluded that Bob's primary


Boy, that is very wordy...make it shorter?

In [14]:
def make_concise_prompt(query, context):
  return f"""Retrieved Documents:

{context}

Query: {query}

Provide a *concise* answer to the query based on the Retrieved Documents.
"""

chat(make_concise_prompt(query, context))

<|user|>
Retrieved Documents:
  
0: Bob is a teacher.
7: Bob wanted to be a doctor, but failed organic chemistry.
1: He teaches math.

Query: What does Bob do for a living?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
Bob is a teacher who teaches math. The statement "Bob is a teacher" from document 0, along with the fact that he teaches math in document 1, provides enough information to answer the query. The additional information in document 7 about Bob's original career aspirations does not change the fact that he currently works as a teacher.</s>


### RAG Pipeline

In [15]:
def answer_question(query, vectors=vectors, llm=llm, sentence_vectorizer=sentence_vectorizer):
  query_vector = sentence_vectorizer.encode(query)
  sims = cosine_similarity(query_vector.reshape(1,-1), vectors)[0]
  topi = sims.argsort()[::-1][:3]
  context = '\n'.join(['%d: %s' % (i, chunks[i]) for i in topi])
  chat(make_concise_prompt(query, context))

answer_question('What does Bob do for a living?')

<|user|>
Retrieved Documents:
  
0: Bob is a teacher.
7: Bob wanted to be a doctor, but failed organic chemistry.
1: He teaches math.

Query: What does Bob do for a living?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
Bob is a teacher who teaches math. The statement "Bob is a teacher" from document 0, along with the fact that he teaches math in document 1, provides enough information to answer the query. The additional information in document 7 about Bob's original career aspirations does not change the fact that he currently works as a teacher.</s>


In [16]:
documents

array(['Bob is a teacher. He teaches math. He likes to take long walks on the beach, read poetry, and drink wine.',
       'Mary is a doctor. She performs heart surgery. She likes to play pickleball and she also juggles.',
       'Bob met Mary at a Taylor Swift concert.',
       'Bob wanted to be a doctor, but failed organic chemistry.',
       'Mary hates wine.', 'Jane is a lawyer. She does not juggle.',
       'Zebras can fly.'], dtype='<U105')

In [17]:
answer_question('What does Mary do for a living?')

<|user|>
Retrieved Documents:
  
3: Mary is a doctor.
8: Mary hates wine.
6: Bob met Mary at a Taylor Swift concert.

Query: What does Mary do for a living?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
Mary is a doctor, as stated in Retrieved Document 3.</s>


In [18]:
answer_question('Does Bob know Mary?')

<|user|>
Retrieved Documents:
  
6: Bob met Mary at a Taylor Swift concert.
0: Bob is a teacher.
3: Mary is a doctor.

Query: Does Bob know Mary?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
Based on the Retrieved Documents, it is not explicitly stated whether Bob knows Mary. However, the fact that they met at a Taylor Swift concert suggests they may have some connection beyond just being fans of the same artist. Without further information, it is unclear whether their relationship extends beyond this chance encounter.</s>


In [19]:
answer_question('Does Mary like Bob?')

<|user|>
Retrieved Documents:
  
6: Bob met Mary at a Taylor Swift concert.
3: Mary is a doctor.
0: Bob is a teacher.

Query: Does Mary like Bob?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
The Retrieved Documents do not provide information about whether Mary likes Bob. Therefore, a *concise* answer to the query would be "The Retrieved Documents do not indicate whether Mary likes Bob."</s>


In [20]:
answer_question('Does Mary juggle?')

<|user|>
Retrieved Documents:
  
10: She does not juggle.
5: She likes to play pickleball and she also juggles.
3: Mary is a doctor.

Query: Does Mary juggle?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
5 out of 10 individuals mentioned in the documents have the ability to juggle, and one of them is identified as "she." In document 5, it is also stated that this person, who is not named, enjoys playing pickleball. However, in document 3, we learn that Mary is a doctor. As there is no direct mention of Mary juggling in any of the documents, we cannot conclusively say that she does


**What went wrong here??**

<br><br><br>

Pronoun resolution!

Context is too short. Options??


<br><br>

- Overlapping chunks
- Return preceding/following $k$ chunks for each retrieved chunk.
- Resolve all pronouns in text

In [21]:
answer_question('Can zebras fly?')

<|user|>
Retrieved Documents:
  
11: Zebras can fly.
5: She likes to play pickleball and she also juggles.
10: She does not juggle.

Query: Can zebras fly?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
No, based on the Retrieved Documents, zebras cannot fly. The first document is a false statement, and there is no information in the other documents to suggest that zebras have the ability to fly.</s>


In [22]:
chat("""Retrieved Documents:

1: Goopfoofs can fly.

Query:

Can Goopfoofs fly?

Provide a *concise* answer to the query based on the Retrieved Documents.
""")

<|user|>
Retrieved Documents: 

1: Goopfoofs can fly. 

Query: 

Can Goopfoofs fly?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
Yes, according to Retrieved Document 1, Goopfoofs can fly.</s>


**What went wrong here??**


<br><br><br>

Pre-training conflicting with "[in-context learning](https://arxiv.org/abs/2301.00234)"

In [23]:
chat("""Retrieved Documents:

1: The Mona Lisa is one of the most famous paintings in the Louvre Museum.

Query:

Who painted the Mona Lisa, and when was it completed?

Provide a *concise* answer to the query based on the Retrieved Documents.
""")

<|user|>
Retrieved Documents: 

1: The Mona Lisa is one of the most famous paintings in the Louvre Museum.

Query: 

Who painted the Mona Lisa, and when was it completed?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
The Mona Lisa was painted by Leonardo da Vinci and its completion date is disputed, but most historians believe it was finished around 1506.</s>


In [24]:
chat("""Retrieved Documents:

1: John Adams was the second president of the United States.
2: Thomas Jefferson was the third president of the United States.

Query:  Did John Adams and Thomas Jefferson work together during their presidencies?

Provide a *concise* answer to the query based on the Retrieved Documents.
""")

<|user|>
Retrieved Documents: 

1: John Adams was the second president of the United States.
2: Thomas Jefferson was the third president of the United States.

Query:  Did John Adams and Thomas Jefferson work together during their presidencies?

Provide a *concise* answer to the query based on the Retrieved Documents.
</s> 
<|assistant|>
No, John Adams and Thomas Jefferson did not work together during their presidencies. John Adams served as the second president from 1797 to 1801, and Thomas Jefferson succeeded him as the third president from 1801 to 1809. They were political rivals during Adams' presidency and their relationship deteriorated further during Jefferson's presidency due to political differences.</s>


<hr size=10 color=#285C4D>

## Group Task

Play with the documents and prompts above to see what it takes to get the system to report that Zebras can fly. What worked and what didn't? Speculate why you observe the behavior that you do.

Also work with the Mona Lisa example. Are you able to get the LLM to only answer based on information in the retrieved documents (e.g., don't output the year if it's not listed)?

<hr size=10 color=#285C4D>

## Sources

- https://web.stanford.edu/class/cs329t/
- https://colab.research.google.com/github/truera/trulens/blob/main/examples/quickstart/quickstart.ipynb