<a href="https://colab.research.google.com/github/thanhnv2303/polio_llm/blob/main/lab06/text-06-prompt-engineering-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 1: Prompt Engineering

Let's consider LLAMA as our starting point. In the following, we see a typical prompt feeding and text generation with LLAMA

In [1]:
from huggingface_hub import login

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Assuming model and tokenizer are already loaded
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to the device (GPU if available)
model = model.to(device)


# Input prompt - Make it clear that you want only the direct answer without any explanations or options
prompt = """
System: You are an expert on world capitals.
Respond with only the capital city of the given country. Do not repeat the question.

Query: What is the capital of France?
Answer:
"""

# Tokenize the input
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

# Generate a response
output = model.generate(
    inputs['input_ids'],  # Tokenized input
    max_length=100,         # Limit response length to avoid extra text
    temperature=0.7,        # Lower temperature to reduce randomness
    do_sample=True,        # Disable sampling for deterministic output
    pad_token_id=tokenizer.eos_token_id  # Ensure the model doesn't go beyond the end token

)

# Decode the response into human-readable text
response = tokenizer.decode(output[0], skip_special_tokens=True)

answer = response.split("query:")[-1].strip()
print("Response:", answer)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-3.2-1B.
401 Client Error. (Request ID: Root=1-691da1b6-29a240107df2a909433ce69b;ef289c6d-803f-4333-a77c-87fdeb4c4c9f)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-1B/resolve/main/config.json.
Access to model meta-llama/Llama-3.2-1B is restricted. You must have access to it and be authenticated to access it. Please log in.

### Fitz

Reference libraries to install: pip install openai pymupdf faiss-cpu scikit-learn

PyMuPDF is a Python library that provides tools for working with PDF files (as well as other document formats like XPS, OpenXPS, CBZ, EPUB, and FB2). It's built on the MuPDF library, a lightweight, high-performance PDF and XPS rendering engine. With PyMuPDF, you can perform various tasks like reading, creating, editing, and extracting content from PDFs, images, and annotations.

In [None]:
import fitz

#open an example pdf
doc = fitz.open("example.pdf")

# Extract text from the first page
page = doc.load_page(0)
text = page.get_text("text")  # Use 'text' mode to get raw text
print(text)


### Example: Text Summarization

Let's ask LLAMA to perform a summarization of the example PDF.

In [None]:
#define the prompt to ask for text summarization.
text_summarization_prompt = ""      #define your prompt here
text = ""                           #load here the FULL text of the article
p1 =  """{PROMPT}. article: {BODY}""".format(PROMPT=text_summarization_prompt, BODY=text)

#feed the prompt to llama
#print the result of text summarization into bullets

r1 = ""

### Adding a System Prompt

Llama was trained with a system message that set the context and persona to assume when solving a task. One of the unsung advantages of open-access models is that you have full control over the system prompt in chat applications. This is essential to specify the behavior of your chat assistant –and even imbue it with some personality–, but it's unreachable in models served behind APIs.


In [None]:
#default standard system message from the Hugging Face blog to the prompt from above
system_prompt = "<<SYS>> You are a helpful, respectful and honest assistant. \
    Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, \
    unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses \
    are socially unbiased and positive in nature. If a question does not make any sense, or is not factually \
    coherent, explain why instead of answering something not correct. If you don't know the answer to a question, \
    please don't share false information. <</SYS>>"

#concatenate the system prompt with your pront and get the response
p2 = ""

r2 = ""

#what changes?

### Customizing the System prompt

With Llama we have full control over the system prompt. The following experiment will instruct Llama to assume the persona of a researcher tasked with writing a concise brief.

Apply the following changes the original system prompt:
- Use the researcher persona and specify the tasks to summarize articles.
- Remove safety instructions; they are unnecessary since we ask Llama to be truthful to the article.


In [None]:
new_system_prompt = ""

p3 = ""

r3 = ""

### Chain-of-Thought prompting

Chain-of-thought is when a prompt is being constructed using a previous prompt answer. For our use case to extract information from text, we will first ask Llama what the article is about and then use the response to ask a second question: what problem does [what the article is about] solve?



In [None]:
#define a prompt to ask what the article is about

p4 = ""

r4 = ""

#now embed the result of the previous prompt in a new prompt to ask what that solves

p5 = ""

r5 = ""




### Generating JSONs with Llama

Llama needs precise instructions when asking it to generate JSON. In essence, here is what works for me to get valid JSON consistently:

- Explicitly state — “ All output must be in valid JSON. Don’t add explanation beyond the JSON” in the system prompt.
- Add an “explanation” variable to the JSON example. Llama enjoys explaining its answers. Give it an outlet.
- Use the JSON as part of the instruction. See the “in_less_than_ten_words” example below.
Change “write the answer” to “output the answer.”


In [None]:


#example addition to a prompt to deal with jsons
json_prompt_addition = "Output must be in valid JSON like the following example {{\"topic\": topic, \"explanation\": [in_less_than_ten_words]}}. Output must include only JSON."

#now generate a prompt by correctly concatenating the system prompt, the json prompt instruction, and an article
p6 = ""

r6 = ""

#compare the difference between the prompt with the formatting instruction and a regular prompt without formatting instructions. is there any difference?




### One-to-Many Shot Learning Prompting

One-to-Many Shot Learning is a term that refers to a type of machine learning problem where the goal is to learn to recognize many different classes of objects from only one or a few examples of each class. For example, if you have only one image of a cat and one image of a dog, can you train a model to distinguish between cats and dogs in new images? This is a challenging problem because the model has to generalize well from minimal data (source)

Important points about the prompts:

- The system prompt includes the instructions to output the answer in JSON.
- The prompt consists of an one-to-many shot learning section that starts after ```<</SYS>>``` and ends with ```</s>```.  See the prompt template below will make it easier to understand.
- The examples are given in JSON because the answers need to be JSON.
- The JSON allows defining the response with name, type, and explanation.
- The prompt question start with the second ```<s>[INST]``` and end with the last ```[/INST]```

```
<s>[INST] <<SYS>>
SYSTEM MESSAGE
<</SYS>>
EXAMPLE QUESTION [/INST]
EXAMPLE ANSWER(S)
</s>
<s>[INST]  
QUESTION
[/INST]
```

In [None]:
#describe all the main nouns in the example.pdf article

#use the following addition for one-to-many prompting exampling
nouns = """[\
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},\
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},\
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},\
{{"name": "Ford", "type": "company", "explanation": "Ford is a company that built vehicles"}},\
{{"name": "John Smith", "type": "person", "explanation": "Mentioned in the article"}},\
]"""

#now build the prompt following the template described above
p7 = ""

r7 = ""

#compare the response of the prompt described above and a zero-shot prompt. Are there any differences?


## Exercise 2: RAG (Retrieval-Augmented-Generation)

RAG (Retrieval-Augmented Generation) is a powerful framework in Natural Language Processing (NLP) that enhances the performance of language models by combining traditional generative models with external knowledge retrieval. This hybrid approach allows models to retrieve relevant information from a large corpus (like a database or document collection) and incorporate this information into the generation process. It is particularly useful when a model needs to answer questions, generate content, or provide explanations based on real-time or domain-specific data.



In [None]:
import os
import glob


#TODO:  Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    print("")
    #your code here...

# Extract text from all uploaded PDF files
pdf_texts = {}
# your code here...

#Display the text from all the PDF files
for pdf_file, text in pdf_texts.items():
    print("") #implement PDF read

### Creating an index of vectors to represent the documents

To perform efficient searches, we need to convert our text data into numerical vectors. To do so, we will use the first step of the BERT transformer.

Since our full pdf files are very long to be fed as input into BERT, we perform a step in which we create a structure where we associate a document number to its abstract, and in a separate dictionary we associate a document number to its full text.


In [None]:
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np




#import the Bert pretrained model from the transformers library
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

#initialization of the dictionary of abstracts. Substitute this with the abstracts of the 10 papers considered as sources for RAG
#(we could use functions to read the PDFs to "cut" the abstracts from the papers. For simplicity reasons, we will copy and paste them)
abstracts_dict = {
    0: ""
}

#the text for rag is used as an input to the BERT model

#The tokenized inputs are passed to the BERT model for processing.
#(#remember padding=True: Ensures that all inputs are padded to the same length, allowing batch processing.)
#The model outputs a tensor (last_hidden_state), where each input token is represented by a high-dimensional vector.
#last_hidden_state is of shape (batch_size, sequence_length, hidden_size), where:
#batch_size: Number of input texts.
#sequence_length: Length of each tokenized text (after padding).
#hidden_size: Dimensionality of the vector representation for each token (default 768 for bert-base-uncased).

#last_hidden_state[:, 0]: Selects the representation of the [CLS] token for each input text. The [CLS] token is a special token added at the start of each input and is often used as the aggregate representation for the entire sequence.

abstract_vectors = ""

#abstract_vectors is a tensor of shape (batch_size, hidden_size) (e.g., (3, 768) in this case), representing each text as a single 768-dimensional vector.

print(abstract_vectors.shape)



### Search

With our text data vectorized and indexed, we can now perform searches. We will define a function to search the index for the most relevant documents based on a query.

To perform the search, we need a function (search documents) where we perform the cosine similarity between the query vector and all the abstract vectors. This function will give our the top-k indexes. Once we find the top-k indexes, with another function, we can collect the full text of the documents from the paper dictionary.

To compute cosine similarity, refer to the following formula

```cs = cosine_similarity(vector_a.detach().numpy(), vector_b.detach().numpy())```



In [None]:




def get_top_k_similar_indices(query_vector, abstract_vectors, k):

    #Computes the top k indices of the most similar abstracts to the query based on cosine similarity.

    #Parameters:
    #- query_vector: A tensor of shape (1, hidden_size) representing the query vector.
    #- abstract_vectors: A tensor of shape (batch_size, hidden_size) representing the abstract vectors.
    #- k: The number of top indices to return.

    #Returns:
    #- sorted_indices: A numpy array of shape (1, k) containing the indices of the top k most similar abstracts.

    return ""


def retrieve_documents(indices, documents_dict):

    #Retrieves the documents corresponding to the given indices and concatenates them into a single string.

    #Parameters:
    #- indices: A numpy array or list of top-k indices of the most similar documents.
    #- documents_dict: A dictionary where keys are document indices (integers) and values are the document texts (strings).

    #Returns:
    #- concatenated_documents: A string containing the concatenated texts of the retrieved documents.

    return ""



#now I create a vector also for my query

query = ""

query_vector = ""





### A function to perform Retrieval Augmented Generation

In this step, we’ll combine the context retrieved from our documents with LLAMA to generate responses. The context will provide the necessary information to the model to produce more accurate and relevant answers.

In [None]:


#now we put it all together

def generate_augmented_response(query, documents):

    system = ""             #TODO: define system prompt

    context = ""               #TODO: concatenate here all the search results


    prompt = ""                 #TODO: create the prompt for LLAMA (system + context + query)

    response = ""

    #perform a query with LLAMA in the usual way

    #return the response
    return ""


# TODO: generate the queries!
query = ""
response = generate_augmented_response(query)
print(response)

#TODO: now compare the results with a prompt without RAG. What are the results?



