# IUST Computer Engineering Department 🏫
## Introduction to Natural Language Processing 📚 (The Final Project)
### Course Instructor: Dr. Marzieh Davoodabadi Farahani 👩‍🏫
### Project Teaching Assistant: Erfan Moosavi Monazzah (tel: @ErfanMoosavi2000) 📞
-------------------------------------------------------------------------------<br>
The objective of this project is to acquaint you with the fundamentals of Retrieval Augmented Generation (RAG). Be sure to explore various options and address challenges in a creative manner. 🎯

**Project Guidelines** 📝
- Avoid cheating at all costs. If a set of submissions is found to be [plagiarized](https://translate.google.as/?sl=en&tl=fa&text=Very%20hard%20word%2C%20I%20know%2C%20here%27s%20the%20meaning%3A%0Aplagiarized&op=translate), only one will be randomly chosen for grading. The others will fail the project. ❌
- You are allowed to use any document, article, paper, or video as a resource for writing your code, provided you include a link to the material used. 📖
- The use of Language Learning Models (LLMs), ChatBots, and Copilots is encouraged. If you utilize any of these tools, make sure to attach the chat history that led you to the answer to your question, or the code, to this .ipynb document. (You must provide the entire chat, not just the final answer or your initial prompt.) 💻
- You may not submit any additional documents, files, etc., along with this document. Only solutions, codes, explanations, etc., in this document will be graded. 📄
- You are required to implement everything (except the Language Modeling parts) from scratch. The use of libraries like langchain, llama_index, etc., is not permitted for this purpose. 🚫
- Please adhere to the code guidelines provided throughout the documents. 📝 I’ve spent time in a library 📚 crafting all of this, so if you overlook them, you’ll lose the points allocated for that section. ❌
- We need to use GPUs for this assignment, don't forget to turn on GPU usage for your notebook session.

-------------------------------------------------------------------------------<br>
# Alright, let's get started. 🚀

## What is RAG? 🤔
We've all used ChatGPT and experienced moments when it starts to generate content that is often incorrect or unrelated to our query. Do you know why this happens? These Large Language Models (LLMs) are not magical entities; they are simply models trained on a vast amount of text. 📚 You could even consider a significant portion of the internet. However, this is not all the data available in the world, because data is not a static concept. You yourself generate some data every day through your use of the Internet, Social Media, and so on. 🌐💻📱

So, no matter how much data you use to train your LLM, you always end up encountering new data. This is one of the reasons behind the famous ChatGPT response that tells you it only knows things up to a certain date. 📅 Also, these models tend to hallucinate too. It means they provide incorrect answers but in a very convincing manner. 🎭

On the other hand, we have retrieval techniques. Don't worry if it sounds complicated (it actually isn't easy, you may need to take a course to familiarize yourself with these concepts 😅, but that's not necessary for this project), but you use it on a daily basis. You can think of Search Engines (like Google, for example) as a complex form of information retrieval. 🔍

So, one day, people came up with this idea that it would be cool if ChatGPT could search Google for us, read the articles for us, summarize what it read, and tell us that. 📖 So, this is not exactly what RAG is, but it's something similar. We have a corpus (a large amount of data) and a query (what a user typed as input). Now, we search through this corpus using techniques related to vectors and vector databases, and find the most similar items in our corpus to the query. Then, we pass these items to an LLM and ask for a structured, well-formatted, user-friendly output. 📈📊

## I'm Interested in the Technical Details, What Should I Read? 📚🔍
- I strongly recommend reading the [original RAG paper](https://arxiv.org/abs/2005.11401). If you need help understanding the paper or have any questions about it, feel free to reach out to me via Telegram or find me on the second floor of the department in the NLP lab on Sundays and Tuesdays. 📖
- There appears to be a [comprehensive 2.5-hour course](https://www.freecodecamp.org/news/mastering-rag-from-scratch/) available. I haven't personally watched it, but if you find a better one, let me know so I can update this document. 🎥
- Here is [an article](https://www.smashingmagazine.com/2024/01/guide-retrieval-augmented-generation-language-models/) that explains the concepts very well. Initially, I wanted to use this article as the basis for this project, but unfortunately, the llama_index library used in the article seems to be outdated, so most of the code would need to be rewritten. On second thought, I found it more useful to focus on core concepts rather than learning specific libraries. You might want to check out some libraries like langchain or llama_index which provide a lot of tools for RAG. (But not for this project) 📝💡
- Don't hesitate to use Google, ask chatbots about any new concepts and terms. If you use search engine-aware chatbots like Microsoft Copilot, they provide links for each part of their answers which is useful if you want to delve deeper into that part. 🌐🤖
- Lastly, we have [the article](https://learnbybuilding.ai/tutorials/rag-from-scratch) that serves as the foundation for this project. 📚🔍

# Learn
First, we’re going to go through a simple RAG implementation. It’s going to be similar to the article, except for the (LLM) part. For that, I’m going to use Hugging Face. 🤗 I’ll also try to explain the code in simple terms, but feel free to read the article if you prefer their writing style.

## Let's Install the Necessary Libraries 📚🔧
Did you know that using the `--quiet` or `-q` option with the `pip install` command minimizes the output displayed on your screen? 🖥️ This can make your terminal less cluttered. Also, using `-U` will upgrade the libraries if they were previously installed. This is particularly useful for certain libraries like `transformers` that are frequently updated. 🔄

In [None]:
!pip install -U accelerate transformers --quiet

## Gather a Corpus 📚
Technically, a corpus refers to a large and structured set of texts. However, for the sake of our discussion, let’s consider our collection as a “corpus”, even though it might not be large in the traditional sense. 😉

In [None]:
corpus_of_documents = [
    "Take a leisurely walk in the park and enjoy the fresh air.",
    "Visit a local museum and discover something new.",
    "Attend a live music concert and feel the rhythm.",
    "Go for a hike and admire the natural scenery.",
    "Have a picnic with friends and share some laughs.",
    "Explore a new cuisine by dining at an ethnic restaurant.",
    "Take a yoga class and stretch your body and mind.",
    "Join a local sports league and enjoy some friendly competition.",
    "Attend a workshop or lecture on a topic you're interested in.",
    "Visit an amusement park and ride the roller coasters."
]

## Create a Retriever 🕵️‍♂️
Now, we’re going to create a simple retriever. The role of the retriever is to compare the user’s query with a large corpus of text and find those that are most similar in context. (You know what context is by now, don’t you? 😊 If you’ve forgotten, refer back to your initial lectures). For now, let’s say we want to find similar text based on simple similarity metrics. The code is straightforward, and I have faith in you, chief! Dive into the code. 👨‍💻

In [None]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

Hey, you may want to look at wikipedia page for [Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index).

In [None]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(query, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

## Create a Generator 🖥️
Now, we’re going to create a generator. This will help us compile the information retrieved into a well-structured and user-friendly text.

OK, let's say in a senario, we ask user what they like to do, the their answer is this:

In [None]:
user_input = "I like to hike"

Now by using the retrieval model I find this activity that best fits this user.

In [None]:
relevant_document = return_response(user_input, corpus_of_documents)
print(relevant_document)

The answer seems good enough, but we can do better, yeah?

Let’s import a Language Model. I’m going to try out Microsoft Phi-3 because it recently hit the market, and I haven’t had a chance to try it for myself yet. So, I’m seizing this opportunity to do so! 😊👨‍💻

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Downloading the model gonna take a while, use this time to rest your eyes for a bit. 😊👀💤

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

Now we try to get the LLM to become our generator. We simply place the retrieved information and user query in the following prompt and ask the model for well formatted text.

In [None]:
prompt = """You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input."""

In [None]:
prompt = prompt.replace("{relevant_document}", relevant_document).replace("{user_input}", user_input)
print(prompt)

In [None]:
messages = [
    {"role": "user", "content": prompt},
]

Here's the augmented generated text

In [None]:
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

## Very Cool, but Not Perfect! 😎👌
Alright, you’ve just seen a very basic example of RAG. However, there are some issues present. The corpus is small, and the documents in the corpus are short sentences, which causes the Language Model (LM) to generate some text on its own. 📚🤖

Also, our retriever is not very efficient and it may encounter bugs in some cases. For instance, even when users specify that they are not interested in a certain activity, the retriever might still bring up that activity for them. 🐜🔍

So, in this project, you’re going to address some of these issues. The rest of this document consists of some empty cells and tips for you on how to fill them with code. Let’s get coding! 👨‍💻🚀

# The Project

## Determine Your Task 🎯
What do you aim to implement with RAG? A recommender system? 🎁 A chatbot for a website’s FAQ? 💬 A medical advisor? 🩺 Or perhaps something else entirely?

Specify your objective in this cell.

In [None]:
task_title = "RAG-based Recommender System for Books"
url_for_more_information = "https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks"

print(f"My task is: {task_title}")
print(f'For more information see: {url_for_more_information}')

## 🧐 Find or gather a corpus
Remember the fake corpus? 📚 It’s time to switch things up and use something real. 🌐 You need to use a dataset from  [huggingface datasets](https://huggingface.co/datasets) for this project. 🚀 Don’t use files that are outside of this notebook, this notebook should be able to run on its own without depending on anything external. 💻👍


In [None]:
# Import the load_dataset function from the datasets library
from datasets import load_dataset

# Load the BookCorpus dataset from Hugging Face Datasets
dataset = load_dataset("amazon_polarity")

# Display the first few examples from the dataset
print(dataset['train'][0])

# Print the structure of the dataset
print(dataset)

# Optionally, explore more examples
for i in range(5):
    print(f"Example {i+1}:")
    print(dataset['train'][i])
    print("\n")

## 📝 Create some queries
I want you to create 20 queries related to your task. You can use any Language Model you want for this matter, or if you’re feeling strong 💪 and have the time, write it yourself. 🖊️

You need to create a Hugging Face account, format your 20 queries into the accepted dataset format for Hugging Face 🤗 and push it to your Hugging Face account. Be sure to make it public and use it for the evaluation task. 👀

In [None]:
import json

queries = [
    {"query": "Recommend me a book similar to '1984' by George Orwell."},
    {"query": "What are some good mystery novels?"},
    {"query": "Can you suggest a romance book set in Paris?"},
    {"query": "I'm looking for a fantasy book with dragons."},
    {"query": "Suggest a non-fiction book about history."},
    {"query": "What are some popular science fiction books?"},
    {"query": "Recommend a thriller with a strong female lead."},
    {"query": "Can you suggest a children's book about adventure?"},
    {"query": "I'm interested in books about artificial intelligence."},
    {"query": "What are some classic literature books I should read?"},
    {"query": "Suggest a book that deals with mental health."},
    {"query": "Can you recommend a good autobiography?"},
    {"query": "I'm looking for a humorous book."},
    {"query": "Suggest a book that has won a Pulitzer Prize."},
    {"query": "What are some good dystopian novels?"},
    {"query": "Recommend a book that was turned into a movie."},
    {"query": "Can you suggest a young adult book series?"},
    {"query": "I'm interested in books about space exploration."},
    {"query": "What are some good horror novels?"},
    {"query": "Recommend a book with a plot twist."}
]

# Save queries to a JSON file
with open('book_recommendation_queries.json', 'w') as f:
    json.dump({"queries": queries}, f)


## 🛠️ Create a Retriever
To create your retriever, you need to use an encoder model. Something like BERT? Nah, BERT is so yesterday. Find something new and shiny! ✨ The basic idea is to encode every document (sentence) in your corpus into a vector space using the same encoder. Then, encode the user query into that same space. With some similarity metrics like dot product, you can find the most similar document to the user’s input and retrieve it. 🎯 You can train your own encoder if you have enough data and resources, 💪 or you can use one of those [ready-made on Hugging Face](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending), like these ones.

In [1]:
!pip install -U transformers accelerate bitsandbytes torch datasets sentence-transformers

Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Collecting torch
  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  

In [2]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from sentence_transformers import SentenceTransformer, util
from datasets import load_dataset
import torch

# Load the pre-trained model from Hugging Face
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model_sentence = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Load the Amazon Book Reviews dataset from Hugging Face
dataset = load_dataset("amazon_polarity")

# Use a smaller subset of the dataset
corpus = [entry['content'] for entry in dataset['train'].select(range(2000))]  # Adjust the range as needed
corpus_embeddings = model_sentence.encode(corpus, convert_to_tensor=True)

# Sample queries related to book recommendations
queries = [
    "Recommend me a book similar to '1984' by George Orwell.",
    "What are some good mystery novels?",
    "Can you suggest a romance book set in Paris?",
    "I'm looking for a fantasy book with dragons.",
    "Suggest a non-fiction book about history.",
    "What are some popular science fiction books?",
    "Recommend a thriller with a strong female lead.",
    "Can you suggest a children's book about adventure?",
    "I'm interested in books about artificial intelligence.",
    "What are some classic literature books I should read?",
    "Suggest a book that deals with mental health.",
    "Can you recommend a good autobiography?",
    "I'm looking for a humorous book.",
    "Suggest a book that has won a Pulitzer Prize.",
    "What are some good dystopian novels?",
    "Recommend a book that was turned into a movie.",
    "Can you suggest a young adult book series?",
    "I'm interested in books about space exploration.",
    "What are some good horror novels?",
    "Recommend a book with a plot twist."
]

# Encode the queries
query_embeddings = model_sentence.encode(queries, convert_to_tensor=True)

# Perform similarity search using dot product
for query, query_embedding in zip(queries, query_embeddings):
    # Compute cosine similarities
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    
    # Find the top 5 most similar sentences
    top_results = torch.topk(cos_scores, k=5)
    
    print(f"Query: {query}")
    for score, idx in zip(top_results[0], top_results[1]):
        print(f"Score: {score:.4f}\tDocument: {corpus[idx]}")
    print("\n")


2024-06-18 10:24:59.011341: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-18 10:24:59.011519: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-18 10:24:59.144046: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/260M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/258M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/254M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: Recommend me a book similar to '1984' by George Orwell.
Score: 0.5647	Document: This is truely an amazing book, Aldous Huxley was a genius. Although different from BNW in that the plot is a little slow, the ideas of society and the individual expressed in this book make it one of the most influential books that I ever read. Don't read this book to be entertained, read it to be enlightened.
Score: 0.5585	Document: I absolutely loved this book! Eckhart Tolle in "The Power of Now" actually references this book and that is how I heard of it. Despite the fact that much of the book is ruled by long monologues from the characters, I enjoyed it all. Huxley had an interesting view of what a society can become when it takes the best of the Eastern and Western worlds combine.
Score: 0.5200	Document: Excellent and well-written book. Recommended reading for young adults and older. We must all remember what atrocities were committed in Europe and other countries. It is a testament to the huma

In [4]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
        top_results = torch.topk(cos_scores, k=5)
    return corpus[top_result[1][0]]

#DOC
### Detailed Documentation

This document provides a detailed description of the code that loads a pre-trained sentence transformer model, retrieves relevant documents from a dataset, and performs a similarity search for a set of sample queries.

#### Importing Required Libraries

```python
from sentence_transformers import SentenceTransformer, util
from datasets import load_dataset
import torch
```

- **Purpose**: Import necessary libraries for loading models, handling datasets, and performing tensor computations.
  - `SentenceTransformer` and `util` from `sentence_transformers`: For encoding text and computing similarities.
  - `load_dataset` from `datasets`: For loading the dataset.
  - `torch`: For tensor operations and similarity calculations.

#### Loading the Pre-trained Model

```python
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model_sentence = SentenceTransformer(model_name)
```

- **Purpose**: Load a pre-trained sentence transformer model for encoding text.
  - **Model Name**: `sentence-transformers/all-MiniLM-L6-v2`.

#### Loading and Preparing the Dataset

```python
dataset = load_dataset("amazon_polarity")
corpus = [entry['content'] for entry in dataset['train'].select(range(2000))]
corpus_embeddings = model_sentence.encode(corpus, convert_to_tensor=True)
```

- **Purpose**: Load a dataset and prepare it for similarity search.
  - **Dataset**: Amazon Book Reviews from Hugging Face.
  - **Corpus Selection**: Use a subset of 2000 entries for efficient processing.
  - **Encoding**: Encode the corpus into dense vector representations using the loaded sentence transformer model.

#### Defining Sample Queries

```python
queries = [
    "Recommend me a book similar to '1984' by George Orwell.",
    "What are some good mystery novels?",
    "Can you suggest a romance book set in Paris?",
    "I'm looking for a fantasy book with dragons.",
    "Suggest a non-fiction book about history.",
    "What are some popular science fiction books?",
    "Recommend a thriller with a strong female lead.",
    "Can you suggest a children's book about adventure?",
    "I'm interested in books about artificial intelligence.",
    "What are some classic literature books I should read?",
    "Suggest a book that deals with mental health.",
    "Can you recommend a good autobiography?",
    "I'm looking for a humorous book.",
    "Suggest a book that has won a Pulitzer Prize.",
    "What are some good dystopian novels?",
    "Recommend a book that was turned into a movie.",
    "Can you suggest a young adult book series?",
    "I'm interested in books about space exploration.",
    "What are some good horror novels?",
    "Recommend a book with a plot twist."
]
```

- **Purpose**: Define a list of sample queries related to book recommendations for the similarity search.

#### Encoding the Queries

```python
query_embeddings = model_sentence.encode(queries, convert_to_tensor=True)
```

- **Purpose**: Encode the sample queries into dense vector representations using the loaded sentence transformer model.

#### Performing Similarity Search

```python
for query, query_embedding in zip(queries, query_embeddings):
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=5)
    
    print(f"Query: {query}")
    for score, idx in zip(top_results[0], top_results[1]):
        print(f"Score: {score:.4f}\tDocument: {corpus[idx]}")
    print("\n")
```

- **Purpose**: Perform a similarity search to find the most relevant documents for each query.
- **Process**:
  1. **Iterate through Queries**: For each query and its corresponding embedding.
  2. **Compute Similarity Scores**: Calculate cosine similarity scores between the query embedding and all corpus embeddings.
  3. **Identify Top Results**: Find the top 5 most similar documents based on the similarity scores.
  4. **Display Results**: Print the query, similarity scores, and corresponding documents.

#### Summary

This code provides a structured approach to:
1. Load and encode a dataset of book reviews.
2. Define a set of sample queries related to book recommendations.
3. Encode the queries using a pre-trained sentence transformer model.
4. Perform a similarity search to retrieve and display the most relevant documents from the corpus for each query.

## 🎛️ Create a Generator
For this part, I practically handed you the whole code on a silver platter. 🍽️ But since we know you’re an explorer at heart and love trying new things, you can’t use the model I previously used. 😈 You have to try 3 different generators and compare them based on the quality of their answers. 🧪📊 [These might come in handy](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer , BitsAndBytesConfig
import shutil
import gc # garbage collect library
import torch
# Define function to generate responses
def generate_response(model, tokenizer, input_text):
    inputs = tokenizer.encode(input_text, return_tensors='pt')
    outputs = model.generate(inputs, max_new_tokens= 500, num_return_sequences=1, do_sample=True, top_k=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

# Load the models and tokenizers
models = [
    'Qwen/Qwen2-7B-Instruct',
'mistralai/Mistral-7B-Instruct-v0.3',
    'meta-llama/Meta-Llama-3-8B-Instruct'
]
def retrieve_relevant_document(query, corpus_embeddings):
    cos_scores = util.pytorch_cos_sim(query, corpus_embeddings)[0]
    top_result = torch.topk(cos_scores, k=1)
    return corpus[top_result[1][0]]

# Sample query and top retrieved document
input_user = "Can you suggest a young adult book series?"
input_user_embeddings = model_sentence.encode(input_user, convert_to_tensor=True)
relevant_document = retrieve_relevant_document(input_user_embeddings , corpus_embeddings)

# clear GPU & CPU
del model_sentence

gc.collect()

torch.cuda.empty_cache()
# Path to the cached model directory

model_dir = f"/root/.cache/huggingface/hub/models--{model_name.replace('/', '--')}"



# Remove the directory

shutil.rmtree(model_dir, ignore_errors=True)



# Optionally, also remove the tokenizer directory if it's separate

tokenizer_dir = f"/root/.cache/huggingface/hub/tokenizers--{model_name.replace('/', '--')}"

shutil.rmtree(tokenizer_dir, ignore_errors=True)



print(f"Deleted model directory: {model_dir}")

print(f"Deleted tokenizer directory: {tokenizer_dir}")



prompt = """You are a bot that makes recommendations for books. Try to be helpful recommender system.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input."""
prompt = prompt.replace("{relevant_document}", relevant_document).replace("{user_input}", input_user)

# Generate responses using the models
for model_name in models:
    model = AutoModelForCausalLM.from_pretrained(model_name , quantization_config=quantization_config,
    device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    input_text = prompt

    # Generate response
    response = generate_response(model, tokenizer, input_text)

    # Print results
    print(f"Model: {model_name}")
    print(f"Response: {response}")
    print("\n")

    # clear GPU & CPU
    del model
    
    gc.collect()

    torch.cuda.empty_cache()
    # Path to the cached model directory

    model_dir = f"/root/.cache/huggingface/hub/models--{model_name.replace('/', '--')}"



    # Remove the directory

    shutil.rmtree(model_dir, ignore_errors=True)



    # Optionally, also remove the tokenizer directory if it's separate

    tokenizer_dir = f"/root/.cache/huggingface/hub/tokenizers--{model_name.replace('/', '--')}"

    shutil.rmtree(tokenizer_dir, ignore_errors=True)



    print(f"Deleted model directory: {model_dir}")

    print(f"Deleted tokenizer directory: {tokenizer_dir}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Deleted model directory: /root/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2
Deleted tokenizer directory: /root/.cache/huggingface/hub/tokenizers--sentence-transformers--all-MiniLM-L6-v2


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Model: Qwen/Qwen2-7B-Instruct
Response: You are a bot that makes recommendations for books. Try to be helpful recommender system.
This is the recommended activity: DAUGHTERS OF DARKNESS was the first book by L. J. Smith that Iread and I was hooked after that! I loved the way L. J. Smithdescribed all the characters, especialy Ash, and gave them all completely different personalities. This is, by far, one of the best books I've ever read and I can't wait for STRANGE FATE to come out. For anyone who is looking for another good series by L. J. Smith I suggest the Vampire Diaries and Dark Visions trilogies.
The user input is: Can you suggest a young adult book series?
Compile a recommendation to the user based on the recommended activity and the user input. Here's a recommendation for you:

If you enjoyed "Daughters of Darkness" by L.J. Smith, you might also like the "House of Night" series by P.C. Cast and Kristin Cast. The series follows the life of Zoey Redbird, a 16-year-old girl who ge

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Model: mistralai/Mistral-7B-Instruct-v0.3
Response: You are a bot that makes recommendations for books. Try to be helpful recommender system.
This is the recommended activity: DAUGHTERS OF DARKNESS was the first book by L. J. Smith that Iread and I was hooked after that! I loved the way L. J. Smithdescribed all the characters, especialy Ash, and gave them all completely different personalities. This is, by far, one of the best books I've ever read and I can't wait for STRANGE FATE to come out. For anyone who is looking for another good series by L. J. Smith I suggest the Vampire Diaries and Dark Visions trilogies.
The user input is: Can you suggest a young adult book series?
Compile a recommendation to the user based on the recommended activity and the user input. Here's my suggestion:
If you're searching for an immersive and captivating young adult book series, I highly recommend starting with "Dark Visions" by L. J. Smith. This thrilling trilogy features rich characters, fascinating 

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Model: meta-llama/Meta-Llama-3-8B-Instruct
Response: You are a bot that makes recommendations for books. Try to be helpful recommender system.
This is the recommended activity: DAUGHTERS OF DARKNESS was the first book by L. J. Smith that Iread and I was hooked after that! I loved the way L. J. Smithdescribed all the characters, especialy Ash, and gave them all completely different personalities. This is, by far, one of the best books I've ever read and I can't wait for STRANGE FATE to come out. For anyone who is looking for another good series by L. J. Smith I suggest the Vampire Diaries and Dark Visions trilogies.
The user input is: Can you suggest a young adult book series?
Compile a recommendation to the user based on the recommended activity and the user input. 

Here's my response:

Hi there! I'm happy to help you find your next great read. Since you loved the Vampire Diaries series by L. J. Smith, I think you might enjoy other young adult book series with similar themes and super

Sure! Below is a detailed explanation of the provided code.

---

### Explanation of the Code

This code demonstrates how to use various large language models (LLMs) to generate responses based on user input, perform document retrieval, and handle model caching efficiently. The code includes several key steps:

1. **Importing Libraries**:
    ```python
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    import shutil
    import gc
    import torch
    ```
    - `transformers`: Provides classes and functions to load pre-trained language models and tokenizers.
    - `shutil`: Used for file operations such as deleting directories.
    - `gc`: The garbage collection library to manage memory usage.
    - `torch`: PyTorch library for tensor computations.

2. **Defining the Function to Generate Responses**:
    ```python
    def generate_response(model, tokenizer, input_text):
        inputs = tokenizer.encode(input_text, return_tensors='pt')
        outputs = model.generate(inputs, max_new_tokens=500, num_return_sequences=1, do_sample=True, top_k=50)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    ```
    - This function encodes the input text using the tokenizer, generates a response using the model, and decodes the output back into human-readable text.

3. **Quantization Configuration**:
    ```python
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    )
    ```
    - This configuration sets up quantization to reduce the memory footprint of the models, making them more efficient to load and run.

4. **Model and Tokenizer Loading**:
    ```python
    models = [
        'Qwen/Qwen2-7B-Instruct',
        'mistralai/Mistral-7B-Instruct-v0.3',
        'meta-llama/Meta-Llama-3-8B-Instruct'
    ]
    ```

5. **Document Retrieval Function**:
    ```python
    def retrieve_relevant_document(query, corpus_embeddings):
        cos_scores = util.pytorch_cos_sim(query, corpus_embeddings)[0]
        top_result = torch.topk(cos_scores, k=1)
        return corpus[top_result[1][0]]
    ```
    - This function retrieves the most relevant document from a corpus based on cosine similarity scores between the query and the corpus embeddings.

6. **Sample Query and Top Retrieved Document**:
    ```python
    input_user = "Can you suggest a young adult book series?"
    input_user_embeddings = model_sentence.encode(input_user, convert_to_tensor=True)
   

## 📊 Evaluate the results
Here, you’ve got to put those 3 models to the test. Use the 20 queries you’ve created on each of the 3 models. Now you’ll have 20 tuples, each containing five items: user input, selected document, and 3 responses from three different models. Use a judge model on each tuple to select the best answer. 🥇 The judge model can be any language model accessible on the internet, whether you find one on Hugging Face or use one through an API. 🌐 Finally, calculate the score for each model, which is how many times the judge picked that model. 🏆

In [6]:
# Function to retrieve the most relevant document
def retrieve_relevant_document(query, corpus_embeddings):
    query_embedding = model_sentence.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_result = torch.topk(cos_scores, k=1)
    return corpus[top_result[1][0]]

# Define function to generate responses
def generate_response(model, tokenizer, input_text):
    inputs = tokenizer.encode(input_text, return_tensors='pt')
    outputs = model.generate(inputs, max_new_tokens=500, num_return_sequences=1, do_sample=True, top_k=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Load the custom dataset from Hugging Face
dataset_queries = load_dataset("soltaniali/RAGG")

# Extract the queries from the dataset
queries = [entry['query'] for entry in dataset_queries['train']]
# Encode the queries
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model_sentence = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# query_embeddings = model_sentence.encode(queries, convert_to_tensor=True)
# Use a smaller subset of the dataset
corpus = [entry['content'] for entry in dataset['train'].select(range(2000))]  # Adjust the range as needed
corpus_embeddings = model_sentence.encode(corpus, convert_to_tensor=True)


Downloading data:   0%|          | 0.00/914 [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20 [00:00<?, ? examples/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [7]:
# Generate responses for each query using each model
responses = {query: {} for query in queries}
for model_name in models:
    model = AutoModelForCausalLM.from_pretrained(model_name , quantization_config=quantization_config,
    device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    for query in queries:
        relevant_document = retrieve_relevant_document(query, corpus_embeddings)
        prompt = f"You are a bot that makes recommendations for books. Try to be a helpful recommender system.\nThis is the recommended activity: {relevant_document}\nThe user input is: {query}\nCompile a recommendation to the user based on the recommended activity and the user input."


        response = generate_response(model, tokenizer, prompt)
        responses[query][model_name] = response

    # Clear GPU & CPU memory
    del model
    gc.collect()
    torch.cuda.empty_cache()

    # Path to the cached model directory
    model_dir = f"/root/.cache/huggingface/hub/models--{model_name.replace('/', '--')}"
    shutil.rmtree(model_dir, ignore_errors=True)

    # Optionally, also remove the tokenizer directory if it's separate
    tokenizer_dir = f"/root/.cache/huggingface/hub/tokenizers--{model_name.replace('/', '--')}"
    shutil.rmtree(tokenizer_dir, ignore_errors=True)


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]



Batches:   0%|          | 0/1 [00:00<?, ?it/s]



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [12]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
# Load the judge model for sequence classification
# Load the judge model for sequence classification
judge_model = AutoModelForSequenceClassification.from_pretrained('roberta-base')
judge_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
judge = pipeline("text-classification", model=judge_model, tokenizer=judge_tokenizer)

# Function to evaluate and select the best response
def judge_best_response(query, responses):
    inputs = [
        f"Query: {query}\nResponse: {responses[model_name]}\n"[:512] for model_name in responses
    ]
    results = judge(inputs)
    best_response_idx = max(range(len(results)), key=lambda idx: results[idx]['score'])
    best_model = list(responses.keys())[best_response_idx]
    return best_model

# Evaluate the responses and calculate the scores
scores = {model_name: 0 for model_name in models}
for query in queries:
    best_model = judge_best_response(query, responses[query])
    scores[best_model] += 1

# Print the scores
print("Scores:")
for model_name, score in scores.items():
    print(f"{model_name}: {score}")
    print(f"{model_name}: {score}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Scores:
Qwen/Qwen2-7B-Instruct: 15
Qwen/Qwen2-7B-Instruct: 15
mistralai/Mistral-7B-Instruct-v0.3: 2
mistralai/Mistral-7B-Instruct-v0.3: 2
meta-llama/Meta-Llama-3-8B-Instruct: 3
meta-llama/Meta-Llama-3-8B-Instruct: 3


#DOC
### Detailed Documentation

This document describes the functionality and purpose of specific segments within the provided Python code, which demonstrates the process of retrieving relevant documents, generating responses, and evaluating the best responses using large language models (LLMs).

#### Retrieving the Most Relevant Document

```python
def retrieve_relevant_document(query, corpus_embeddings):
    query_embedding = model_sentence.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_result = torch.topk(cos_scores, k=1)
    return corpus[top_result[1][0]]
```

- **Purpose**: This function identifies and retrieves the most relevant document from a given corpus based on a query.
- **Process**:
  1. **Encoding**: The input query is encoded into a dense vector using a pre-trained sentence transformer model.
  2. **Similarity Calculation**: Cosine similarity scores between the encoded query and corpus embeddings are computed.
  3. **Top Result Identification**: The document with the highest cosine similarity score is identified.
  4. **Return Document**: The most relevant document from the corpus is returned.

#### Generating Responses

```python
def generate_response(model, tokenizer, input_text):
    inputs = tokenizer.encode(input_text, return_tensors='pt')
    outputs = model.generate(inputs, max_new_tokens=500, num_return_sequences=1, do_sample=True, top_k=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
```

- **Purpose**: This function generates a response to a given input text using a specified model and tokenizer.
- **Process**:
  1. **Tokenization**: The input text is tokenized into a format suitable for the model.
  2. **Response Generation**: The model generates a response based on the tokenized input.
  3. **Decoding**: The generated response is decoded from token format to human-readable text.
  4. **Return Response**: The decoded response is returned.

#### Loading and Processing the Dataset

```python
dataset_queries = load_dataset("soltaniali/RAGG")

queries = [entry['query'] for entry in dataset_queries['train']]
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model_sentence = SentenceTransformer(model_name)
corpus = [entry['content'] for entry in dataset['train'].select(range(2000))]
corpus_embeddings = model_sentence.encode(corpus, convert_to_tensor=True)
```

- **Purpose**: This segment loads and processes a custom dataset from Hugging Face, extracting queries and corpus content, and encoding them for later use.
- **Process**:
  1. **Dataset Loading**: The custom dataset is loaded from Hugging Face.
  2. **Query Extraction**: Queries are extracted from the dataset.
  3. **Model Initialization**: A pre-trained sentence transformer model is initialized.
  4. **Corpus Selection**: A subset of the dataset is selected to form the corpus.
  5. **Encoding**: Both queries and corpus contents are encoded into dense vectors using the sentence transformer model.

#### Generating Responses for Each Query

```python
responses = {query: {} for query in queries}
for model_name in models:
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    for query in queries:
        relevant_document = retrieve_relevant_document(query, corpus_embeddings)
        prompt = f"You are a bot that makes recommendations for books. Try to be a helpful recommender system.\nThis is the recommended activity: {relevant_document}\nThe user input is: {query}\nCompile a recommendation to the user based on the recommended activity and the user input."
        response = generate_response(model, tokenizer, prompt)
        responses[query][model_name] = response

    # Clear GPU & CPU memory
    del model
    gc.collect()
    torch.cuda.empty_cache()

    # Path to the cached model directory
    model_dir = f"/root/.cache/huggingface/hub/models--{model_name.replace('/', '--')}"
    shutil.rmtree(model_dir, ignore_errors=True)
    tokenizer_dir = f"/root/.cache/huggingface/hub/tokenizers--{model_name.replace('/', '--')}"
    shutil.rmtree(tokenizer_dir, ignore_errors=True)
```

- **Purpose**: This segment generates responses for each query using different models and clears resources to maintain efficiency.
- **Process**:
  1. **Initialization**: Responses dictionary is initialized to store responses for each query.
  2. **Model Loading**: Each model from the list of models is loaded with quantization configurations for efficiency.
  3. **Query Processing**: For each query:
     - **Document Retrieval**: The most relevant document is retrieved from the corpus.
     - **Prompt Creation**: A prompt is created by incorporating the relevant document and the query.
     - **Response Generation**: A response is generated using the model and tokenizer.
     - **Response Storage**: The generated response is stored in the responses dictionary.
  4. **Resource Cleanup**: After processing each model, resources are cleaned up by deleting the model and clearing caches.

#### Evaluating Responses

```python
judge_model = AutoModelForSequenceClassification.from_pretrained('roberta-base')
judge_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
judge = pipeline("text-classification", model=judge_model, tokenizer=judge_tokenizer)

def judge_best_response(query, responses):
    inputs = [f"Query: {query}\nResponse: {responses[model_name]}\n"[:512] for model_name in responses]
    results = judge(inputs)
    best_response_idx = max(range(len(results)), key=lambda idx: results[idx]['score'])
    best_model = list(responses.keys())[best_response_idx]
    return best_model

scores = {model_name: 0 for model_name in models}
for query in queries:
    best_model = judge_best_response(query, responses[query])
    scores[best_model] += 1

print("Scores:")
for model_name, score in scores.items():
    print(f"{model_name}: {score}")
```

- **Purpose**: This segment evaluates the generated responses to identify and score the best-performing model for each query.
- **Process**:
  1. **Judge Model Initialization**: A sequence classification model is loaded and initialized as a text classification pipeline.
  2. **Best Response Evaluation**: A function is defined to evaluate responses:
     - **Input Preparation**: Inputs are prepared by concatenating the query and each model's response.
     - **Model Evaluation**: The judge model evaluates each response.
     - **Best Response Selection**: The response with the highest score is identified.
  3. **Score Calculation**: For each query, the best model is determined and scores are updated.
  4. **Score Printing**: The final scores for each model are printed, indicating their performance across all queries.

### Now that I'm writing this message, it's 3 in the morning and I'm tired as fox. So I hope you've learned something from this project and someday you use what you've learned here in a real-case scenario. Good Luck! ✌️

#CHATGPT HISTORY:

https://chatgpt.com/share/e13a6ea8-9872-474d-83d6-96a9de8a443e

https://chatgpt.com/share/8a88d0bd-ac21-4620-919a-b1a75e6b3d32

https://chatgpt.com/share/06781b6e-0a3b-4f3f-9ea1-be25d6d010f5

https://chatgpt.com/share/7a7f5d78-4dc8-498b-bac1-c585f22b92ee

https://chatgpt.com/share/fc771389-12cc-4306-ba16-5dea88054955