<a href="https://colab.research.google.com/github/sayantika21175/RAG_Projects/blob/main/RAG_USING_FAISS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#install Libraries

In [5]:
# !pip install faiss-cpu
# !pip install sentence-transformers
# !pip install pypdf
# !pip install tiktoken
#!pip install -U langchain-community

#Import Libraries

In [6]:
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.chains import RetrievalQA
import pandas as pd

#load the pdf file


*   loader.load() -> This loads the pdf file and convert it into the Document object. It does not split the pdf into multiple chunks
*   loader.load_and_split() -> split the pdf into multiple chunks based on the textsplitter provided to it. If no text splitter is provided it splits the documemnt using RecursiveTextSplitter



In [7]:
loader=PyPDFLoader("/content/drive/MyDrive/lumen_training_dataset1/chasesql.pdf")
documents=loader.load()

* Calulating the Average no of characters in the documents

* Also calculate the Minimum and Maximum no of characters in each document in documents

In [16]:
char_counts=[len(doc.page_content) for doc in documents]
print(f"The average no of characters in each page of the pdf is {sum(char_counts)/len(char_counts)}")

The average no of characters in each page of the pdf is 2686.6


In [162]:
print(min(char_counts))
print(max(char_counts))

279
5000


#Define the Splitter. Here RecursiveCharacterTextSplitter is considered which splits the documents based on new line, '.', '?','!' and double newline recursively

* Considered Chunk size=500 as the input token limit of the embedding model is 512
* Chunk overlap=50

In [52]:
splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50,separators=["\n\n", "\n", ". ", "? ", "! ", " "])

In [68]:
documents[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-10-04T00:06:07+00:00', 'author': '', 'keywords': '', 'moddate': '2024-10-04T00:06:07+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/drive/MyDrive/lumen_training_dataset1/chasesql.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1'}, page_content='CHASE-SQL: Multi-Path Reasoning and Preference Optimized\nCandidate Selection in Text-to-SQL\nMohammadreza Pourreza1∗, Hailong Li1∗, Ruoxi Sun1, Yeounoh Chung1, Shayan Talaei2,\nGaurav Tarlok Kakkar1, Yu Gan1, Amin Saberi2, Fatma Özcan1, Sercan Ö. Arık1\n1Google Cloud, Sunnyvale, CA, USA\n2Stanford University, Stanford, CA, USA\n{pourreza, hailongli, ruoxis, yeounoh}@google.com\n{gkakkar, gany, fozcan, soarik}@google.com\n{stalaei, saberi}@stanford.edu\n∗Equal contribution\nOctober 4, 2024\nAbstract\nI

#Preprocessing
* Fix broken words split by hyphens at line breaks.
* Remove multiple whitespaces including new lines and tabs with single space


In [71]:
import re

def fix_hyphenated_line_breaks(text):
    return re.sub(r'(\w+)-\n(\w+)', r'\1\2', text)

def clean_whitespace(text):
    text = re.sub(r'\s+', ' ', text)         # Replace multiple whitespace (including newlines, tabs) with single space
    return text.strip()

In [73]:
from langchain.schema import Document
cleaned_text=[]
for doc in documents:
  text=doc.page_content

  text=clean_whitespace(text)

  text=fix_hyphenated_line_breaks(text)

  cleaned_text.append(Document(page_content=text,metadata=doc.metadata))

In [75]:
cleaned_text[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-10-04T00:06:07+00:00', 'author': '', 'keywords': '', 'moddate': '2024-10-04T00:06:07+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/drive/MyDrive/lumen_training_dataset1/chasesql.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1'}, page_content='CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL Mohammadreza Pourreza1∗, Hailong Li1∗, Ruoxi Sun1, Yeounoh Chung1, Shayan Talaei2, Gaurav Tarlok Kakkar1, Yu Gan1, Amin Saberi2, Fatma Özcan1, Sercan Ö. Arık1 1Google Cloud, Sunnyvale, CA, USA 2Stanford University, Stanford, CA, USA {pourreza, hailongli, ruoxis, yeounoh}@google.com {gkakkar, gany, fozcan, soarik}@google.com {stalaei, saberi}@stanford.edu ∗Equal contribution October 4, 2024 Abstract In tackling t

# Generating the chunks with the cleaned text after Preprocessing steps

In [122]:
chunks=splitter.split_documents(cleaned_text)

In [123]:
chunks[1]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-10-04T00:06:07+00:00', 'author': '', 'keywords': '', 'moddate': '2024-10-04T00:06:07+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '/content/drive/MyDrive/lumen_training_dataset1/chasesql.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1'}, page_content='. Arık1 1Google Cloud, Sunnyvale, CA, USA 2Stanford University, Stanford, CA, USA {pourreza, hailongli, ruoxis, yeounoh}@google.com {gkakkar, gany, fozcan, soarik}@google.com {stalaei, saberi}@stanford.edu ∗Equal contribution October 4, 2024 Abstract In tackling the challenges of large language model (LLM) performance for Text-to-SQL tasks, we introduce CHASE-SQL, a new framework that employs innovative strategies, using test-time compute in multi-agent modeling to improve candidate generation')

In [124]:
chunks[1].page_content[-50:]

'lti-agent modeling to improve candidate generation'

In [125]:
chunks[2].page_content[:50]

'modeling to improve candidate generation and selec'

In [126]:
len(chunks)     # Number of chunks generated 228

228

#Load the embedding model "all-MiniLM-L6-v2" from SentenceTransformer
* Input token limit: 512
* Output embedding size =384

In [29]:
embed_model=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [127]:
texts=[chunks.page_content for chunks in chunks]

In [128]:
len(texts)

228

#Generate embeddings with batch size 32

In [129]:
embeddings=embed_model.encode(texts,batch_size=32,show_progress_bar=True)

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

In [130]:
embeddings.shape

(228, 384)

# Import Faiss as Vector db to store the embeddings

In [37]:
import faiss

In [38]:
embedding_dim=embeddings.shape[1]

#Faiss Index

*  Create A Faiss Index with stores the data in a flat array.
*  Consider a big 2d matrix of size (198*384) into a 1D array(usually float values one after the other)
* IndexFlatL2 -> Computes the L2 distances. So when distance value is low means more similar. High means less similar
* IndexFlatIP -> Inner product or Cosine Similarity. So when this distance value is more, more similar. When distance value is less, less similar
* This flat index suitable for small dataset bot not for large dataset because it searches linearly

In [131]:
index=faiss.IndexFlatL2(embedding_dim)  # this index considerd L2 distance during search

In [132]:
index.add(embeddings)
print(f"The number of vectors in the index {index.ntotal}")

The number of vectors in the index 228


In [133]:
faiss.write_index(index,"/content/drive/MyDrive/lumen_training_dataset1/faiss_index.faiss")   # write the index into the disk

In [134]:
faiss_index=faiss.read_index("/content/drive/MyDrive/lumen_training_dataset1/faiss_index.faiss")   # read the index from the disk

# Query

In [163]:
query_text="What is Text-to-SQL?"

In [136]:
query_embedding=embed_model.encode([query_text])   # Generate the embeddings for the Query

# Similarity search and Retrieve the top 3 Most Relevant document from vector db where the metric is L2 distance

In [137]:
k=3
distances,indices=faiss_index.search(query_embedding,k)
print(f"Nearest chunk indices:",indices)
print(f"Nearest chunk distances:",distances)

Nearest chunk indices: [[ 7 10  9]]
Nearest chunk distances: [[0.6139318 0.7177527 0.7873499]]


In [139]:
#db=FAISS.from_documents(chunks,embed_model)
retrieved_doc=[chunks[i] for i in indices[0]]
for doc in retrieved_doc:
  print(doc.page_content)
  print('\n')

. 1 Introduction Text-to-SQL, as a bridge between human language and machine-readable structured query languages, is crucial for many use cases, converting natural language questions into executable SQL commands (Androutsopoulos et al., 1995; Li & Jagadish, 2014; Li et al., 2024c; Yu et al., 2018;?)


. Text-to-SQL can be considered a specialized form of code generation, with the contextual information potentially including the database schema, its metadata and along with the values. In the broader code generation domain, utilizing LLMs to generate a wide range of diverse candidates and select the best one has proven to be effective (Chen et al., 2021; Li et al., 2022; Ni et al., 2023). However, it is non-obvious what 1 arXiv:2410.01943v1 [cs.LG] 2 Oct 2024


. Furthermore, Text-to-SQL systems play a pivotal role in automating data analytics with complex reasoning and powering conversational agents, expanding their applications beyond traditional data retrieval (Sun et al., 2023; Xie e

# Build Faiss index with Cosine Similarity

In [165]:
faissIndex_with_cosine=faiss.IndexFlatIP(embedding_dim)
faissIndex_with_cosine.add(embeddings)
faiss.write_index(faissIndex_with_cosine,"/content/drive/MyDrive/lumen_training_dataset1/faiss_index_cosine.faiss")
faissIndex_with_cosine_read=faiss.read_index("/content/drive/MyDrive/lumen_training_dataset1/faiss_index_cosine.faiss")

# Similarity search and Retrieve the top 3 Most Relevant document from vector DB where the metric is Cosine

In [166]:
distances_cosine,indices_cosine=faissIndex_with_cosine.search(query_embedding,3)
print(f"Nearest chunk indices:",indices_cosine)
print(f"Nearest chunk distances:",distances_cosine)

Nearest chunk indices: [[ 7 10  9]]
Nearest chunk distances: [[0.6930342  0.6411238  0.60632515]]


In [167]:
#db=FAISS.from_documents(chunks,embed_model)
retrieved_doc_cosine=[chunks[i] for i in indices_cosine[0]]
for doc in retrieved_doc_cosine:
  print(doc.page_content)
  print('\n')

. 1 Introduction Text-to-SQL, as a bridge between human language and machine-readable structured query languages, is crucial for many use cases, converting natural language questions into executable SQL commands (Androutsopoulos et al., 1995; Li & Jagadish, 2014; Li et al., 2024c; Yu et al., 2018;?)


. Text-to-SQL can be considered a specialized form of code generation, with the contextual information potentially including the database schema, its metadata and along with the values. In the broader code generation domain, utilizing LLMs to generate a wide range of diverse candidates and select the best one has proven to be effective (Chen et al., 2021; Li et al., 2022; Ni et al., 2023). However, it is non-obvious what 1 arXiv:2410.01943v1 [cs.LG] 2 Oct 2024


. Furthermore, Text-to-SQL systems play a pivotal role in automating data analytics with complex reasoning and powering conversational agents, expanding their applications beyond traditional data retrieval (Sun et al., 2023; Xie e

# Generation using LLM

* Pass the retrieved documents as context to the LLM along with the user question and prompt
* Now this LLM generates the final answer to the query based on the context provided to it

In [145]:
context=["\n\n".join([doc.page_content for doc in retrieved_doc])]
user_question=query_text
input_text=f"Answer the user question: {user_question} based on the context: {context}"


In [146]:
from transformers import BartTokenizer,BartForConditionalGeneration

tokenizer=BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model=BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

inputs=tokenizer(input_text,return_tensors="pt",max_length=1024,truncation=True)

model_output=model.generate(inputs["input_ids"],max_length=300,min_length=100,length_penalty=2.0,num_beams=4,early_stopping=True)

final_result=tokenizer.decode(model_output[0],skip_special_tokens=True)
print(final_result)



Text-to-SQL is a bridge between human language and machine-readable structured query languages. As data continues to grow exponentially, the ability to query databases efficiently without extensive SQL knowledge becomes increasingly vital for a broad range of applications. The contextual information potentially includes the database schema, its metadata and along with the values. It is non-obvious what 1 arXiv:2410.01943v1 [cs.LG] 2 Oct 2024\n\n. 1 Introduction Text- to-SQL can be considered a specialized form of code generation.


In [147]:
print(f"The final answer is: {final_result}")

The final answer is: Text-to-SQL is a bridge between human language and machine-readable structured query languages. As data continues to grow exponentially, the ability to query databases efficiently without extensive SQL knowledge becomes increasingly vital for a broad range of applications. The contextual information potentially includes the database schema, its metadata and along with the values. It is non-obvious what 1 arXiv:2410.01943v1 [cs.LG] 2 Oct 2024\n\n. 1 Introduction Text- to-SQL can be considered a specialized form of code generation.
