# Will be using:
 1. Langchain framework for RAG Architecture.
 2. Faiss for Vector DB
 3. PyPDF for pdf text extraction.
 4. Sentence transformer for vector embedding.
 5. Free Chatgpt API from RAPID API.
 6. FlashRank(OpenSource) to improve the retrieval Performance

In [1]:
!pip install langchain
!pip install faiss-gpu
!pip install pypdf
!pip install sentence-transformers
!pip install flashrank

Collecting langchain
  Downloading langchain-0.1.16-py3-none-any.whl (817 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.32 (from langchain)
  Downloading langchain_community-0.0.34-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.42 (from langchain)
  Downloading langchain_core-0.1.45-py3-none-any.whl (291 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m291.3/291.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downl

In [2]:
from langchain.schema.output_parser import StrOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.vectorstores.utils import filter_complex_metadata
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [3]:
from google.colab import files
uploaded = files.upload()
# Uploading the file named Doc1.pdf
txt_file_path = 'Doc1.pdf'

Saving Doc1.pdf to Doc1.pdf


In [19]:
#Extracting the text and splitting into chunks to be vectorized.
docs = PyPDFLoader(file_path=txt_file_path).load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=80)
chunks = text_splitter.split_documents(docs)
chunks = filter_complex_metadata(chunks)
#Using the embedding model named all-MiniLM-L6-v2, more can be read in HuggingFace
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
#Storing the chunks in vector DB
db = FAISS.from_documents(documents=chunks, embedding=embedding_function)

In [20]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.4,
        "k":3},
)

## This question can be found in Page 12 in 3rd Paragraph.

In [21]:
what_to_ask="What are the definitions of visualization?"

In [22]:
retrieves=retriever.invoke(what_to_ask)

### **Only one chunk got retrieved that crosses the pre-defined threshold. Which does not contain the answer we need.**

In [23]:
retrieves

[Document(page_content="phenomena. Scientists need a representation form that can show all the \ncorrelations. Static maps are not the best presentation form for displaying \nthese relationship s. Often, maps are overloaded with information layers \nto show the many correlations. ln an animated map sequence, map \nelements can be presented in different orders and combinations to make \nthe spatial relationships more apparent. The map user can be directed \nthrough the presented subject, and the correlations can be brought to the \nuser's attention. \nIn the late 1980s, the sciences discovered scientific visualization. It is \nused for data analysis to see patterns that either answer questions or that \npose new and unexpected questions. Scientific visualization requires \ncomputer animation, especially interactive animation, that can show the Cartographic \nAnimation: \nPotential and \nResearch \nIssues \nDoris Karl \nTHE EEO FOR ANIMA TIO \nIN CARTOGRAPHY \nDoris Karl is a stude11t at

# Using ReRank to improve the Performance

In [24]:
# Decreasing the threshold and increasing the number of chunks to obtain more chunks.
retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.2,
        "k":10},
)

In [25]:
retrieves=retriever.invoke(what_to_ask)
retrieves

[Document(page_content="phenomena. Scientists need a representation form that can show all the \ncorrelations. Static maps are not the best presentation form for displaying \nthese relationship s. Often, maps are overloaded with information layers \nto show the many correlations. ln an animated map sequence, map \nelements can be presented in different orders and combinations to make \nthe spatial relationships more apparent. The map user can be directed \nthrough the presented subject, and the correlations can be brought to the \nuser's attention. \nIn the late 1980s, the sciences discovered scientific visualization. It is \nused for data analysis to see patterns that either answer questions or that \npose new and unexpected questions. Scientific visualization requires \ncomputer animation, especially interactive animation, that can show the Cartographic \nAnimation: \nPotential and \nResearch \nIssues \nDoris Karl \nTHE EEO FOR ANIMA TIO \nIN CARTOGRAPHY \nDoris Karl is a stude11t at

In [26]:
# importing open-source Flash-Rank which uses natural language to filter out irrelevant chunks
from flashrank import Ranker, RerankRequest
ranker = Ranker()

In [27]:
# Converting into required list to be passed into Flashrank.
passages = []
for element in retrieves:
  passage = {
    "text": element.page_content,
    "meta": element.metadata
  }
  passages.append(passage)

## After using Flash Rank we can see the context which contains answer to our question is placed as the first chunk. Hence we can confidently pass it into Chat-GPT

In [28]:
# Using Flash Rank to Re - Rank the chunks
rerankrequest = RerankRequest(query=what_to_ask, passages=passages)
results = ranker.rerank(rerankrequest)
results

[{'text': 'tools allo"v our visual and cognitive processes to almost automatically \nfocus on the patterns depicted rather than on mentally generating those \npatterns. \nFollowing from the above conception of visualization, a research \nagenda to address visualizing uncertain information should include \nattention to the cognitive issues of what it means to understand attribute, \nspatial, and temporal uncertainty and the implications of this understand\xad\ning for decision making and for symbolizing and categorizing uncertainty. \nAt the most basic level, uncertainty can be divided into two components \nthat might require different visualization strategies: \\\'isualizing accuracy \nand visualizing precision. In addition, attention should be directed toward \nthe methodological, technical, and ergonomic issue:; of generating dis\xad\nplays and creating interfaces that work. It is, of course, also essential to \ndevelop methods for assessing and measuring uncertainty before we can',


In [30]:
# Extracting only the first 3 chunks from the Re- Ranked List

texts = [item['text'] for item in results[:3]]

#Converting into strings

concatenated_text = '\n'.join(texts)

print(concatenated_text)

tools allo"v our visual and cognitive processes to almost automatically 
focus on the patterns depicted rather than on mentally generating those 
patterns. 
Following from the above conception of visualization, a research 
agenda to address visualizing uncertain information should include 
attention to the cognitive issues of what it means to understand attribute, 
spatial, and temporal uncertainty and the implications of this understand­
ing for decision making and for symbolizing and categorizing uncertainty. 
At the most basic level, uncertainty can be divided into two components 
that might require different visualization strategies: \'isualizing accuracy 
and visualizing precision. In addition, attention should be directed toward 
the methodological, technical, and ergonomic issue:; of generating dis­
plays and creating interfaces that work. It is, of course, also essential to 
develop methods for assessing and measuring uncertainty before we can
8 cartographic perspectives Number

In [31]:
#Instuction to be passed to the chatgpt.
instruction="Answer the user quesestion as accuractely as possible, based on the following context :"+concatenated_text

In [32]:
#Using RAPID API's free Chatgpt API to get answer based on user query.

# You can get your key from here : https://rapidapi.com/haxednet/api/chatgpt-api8
import requests

url = "https://chatgpt-api8.p.rapidapi.com/"

payload = [
	{
		"content": instruction,
		"role": "system"
	},
	{
		"content": what_to_ask,
		"role": "user"
	}
]
headers = {
	"content-type": "...",
	"X-RapidAPI-Key": "...",
	"X-RapidAPI-Host": "..."
}


response = requests.post(url, json=payload, headers=headers)

print(response.json())

{'text': 'Visualization refers to the process of creating visual representations of data or information. In the context of uncertain information, visualization involves using tools to automatically focus on patterns depicted rather than mentally generating those patterns. It helps in understanding attributes, spatial, and temporal uncertainty, which can be divided into visualizing accuracy and visualizing precision. Visualization also involves creating dynamic and user-oriented forms of information display, especially important in the realm of computer animation in cartography.', 'finish_reason': 'stop', 'model': 'gpt-3.5-turbo-030', 'server': 'backup-K'}


## The response we get is very accurate to the answer from the pdf. (Page 12 - 3rd Paragraph)

In [33]:
print(response.json()['text'])

Visualization refers to the process of creating visual representations of data or information. In the context of uncertain information, visualization involves using tools to automatically focus on patterns depicted rather than mentally generating those patterns. It helps in understanding attributes, spatial, and temporal uncertainty, which can be divided into visualizing accuracy and visualizing precision. Visualization also involves creating dynamic and user-oriented forms of information display, especially important in the realm of computer animation in cartography.
