# RAG Chatbot
**Vector Search Enabled PDF Query System**

- Vector Search enhances machine learning models by allowing similarity comparison between embeddings
- Embeddings are the representations of the high dimensional data
- Because LLMs are stateless we need Vector-based DBs to store the state of the embeddings before we lose them.
- Inspired from [LangChain GEN AI Tutorial](https://www.youtube.com/watch?v=x0AnCE9SE4A)
- I'll be using [DataStax](https://www.datastax.com/?utm_medium=search_pd&utm_source=google&utm_campaign=ggl_s_nam_dev_brand&utm_content=) for Vector DB

**Dataset and Data Source**

Data Source: [Ministry of Finance, Quebec](https://www.finances.gouv.qc.ca/Budget_and_update/budget/speech.asp)

PDF Doc: [Budget Speech](https://www.finances.gouv.qc.ca/Budget_and_update/budget/documents/Budget2324_BudgetSpeech.pdf)

## 1. Loading Libraries

In [1]:
# langchain
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

# datasets
from datasets import load_dataset

# db connections
import cassio

# pdf reader
from PyPDF2 import PdfReader

# generic
from typing_extensions import Concatenate

## 2. Setting up credentials

In [2]:
ASTRA_DB_APPLICATION_TOKEN = ""
ASTRA_DB_ID = ""

OPENAI_API_KEY = ""

## 3. Reading PDF File (Quebec Budget)

In [3]:
pdf_reader = PdfReader('Budget2324_BudgetSpeech.pdf')

In [5]:
raw_text = ''

for i, page in enumerate(pdf_reader.pages):
    content = page.extract_text()#
    if content:
        raw_text += content

In [6]:
raw_text

'BUDGET 2023-2024 \nA COMMITTED \nQUÉBEC \nBUDGET SPEECH \nMarch 2023 BUDGET 2023-2024 \nA COMMITTED \nQUÉBEC \nBUDGET SPEECH \nMarch 2023 \nDelivered before the National Assembly by Eric Girard, Minister of Finance and \nResponsible for Relations with English-Speaking Quebecers, on March 21, 2023.  Budget 2023-2024 \nBudget Speech \nLegal deposit – March 21, 2023 \nBibliothèque et Archives nationales du Québec ISBN 978-2-550-94113-2 (Print) ISBN 978-2-550-94114-9 (PDF) \n© Gouvernement du Québec, 2023  \n     \n   \n    \n   \n  \n   \n   \n \n    \n    \n   \n   \n   \n  \n   \n  \n   \n   \n    \n    \n   \n   \n   \n  \n   \n   \n     \n   \n A COMMITTED QUÉBEC \nIntroduction .............................................................................................. 3 \n1. Growing Québec’s wealth ................................................................ 9 \nLowering taxes .................................................................................... 9 \nIncreasing t

## 4. Initializing Connections

In [7]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

## 5. LLM & Embeddings

In [8]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  warn_deprecated(
  warn_deprecated(


## 6. Vector Store

In [9]:
vector_store = Cassandra(embedding=embeddings, table_name='budget_speech_quebec_2023_2024', session=None, keyspace=None)

In [11]:
text_splitter = CharacterTextSplitter(separator='\n', chunk_size=800, chunk_overlap=200, length_function=len)
texts = text_splitter.split_text(raw_text)

In [12]:
texts[:10]

['BUDGET 2023-2024 \nA COMMITTED \nQUÉBEC \nBUDGET SPEECH \nMarch 2023 BUDGET 2023-2024 \nA COMMITTED \nQUÉBEC \nBUDGET SPEECH \nMarch 2023 \nDelivered before the National Assembly by Eric Girard, Minister of Finance and \nResponsible for Relations with English-Speaking Quebecers, on March 21, 2023.  Budget 2023-2024 \nBudget Speech \nLegal deposit – March 21, 2023 \nBibliothèque et Archives nationales du Québec ISBN 978-2-550-94113-2 (Print) ISBN 978-2-550-94114-9 (PDF) \n© Gouvernement du Québec, 2023  \n     \n   \n    \n   \n  \n   \n   \n \n    \n    \n   \n   \n   \n  \n   \n  \n   \n   \n    \n    \n   \n   \n   \n  \n   \n   \n     \n   \n A COMMITTED QUÉBEC \nIntroduction .............................................................................................. 3',
 'A COMMITTED QUÉBEC \nIntroduction .............................................................................................. 3 \n1. Growing Québec’s wealth .................................................

In [13]:
# loading data into Vector DB

vector_store.add_texts(texts)
print(f'Added {len(texts)} documents to the vector database')

Added 87 documents to the vector database


In [14]:
vector_index = VectorStoreIndexWrapper(vectorstore=vector_store)

## 7. Question Answer on PDF

**From Budget Speech**
- what is the role of higher education?
- can you highlight budget benefits for the seniors?
- are there any implications on immigration policy?

**From Math's Book**
- what are mortgages?
- What is the strategy for seniors?

In [15]:
input_question = True

while True:
    if input_question:
        query = input('Ask Question: ').strip()
    else:
        query = input('Ask next Question: ').strip()

    if query.lower() == 'exit':
        break

    if query.lower() == '':
        continue

    input_question = False

    print(f'\nQuestion : {query}')
    response = vector_index.query(query, llm=llm).strip()
    print(f'Response : {response.strip()}')


    print(f'\n\nOther Relevant Responses:')
    for doc, score in vector_store.similarity_search_with_score(query, k=5):
        print(f'{round(score,2)*100}% - {(doc.page_content[:300]).strip()}\n')

Ask Question:  are there any implications on immigration policy?



Question : are there any implications on immigration policy?
Response : Yes, the context suggests that the government is taking steps to support the socioeconomic integration of immigrants into the labour market, which could have implications on immigration policy.


Other Relevant Responses:
89.0% - ADDRESSING THE LABOUR SHORTAGE 
Madam President, our government is particularly sensitive to the situation of the 
Québec workforce. It is the powerhouse of our economic vitality. We are aware that 
the current shortage combined with an aging population will present challenges for years to come. It

89.0% - ADDRESSING THE LABOUR SHORTAGE 
Madam President, our government is particularly sensitive to the situation of the 
Québec workforce. It is the powerhouse of our economic vitality. We are aware that 
the current shortage combined with an aging population will present challenges for years to come. It

89.0% - ADDRESSING THE LABOUR SHORTAGE 
Madam President, our government is particularly

Ask next Question:  exit
