<a href="https://colab.research.google.com/github/zulfiqaralimir/LangChain-UseCases/blob/master/chatgpt_for_your_own_pdf_files_with_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ChatGPT for YOUR OWN PDF files with LangChain.**

# **Required Packages**

In [None]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

# **Loading the Packages**

In [None]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

# **Operating System and API Key**

In [None]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "YOUR-OPENAI-API-KEY"

# **Connecting to Google Drive for PDF File**

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

# **Reading the PDF File from Google Drive**

In [None]:
# location of the pdf file/files.
reader = PdfReader('/content/gdrive/My Drive/Colab Notebooks/2023_GPT4All_Technical_Report.pdf')

# **Reader Object (It has all info how to read PDF File)**

In [None]:
reader

# **Raw Text**

In [None]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [None]:
raw_text

# **?**

In [None]:
raw_text[:100]

'GPT4All: Training an Assistant-style Chatbot with Large Scale Data\nDistillation from GPT-3.5-Turbo\nY'

# **Splitting into Chunks to avoid Token Size Limits**

In [None]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits.

text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

# **No of Chunks**

In [None]:
len(texts)

# **Reading the First Chunk**

In [None]:
texts[0]

# **Second Chunk** (Check the overlap between Two Chuncks) (Optional but Helping)

In [None]:
texts[1]

# **Downloading the Embedding from OpenAI** (Need API Key)
Embedding is List of Float Numbers (It measure the distance b/w Two Text Strings / Sentences.

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

# **Vector Database (**Taking the Chunks and finding corresponing Embeddings)
It will be stored in 'docsearch'

In [None]:
docsearch = FAISS.from_texts(texts, embeddings)

In [None]:
docsearch

# Importing **QnA Chain** from LangChain and Corresponding OpenAI Object

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

# **Passing different Models** (It will create a Chain)
(text ada-001)
Capbable of very Simple Tasks, usually the Fastest model in GPT-3 series, and lowest in cost.
2049 tokens
up to OCT 2019

In [None]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

# **Start Asking Questions**
(From embedding, it find using semantically search, closest text in document)

In [None]:
query = "who are the authors of the article?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

# **Next Query**

In [None]:
query = "What was the cost of training the GPT4all model?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

# **Some other Questions**

In [None]:
query = "How was the model trained?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "what was the size of the training dataset?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
#it is not exact answer because this information is not present in the paper.
query = "How is this different from other models?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
#Answer: I don't know. (It is not in the Technical Report)
query = "What is Google Bard?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

**Vidoe Line:**

https://www.youtube.com/watch?v=TLf90ipMzfE

https://www.toolspedia.io/ai-tool/pdfgpt/

https://www.pdfgpt.io/plan
