<a href="https://colab.research.google.com/github/saffarizadeh/INSY4054/blob/main/LangChain_and_OpenAI_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://saffarizadeh.com/Logo.png" width="300px"/>

# *INSY 4054: Emerging Technologies*

# **LangChain and OpenAI GPT**

---

Credit: **Sam Witteveen** (https://www.youtube.com/@samwitteveenai)

Source: https://www.youtube.com/watch?v=ZzgUqFtxgXI

In [None]:
!pip -q install langchain openai tiktoken PyPDF2 faiss-cpu

# Chat & Query your PDF files

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "PUT YOUR OPENAI API KEY HERE"

In [None]:
!pip show langchain

## The Game plan


<img src="https://dl.dropboxusercontent.com/s/gxij5593tyzrvsg/Screenshot%202023-04-26%20at%203.06.50%20PM.png" alt="vectorstore">


<img src="https://dl.dropboxusercontent.com/s/v1yfuem0i60bd88/Screenshot%202023-04-26%20at%203.52.12%20PM.png" alt="retreiver chain">


In [None]:
# Download the PDF Reid Hoffman book with GPT-4 from his free download link
!wget -q https://www.impromptubook.com/wp-content/uploads/2023/03/impromptu-rh.pdf

### Basic Chat PDF


In [None]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

## Reading in the PDF


In [None]:
# location of the pdf file/files.
doc_reader = PdfReader('/content/impromptu-rh.pdf')

In [None]:
doc_reader

In [None]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [None]:
len(raw_text)

In [None]:
raw_text[:100]

### Text Splitter

This takes the text and splits it into chunks. The chunk size is characters not tokens

In [None]:
# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200, #striding over the text
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
len(texts)

In [None]:
texts[20]

In [None]:
texts[10]

## Making the embeddings

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [None]:
docsearch = FAISS.from_texts(texts, embeddings)

In [None]:
docsearch.embedding_function

In [None]:
query = "how does GPT-4 change social media?"
docs = docsearch.similarity_search(query)

In [None]:
len(docs)

In [None]:
docs[0]

## Plain QA Chain

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [None]:
chain = load_qa_chain(OpenAI(),
                      chain_type="stuff") # we are going to stuff all the docs in at once

In [None]:
# check the prompt
chain.llm_chain.prompt.template

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:

In [None]:
query = "who are the authors of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "who is the author of the book?"
query_02 = "has it rained this week?"
docs = docsearch.similarity_search(query_02)
chain.run(input_documents=docs, question=query)

In [None]:
query = "who is the book authored by?"
docs = docsearch.similarity_search(query,k=4)
chain.run(input_documents=docs, question=query)

## RetrievalQA
RetrievalQA chain uses load_qa_chain and combines it with the a retriever (in our case the FAISS index)

In [None]:
from langchain.chains import RetrievalQA

# set up FAISS as a generic retriever
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":4})

# create the chain to answer questions
rqa = RetrievalQA.from_chain_type(llm=OpenAI(),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [None]:
rqa("What is OpenAI?")

In [None]:
query = "What does gpt-4 mean for creativity?"
rqa(query)['result']

In [None]:
query = "what have the last 20 years been like for American journalism?"
rqa(query)['result']

In [None]:
query = "how can journalists use GPT-4??"
rqa(query)['result']

In [None]:
query = "How is GPT-4 different from other models?"
rqa(query)['result']

In [None]:
query = "What is beagle Bard?"
rqa(query)['result']