# **OpenAi GPT-3 based chatbot**





By **Shivansh Singh**

- Installs, Imports and API Keys
- Loading PDFs and chunking with LangChain
- Embedding text and storing embeddings
- Creating retrieval function
- Creating chatbot with chat memory

## 0. Installs, Imports and API Keys

In [2]:
%%writefile requirements.txt
chromadb
langchain
tiktoken
pypdf
tiktoken
faiss-cpu

Writing requirements.txt


In [185]:
%%capture
!pip install openai -q

In [3]:
%%capture
%pip install -r requirements.txt

In [184]:
# RUN THIS CELL FIRST!
%%capture
!pip install -q textract

In [164]:
import os
import platform

import openai
import langchain
import pandas as pd
import numpy as np
import tiktoken

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
dataset_path = '/content/data_storytelling.txt'

print('Python: ', platform.python_version())

Python:  3.10.12


In [186]:
os.environ["OPENAI_API_KEY"] = "Enter your api key"

## 1. Loading PDFs and chunking with LangChain

In [168]:
# You MUST add your PDF to local files in this notebook (folder icon on left hand side of screen)

# Simple method - Split by pages 
loader = PyPDFLoader("/content/Data Storytelling.pdf")
pages = loader.load_and_split()
print(pages[0])

# SKIP TO STEP 2 IF YOU'RE USING THIS METHOD
chunks_pages = pages

page_content='Data Storytelling & Methodology Document  \n– AirBnb, New York Analysis using Python  \nBy – Shivansh singh, Aparna Sahu, Manish Matsaniya  \nA. Methodology Approach  \n1. Research Problem  \n\uf0b7 For the past few months, Airbnb has seen a major decline in revenue due to the lockdown imposed \nduring the pandemic.  \n\uf0b7 Now after 2 years of devastating Covid pandemic, the restrictions knot have started losing, and people \nhave started to travel more. Hence, Airbnb wants to make sure to be fully prepared for this change.  \n \n2. Business Under standing  \n \n- Airbnb id an American company based in San Francisco, California. It operates an online market places \nfor lodging, primarily home stays for vacation rental, and tourism activities. The platform can be \nreachable via its web sites and mobile applic ation.  \n- Being as online market place for hosting personal home stays and private apartments in the majority, and \nsue to this, company have basically two ty

In [169]:
from pygments import token
# Advanced method - Split by chunk

# Step 1: Convert PDF to text
import textract
doc = textract.process("/content/Data Storytelling.pdf")

In [170]:
# Step 2: Save to .txt and reopen (helps prevent issues)
with open('data_storytelling.txt', 'w') as f:
    f.write(doc.decode('utf-8'))

with open('data_storytelling.txt', 'r') as f:
    text = f.read()


In [179]:
# encoding to get values
encoding = tiktoken.get_encoding("gpt2")
input_ids = encoding.encode(text)
print(input_ids)

[6601, 8362, 18072, 1222, 11789, 1435, 16854, 220, 220, 198, 198, 1906, 3701, 33, 46803, 11, 968, 1971, 14691, 1262, 11361, 220, 198, 198, 3886, 784, 43305, 504, 71, 1702, 71, 11, 5949, 28610, 311, 12196, 11, 1869, 680, 30107, 3216, 3972, 220, 198, 198, 32, 13, 220, 11789, 1435, 38066, 220, 198, 16, 13, 220, 4992, 20647, 220, 198, 198, 42122, 262, 19798, 5314, 13, 220, 628, 220, 198, 198, 17, 13, 220, 7320, 28491, 220, 198, 198, 171, 224, 115, 220, 1114, 220, 262, 220, 1613, 220, 1178, 220, 1933, 11, 220, 35079, 220, 468, 220, 1775, 220, 257, 220, 1688, 220, 7794, 220, 287, 220, 6426, 220, 2233, 220, 284, 220, 262, 220, 47955, 220, 10893, 220, 198, 198, 171, 224, 115, 220, 2735, 706, 362, 812, 286, 14101, 39751, 312, 19798, 5314, 11, 262, 8733, 29654, 423, 2067, 6078, 11, 290, 661, 220, 198, 198, 14150, 2067, 284, 3067, 517, 13, 16227, 11, 35079, 3382, 284, 787, 1654, 284, 307, 3938, 5597, 329, 428, 1487, 13, 220, 628, 220, 198, 12, 220, 35079, 4686, 281, 220, 1605, 1664, 1912, 287, 29

In [180]:
#again loading documnet for further spiliting process
with open('data_storytelling.txt') as f:
  story_telling = f.read()

### function to split document

In [173]:
def split_text_into_documents(input_text, max_document_length, min_document_count):
    documents = []
    chunks = text_splitter.split_text(input_text)
    total_chunks = len(chunks)
    chunks_per_document = max(total_chunks // min_document_count, 1)

    for i in range(0, total_chunks, chunks_per_document):
        documents.append("\n".join(chunks[i:i+chunks_per_document]))
    
    # Ensure at least min_document_count
    while len(documents) < min_document_count:
        documents.append("")
    
    return documents

### Load spilited documents

In [174]:
max_length = 2048
min_documents = 5
documents = split_text_into_documents(story_telling, max_length, min_documents)

for i, document in enumerate(documents):
    print(f"Document {i + 1}:\n{document}")

Document 1:
Data Storytelling & Methodology Document  

– AirBnb, New York Analysis using Python 

By – Shivansh singh, Aparna Sahu, Manish Matsaniya 

A.  Methodology Approach 
1.  Research Problem 

during the pandemic. 

 

2.  Business Understanding 

  For  the  past  few  months,  Airbnb  has  seen  a  major  decline  in  revenue  due  to  the  lockdown  imposed 

  Now after 2 years of devastating Covid pandemic, the restrictions knot have started losing, and people 

have started to travel more. Hence, Airbnb wants to make sure to be fully prepared for this change. 

 
-  Airbnb id an  American company based in San Francisco, California. It operates an  online  market places 
for  lodging,  primarily  home  stays  for  vacation  rental,  and  tourism  activities.  The  platform  can  be 
reachable via its web sites and mobile application.
-  Being as online market place for hosting personal home stays and private apartments in the majority, and 
sue to this, company have basi

In [77]:
# Result is many LangChain 'Documents' around 500 tokens or less (Recursive splitter sometimes allows more tokens to retain context)
type(documents[0]) 

str

## 2. Deployment of Embedding model, Embed text and store embeddings

In [181]:
embeddings = OpenAIEmbeddings(deployment="text-similarity-davinci-001") #deployment="text-similarity-davinci-001"

In [150]:
%%capture
!pip install deeplake

Function to split dcument again because

{ db = FAISS.from_documents(story_telling, embeddings) }

this function was giving error : attribute error : 'str' object has no attribute 'page_content'.

In [182]:

# This is a long document we can split up.
with open('/content/data_storytelling.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
pages = text_splitter.split_text(state_of_the_union)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.create_documents(pages)

print (texts)

embeddings = OpenAIEmbeddings()
db = DeepLake.from_documents(texts, embeddings, overwrite=True)



[Document(page_content='Data Storytelling & Methodology Document  \n\n– AirBnb, New York Analysis using Python \n\nBy – Shivansh singh, Aparna Sahu, Manish Matsaniya \n\nA.  Methodology Approach \n1.  Research Problem \n\nduring the pandemic. \n\n \n\n2.  Business Understanding \n\n\uf0b7  For  the  past  few  months,  Airbnb  has  seen  a  major  decline  in  revenue  due  to  the  lockdown  imposed \n\n\uf0b7  Now after 2 years of devastating Covid pandemic, the restrictions knot have started losing, and people \n\nhave started to travel more. Hence, Airbnb wants to make sure to be fully prepared for this change. \n\n \n-  Airbnb id an  American company based in San Francisco, California. It operates an  online  market places \nfor  lodging,  primarily  home  stays  for  vacation  rental,  and  tourism  activities.  The  platform  can  be \nreachable via its web sites and mobile application.', metadata={}), Document(page_content='-  Being as online market place for hosting personal h

Evaluating ingest: 100%|██████████| 1/1 [00:00<00:00

Dataset(path='./deeplake/', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (21, 1536)  float32   None   
    ids      text     (21, 1)      str     None   
 metadata    json     (21, 1)      str     None   
   text      text     (21, 1)      str     None   





## 3. Setup retrieval function

In [183]:

retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 4

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=False)

# What was the restaurant the group was talking about called?
query = input("enter query :")

# The Hungry Lobster
ans = qa({"query": query})

print(ans)

enter query :what is story telling
{'query': 'what is story telling', 'result': ' Storytelling is the process of using narrative techniques to tell a story and convey an idea or message.'}


## 4. Create chatbot with chat memory.

In [175]:
from IPython.display import display
import ipywidgets as widgets

# Create conversation chain that uses our vectordb as retriver, this also allows for chat history management
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

In [177]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""
    
    if query.lower() == 'exit':
        print("Thank you for using the State of the Union chatbot!")
        return
    
    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))
    
    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

print("Welcome to the Transformers chatbot! Type 'exit' to stop.")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Welcome to the Transformers chatbot! Type 'exit' to stop.


Text(value='', placeholder='Please enter your question:')

HTML(value='<b>User:</b> what data story tel')

HTML(value='<b><font color="blue">Chatbot:</font></b>  We are presenting a data story and methodology document…

HTML(value='<b>User:</b> what is story telli')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The data story is about understanding the decline in Ai…