# Chatbot based on internal data (PDFs)

Steps:
1. Reading the PDFs and WebPages
2. Chunk the PDFs and Webpages
3. Create vector embeddings from the PDFs and Webpages
4. Add to Pinecode Vector DB
5. Create a chatbot that queries from Pincone to implement RAG architecture

### Import Libraries

Load all the necessary modules and libraries. 

If not present, add them to requirements.txt and run python -m requirements.txt on the terminal

In [29]:
import os
import openai
import langchain
import pinecone
from langchain.document_loaders import PyPDFDirectoryLoader, AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI

Load the necessary environment variables which will contain the API Key

In [5]:
from dotenv import load_dotenv
load_dotenv()

True

### Reading the PDF and Webpages

Create a function that is used to read PDFs in a given folder using document loaders.

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

In [42]:
def read_pdfs(folder):
    file_loader = PyPDFDirectoryLoader(folder)
    pdfs = file_loader.load()
    return pdfs
    

In [43]:
pdfs = read_pdfs('data')
print("Number of pages:",len(pdfs))
pdfs

Number of pages: 23


[Document(page_content="Kakaraparti  Bhavanarayana  College  is truly a dream come true for many, \nespecially for those who are residing in the old town of Vijayawada. The long \ncherished dream has been realized through the benevolence of  Danaseela,  \nPurapramukha  and  Vidya -Poshaka  Sri Kakaraparti  Bhavanarayana  \nShresti.  The foundation stone of the college was laid on 6th November, 1964 \nby Sri Kasu Brahmananda Reddy, the then Chief Minister of Andhra Pradesh. \nThe college was constructed on 4.1 1 acres of land of the  S.K.P.V.V.  Hindu  \nHigh  Schools’  Committee. It commenced functioning fully ever since July, \n1965. The college had a humble beginning with 278 students and a devoted \nstaff of just nineteen under the visionary leadership of the Founder P rincipal \nSri S. Sundaram.  \nThe infrastructure of the college is admirable. The college has a state of the \nart library with a digital library and a spacious reading room. The college is \nembellished with expansi

In [44]:
def read_webpage(url):
    html_loader = AsyncHtmlLoader(url)
    html = html_loader.load()
    html_transformer = Html2TextTransformer()
    docs_transformed = html_transformer.transform_documents(html) 
    return docs_transformed

In [39]:
urls = ["https://www.espn.com"]
webpage = read_webpage(urls)
print("Number of pages:",len(webpage))
webpage

Fetching pages:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  3.68it/s]


Number of pages: 1


[Document(page_content="Skip to main content  Skip to navigation\n\n<\n\n>\n\nMenu\n\n## ESPN\n\n  *   *   * scores\n\n  * NFL\n  * NBA\n  * NCAAM\n  * NCAAW\n  * MLB\n  * NHL\n  * …\n\n    * Soccer\n    * NCAAF\n    * Sports Betting\n    * Boxing\n    * CFL\n    * NCAA\n    * Cricket\n    * F1\n    * Golf\n    * Horse\n    * LLWS\n    * MMA\n    * NASCAR\n    * NBA G League\n    * Olympic Sports\n    * PLL\n    * Professional Wrestling\n    * Racing\n    * RN BB\n    * RN FB\n    * Rugby\n    * Tennis\n    * WNBA\n    * X Games\n    * UFL\n\n  * More ESPN\n  * Fantasy\n  * Watch\n  * ESPN BET\n  * ESPN+\n\n##\n\n  * Subscribe Now\n  * NHL\n\n  * German Cup: Semifinals\n\n  * NCAA Baseball\n\n  * NCAA Softball\n\n  * MLB\n\n  * NCAA Women's Gymnastics: First Round\n\n  * Field Yates' Two-Round NFL Mock Draft\n\n  * UFC 300: Pereira vs. Hill (Apr. 13, ESPN+ PPV)\n\n## Quick Links\n\n  * Men's Tournament Challenge\n\n  * Women's Tournament Challenge\n\n  * Men's TC 2nd Chance\n\n  * Wome

### Chunk the pdfs

The LLM model can only handle a certain number of tokens at a time. So, we need to chunk the PDFs into smaller parts.
This can be done by splitting the PDFs into smaller parts based on the number of tokens.
Langchain provides a function to split the text into smaller parts based on the number of tokens.

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

In [47]:
def chunk_documents(documents, chunk_size=1000, chunk_overlap=50):
    splitter = RecursiveCharacterTextSplitter(chunk_size, chunk_overlap, is_separator_regex=False)
    chunks = splitter.split_documents(documents)
    return chunks

In [48]:
chunked_pdfs = chunk_documents(documents=pdfs)
chunk_documents

TypeError: 'int' object is not subscriptable

In [None]:
embeddings = OpenAIEmbeddings(api_key=os.environ['OPENAI_API_KEY'])
pdf_vectors = embeddings.embed_documents(pdfs)