# Chatbot based on internal data (PDFs)

Steps:
1. Reading the PDFs and WebPages
2. Chunk the PDFs and Webpages
3. Create vector embeddings from the PDFs and Webpages
4. Add to Pinecode Vector DB
5. Create a chatbot that queries from Pincone to implement RAG architecture

### Import Libraries

Load all the necessary modules and libraries. 

If not present, add them to requirements.txt and run python -m requirements.txt on the terminal

In [20]:
import os
from langchain.document_loaders import PyPDFDirectoryLoader, AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_cohere import CohereEmbeddings
from langchain_openai import OpenAIEmbeddings

Load the necessary environment variables which will contain the API Key

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

### Reading the PDF and Webpages

Create a function that is used to read PDFs in a given folder using document loaders.

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

In [3]:
def read_pdfs(folder):
    file_loader = PyPDFDirectoryLoader(folder)
    pdfs = file_loader.load()
    return pdfs
    

In [4]:
pdfs = read_pdfs('data')
print("Number of pages:",len(pdfs))
pdfs

Number of pages: 50


[Document(page_content='Future of \nWork Report November 2023\nAI at Work', metadata={'source': 'data/future-of-work-report-ai-november-2023.pdf', 'page': 0}),
 Document(page_content='Executive summary        3\nProfessionals are increasingly exploring and applying to AI-related roles      5\nGenerative artificial intelligence in the workforce: Bringing opportunities across    \neducation, generations, genders, and industries       11\nExecutives and employees express excitement, anxiousness about AI       18 \nHow LinkedIn can help       23\nMethodology and credits         30T able of contents:', metadata={'source': 'data/future-of-work-report-ai-november-2023.pdf', 'page': 1}),
 Document(page_content='3Executive summary \nProfessionals and business leaders around the world are asking how \nartificial intelligence (AI) may change work, and they’re coming to LinkedIn \nto deepen their understanding and share what they’re learning. That’s why \nwe’re releasing our second Future of Work 

### Chunk the pdfs

The LLM model can only handle a certain number of tokens at a time. So, we need to chunk the PDFs into smaller parts.
This can be done by splitting the PDFs into smaller parts based on the number of tokens.
Langchain provides a function to split the text into smaller parts based on the number of tokens.

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

In [7]:
def chunk_documents(documents, chunk_size=500, chunk_overlap=0):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = splitter.split_documents(documents)
    return chunks

In [8]:
chunked_pdfs = chunk_documents(documents=pdfs)
chunk_documents

<function __main__.chunk_documents(documents, chunk_size=500, chunk_overlap=0)>

### Create vector embeddings

Vector embeddings are created as the LLM model cannot directly work with other data. The data is converted into vector embeddings using the LLM model. These embeddings are then stored in the Pinecone database.

In [21]:
embeds = CohereEmbeddings(
    cohere_api_key=os.getenv("COHERE_API_KEY")
)

In [24]:
from langchain.vectorstores import FAISS
docsearch = FAISS.from_documents(chunked_pdfs, embeds)


In [27]:
retriever = docsearch.as_retriever()

### Creating Prompt Template

Create a prompt template that can be used to query the Qdrant database. This template can be used to query the database and get the relevant information.

In [30]:
from langchain.prompts import PromptTemplate
prompt_template = """Text: {context}

Question: {question}

Answer the question based on the PDF Document provided. If the text doesn't contain the answer, reply that the answer is not available.
Do Not Hallucinate"""

prompt = PromptTemplate.from_template(prompt_template)

### LLM Model

The LLM model is used to generate the responses to the queries. The model is loaded and the prompt template is used to query the database and get the relevant information. We will use cohere model

In [31]:
from langchain.llms import Cohere

llm=Cohere(model="command-nightly", temperature=0.9)

  warn_deprecated(


In [32]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retrievable = RunnableParallel(
    {
        "context":retriever,
        "question":RunnablePassthrough()
    }
)
chain = retrievable | prompt | llm | StrOutputParser()

In [33]:
question = "How many AI skills are there according to LinkedIn?"
output = chain.invoke(question)

In [34]:
output

'According to the text, LinkedIn identifies 121 AI skills from a total of 41,000 distinct skills.'