<a href="https://colab.research.google.com/github/sudikshyapant/DhammaBot-Project/blob/main/DhammaBot_1_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook makes a question answering chain with a specified website as a context data.

# Setting up

Install dependencies

In [None]:
!pip install langchain==0.0.189
!pip install pinecone-client
!pip install openai
!pip install tiktoken
!pip install nest_asyncio

Collecting langchain==0.0.189
  Downloading langchain-0.0.189-py3-none-any.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.6/975.6 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain==0.0.189)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain==0.0.189)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.189)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.189)
  Downloading typing_insp

Set up OpenAI API key

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "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"

Set up Pinecone API keys

In [None]:
import pinecone

# initialize pinecone
pinecone.init(
    api_key="ce600375-cefb-437c-bb65-12a6b88c9e4a",  # find at app.pinecone.io
    environment="gcp-starter"  # next to api key in console
)

  from tqdm.autonotebook import tqdm


# Configuring DL model

# New Section

# Index

**Load data from Web**

Extends from the WebBaseLoader, this will load a sitemap from a given URL, and then scrape and load all the pages in the sitemap, returning each page as a document.

The scraping is done concurrently, using WebBaseLoader. There are reasonable limits to concurrent requests, defaulting to 2 per second.

Link to the [documentation](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/sitemap.html)

In [None]:
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()

from langchain.document_loaders.sitemap import SitemapLoader

loader = SitemapLoader(
    "https://humblewarriortherapy.com/sitemap.xml",
    filter_urls=["https://humblewarriortherapy.com/blog/"]
)
docs = loader.load()

Fetching pages: 0it [00:00, ?it/s]


**Split the text from docs into smaller chunks**

There are many ways to split the text. We are using the text splitter that is recommended for generic texts. For more ways to slit the text check the [documentation](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1200,
    chunk_overlap  = 200,
    length_function = len,
)

docs_chunks = text_splitter.split_documents(docs)

Create embeddings

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

**Creating a vectorstore**

A vectorstore stores Documents and associated embeddings, and provides fast ways to look up relevant Documents by embeddings.

There are many ways to create a vectorstore. We are going to use Pinecone. For other types of vectorstores visit the [documentation](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html)

First you need to go to [Pinecone](https://www.pinecone.io/) and create an index there. Then type the index name in "index_name"

In [None]:
from langchain.vectorstores import Pinecone

index_name = "ind"

 #create a new index
docsearch = Pinecone.from_documents(docs_chunks, embeddings, index_name=index_name)

# if already have an index, can load it like this
# docsearch = Pinecone.from_existing_index(index_name, embeddings)


If you cannot create a Pinecone account, try to use CromaDB. The following code creates a transient in-memory vectorstore. For further information check the [documentation](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html).

The following code block uses Croma for creating a vectorstore. Uncomment it if you don't have access to pinecone and use it instead.

In [None]:
# from langchain.vectorstores import Chroma
# docsearch = Chroma.from_documents(docs, embeddings)

Vectorstore is ready. Let's try to query our docsearch with similarity search

In [None]:
# query = "How can Buddhism help with distress?"
# docs = docsearch.similarity_search(query)
# print(docs[0])
# print(len(docs))

# Making a question answering chain
The question answering chain will enable us to generate the answer based on the relevant context chunks. See the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) for more explanation.

Additionally, we can return the source documents used to answer the question by specifying an optional parameter when constructing the chain. For more information visit the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html#return-source-documents)

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
llm=OpenAI()

qa_with_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)

query = "How can Buddhsim help me deal with stress?"
result = qa_with_sources({"query": query})
result["result"]

' Buddhism offers a number of techniques that can help bring awareness to and alleviate stress. These include meditation, mindfulness, and breathing exercises. Buddhist teachings also emphasize the importance of compassion and empathy. By cultivating these qualities, it can help us to better understand our own feelings and experience of stress, as well as those of others, so that we can respond to situations in a more constructive way.'

Output source documents that were found for the query

In [None]:
# result["source_documents"]