# Document based Question Answering System using LangChain

## Installation

In [1]:
!pip install langchain InstructorEmbedding --upgrade
!pip install unstructured
!pip install unstructured[local-inference]
!apt-get install poppler-utils -y

Collecting langchain
  Downloading langchain-0.1.13-py3-none-any.whl.metadata (13 kB)
Collecting InstructorEmbedding
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl.metadata (20 kB)
Collecting langchain-community<0.1,>=0.0.29 (from langchain)
  Downloading langchain_community-0.0.29-py3-none-any.whl.metadata (8.3 kB)
Collecting langchain-core<0.2.0,>=0.1.33 (from langchain)
  Downloading langchain_core-0.1.36-py3-none-any.whl.metadata (6.0 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.38-py3-none-any.whl.metadata (13 kB)
Collecting packaging<24.0,>=23.2 (from langchain-core<0.2.0,>=0.1.33->langchain)
  Downloading packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.0-cp310-cp310-manylinux_2_17_

In [2]:
!pip install --upgrade --quiet  langchain-pinecone
!pip install sentence-transformers==2.2.2

[0mCollecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m927.0 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0mm
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=9694828d637110b3825aba3c5136b01bbab183786b17f2dc95700d73e9d9c7db
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
[0mInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


## Imports

In [3]:
import os
import pinecone


from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings # Open source alternative to OpenAI

## Loading the documents
First, we'll use LangChain's DirectoryLoader to load documents from a directory. In this example, we suppose the papers are saved in a directory called 'data'.

In [71]:
directory = '/kaggle/input/data-langchain-test'

def load_docs(directory):
    loader = DirectoryLoader(directory)
    documents = loader.load()
    return documents

documents = load_docs(directory)
print(documents)
print(len(documents))

[Document(page_content='My country Nepal is situated between two countries, India and China. Although it is sandwiched between international powers, conflicts have not happened between any, and peace remains. Nepal is a country of various castes and cultures. In other words, it is like a beautiful garden of flowers with people of different ethnicities and backgrounds.\n\nMy country Nepal is not only unique for its flag but also its geographical terrain, the variety of castes and cultures you can find, and the rich history of it. It is the land of various great places and important figures that people know far and wide. The temperature here spans from cool to hot and is a heaven for residing in.\n\nThere are about 126 castes in Nepal each with its own rich history and culture which makes Nepal a rich place for culture. Some dating back to the millenniums. Not only that, our country is very rich in its geographical terrain. From the lowest point of just 70 Meters from sea level to the wo

## Splitting Documents
To facilitate processing, we must now divide the documents into smaller pieces. The tool we'll be using is LangChain's RecursiveCharacterTextSplitter, which by default seeks to split on the characters ["\n", "\n", " ,""].

In [72]:
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(documents)
    return docs

docs = split_docs(documents)
print(docs)
print(len(docs))

[Document(page_content='My country Nepal is situated between two countries, India and China. Although it is sandwiched between international powers, conflicts have not happened between any, and peace remains. Nepal is a country of various castes and cultures. In other words, it is like a beautiful garden of flowers with people of different ethnicities and backgrounds.\n\nMy country Nepal is not only unique for its flag but also its geographical terrain, the variety of castes and cultures you can find, and the rich history of it. It is the land of various great places and important figures that people know far and wide. The temperature here spans from cool to hot and is a heaven for residing in.', metadata={'source': '/kaggle/input/data-langchain-test/nepal_essay.pdf'}), Document(page_content='There are about 126 castes in Nepal each with its own rich history and culture which makes Nepal a rich place for culture. Some dating back to the millenniums. Not only that, our country is very r

In [73]:
print(docs[0].page_content)

My country Nepal is situated between two countries, India and China. Although it is sandwiched between international powers, conflicts have not happened between any, and peace remains. Nepal is a country of various castes and cultures. In other words, it is like a beautiful garden of flowers with people of different ethnicities and backgrounds.

My country Nepal is not only unique for its flag but also its geographical terrain, the variety of castes and cultures you can find, and the rich history of it. It is the land of various great places and important figures that people know far and wide. The temperature here spans from cool to hot and is a heaven for residing in.


## Creating Embeddings from documents
Once the documents are split, we need to embed them using HuggingFace's InstructEmbedding model. First, we need to install the tiktoken library.

In [74]:
!pip install tiktoken -q

[0m

In [75]:
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})
print(instructor_embeddings)

load INSTRUCTOR_Transformer
max_seq_length  512
client=INSTRUCTOR(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
) model_name='hkunlp/instructor-xl' cache_folder=None model_kwargs={'device': 'cuda'} encode_kwargs={} embed_instruction='Represent the document for retrieval: ' query_instruction='Represent the question for retrieving supporting documents: '


In [76]:
model = INSTRUCTOR('hkunlp/instructor-xl')
sentence = "The process of photosynthesis converts light energy into chemical energy stored in glucose molecules, crucial for sustaining life on Earth. Diversification of investment portfolios mitigates risk by allocating resources across various assets, aiming to optimize returns while minimizing potential losses"
instruction = "Represent the Science sentence:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

load INSTRUCTOR_Transformer
max_seq_length  512
[[ 1.86558086e-02 -1.30594689e-02  5.06217182e-02 -7.26895332e-02
  -2.51979083e-02 -2.11473238e-02 -5.37240282e-02 -1.39331026e-03
   1.13040032e-02 -3.11218407e-02  2.38871220e-02  2.89297402e-02
  -4.55654301e-02 -7.98486173e-02 -1.07878996e-02 -2.54836995e-02
  -4.00265772e-03 -3.10147628e-02 -2.79947594e-02 -2.89855897e-03
   2.41845977e-02 -4.38629044e-03 -5.20347841e-02  1.86297409e-02
  -4.47434373e-02 -6.59082681e-02  1.27046893e-03  2.23206263e-02
  -8.56544077e-03  9.86170676e-03  4.64066677e-02 -2.16308963e-02
   1.55185349e-02 -5.67714609e-02  5.90912290e-02 -2.58733761e-02
  -1.75394882e-02 -1.58283319e-02 -2.66248677e-02 -2.30054613e-02
  -1.00444928e-02  4.47555669e-02  5.96100464e-02 -2.81654135e-03
  -2.65924353e-03 -1.27874210e-03  4.16583233e-02  7.87444972e-03
   3.48810367e-02  1.06809149e-02 -7.62174046e-03 -9.59772058e-03
  -1.89019423e-02 -3.30612361e-02 -1.20124193e-02 -2.48775315e-02
  -3.69942375e-02  2.3463461

## Vector Search with Pinecone
Next, we will use Pinecone to create an index for our documents.

In [77]:
import os
os.environ['PINECONE_API_KEY'] = 'abecad8a-37c0-4c09-91f4-43ca487df9a1'
os.environ['PINECONE_INDEX_NAME'] = 'langchain-qna'

In [78]:
index_name = "langchain-qna"
doc_search = PineconeVectorStore.from_documents(docs, instructor_embeddings, index_name=index_name)

The Pinecone.from_documents() the method processes the input documents, generates embeddings using the provided HuggingFaceInstructEmbeddings instance, and creates a new Pinecone index with the specified name. The resulting index object can perform similarity searches and retrieve relevant documents based on user queries.

In [91]:
retriever = doc_search.as_retriever(search_type="similarity", search_kwargs={"k":2})

In [92]:
print(retriever.search_type)

similarity


In [93]:
print(retriever.search_kwargs)

{'k': 2}


In [15]:
!pip install transformers
!pip install -qU huggingface_hub

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

In [109]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=50)
hf = HuggingFacePipeline(pipeline=pipe)

In [110]:
# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=hf,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [111]:
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [116]:
query = 'Why nepal is beautiful?'
print(query)

Why nepal is beautiful?


In [117]:
llm_response = qa_chain_instrucEmbed({"query": query})
print(llm_response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'query': 'Why nepal is beautiful?', 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nHimalayan regions have high and mighty Himalayas that are breathtaking to look at. Out of the world’s top 10 highest peaks, 8 of them fall in my country. It is already a great pride to have come from such a country. The diverse flora and fauna, beautiful landscapes, lush and green jungles, historical and religious places in Nepal are enough to gather the attention of foreigners and locals too. People from all over the world pay thousands of dollars just to see our country’s snow-capped mountains, rivers, cliffs, waterfalls, other beautiful landscapes, the rich flora and fauna, and sites of great religious and historical importance. It just doesn’t end there.\n\nThere are about 126 castes in Nepal each with its own rich history and culture which makes Nepal a rich place for cu

In [118]:
process_llm_response(llm_response)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say
that you don't know, don't try to make up an answer.

Himalayan regions have high and mighty Himalayas that are breathtaking to look at. Out of the world’s top 10
highest peaks, 8 of them fall in my country. It is already a great pride to have come from such a country. The
diverse flora and fauna, beautiful landscapes, lush and green jungles, historical and religious places in
Nepal are enough to gather the attention of foreigners and locals too. People from all over the world pay
thousands of dollars just to see our country’s snow-capped mountains, rivers, cliffs, waterfalls, other
beautiful landscapes, the rich flora and fauna, and sites of great religious and historical importance. It
just doesn’t end there.

There are about 126 castes in Nepal each with its own rich history and culture which makes Nepal a rich place
for culture. Some dating back to the millenniums. Not only

In [119]:
query = 'what are the companies of Elon Musk?'
llm_response = qa_chain_instrucEmbed({"query": query})
process_llm_response(llm_response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say
that you don't know, don't try to make up an answer.

In 2015, Musk co-founded OpenAI, a nonprofit research company that promotes friendly artificial intelligence.
In 2016, he co-founded Neuralink, a neurotechnology company focused on developing brain–computer interfaces,
and founded The Boring Company, a tunnel construction company.

Elon Musk is the co-founder of Zip2, a web software firm that he started with his brother Kimbal Musk. The
firm created and marketed a newspaper industry Internet “City Guide.” He obtained contracts with The New York
Times and The Chicago Tribune and persuaded the board of directors to abandon its plans for a merger with
CitySearch.

In 2003, Elon Musk formed Tesla Motors, an electric car company. He took on several investors including Martin
Eberhard, Marc Tarpenning, Ian Wright, and JB Straubel. After a successful

performance review, Elon becam

## Conclusion
In conclusion, the system is performing ok, but this could be more better. Because, we are using a gpt-2, which is not that sufficient compares to other models. In addition, the model that we are using for creating embedding vector is not that good.