<a href="https://colab.research.google.com/github/rolosapien206/hello-world/blob/master/langchain/Chat_with_Any_Documents_Own_ChatGPT_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with any documents using langchain

#### [Youtube video covering this notebook](https://youtu.be/TeDgIDqQmzs)

[OpenAI token limit](https://platform.openai.com/docs/models/gpt-4)  
OpenAI's embedding model has 1536 dimensions.  
After the data is turned into embeddings, they are stored in a vectorstore database, such as Pinecone, Chroma and Faiss, etc.  
Once the query is provided, the most relevant chunks of data is queried based on the similarity (semantic search)  


## Setup

In [None]:
%%capture
!pip install openai langchain  tiktoken pypdf unstructured[local-inference] gradio chromadb

In [None]:
%reload_ext watermark
%watermark -a "Sudarshan Koirala" -vmp langchain,openai,chromadb

Author: Sudarshan Koirala

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 7.34.0

langchain: 0.0.162
openai   : 0.27.6
chromadb : 0.3.22

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.10.147+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [None]:
import os
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone, Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI

In [None]:
os.environ['OPENAI_API_KEY'] ="OPENAI_API_KEY"

In [None]:
#llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

[LangChain Document Loader](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html)

In [None]:
from langchain.document_loaders import DirectoryLoader

pdf_loader = DirectoryLoader('/content/Documents/', glob="**/*.pdf")
readme_loader = DirectoryLoader('/content/Documents/', glob="**/*.md")
txt_loader = DirectoryLoader('/content/Documents/', glob="**/*.txt")

In [None]:
#take all the loader
loaders = [pdf_loader, readme_loader, txt_loader]

#lets create document
documents = []
for loader in loaders:
    documents.extend(loader.load())



In [None]:
print (f'You have {len(documents)} document(s) in your data')
print (f'There are {len(documents[0].page_content)} characters in your document')

You have 3 document(s) in your data
There are 10701 characters in your document


In [None]:
documents[0]

Document(page_content='GPT4All\n\n\n\nJ: An Apache\n\n\n\n2 Licensed Assistant\n\n\n\nStyle Chatbot\n\nYuvanesh Anand\n\nyuvanesh@nomic.ai\n\nZach Nussbaum\n\nzach@nomic.ai\n\nBrandon Duderstadt\n\nbrandon@nomic.ai\n\nBenjamin M. Schmidt\n\nben@nomic.ai\n\nAdam Treat\n\ntreat.adam@gmail.com\n\nAndriy Mulyar\n\nandriy@nomic.ai\n\nAbstract\n\nGPT4All-J is an Apache-2 licensed chatbot trained over a massive curated corpus of as- sistant interactions including word problems, multi-turn dialogue, code, poems, songs, and stories. It builds on the March 2023 GPT4All release by training on a significantly larger corpus, by deriving its weights from the Apache-licensed GPT-J model rather than the GPL-licensed of LLaMA, and by demonstrat- ing improved performance on creative tasks such as writing stories, poems, songs and plays. We openly release the training data, data curation procedure, training code, and fi- nal model weights to promote open research and reproducibility. Additionally, we rel

## Split the Text from the documents

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=40) #chunk overlap seems to work better
documents = text_splitter.split_documents(documents)
print(len(documents))

23


In [None]:
documents[0]

Document(page_content='GPT4All\n\n\n\nJ: An Apache\n\n\n\n2 Licensed Assistant\n\n\n\nStyle Chatbot\n\nYuvanesh Anand\n\nyuvanesh@nomic.ai\n\nZach Nussbaum\n\nzach@nomic.ai\n\nBrandon Duderstadt\n\nbrandon@nomic.ai\n\nBenjamin M. Schmidt\n\nben@nomic.ai\n\nAdam Treat\n\ntreat.adam@gmail.com\n\nAndriy Mulyar\n\nandriy@nomic.ai\n\nAbstract', metadata={'source': '/content/Documents/gpt4all.pdf'})

In [None]:
documents[1]

Document(page_content='GPT4All-J is an Apache-2 licensed chatbot trained over a massive curated corpus of as- sistant interactions including word problems, multi-turn dialogue, code, poems, songs, and stories. It builds on the March 2023 GPT4All release by training on a significantly larger corpus, by deriving its weights from the Apache-licensed GPT-J model rather than the GPL-licensed of LLaMA, and by demonstrat- ing improved performance on creative tasks such as writing stories, poems, songs and plays. We openly release the training data, data curation procedure, training code, and fi- nal model weights to promote open research and reproducibility. Additionally, we release Python bindings and a Chat UI to a quantized 4-bit version of GPT4All-J allowing virtually anyone to run the model on CPU.\n\n1 Data Collection and Curation\n\nWe gather a diverse sample of questions/prompts by leveraging several publicly available datasets and curating our own set of prompts:\n\nSeveral\n\nsubsam

## Embeddings and storing it in Vectorestore

In [None]:
embeddings = OpenAIEmbeddings()

### Using Chroma for storing vectors

In [None]:
from langchain.vectorstores import Chroma

In [None]:
vectorstore = Chroma.from_documents(documents, embeddings)



### Using pinecone for storing vectors

In [None]:
%%capture
!pip install pinecone-client

- [Pinecone langchain doc](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/pinecone.html?highlight=pinecone#pinecone
)
- What is [vectorstore](https://www.pinecone.io/learn/vector-database/)
- Get your pinecone api key and env -> https://app.pinecone.io/

In [None]:
import os
import getpass
PINECONE_API_KEY = getpass.getpass('Pinecone API Key:')

Pinecone API Key:··········


In [None]:
PINECONE_ENV = getpass.getpass('Pinecone Environment:')

Pinecone Environment:··········


In [None]:
import pinecone

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV  # next to api key in console
)

index_name = "langchain-demo"

vectorstore = Pinecone.from_documents(documents, embeddings, index_name=index_name)

In [None]:
# if you already have an index, you can load it like this
import pinecone
from tqdm.autonotebook import tqdm

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV  # next to api key in console
)

index_name = "langchain-demo"
vectorstore = Pinecone.from_existing_index(index_name, embeddings)

#### We had 23 documents so there are 23 vectors being created in Pinecone.

In [None]:
query = "Who are the authors of gpt4all paper ?"
docs = vectorstore.similarity_search(query)

In [None]:
len(docs) #it went on and search on the 4 different vectors to find the similarity

4

In [None]:
print(docs[0].page_content)

4 Use Considerations

The authors release data and training details in hopes that it will accelerate open LLM research, particularly in the domains of fairness, align- ment, interpretability, and transparency. GPT4All- J model weights and quantized versions are re- leased under an Apache 2 license and are freely available for use and distribution. Please note that the less restrictive license does not apply to the original GPT4All model that is based on LLaMA, which has a non-commercial GPL license. The assistant data was gathered from OpenAI’s GPT- 3.5-Turbo, whose terms of use prohibit developing models that compete commercially with OpenAI.

References

Stella Biderman, Hailey Schoelkopf, Quentin An- thony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Puro- hit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A suite for analyzing large language models across training and scaling.


In [None]:
print(docs[1].page_content)

Building on the GPT4All dataset, we curated the GPT4All-J dataset by augmenting the origi- nal 400k GPT4All examples with new samples encompassing additional multi-turn QA samples and creative writing such as poetry, rap, and short stories. We designed prompt templates to create different scenarios for creative writing. The cre- ative prompt template was inspired by Mad Libs style variations of ‘Write a [creative story type] about [NOUN] in the style of [PERSON]‘. In ear- lier versions of GPT4All, we found that rather than writing actual creative content, the model would discuss how it would go about writing the content. Training on this new dataset allows GPT4All-J to write poems, songs, and plays with increased com- petence.


## Now the langchain part (Chaining with Chat History) --> With One line of Code (Fantastic)
- There are many chains but we use this [link](https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html)

In [None]:
from langchain.llms import OpenAI

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":2})
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever)

In [None]:
chat_history = []
query = "How much is spent for training the gpt4all model?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

' $200'

In [None]:
chat_history.append((query, result["answer"]))
chat_history

[('How much is spent for training the gpt4all model?', ' $200')]

In [None]:
query = "What is this number multiplied by 2?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

' $1600'

## Create a chatbot with memory with simple widgets

In [None]:
from IPython.display import display
import ipywidgets as widgets

In [None]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""

    if query.lower() == 'exit':
        print("Thanks for the chat!")
        return

    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))

    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="Orange">Chatbot:</font></b> {result["answer"]}'))

print("Chat with your data. Type 'exit' to stop")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Chat with your data. Type 'exit' to stop


Text(value='', placeholder='Please enter your question:')

HTML(value='<b>User:</b> who are the authors of gpt4al')

HTML(value='<b><font color="Orange">Chatbot:</font></b>  The authors of GPT4All are Yuvanesh Anand, Zach Nussb…

HTML(value='<b>User:</b> what is pandas ai ')

HTML(value='<b><font color="Orange">Chatbot:</font></b> \n\nPandas AI is a Python library that adds generative…

## Gradio Part (Building the [chatbot like UI](https://gradio.app/docs/#chatbot))

### Gradio sample example

In [None]:
import gradio as gr
import random

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    def respond(message, chat_history):
        print(message)
        print(chat_history)
        bot_message = random.choice(["How are you?", "I love you", "I'm very hungry"])
        chat_history.append((message, bot_message))
        print(chat_history)
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot])
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://1dbfbc6e387c4c0006.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


hello
[]
[('hello', 'I love you')]
hi
[['hello', 'I love you']]
[['hello', 'I love you'], ('hi', 'How are you?')]
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7863 <> https://1dbfbc6e387c4c0006.gradio.live




### Gradio langchain example

In [None]:
import gradio as gr
with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    def respond(user_message, chat_history):
        print(user_message)
        print(chat_history)
        # Get response from QA chain
        response = qa({"question": user_message, "chat_history": chat_history})
        # Append user message and response to chat history
        chat_history.append((user_message, response["answer"]))
        print(chat_history)
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://42d679ac88ec3d1362.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


hello
[]
[('hello', " I'm sorry, I don't know the answer to that question.")]
who are the authors of gpt4all paper.
[['hello', ' I’m sorry, I don’t know the answer to that question.']]


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 399, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1299, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1022, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "<ipython-input-68-74e405dd0daf>", line 11, in respond
    response = qa({"question": user_message, "chat_history": ch

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7863 <> https://42d679ac88ec3d1362.gradio.live


