# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [None]:
json.loads()

In [None]:
#pip install --upgrade langchain

In [5]:
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
import concurrent.futures
import pandas as pd
from langchain.chains import RetrievalQA, VectorDBQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS 
from langchain.vectorstores import DocArrayInMemorySearch, Chroma
from langchain.indexes import VectorstoreIndexCreator
from IPython.display import display, Markdown
from langchain.llms import OpenAI
from langchain_core.prompts import ChatPromptTemplate

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [3]:
## data contains news articles related to technology

In [11]:
df = pd.read_csv('../data/tech_news_articles2.csv').reset_index()


In [15]:
df = df.rename(columns={'index':'id'})
df[['text','id']].to_csv('../data/vectorstore.csv',index=False)
# pd.read_csv('../data/vectorstore.csv')

In [18]:
file = '../data/vectorstore.csv'
# file = 'data/tech_news_articles.csv'
loader = CSVLoader(file_path=file)
documents = loader.load()

In [19]:
documents[0]

Document(metadata={'source': '../data/vectorstore.csv', 'row': 0}, page_content='text: The Sun’ll come out tomorrow, and you no longer have to bet your bottom dollar to be sure of it. Google’s DeepMind team released its latest weather prediction model this week, which outperforms a lea… [+6059 chars]\nid: 0')

In [21]:
from langchain_huggingface import HuggingFaceEmbeddings
# define embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-small")

  from tqdm.autonotebook import tqdm, trange


In [20]:
query = 'what is the most popular language?'

In [12]:
#pip install docarray

In [21]:
index = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])



In [13]:
llm_replacement_model = ChatOpenAI(temperature=0, model='gpt-4o-mini')

response = index.query(query, llm_replacement_model)

  llm_replacement_model = ChatOpenAI(temperature=0, model='gpt-4o-mini')


NameError: name 'index' is not defined

In [23]:
# Responds with the summary of the document most related to the query

In [24]:
display(Markdown(response))

I don't know.

In [51]:
query ="which article is most relevant to AI?"

In [52]:
response = index.query(query, llm_replacement_model)

In [53]:
display(Markdown(response))

The first blurp about two artificial intelligence pioneers being awarded the Nobel Prize for their work in machine learning is the most relevant to AI.

## Using Chroma Vectordb for RAG
- We create a chorma vectordb and use parallel processing for faster processing when inserting documents into chroma
- Then we create a Retrival Augmented Generation (RAG) system where it answers the query with the most relevant documents

In [22]:
#Place vectorDB under /tmp. It can be anywhere else
# from langchain.vectorstores import Chroma
persist_directory = "/tmp/chromadb"
# vectordb = Chroma.from_documents(documents=list(documents[0:1]), embedding=embeddings,
#                                  persist_directory=persist_directory)
vectordb = Chroma(embedding_function =embeddings,
                                 persist_directory=persist_directory)


# vectordb.persist()
# vectordb._collection.count()

def batch_process(documents_arr, batch_size,):
    for i in range(1, len(documents_arr), batch_size):
        batch = documents_arr[i:i + batch_size]
        add_to_chroma_database(batch)

def add_to_chroma_database(batch):
    vectordb.add_documents(documents=batch)
    
    
batch_size = 50

# batch_process(documents, batch_size, add_to_chroma_database)

def form_batch(documents_arr, batch_size):
    data_list = []
    for i in range(1, len(documents_arr), batch_size):
        data_list.append(documents_arr[i:i + batch_size])
    return data_list

data_list = form_batch(documents, 50)

#this allows parallel processing and faster processing for inserting the articles into chroma
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(add_to_chroma_database, data_list)

  vectordb = Chroma(embedding_function =embeddings,


In [8]:
data_list = form_batch(documents, 50)

In [24]:
vectordb._collection.count()

985

In [39]:
results = vectordb.similarity_search(query)
[print(doc.page_content) for doc in results]

blurp: Deploying AI applications to the cloud is a crucial step in enhancing their accessibility, usability, and real-world impact. By transitioning AI apps from a local environment to the cloud, developers can ensure that their applications are easily accessible to…
blurp: Hello everyone,
I’m exploring ways to optimize [cloud storage][1] solutions using Wolfram Language and would love to hear your insights and experiences.
I’ve been working with large datasets and am particularly interested in:
1.Data Compression: Are there …
blurp: The article highlights the critical need for robust cloud security amidst emerging threats like APTs, quantum computing risks, and ransomware-as-a-service. It details advancements like Zero Trust Architecture, AI and ML integration, Secure Access Service Edge…
blurp: These cloud security statistics paint a worrying picture for businesses worldwide. Nearly one in two companies have reported security breaches, a statistic all the more disturbing consideri

[None, None, None, None]

In [29]:
PROMPT_TEMPLATE = """
Based only on the following context
{context}
 - -
Answer the question:{question} 
"""

In [30]:
def query_rag(query, vectordb):
    """
    Query a Retrieval-Augmented Generation (RAG) system using Chroma database and OpenAI.
    Args:
    - query_text (str): The text to query the RAG system with.
    Returns:
    - formatted_response (str): Formatted response including the generated text and sources.
    - response_text (str): The generated response text.
    """
    results = vectordb.similarity_search_with_relevance_scores(query, k=3)
    
    if len(results) == 0 or results[0][1] < 0.7:
        print(f"Unable to find matching results.")

    # Combine context from matching documents
    context_text = "\n\n - -\n\n".join([doc.page_content for doc, _score in results])

    # Create prompt template using context and query text
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query)

    # Initialize OpenAI chat model
    model = ChatOpenAI()

    # Generate response text based on the prompt
    response_text = model.predict(prompt)

    # Get sources of the matching documents
    sources = [doc.metadata.get("source", None) for doc, _score in results]

    # Format and return response including generated text and sources
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    
    return response_text, formatted_response

response_text, formatted_response = query_rag(prompt, vectordb)
print(response_text)

| Article Title | Summary |
|--------------|---------|
| AI Technology Advancements | This article discusses the latest advancements in AI technology, including new developments in machine learning, natural language processing, and computer vision. It explores how these advancements are shaping various industries and improving efficiency and productivity. |
| Challenges of AI | This article delves into the challenges that AI technology faces, such as bias in algorithms, data privacy concerns, and ethical implications. It discusses how these challenges are being addressed by researchers and the industry to ensure responsible AI development. |
| Opportunities in AI | This article highlights the opportunities that AI presents, such as improved healthcare diagnostics, autonomous vehicles, and personalized recommendations. It explores how businesses can leverage AI to gain a competitive edge and drive innovation in their respective fields. |


## Retreival QA

In [19]:
db = DocArrayInMemorySearch.from_documents(
    documents, 
    embeddings
)

In [20]:
retriever = db.as_retriever()
# retriever = vectordb.as_retriever()

In [21]:
llm_replacement_model = ChatOpenAI(temperature=0, model='gpt-4o-mini')
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm_replacement_model, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [22]:
query =  "Please list all your articles with the topic of AI in a table in markdown and summarize each one"

In [23]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [18]:
display(Markdown(response))

I don't know.