Load and vectorize html documents.  Post questions to check search capability.

In [1]:
import os
import re
import glob
import time
import openai

# langchain
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# authentication
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ["OPENAI_API_KEY"]

Load local html files here, local files have to be loaded one by one, apparently

In [3]:
urls = glob.glob('data/text/Robert_King*html')
urls

['data/text/Robert_King_pitchbook.html',
 'data/text/Robert_King_google.html',
 'data/text/Robert_King_equilar.html',
 'data/text/Robert_King_linkedin.html',
 'data/text/Robert_King_relsci.html',
 'data/text/Robert_King_zoominfo.html',
 'data/text/Robert_King_wealthx.html']

In [4]:
docs = []
for url in urls:
    doc = UnstructuredHTMLLoader(url).load()
    # infer client name and add to metadata
    root = url.split('/')[-1]
    toks = root.split('_')
    client_name = toks[:-1]
    doc_type = toks[-1][:-5]
    # manually edit metadata
    doc[0].metadata['client_name'] = ' '.join(client_name)
    doc[0].metadata['doc_type'] = doc_type
    #print(doc[0].metadata['client_name'], doc_type)
    docs.extend(doc)

docs

[Document(page_content='Lead partner on deals:\n\nCompany: Tech Startup XYZ\n\nDeal Date: May 15 2023\n\nDeal Type: Series B\n\nDeal Size: $30M\n\nDeal Status: Completed\n\nLocation: San Francisco, CA\n\nRepresenting: Sequoia Capital\n\nOther Partners: Rich Dude I, Rich Guy II, Notso Rich III, Rich Wannabe\n\nInvestor bio: Robert King is a highly successful investor with a strong track record in the hedge fund industry. He has led numerous successful deals and has a deep understanding of the startup ecosystem. Robert is known for his strategic thinking and ability to identify promising investment opportunities.', metadata={'source': 'data/text/Robert_King_pitchbook.html', 'client_name': 'Robert King', 'doc_type': 'pitchbook'}),
 Document(page_content='Article 1:\n\nTitle: Robert King donates $1 million to local charity\n\nDate: March 10, 2023\n\nAbstract: In a generous act of philanthropy, Robert King has donated $1 million to a local charity.                          The donation will

Vectorstore

In [5]:
# optionally persist the vector db to disk
persist_directory = 'data/chroma/'
override = False

if os.path.exists(persist_directory) and override:  # clean out existing data
    for f in glob.glob(persist_directory+'index/*'):
        os.remove(f)
    
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
    #persist_directory=persist_directory  # uncomment to persist on disk
)

In [6]:
# retrieval based on natural language question (no metadata filter applied)
q1 = 'Tell me about Robert King\'s family assets, board memberships, and venture capital deals'
max_k=7

### retrieval based on cosine similarity
for d in vectordb.similarity_search(q1, k=max_k):  # 6/8 results are for Robert King
    print(d.metadata)

{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}
{'client_name': 'Robert King', 'doc_type': 'pitchbook', 'source': 'data/text/Robert_King_pitchbook.html'}
{'client_name': 'Robert King', 'doc_type': 'zoominfo', 'source': 'data/text/Robert_King_zoominfo.html'}
{'client_name': 'Robert King', 'doc_type': 'google', 'source': 'data/text/Robert_King_google.html'}
{'client_name': 'Robert King', 'doc_type': 'relsci', 'source': 'data/text/Robert_King_relsci.html'}
{'client_name': 'Robert King', 'doc_type': 'equilar', 'source': 'data/text/Robert_King_equilar.html'}
{'client_name': 'Robert King', 'doc_type': 'wealthx', 'source': 'data/text/Robert_King_wealthx.html'}


In [7]:
### retrieval based on maximal marginal relevance
### (cosine similarity plus an additional filter to promote diversity)
### https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/mmr
for d in vectordb.max_marginal_relevance_search(q1, k=max_k):  # only 4/8 results pertain to Robert King - BAD
    print(d.metadata)

Number of requested results 20 is greater than number of elements in index 7, updating n_results = 7


{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}
{'client_name': 'Robert King', 'doc_type': 'pitchbook', 'source': 'data/text/Robert_King_pitchbook.html'}
{'client_name': 'Robert King', 'doc_type': 'zoominfo', 'source': 'data/text/Robert_King_zoominfo.html'}
{'client_name': 'Robert King', 'doc_type': 'google', 'source': 'data/text/Robert_King_google.html'}
{'client_name': 'Robert King', 'doc_type': 'relsci', 'source': 'data/text/Robert_King_relsci.html'}
{'client_name': 'Robert King', 'doc_type': 'equilar', 'source': 'data/text/Robert_King_equilar.html'}
{'client_name': 'Robert King', 'doc_type': 'wealthx', 'source': 'data/text/Robert_King_wealthx.html'}


In [8]:
# retrieval with metadata filter: only one metadata field can be specified at a time
q2 = 'Tell me about Robert King\'s equity holdings, stocks sold, transactions, and annual compensation'

print('#### Similarity search ####')
for d in vectordb.similarity_search(q2, k=max_k, filter={'client_name': 'Robert King'}):
    print(d.metadata)

print('\n\n#### MMR search ####')
for d in vectordb.max_marginal_relevance_search(q2, k=max_k, filter={'client_name': 'Robert King'}):
    print(d.metadata)

### equilar is ranking highest in both cases, but it took several attempts to make
### the question specific enough so it placed high
### documents appear ranked in the same order

#### Similarity search ####
{'client_name': 'Robert King', 'doc_type': 'equilar', 'source': 'data/text/Robert_King_equilar.html'}
{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}
{'client_name': 'Robert King', 'doc_type': 'zoominfo', 'source': 'data/text/Robert_King_zoominfo.html'}
{'client_name': 'Robert King', 'doc_type': 'pitchbook', 'source': 'data/text/Robert_King_pitchbook.html'}
{'client_name': 'Robert King', 'doc_type': 'google', 'source': 'data/text/Robert_King_google.html'}
{'client_name': 'Robert King', 'doc_type': 'relsci', 'source': 'data/text/Robert_King_relsci.html'}
{'client_name': 'Robert King', 'doc_type': 'wealthx', 'source': 'data/text/Robert_King_wealthx.html'}


#### MMR search ####


Number of requested results 20 is greater than number of elements in index 7, updating n_results = 7


{'client_name': 'Robert King', 'doc_type': 'equilar', 'source': 'data/text/Robert_King_equilar.html'}
{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}
{'client_name': 'Robert King', 'doc_type': 'zoominfo', 'source': 'data/text/Robert_King_zoominfo.html'}
{'client_name': 'Robert King', 'doc_type': 'pitchbook', 'source': 'data/text/Robert_King_pitchbook.html'}
{'client_name': 'Robert King', 'doc_type': 'google', 'source': 'data/text/Robert_King_google.html'}
{'client_name': 'Robert King', 'doc_type': 'relsci', 'source': 'data/text/Robert_King_relsci.html'}
{'client_name': 'Robert King', 'doc_type': 'wealthx', 'source': 'data/text/Robert_King_wealthx.html'}


Self-query meta data filter

In [9]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [10]:
metadata_field_info = [
    AttributeInfo(
        name="client_name",
        description="The name of the client associated with the document.",
        type="string",
    ),
    
    AttributeInfo(
        name="doc_type",
        description="The lecture the chunk is from, should be one of `equilar`, \
            `google`, `linkedin`, `pitchbook`, `relsci`, `wealthx`, or `zoominfo`",
        type="string",
    ),
]

In [11]:
document_content_description = "Client documents"

llm = OpenAI(temperature=0)

sq_retriever = SelfQueryRetriever.from_llm(  # sq_retriever = self-query retriever
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True
)

In [12]:
print(f'### Question: {q1} ###\n')
docs = sq_retriever.get_relevant_documents(q1)
for d in docs:
    print(d.metadata)

### self-query only returns 4 documents, not sure why
### was hoping wealthx would be included

### Question: Tell me about Robert King's family assets, board memberships, and venture capital deals ###





query='Robert King family assets board memberships venture capital deals' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King') limit=None
{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}
{'client_name': 'Robert King', 'doc_type': 'pitchbook', 'source': 'data/text/Robert_King_pitchbook.html'}
{'client_name': 'Robert King', 'doc_type': 'relsci', 'source': 'data/text/Robert_King_relsci.html'}
{'client_name': 'Robert King', 'doc_type': 'google', 'source': 'data/text/Robert_King_google.html'}


In [13]:
print(f'### Question: {q2} ###\n')
docs = sq_retriever.get_relevant_documents(q2)
for d in docs:
    print(d.metadata)

### Question: Tell me about Robert King's equity holdings, stocks sold, transactions, and annual compensation ###

query='Robert King equity holdings stocks sold transactions annual compensation' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King') limit=None
{'client_name': 'Robert King', 'doc_type': 'equilar', 'source': 'data/text/Robert_King_equilar.html'}
{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}
{'client_name': 'Robert King', 'doc_type': 'pitchbook', 'source': 'data/text/Robert_King_pitchbook.html'}
{'client_name': 'Robert King', 'doc_type': 'zoominfo', 'source': 'data/text/Robert_King_zoominfo.html'}


In [14]:
### test query on doc_type metadata with limit
q = "Find the linkedin data for Robert King"
docs = sq_retriever.get_relevant_documents(q)
for d in docs:
    print(d.metadata)

query='Robert King' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='linkedin') limit=None
{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}


In [15]:
### test query on doc_type metadata with limit
q = "Find the linkedin data for client Robert King"
docs = sq_retriever.get_relevant_documents(q)
for d in docs:
    print(d.metadata)

### note: "client Robert King" also sets filter on the client name
### could also say "the linkedin profile" to imply there is one result

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='linkedin')]) limit=None
{'client_name': 'Robert King', 'doc_type': 'linkedin', 'source': 'data/text/Robert_King_linkedin.html'}


In [16]:
q = "Find the equilar and relsci documents for client Robert King"
docs = sq_retriever.get_relevant_documents(q)
for d in docs:
    print(d.metadata)

### This demonstrates simultaneous filter on doc type and client name

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='equilar'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='relsci')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King')]) limit=None
{'client_name': 'Robert King', 'doc_type': 'relsci', 'source': 'data/text/Robert_King_relsci.html'}
{'client_name': 'Robert King', 'doc_type': 'equilar', 'source': 'data/text/Robert_King_equilar.html'}


Recommendations:
 - Vector DB searches (similarity, maximal marginal relevance) in Chroma are prone to mixing results for separate clients.
 - Metadata filters clients for the correct client or document type. Not sure if metadata filters can include multiple filters (e.g., client_name and doc_type, or doc_type has one of several values) at once.
 - Self-query has the potential to automatically construct the filter for client and document type. It unfortunately can still mix results for multiple clients, unless one is careful about asking the question.  Also, with Chroma, the output seems fixed on just four documents.

Todo:
 - Repeat this experiment with other vectorstores to see if one works better.
 - Implement the final call to the LLM with the retrieved documents as context.

### Q&A with retrieval

In [17]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)

In [21]:
### generation without retrieval
q3 = 'Who is Robert King?'
print(f'### Question: {q3} ###\n')
llm.call_as_llm(q3)

### Question: Who is Robert King? ###



'There are several individuals named Robert King, so it is unclear which specific person you are referring to. Some notable individuals named Robert King include:\n\n1. Robert King (composer): An English composer known for his Baroque music compositions.\n2. Robert King (journalist): An American journalist and author who has written extensively on criminal justice issues.\n3. Robert King (screenwriter): A British screenwriter known for his work on the TV series "The Good Karma Hospital" and "Vera."\n4. Robert King (politician): An American politician who served as the Governor of Indiana from 1997 to 2003.\n5. Robert King (music producer): An American music producer and songwriter who has worked with various artists in the music industry.\n\nWithout more specific information, it is difficult to determine which Robert King you are referring to.'

In [19]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=sq_retriever
)

result = qa_chain({"query": q3})
print(f'### Question: {q3} ###\n')
result['result']

query='Robert King' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King') limit=None
### Question: Who is Robert King? ###



'Robert King is a seasoned finance professional with extensive experience in the hedge fund industry. He has held positions as the CEO of Hedge Fund A and Senior Portfolio Manager at Hedge Fund B. He is also a board director for the San Francisco Symphony. Robert is known for his philanthropy, having donated $1 million to a local charity. He has also led successful deals as an investor, including representing Sequoia Capital in a Series B deal for Tech Startup XYZ.'

In [20]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=sq_retriever
)

result = qa_chain({"query": q2})
print(f'### Question: {q2} ###\n')
result['result']

query='Robert King equity holdings stocks sold transactions annual compensation' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King') limit=None
### Question: Tell me about Robert King's equity holdings, stocks sold, transactions, and annual compensation ###



"Robert King's equity holdings are valued at $20 million. He has sold $10 million worth of stock in equity transactions over the last 36 months. Additionally, he has received new equity grants totaling $5 million and has exercised options worth $2 million during the same period. His annual compensation is $5 million."

In [22]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=sq_retriever
)

result = qa_chain({"query": q1})
print(f'### Question: {q1} ###\n')
result['result']

query='Robert King family assets board memberships venture capital deals' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King') limit=None
### Question: Tell me about Robert King's family assets, board memberships, and venture capital deals ###



"Based on the provided context, there is no information available about Robert King's family assets, board memberships, or specific venture capital deals he has been involved in, apart from the Tech Startup XYZ deal where he represented Sequoia Capital."

### Combine multiple LLM calls to construct one coherent biography

Experiment 1:

    a. Summarize each document individually
    b. Feed summaries into one prompt
    c. Write the biography, emphasizing specific information

In [54]:
### Summarize individual documents
multi_doc_prompt_dict = {'linkedin': {'q': '', 'a': ''},
                         'google': {'q': '', 'a': ''},
                         'equilar': {'q': '', 'a': ''},
                         'pitchbook': {'q': '', 'a': ''},
                         'relsci': {'q': '', 'a': ''},
                         'wealthx': {'q': '', 'a': ''},
                         'zoominfo': {'q': '', 'a': ''},
                        }

client = 'Robert King'
doc_retriever_template = 'Retrieve the {doc_type} document for {client}.' # Summarize the provided context in 100 or fewer words.'

doc_summary_template = '''
Summarize the context below in 100 words or less.

Context:
{context}
'''

for key in multi_doc_prompt_dict.keys():
    ### retrieve exact document per client
    q = doc_retriever_template.format(doc_type=key, client=client)
    docs = sq_retriever.get_relevant_documents(q)
    multi_doc_prompt_dict[key]['q'] = q
    #response = qa_chain({'query': 'Summarize the context in 100 words or less.', 'input_docs': docs})

    ### summarize the document
    doc_content = '\n'.join([d.page_content for d in docs])
    response = llm.call_as_llm(doc_summary_template.format(context=doc_content))
    
    multi_doc_prompt_dict[key]['a'] = response

print('\n\n### Results:')
for key in multi_doc_prompt_dict.keys():
    print(f"document_type: {key}, summary: {multi_doc_prompt_dict[key]['a']}")

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='linkedin')]) limit=None
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='google')]) limit=1
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='equilar')]) limit=1


Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID f3e016df5f1df9499b7d936645603e7f in your email.) {
  "error": {
    "message": "The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID f3e016df5f1df9499b7d936645603e7f in your email.)",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
 500 {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='pitchbook')]) limit=1
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='relsci')]) limit=None
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_type', value='wealthx')]) limit=None
query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='client_name', value='Robert King'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='doc_typ

In [57]:
### construct one prompt with all of the above summaries

bio_prompt_template = '''
You are a writer and biographer. You specialize in writing
accurate life summarizes given large input documes.
Below is information from several documents about a single
client named {client}.

Prepare a biography that includes in the following order:
1. The client's name
2. Their professional work history
3. Board member activities
4. Philantropic activities
5. Their education
6. Any details about their family

Format the output as prose rather than an ordered list.

Input context:
{context}

Your response here:
'''

context = ''
for key in multi_doc_prompt_dict:
    context += ('\n\n' + multi_doc_prompt_dict[key]['a'])

formatted_prompt = bio_prompt_template.format(client=client, context=context)
print(f'Input prompt:\n\n{formatted_prompt}')

response = llm.call_as_llm(formatted_prompt)

Input prompt:


You are a writer and biographer. You specialize in writing
accurate life summarizes given large input documes.
Below is information from several documents about a single
client named Robert King.

Prepare a biography that includes in the following order:
1. The client's name
2. Their professional work history
3. Board member activities
4. Philantropic activities
5. Their education
6. Any details about their family

Format the output as prose rather than an ordered list.

Input context:


Robert King is a highly respected finance professional with experience in the hedge fund industry. He is currently the CEO of Hedge Fund A in San Francisco and previously worked as a Senior Portfolio Manager at Hedge Fund B. He holds an MBA from Stanford University and is actively involved in the local community as a Board Director for the San Francisco Symphony. With his strong leadership skills and extensive knowledge in finance, Robert is well-regarded in the industry.

Robert King h

In [59]:
print(response)

Robert King, a highly respected finance professional, has had an illustrious career in the hedge fund industry. Currently serving as the CEO of Hedge Fund A in San Francisco, Robert's expertise and leadership have propelled him to the forefront of the finance world. Prior to his current role, he held the position of Senior Portfolio Manager at Hedge Fund B, where he made significant contributions to the company's success.

In addition to his professional achievements, Robert is actively involved in various board member activities. He serves as a Board Director for the San Francisco Symphony, showcasing his commitment to the local community and the arts. His strategic thinking and extensive knowledge in finance make him a valuable asset in this role.

Robert King's philanthropic endeavors are equally noteworthy. Recently, he made a generous donation of $1 million to a local charity, aiming to improve the lives of underprivileged youth in the community. This act of kindness reflects his 