# LangChain Tools

Inspirations:

https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html

https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html

https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html

https://github.com/openai/chatgpt-retrieval-plugin/blob/main/examples/providers/pinecone/semantic-search.ipynb



In [28]:
!pip install -qU openai langchain pinecone-client chromadb pandas tqdm

In [29]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.vectorstores import Pinecone
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

In [30]:
import os
os.environ["OPENAI_API_KEY"] = "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"

## Load Data

### Skip: Use LangChain Pinecone Module

In [None]:
# https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/pinecone.html

import pinecone
# initialize pinecone
pinecone.init(
    api_key="99beb78c-ae94-488e-9fc3-2617e2edd589",
    environment="eu-west1-gcp"
)

index_name = "longevity-kb-v1"

In [None]:
# https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/csv.html

from langchain.document_loaders import CSVLoader

loader = CSVLoader("./albumin_aging.csv")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

# Permanent Pinecone storage
docsearch = Pinecone.from_documents(documents, embeddings, index_name=index_name)

# Transient in-memory Index
# docsearch = Chroma.from_documents(texts, embeddings)

### Use ChatGPT Plugin Retriever

#### Skip: OpenAI Plugin Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `retrieval-app` repository:

```
git clone git@github.com:openai/retrieval-app.git
```

3. Navigate to the app directory:

```
cd /path/to/retrieval-app
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set Pinecone-specific environment variables:

* `DATASTORE`: set to `pinecone`.

* `PINECONE_API_KEY`: Set to your Pinecone API key. This requires a free Pinecone account and can be [found in the Pinecone console](https://app.pinecone.io/).

* `PINECONE_ENVIRONMENT`: Set to your Pinecone environment, looks like `us-east1-gcp`, `us-west1-aws`, and can be found next to your API key in the [Pinecone console](https://app.pinecone.io/).

* `PINECONE_INDEX`: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like `"openai-retrieval-app"`. *Note that index names are restricted to alphanumeric characters, `"-"`, and can contain a maximum of 45 characters.*

8. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

In that case, the app is automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

#### Prepare and Upsert CSV

In [31]:
import pandas as pd
from urllib.parse import urlencode, urljoin, urlparse

# Import a local CSV
# df = pd.read_csv("./albumin_aging.csv")

# Convert Google Sheets into CSV
# https://towardsdatascience.com/read-data-from-google-sheets-into-pandas-without-the-google-sheets-api-5c468536550
# https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/edit

# 1) Import a single sheet
# sheet_id = "1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4"
# sheet_name = "Aging"
# url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
# df = pd.read_csv(url)

# 2) Concat all sheets in a Google Sheet spreadsheet into a single DataFrame
sheet_names = ["Aging", "Aging + Increase", "Aging + Decrease", "Longevity", "Longevity + Increase", "Longevity + Decrease"]

li = []
for sheet_name in sheet_names:
    # https://stackoverflow.com/questions/65199173/how-to-format-string-in-url-by-python
    link_url = "https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?{}"
    link_params = {'tqx': 'out:csv', 'sheet': sheet_name}
    url = link_url.format(urlencode(link_params))
    print(url)
    df = pd.read_csv(url, index_col=None, header=0)
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

df.dtypes

https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?tqx=out%3Acsv&sheet=Aging
https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?tqx=out%3Acsv&sheet=Aging+%2B+Increase
https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?tqx=out%3Acsv&sheet=Aging+%2B+Decrease
https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?tqx=out%3Acsv&sheet=Longevity
https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?tqx=out%3Acsv&sheet=Longevity+%2B+Increase
https://docs.google.com/spreadsheets/d/1LMwztkSRMLfeci_UJ-9cDRgvGHciyV6FU3IyEjHHyM4/gviz/tq?tqx=out%3Acsv&sheet=Longevity+%2B+Decrease


Rank                                            int64
PMID                                            int64
Title                                          object
Year                                            int64
Journal                                        object
URL                                            object
Research Article                               object
Clinical Article                               object
Number of Citations from Clinical Articles      int64
Human                                         float64
Animal                                        float64
Molecular/Cellular                            float64
NIH Percentile                                float64
Number of Citations                             int64
Approximate Potential to Translate            float64
Parameter in Title                             object
Keywords in Title                              object
Parameter and Keywords in Title                object
Cooccurance of Parameter and

The format required by the apps `upsert` function is a list of documents like:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional.

In [32]:
# ValueError: Out of range float values are not JSON compliant
df = df.fillna('')

df.columns

Index(['Rank', 'PMID', 'Title', 'Year', 'Journal', 'URL', 'Research Article',
       'Clinical Article', 'Number of Citations from Clinical Articles',
       'Human', 'Animal', 'Molecular/Cellular', 'NIH Percentile',
       'Number of Citations', 'Approximate Potential to Translate',
       'Parameter in Title', 'Keywords in Title',
       'Parameter and Keywords in Title',
       'Cooccurance of Parameter and Keywords',
       'Sentences with Parameter and Keywords', 'Sentences with Parameter',
       'Sentences with Keywords', 'Abstract', 'Relevant', 'Comment'],
      dtype='object')

In [41]:
documents = [
    {
        'id': str(r['PMID']),
        'text': r['Abstract'],
        'metadata': {
            'url': str(r['URL']),
            'PMID': str(r['PMID']),
            'Rank': str(r['Rank']),
            'Title': str(r['Title']),
            'Year': str(r['Year']),
            'Journal': str(r['Journal']),
            'Research Article': str(r['Research Article']),
            'Clinical Article': str(r['Clinical Article']),
            'Number of Citations from Clinical Articles': str(r['Number of Citations from Clinical Articles']),
            'Human': str(r['Human']),
            'Animal': str(r['Animal']),
            'Molecular/Cellular': str(r['Molecular/Cellular']),
            'NIH Percentile': str(r['NIH Percentile']),
            'Number of Citations': str(r['Number of Citations']),
            'Approximate Potential to Translate': str(r['Approximate Potential to Translate'])
        }
    } for r in df.to_dict(orient='records')
]
documents[0]

{'id': '3335748',
 'text': 'To clarify the relation between age and serum albumin, measures were obtained on a screened population of 1066 healthy males in the Normative Aging Study. Multiple regression analysis shows only a slight decline in albumin of 0.054 gm/dl per decade with R = -0.12 (p less than .001) on cross-sectional data. This small decline occurs entirely within the range of normal, contrary to many previous reports. Mean albumin values were 4.25 (+/- .26 SD) for subjects in the eighth decade and 4.13 (+/- .29 SD) in the ninth decade. Longitudinally, there was an upwards trend in albumin for five birth cohorts over an 8-yr period which may reflect laboratory drift. A multivariate model of cross-sectional data can explain only 5% of the variance. The age-related decline within healthy subjects is far less than previously described. Our data demonstrate that hypoalbuminemia is not a consequence of normal aging.',
 'metadata': {'url': 'https://pubmed.ncbi.nlm.nih.gov/3335748'

In [34]:
type(documents[0])

dict

In [35]:
len(documents)

2454

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [36]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJsb25nZXZpdHkta25vd2xlZGdlLXBsYXRmb3JtIiwibmFtZSI6IllFQVJTIEdtYkgiLCJpYXQiOjI4Mzk2MTYyNzg2M30.oZJjlb_rAxbY-jf-0PosbhwIsur3PQtcXppgLDbV7nA"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [37]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}
headers

{'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJsb25nZXZpdHkta25vd2xlZGdlLXBsYXRmb3JtIiwibmFtZSI6IllFQVJTIEdtYkgiLCJpYXQiOjI4Mzk2MTYyNzg2M30.oZJjlb_rAxbY-jf-0PosbhwIsur3PQtcXppgLDbV7nA'}

We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [42]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "https://chatgpt-longevity-plugin-2exib.ondigitalocean.app"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    print(i)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

  0%|          | 0/25 [00:00<?, ?it/s]

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400


#### Testing the Semantic Search

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint.

In [43]:
queries = [
    {
        'query': 'How does albumin serum levels change as we grow older?'
    },
    {
        'query': 'What can we do to prevent lower albumin levels in the elderly?',
    },
    {
        'query': 'What does it mean if a young person has low albumin levels?'
    }
]
# format into the structure needed by the /query endpoint
# [{'query': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
# {'query': 'When did the Scholastic Magazine of Notre dame begin publishing?'},
# {'query': 'Where is the headquarters of the Congregation of the Holy Cross?'}]

In [44]:
res = requests.post(
    "https://chatgpt-longevity-plugin-2exib.ondigitalocean.app/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [200]>

In [45]:
res.json()['results']

[{'query': 'How does albumin serum levels change as we grow older?',
  'results': [{'id': '3335748_0',
    'text': 'To clarify the relation between age and serum albumin, measures were obtained on a screened population of 1066 healthy males in the Normative Aging Study. Multiple regression analysis shows only a slight decline in albumin of 0.054 gm/dl per decade with R = -0.12 (p less than .001) on cross-sectional data. This small decline occurs entirely within the range of normal, contrary to many previous reports. Mean albumin values were 4.25 (+/- .26 SD) for subjects in the eighth decade and 4.13 (+/- .29 SD) in the ninth decade. Longitudinally, there was an upwards trend in albumin for five birth cohorts over an 8-yr period which may reflect laboratory drift. A multivariate model of cross-sectional data can explain only 5% of the variance. The age-related decline within healthy subjects is far less than previously described.',
    'metadata': {'source': None,
     'source_id': Non

Now we can loop through the responses and see the results returned for each query:

In [46]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
How does albumin serum levels change as we grow older?

0.9: To clarify the relation between age and serum albumin, measures were obtained on a screened population of 1066 healthy males in the Normative Aging Study. Multiple regression analysis shows only a slight decline in albumin of 0.054 gm/dl per decade with R = -0.12 (p less than .001) on cross-sectional data. This small decline occurs entirely within the range of normal, contrary to many previous reports. Mean albumin values were 4.25 (+/- .26 SD) for subjects in the eighth decade and 4.13 (+/- .29 SD) in the ninth decade. Longitudinally, there was an upwards trend in albumin for five birth cohorts over an 8-yr period which may reflect laboratory drift. A multivariate model of cross-sectional data can explain only 5% of the variance. The age-related decline within healthy subjects is far less than previously described.
0.9: While lower serum albumin concentra

#### *LangChain Docs: Alternative Tutorial*

https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/chatgpt-plugin-retriever.html

## Query Data w/ LangChain

In [47]:
# https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html#custom-prompts
# Create custom prompt

from langchain.prompts import PromptTemplate
prompt_template = """
You are one of the best medical researchers in the field of disease prevention and longevity in the world.
Use the following pieces of context to answer the question at the end if applicable. Always provide at least three numbered references with the correct title, authors and URL and include direct quotes with the corresponding numbers in your answers.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer in English:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

### Skip: Pinecone Module

In [None]:
retriever = docsearch.as_retriever()

### ChatGPT Module

In [48]:
from langchain.retrievers import ChatGPTPluginRetriever

retriever = ChatGPTPluginRetriever(url="https://chatgpt-longevity-plugin-2exib.ondigitalocean.app", bearer_token="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJsb25nZXZpdHkta25vd2xlZGdlLXBsYXRmb3JtIiwibmFtZSI6IllFQVJTIEdtYkgiLCJpYXQiOjI4Mzk2MTYyNzg2M30.oZJjlb_rAxbY-jf-0PosbhwIsur3PQtcXppgLDbV7nA")

### Results

In [49]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever, chain_type_kwargs=chain_type_kwargs, return_source_documents=True)

# query = "How does albumin serum levels change as we grow older? Provide quotes and at least three references"
query = "How does albumin relate to aging?"
result = qa({"query": query})

In [50]:
print(result["result"])

 Albumin levels tend to decrease with age, although the rate of decrease is small. Studies have shown that a decrease in serum albumin concentration in community-dwelling elderly people is associated with aging, and that albumin variability is independently associated with an individual's aging and death process. However, because the relationship to age is small, the finding of hypoalbuminemia in an elderly patient should generally be attributed to disease rather than age alone.


In [51]:
result["source_documents"]

[Document(page_content="Different from the albumin levels, albumin variability was not dependent on chronological age but was independently associated with an individual's aging and death process.  CONCLUSION: The observed upward trend in albumin variability seems to be consistent with a presumed aging-related decline in homeostatic capacity.", metadata={'id': '28953942_2', 'metadata': {'source': None, 'source_id': None, 'url': 'https://pubmed.ncbi.nlm.nih.gov/28953942', 'created_at': None, 'author': None, 'document_id': '28953942'}, 'embedding': None, 'score': 0.896775067}),
 Document(page_content='In conclusion, a fall in serum albumin concentration in community-dwelling, self-supported elderly persons was associated significantly with aging.', metadata={'id': '17484377_2', 'metadata': {'source': None, 'source_id': None, 'url': 'https://pubmed.ncbi.nlm.nih.gov/17484377', 'created_at': None, 'author': None, 'document_id': '17484377'}, 'embedding': None, 'score': 0.893465698}),
 Docume

# *Archive: Using the Pinecone Retrieval App*

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [None]:
!pip install -qU datasets pandas tqdm langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.9/472.9 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Preparing Data manually

In this walkthrough we will see how to use the retrieval API with a Pinecone datastore for *semantic search / question-answering*.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. The full instructions for doing this are found in the [project README]().

We will summarize the instructions (specific to the Pinecone datastore) before moving on to the walkthrough.

### Use Pandas CSV Loader

In [None]:
import pandas as pd
df = pd.read_csv("./albumin_aging.csv")
df.dtypes

Rank                                            int64
PMID                                            int64
Title                                          object
Year                                            int64
Journal                                        object
URL                                            object
Research Article                               object
Clinical Article                               object
Number of Citations from Clinical Articles      int64
Human                                         float64
Animal                                        float64
Molecular/Cellular                            float64
NIH Percentile                                float64
Number of Citations                             int64
Approximate Potential to Translate            float64
Parameter in Title                             object
Keywords in Title                              object
Parameter and Keywords in Title                object
Cooccurance of Parameter and

The format required by the apps `upsert` function is a list of documents like:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional.

To create this format for our SQuAD data we do:

In [None]:
# ValueError: Out of range float values are not JSON compliant
df = df.fillna('')

df.columns

Index(['Rank', 'PMID', 'Title', 'Year', 'Journal', 'URL', 'Research Article',
       'Clinical Article', 'Number of Citations from Clinical Articles',
       'Human', 'Animal', 'Molecular/Cellular', 'NIH Percentile',
       'Number of Citations', 'Approximate Potential to Translate',
       'Parameter in Title', 'Keywords in Title',
       'Parameter and Keywords in Title',
       'Cooccurance of Parameter and Keywords',
       'Sentences with Parameter and Keywords', 'Sentences with Parameter',
       'Sentences with Keywords', 'Abstract', 'Relevant', 'Comment'],
      dtype='object')

In [None]:
documents = [
    {
        'id': str(r['PMID']),
        'text': r['Abstract'],
        'metadata': {
            'Rank': str(r['Rank']),
            'Title': str(r['Title']),
            'Year': str(r['Year']),
            'Journal': str(r['Journal']),
            'URL': str(r['URL']),
            'Research Article': str(r['Research Article']),
            'Clinical Article': str(r['Clinical Article']),
            'Number of Citations from Clinical Articles': str(r['Number of Citations from Clinical Articles']),
            'Human': str(r['Human']),
            'Animal': str(r['Animal']),
            'Molecular/Cellular': str(r['Molecular/Cellular']),
            'NIH Percentile': str(r['NIH Percentile']),
            'Number of Citations': str(r['Number of Citations']),
            'Approximate Potential to Translate': str(r['Approximate Potential to Translate']),
            'Parameter in Title': str(r['Parameter in Title']),
            'Keywords in Title': str(r['Keywords in Title']),
            'Parameter and Keywords in Title': str(r['Parameter and Keywords in Title']),
            'Cooccurance of Parameter and Keywords': str(r['Cooccurance of Parameter and Keywords']),
            'Sentences with Parameter and Keywords': str(r['Sentences with Parameter and Keywords']),
            'Sentences with Parameter': str(r['Sentences with Parameter']),
            'Sentences with Keywords': str(r['Sentences with Keywords']),
        }
    } for r in df.to_dict(orient='records')
]
documents[3]

In [None]:
type(documents[0])

dict

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [None]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJsb25nZXZpdHkta25vd2xlZGdlLXBsYXRmb3JtIiwibmFtZSI6IllFQVJTIEdtYkgiLCJpYXQiOjI4Mzk2MTYyNzg2M30.oZJjlb_rAxbY-jf-0PosbhwIsur3PQtcXppgLDbV7nA"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [None]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}
headers

{'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJsb25nZXZpdHkta25vd2xlZGdlLXBsYXRmb3JtIiwibmFtZSI6IllFQVJTIEdtYkgiLCJpYXQiOjI4Mzk2MTYyNzg2M30.oZJjlb_rAxbY-jf-0PosbhwIsur3PQtcXppgLDbV7nA'}

We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [None]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "https://chatgpt-longevity-plugin-2exib.ondigitalocean.app"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    print(i)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

  0%|          | 0/13 [00:00<?, ?it/s]

0
100
200
300
400
500
600
700
800
900
1000
1100
1200


### Testing the Semantic Search

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint.

In [None]:
queries = [
    {
        'query': 'How does albumin serum levels change as we grow older?'
    },
    {
        'query': 'What can we do to prevent lower albumin levels in the elderly?',
    },
    {
        'query': 'What does it mean if a young person has low albumin levels?'
    }
]
# format into the structure needed by the /query endpoint
# [{'query': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
# {'query': 'When did the Scholastic Magazine of Notre dame begin publishing?'},
# {'query': 'Where is the headquarters of the Congregation of the Holy Cross?'}]

In [None]:
res = requests.post(
    "https://chatgpt-longevity-plugin-2exib.ondigitalocean.app/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [200]>

In [None]:
res.json()['results']

Now we can loop through the responses and see the results returned for each query:

In [None]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the [Pinecone console](https://app.pinecone.io/).