## Generating Lucene Query Language queries from free text input

This notebook takes text input, generates Lucene Query Language statements and queries RSpace ELN API to retrieve documents. It uses OpenAI's GPT-4 model and 'Function' API.

The use case is that to generate complex search queries using Lucene syntax, it's tricky to remember the syntax to create a valid query.

This notebook allows queries made from 'free text' .

For example:

    Find docs tagged with X and Y but not Z, oldest first
    Find "some phrase" in full text, and tagged with Y, alphabetical order
    Find docs where name starts with XXXX, newest first

Note that your RSpace data **is not** sent to OpenAI's servers, only your query.
The actual search of RSpace is performed by this script.

To run this you'll need:

* An OpenAI API key
* An account on https://community.researchspace.com, and an RSpace API key. (It's free to set up).
* Python RSpace client, OpenAI API and various dependencies:
 - `pip install rspace_client notebook requests openai scipy tenacity tiktoken termcolor`

### Setup

Here we import everything we need and  check RSpace API connection:

In [12]:
import os
from rspace_client.eln import eln
import json
import openai
import requests
from termcolor import colored
from tenacity import retry, wait_random_exponential, stop_after_attempt

eln_cli = eln.ELNClient(os.getenv("RSPACE_URL"), os.getenv("RSPACE_API_KEY"))
print(eln_cli.get_status())
## Open AI API key should be an environment variable: OPENAI_API_KEY=<YOUR KEY>
GPT_MODEL = "gpt-4" ##gpt-3.5-turbo is alternative option. 

{'message': 'OK', 'rspaceVersion': '1.91.1'}


### Boilerplate

- sending request to OpenAI API
- pretty-printing results

In [13]:
@retry(wait=wait_random_exponential(multiplier=1, max=40), stop=stop_after_attempt(3))
def chat_completion_request(
    messages, functions=None, function_call=None, model=GPT_MODEL
):
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + openai.api_key,
    }
    json_data = {"model": model, "messages": messages}

    if functions is not None:
        json_data.update({"functions": functions})
    if function_call is not None:
        json_data.update({"function_call": function_call})
    try:
        response = requests.post(
            "https://api.openai.com/v1/chat/completions",
            headers=headers,
            json=json_data,
        )
        return response
    except Exception as e:
        print("Unable to generate ChatCompletion response")
        print(f"Exception: {e}")
        return e

In [14]:
role_to_color = {
    "system": "red",
    "user": "green",
    "assistant": "blue",
    "function": "magenta",
}


def pretty_print_conversation(messages):
    """
    Prints color-coded messages according to their role
    """

    for message in messages:
        if message["role"] == "system":
            print(
                colored(
                    f"system: {message['content']}\n", role_to_color[message["role"]]
                )
            )
        elif message["role"] == "user":
            print(
                colored(f"user: {message['content']}\n", role_to_color[message["role"]])
            )
        elif message["role"] == "assistant" and message.get("function_call"):
            print(
                colored(
                    f"assistant: {message['function_call']}\n",
                    role_to_color[message["role"]],
                )
            )
        elif message["role"] == "assistant" and not message.get("function_call"):
            print(
                colored(
                    f"assistant: {message['content']}\n", role_to_color[message["role"]]
                )
            )
        elif message["role"] == "function":
            print(
                colored(
                    f"function ({message['name']}): {message['content']}\n",
                    role_to_color[message["role"]],
                )
            )

Performs the conversation. Sends  messages and functions to OpenAI's Chat completion API.

IF a function is returned, it's invoked. The conversation and the results are returned as a tuple.

In [15]:
def do_conversation(messages, functions):

    resp = chat_completion_request(messages, functions, {'name':'lucene'})
    active_messages = messages.copy()
    response_message = resp.json()['choices'][0]['message']
    active_messages.append(response_message)

    if response_message['function_call'] is not None:
        f_name = response_message['function_call']['name']
        f_args = json.loads(response_message['function_call']['arguments'])
        rspace_search_result = available_functions[f_name](**f_args)
    return (active_messages, rspace_search_result)

### Function definitions

The function we want to call, and its description in JSON Schema

In [16]:
## This is the function that will be invoked with arguments generated by AI.
## It will make calls to RSpace's search API.
## Note that date-range basedd syntax is not supprted by RSpace. Also there seems to be a limit of ~125 characters
## for the Lucene search query length.

def search_rspace_eln(luceneQuery, sort_order="lastModified desc"):
    q = "l: " + luceneQuery
    docs = eln_cli.get_documents(query=q, order_by=sort_order)['documents']
    wanted_keys = ['globalId','name', 'tags', 'created'] # The keys we want
    summarised = list(map(lambda d: dict((k, d[k]) for k in wanted_keys if k in d), docs))
    return summarised

Below we define the data structure that we want GPT-4 model to return.

In [17]:
available_functions = {
 "lucene":search_rspace_eln
}

functions = [
  {
    "name": "lucene",
    "description": """
    A valid Lucene Query Language string generated from user input.
    Document fields are name, docTag, fields.fieldData, and username.
    Don't use wildcards. Don't state your reasoning.
    """,
    "parameters": {
        "type":"object",
        "properties": {
            "luceneQuery": {
                "type":"string",
                "description":"Valid Lucene Query Language as plain text"
            },
            "sort_order": {
                "type":"string",
                "description":"How results should be sorted",
                "enum":["name asc", "name desc", "created asc", "created desc"]
            },
            
        }
    }
  }
]

### Executing the conversation

Change the content of the second message to be something relevant for your work.

Use as precise language as you can. Try seeing how little you need to type. 

Preamble such as `I want to search for documents....` seems mostly unnecessary.


In [18]:
messages = [
 {
     "role" : "system",
     "content": "Generate function arguments from user input. Don't show reasoning."
 },
 {
     "role" : "user",
     "content": """
         I want to search for documents that are tagged with PCR but not ECL, 
         containing the phrase “DNA replication” but not "RNA".
         Reverse alphabetical order
         """
 } 
]


In [19]:
(conversation, results) = do_conversation(messages,functions)
pretty_print_conversation(conversation)
print("Search results from RSpace\n--------------------------")
print(json.dumps(results, indent=2))

[31msystem: Generate function arguments from user input. Don't show reasoning.
[0m
[32muser: 
         I want to search for documents that are tagged with PCR but not ECL, 
         containing the phrase “DNA replication” but not "RNA"
         List results in reverse alphabetical order
         
[0m
[34massistant: {'name': 'lucene', 'arguments': '\n{\n  "luceneQuery": "docTag:PCR NOT docTag:ECL AND fields.fieldData:\\"DNA replication\\" NOT fields.fieldData:\\"RNA\\"",\n  "sort_order": "name desc"\n}'}
[0m
Search results from RSpace
--------------------------
[
  {
    "globalId": "SD1924558",
    "name": "aurora expression analysis on cell growth: second attempt",
    "tags": "polyclonal,PCR",
    "created": "2023-09-17T18:23:38.029Z"
  },
  {
    "globalId": "SD1924288",
    "name": "aurora expression analysis on cell growth",
    "tags": "polyclonal,PCR",
    "created": "2023-09-15T23:47:26.853Z"
  }
]
