# BTE Query Generator using GPT
- Generates BTE queries from a question using GPT
- Needs a lot more testing
- Currently only supports the following categories
```
List of Categories:
biolink:Disease
biolink:Gene
biolink:SmallMolecule
biolink:Drug
biolink:Protein
biolink:SequenceVariant
```
- Currently only supports the following predicates
```
List of Predicates:
biolink:causes
biolink:caused_by
biolink:particpates_in
biolink:treats
biolink:treated_by
biolink:contributes_to
biolink:affects
biolink:related_to
biolink:has_phenotype
biolink:occurs_together_in_literature_with
biolink:regulates
```

In [1]:
# set the question we are trying to genreate a query for
question = "What compounds are related to diabetes through other genes?"

In [2]:
# initialize the prompts
id_extraction_prompt = """
Your job is to extract the named things from a query and put them in square brackets.
Named things refer to one biological/chemical thing. 
The following (or synonyms) are NOT named things: proteins, genes, molecules, diseases, drugs, proteins

Example query: What diseases are caused by cyclin dependent kinase 2?
Your response would be the text in quotes: "[cyclin dependent kinase 2]"
Since disease is a broad category, it is NOT counted as a named thing! However, cyclin dependent kinase 2 is a specific entity so it is a named thing.

Another example query: Which diseases are related to cyclin dependent kinase 2 via a protein?
Your response would be the text in quotes: "[cyclin dependent kinase 2]"
Since disease is a broad category, it is NOT counted as a named thing! Since protein is a broad category, it is NOT counted as a named thing! However, cyclin dependent kinase 2 is a specific entity so it is a named thing.

Another example query: Which genes are related to alzheimer's via a drug?
Your response would be the text in quotes: "[alzheimer's]"
Since genes is a broad category, it is NOT counted as a named thing! Since drug is a broad category, it is NOT counted as a named thing! However, alzheimer's is a specific entity so it is a named thing.

The following (or synonyms) are NOT named things: proteins, genes, molecules, diseases, drugs, proteins

USE THE FORMAT BELOW:
[named thing #1] [named thing #2 IF APPLICABLE] ...

Your response should only contain named things! Please list ONLY specific named biological/chemical entities and NOT categories of biological/chemical entities. 

Again, DO NOT respond with a sentence or question or anything else more than one word, just use that exact format with the square brackets. 
DO NOT consider context.
DO NOT include intermediate entities.
"""

json_generation_prompt = """
Your job is to generate JSON based on the query given
List of Categories:
biolink:Disease
biolink:Gene
biolink:SmallMolecule
biolink:Drug
biolink:Protein
biolink:SequenceVariant

List of Predicates:
biolink:causes
biolink:caused_by
biolink:particpates_in
biolink:treats
biolink:treated_by
biolink:contributes_to
biolink:affects
biolink:related_to
biolink:has_phenotype
biolink:occurs_together_in_literature_with
biolink:regulates

A query is in the following JSON format (JSON cannot have trailing commas, so make sure to avoid that)
{
    "message": {
         "query_graph": {
             "nodes": {
                "n1": {
                    "categories": ["entity category"],
                    "ids": ["an id"]
                },
                "n2": {
                   "categories": ["entity category"],
                   "ids": ["an id"]
                }
             },
             "edges": {
                "e1": {
                    "subject": "one of the nodes specified in nodes section",
                    "predicates": ["one of the predicates listed above"],
                    "object": "a different node specified in nodes section"
                }
            }
        }
    }
}
For each node, you may specify one or more IDs OR one or more categories. If you use a category you can just directly put the name of the category (as long as it is from the categories list), but if you want to refer to a specific entity [ie. a specific disease, or a specific gene], then it must be converted to an ID.

For the example, What diseases are caused by cyclin dependent kinase 2?. You would know the ID of cyclin dependent kinase 2 is MESH:D051357 (only because I told you). You would then need to figure out the predicate (type of edge) based on the question, for example in this question it would be “biolink:causes”. Then figure out how to order the nodes, so that the order makes sense (it should be [subject] [predicate] [object]. for example for the question above [some gene] [biolink:causes] [some disease]).

Now you can write it in some json. For the question above here would be the json:
{
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
                    "ids": ["MESH:D051357"]
                },
                "n2": {
                    "categories": ["biolink:Disease"],
                }
              }
             "edges": {
                "e1": {
                    "subject": "n1",
                    "predicates": ["biolink:causes"],
                    "object": "n2"
                }
           }
        }
    }
}

Additionally, multiple edges can be used if needed for the query. In this case only ONE node needs an ID (as long as all other nodes are directly or indirectly connected to it). For example, for the question Which compounds are related to cyclin dependent kinase 2 via a protein? we could use:
{
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
                    "ids": ["MESH:D051357"]
                },
                "n2": {
                    "categories": ["biolink:Protein"],
                },
                "n3": {
                    "categories: ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e1": {
                    "subject": "n1",
                    "predicates": ["biolink:related_to"],
                    "object": "n2"
                },
                "e2": {
                    "subject": "n2",
                    "predicates": ["biolink:related_to"],
                    "object": "n3"
                }
            }
        }
    }
}

Please add "knowledge_type": "inferred" to the edge ONLY for the following special cases (ensure that it EXACTLY matches these cases) AND when the query has EXACTLY one edge:
1. The subject of the edge is a SmallMolecule or Drug, the predicate is biolink:affects, and the object is a Gene or Protein
2. The subject of the edge is a SmallMolecule or Drug, the predicate is biolink:treats, and the object is a Disease

This is an example of adding "knowledge_type": "inferred"
"e1": {
...
"knowledge_type": "inferred"
}

Also, use the SIMPLEST query possible to answer the question. ONLY answer with the JSON (NO OTHER TEXT)
"""

In [4]:
# set up openai
import requests
import openai
openai.api_key = "YOUR OPENAI KEY"

In [5]:
# function to extract things we need to get IDs for (using gpt)
# GPT will respond like [named thing] [another named thing] ..., this function converts this syntax into a list
def extract_ids(question):
    messages_list = [{"role": "system", "content": id_extraction_prompt}, {"role": "user", "content": question}]
    chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", temperature=0, messages=messages_list)

    ids = []
    active = False
    ind = -1
    for i in chat_completion.choices[0].message.content:
        if i == '[':
            active = True
            ind += 1
            ids.append('')
        elif i == ']':
            active = False
        elif active:
            ids[ind] += i

    return ids    

In [6]:
# function to resolve IDs for names using SRI Name resolution
def resolve_ids(ids):
    prefix_str = ""
    for i in ids:
        url = "https://name-resolution-sri.renci.org/lookup?string=" + i.replace(" ", "%20") + "&offset=0&limit=1"
        res = requests.post(url, data={})
        prefix_str += i + "=" + next(iter(res.json())) + "\n"
    return prefix_str

In [7]:
# function to get json (using gpt) when IDs have been resolved
def get_json(question, resolved_ids):
    messages_list = [{"role": "system", "content": json_generation_prompt}, {"role": "assistant", "content": resolved_ids}, {"role": "user", "content": question}]
    chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", temperature=0, messages=messages_list)
    
    return chat_completion.choices[0].message.content

In [8]:
# main function which uses all the other functinos to get question --> Query JSON for BTE
def question_to_json(question):
    output = 'Server error'
    try: 
        print('Question: ' + question)
        ids = extract_ids(question)
        print("IDs: " + str(ids))
        resolved_ids = resolve_ids(ids)
        print("Resolved IDs: " + resolved_ids)
        output = get_json(question, resolved_ids)
        print("JSON: \n" + output)
    except openai.error.RateLimitError as e:
        output = 'open ai rate limit reached (wait like 1 min) :('
        print('open ai rate limit reached :(')
        
    return output

In [9]:
# get BTE input JSON from our question
question_to_json(question)

Question: What compounds are related to diabetes through other genes?
IDs: ['diabetes']
Resolved IDs: diabetes=MONDO:0005015

JSON: 
{
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
                    "ids": ["MONDO:0005015"]
                },
                "n2": {
                    "categories": ["biolink:Gene"]
                },
                "n3": {
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e1": {
                    "subject": "n1",
                    "predicates": ["biolink:related_to"],
                    "object": "n2"
                },
                "e2": {
                    "subject": "n2",
                    "predicates": ["biolink:related_to"],
                    "object": "n3"
                }
            }
        }
    }
}


'{\n    "message": {\n        "query_graph": {\n            "nodes": {\n                "n1": {\n                    "ids": ["MONDO:0005015"]\n                },\n                "n2": {\n                    "categories": ["biolink:Gene"]\n                },\n                "n3": {\n                    "categories": ["biolink:SmallMolecule"]\n                }\n            },\n            "edges": {\n                "e1": {\n                    "subject": "n1",\n                    "predicates": ["biolink:related_to"],\n                    "object": "n2"\n                },\n                "e2": {\n                    "subject": "n2",\n                    "predicates": ["biolink:related_to"],\n                    "object": "n3"\n                }\n            }\n        }\n    }\n}'