# Querying Chromeleon Data

## Configure Python Imports

In order to run our Notebook, we will need the following libraries:

- requests: This library is used to make HTTP Requests to the TDP API
- json: Allows us to manipulate files as JSON Objects
- pandas: This is a very useful library for storing data in tabular structures
- numpy: Open Source Framework for mathematical computation
- matplotlib: Library for creating visualizations of your data

In [None]:
import requests, json, pandas as pd, numpy as np, matplotlib.pyplot as plt
%matplotlib inline

## Configure Connection Variables

Create and store information on how to connect to the TDP API

In [None]:
api_root = ""
search_url = api_root + "/v1/datalake/searchEql"
retrieve_file = api_root + "/v1/datalake/retrieve"
print(search_url)
print(retrieve_file)

In [None]:
org_slug = ""
user_token = ""
headers = {"x-org-slug": org_slug, "ts-auth-token": user_token}

## Create Query to Search for All Injections in a Project

The TDP API uses ElasticSearch for indexing the Chromeleon Content. This powerful tool allows advanced searching against the data to find the appropriate information based upon your use case(s).

In this scenario, we are creating a query to find all of the Chromeleon Data for a specific sequence/folder in a Data Vault.

If you want to find all the possible sequences/folders, Navigate in the TetraScience Scientific Data Cloud to the Search File tab. Click on Browse, next to List in the top right. Then navigate down the folder list to find see the sequences.

Here we grab all the injections from the project under "SEQUENCE_NAME.seq" by searching for all files with a file path that contains the sequence/project/directory name. Change the sequence name to be relevant to your data.

We also limit our search by only looking for sequences that have IDSs created by a specific pipeline.

In [None]:
payload = {
    "size": 200,
      "query": {
          "bool": {
              "must": [
                  {"term": {"idsType": "thermofisher-chromeleon"}},
                  {"term": {"source.name": "PIPELINE_THAT_CREATED_CHROMELEON_IDS"}},
                  {"wildcard": {"file.path": "*SEQUENCE_NAME.seq*"}}
                  ]
          }
      },
    "_source": [ "fileId", "filePath", "labels"],
}

In [None]:
payload

## Run Search Request and Display Results

We will now use the requests library to make a request to the TDP API.  We have previously configured the connection variables as well as the query we are executing.

In [None]:
request = requests.post(search_url, json=payload, headers=headers)

In [None]:
result = request.json()['hits']['hits']

# print the first result to see structure
result[0]

We can look at all the run names of the sequence by printing the run labels out:

In [None]:
for run in result:
  labels = run["_source"]["labels"]
  run_name = next(item["value"] for item in labels if item["name"] == "run_name")
  print(run_name)

If you want to filter on specific runs, you could write code here to do that.

## Retrieve the Amounts of Chemicals in one Sequence

In [None]:
first_file_id = result[0]["_source"]["fileId"]
print(first_file_id)

In [None]:
first_file = requests.get(retrieve_file+"?fileId="+first_file_id, headers=headers)

In [None]:
IDS_info = json.loads(first_file.text)

In [None]:
peaks = IDS_info["results"][1]["peaks"]
chemicals = []
amounts = []
for peak in peaks:
  chemicals += [peak["name"]["value"]]
  amounts += [peak["amount"]["value"]]
np.vstack([np.array(chemicals), np.array(amounts)]).T

## Ways to extend this:


*   Produce Peak Results table from Chromeleon from Peak data in IDS
*   Compare levels of chemical compounds across runs
*   Predict unknown chemicals from unknown runs