# Pulling IDS Data Cubes into Pandas DataFrame (Cytometry Example)

This notebook demonstrates how to use TetraScience's APIs to ingest raw data and make it available in python for data analysis or machine learning. In particular, we demonstrate how to do this with cytometry data.

The steps are as follows:
* Download example data - here we use cytometry
* Create a data pipeline that converts raw instrument files to vendor-neutral Intermediate Data Schema (IDS) - here we use a pipeline that converts cytometry .fcs files
* Upload data to TDP (instead of using this file upload API, it is more typical to use TetraScience's [File-Log Agent](https://developers.tetrascience.com/docs/file-log-agent))
* Find all IDS data that came from our pipeline
* Pull out Data Cubes and put into Pandas DataFrame

## Import Libraries

In [None]:
import os
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pprint

## Constants for use in Notebook

In [None]:
# Location of authentication file
AUTH_DIR = "./"
AUTH_FILENAME = "auth.json"

In [None]:
# APIs used in this notebook
BASE_API = "https://api.tetrascience-uat.com/v1/"
API_EQL_SEARCH = BASE_API + "datalake/searchEql"
API_RETRIEVE_FILE = BASE_API + "datalake/retrieve"
API_PIPELINE_INFO = BASE_API + "pipeline/"
API_PIPELINE_CREATION = BASE_API + "pipeline/create"
API_FILE_UPLOAD = BASE_API + 'datalake/upload'

## Pull in Authenication Information for Headers

In [None]:
with open(os.path.join(AUTH_DIR, AUTH_FILENAME), "r") as f:
    auth_data = json.loads(f.read())

headers = {"ts-auth-token": auth_data["auth_token"],
           "x-org-slug": auth_data["org"]}

## Download example cytometry data

* Navigate to [flowcytometry.org](flowcytometry.org) to download freely available cytometry datasets for analysis. In particular, navigate to [this dataset](https://flowrepository.org/id/FR-FCM-Z2KP) with data from a study analyzing blood from individuals who had varying levels of COVID-19 (based on [this study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7662088/))

* Click on the download button, and then on the download page click "ZIP & Download Files".

* Unzip the file to get a folder full of .fcs files

In [None]:
# Location of fcs dataset
CYTOMETRY_DIR = "./"
CYTOMETRY_FOLDER = "FlowRepository_FR-FCM-Z2KP_files/"

## Create Pipeline to convert Cytometry data to IDS

In [None]:
cytometry_pipeline_info = {'name': 'Example - Create Cytometry Tetra Data',
                           'description': 'Transform FCS to IDS',
                           'triggerType': 'custom',
                           'triggerCondition': {'groupOperator': 'AND',
                                                'groupLevel': 1,
                                                'groups': [{'groupLevel': 2,
                                                            'groupOperator': 'AND',
                                                            'groups': [{'key': 'category', 
                                                                        'operator': 'is', 
                                                                        'value': 'raw'}]},
                                                           {'groupLevel': 2,
                                                            'groupOperator': 'AND',
                                                            'groups': [{'key': 'tags',
                                                                        'operator': 'has a tag that is',
                                                                        'value': 'example-cytometry'}]}]},
                           'protocolSlug': 'bd-flow-cytometers-raw-to-ids',
                           'protocolVersion': 'v1.1.2',
                           'masterScriptNamespace': 'common',
                           'masterScriptSlug': 'bd-flow-cytometers-raw-to-ids',
                           'masterScriptVersion': 'v1.1.2'}

In [None]:
### WARNING: Run the commands below one time, as running it multiple times creates duplicate pipelines

# create_cytometry_pipeline = requests.post(API_PIPELINE_CREATION, headers=headers, data=json.dumps(cytometry_pipeline_info))
# create_cytometry_pipeline.text

In [None]:
# Save Pipeline ID (from API response above) to a variable:
cytometry_pipeline_id = ""

## Upload Cytometry Data to TDP

In [None]:
fcs_files = [CYTOMETRY_DIR + CYTOMETRY_FOLDER + file for file in 
             os.listdir(os.path.join(CYTOMETRY_DIR, CYTOMETRY_FOLDER)) if ".fcs" in file]

In [None]:
num_raw_files = len(fcs_files)

In [None]:
def upload_file(auth_token, org, filepath, filename, tag):
    file_upload_curl = "curl --location '%s' \
                             --header 'ts-auth-token: %s' \
                             --header 'x-org-slug: %s' \
                             --header 'Content-Transfer-Encoding: multipart/form-data' \
                             --form 'file=@%s' \
                             --form 'filename=%s' \
                             --form 'tags=[\"%s\"]'" % (API_FILE_UPLOAD, auth_token, org, filepath, filename, tag)
    os.system(file_upload_curl)

In [None]:
for fcs_file in fcs_files:
    upload_file(auth_data["auth_token"], auth_data["org"],  fcs_file, fcs_file, 'example-cytometry')

## Find all IDS files created by Pipeline

In [None]:
query = {
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "integration.id": cytometry_pipeline_id
          }
        },
        {
          "match": {
            "integration.type": "datapipeline"
          }
        }
      ]
    }
  }
}

payload = json.dumps(query)
ids_file_search = requests.post(API_EQL_SEARCH, headers=headers, data=payload)
results = json.loads(ids_file_search.text)["hits"]["hits"]
num_ids_files = len(results)

print("Checking status of pipeline.")
print("Number of Raw Files: %d" % num_raw_files)
print("Number of IDS Files: %d" % num_ids_files)
if num_raw_files > num_ids_files:
    print("Raw files still processing.")
else:
    print("All files converted.")

## Pull IDS Data Cube into Pandas DataFrame

In [None]:
first_file_id = results[0]["_source"]["fileId"]
print(first_file_id)

In [None]:
retrieve_file = requests.get(API_RETRIEVE_FILE+"?fileId="+first_file_id, headers=headers)

In [None]:
IDS_info = json.loads(retrieve_file.text)

In [None]:
IDS_info

In [None]:
# Pull out the raw data
data = [np.array(x["measures"][0]["value"]) for x in IDS_info["datacubes"]]
data = np.vstack(data).T

In [None]:
# Pull out the channel information
channels =[x["measures"][0]["name"] for x in IDS_info["datacubes"]]

In [None]:
# Pull out the measurement timings
time = IDS_info["datacubes"][0]["dimensions"][0]["scale"]

In [None]:
# Insert information into Pandas DataFrame
cytometry_df = pd.DataFrame(data, index=time, columns=channels)

In [None]:
cytometry_df.head()

## Visualize FSC-H Channel Histogram

In [None]:
_ = plt.hist(cytometry_df["FSC-H"], bins=100)

## References

* Spidlen J, Breuer K, Rosenberg C, Kotecha N and Brinkman RR. FlowRepository - A Resource of Annotated Flow Cytometry Datasets Associated with Peer-reviewed Publications. Cytometry A. 2012 Sep; 81(9):727-31
* Neumann, J., Prezzemolo, T., Vanderbeke, L., Roca, C. P., Gerbaux, M., Janssens, S., ... & Yserbyt, J. (2020). Increased IL‐10‐producing regulatory T cells are characteristic of severe cases of COVID‐19. Clinical & translational immunology, 9(11), e1204.