## Notebook Scheduler HSNE Demo Notebook 

### Lucas Graybuck & Paul Mariz 

### 2021-05-02

In this notebook, we'll use the Python HISE SDK to load a batch of .fcs files and run a HSNE dimension reduction. We'll discover midway through that the IDE instance we're on is too small, and that we need to schedule this notebook to run on a larger instance.

To do this, we'll follow this overall flow:

1. Get some files of interest from HISE Advanced search
2. Read the data from the files and prepare it for HSNE
3. Schedule the Notebook
4. Wait for the scheduled job to complete
5. Download the output
6. Profit

In [1]:
#I need a few extra libraries in order to run this analysis
#You may also have to restart the notebook kernel in order to load the new version of numpy
!pip uninstall -y numpy
!pip install "numpy == 1.20.3"
!pip install "cmake == 3.20.2"
!pip install "flowkit == 0.6.1"
!pip install nptsne
!pip install pyreadr
!pip install community
!pip install python-louvain
import hisepy
import flowkit
import sklearn
import pandas
import numpy
import networkx
import nptsne
import pyreadr
from community import community_louvain

Found existing installation: numpy 1.20.3
Uninstalling numpy-1.20.3:
  Successfully uninstalled numpy-1.20.3
Collecting numpy==1.20.3
  Using cached numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3 MB)
Installing collected packages: numpy
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

fastai 1.0.61 requires nvidia-ml-py3, which is not installed.[0m
Successfully installed numpy-1.20.3


### 1. Use the SDK to read files from HISE

From the [HISE UI](https://allenimmunology.org) I did a search for FCS files (I used the public "flow QC fcs" query and selected batch B001). That gave me this code snippet:
```
fres1 <- hise::readFiles(list("d00fd209-7c6c-4f39-a857-3c945ae23b82", "2f1ea469-587c-4815-bb6e-77462eaa0748", "d564eebd-241c-4025-8b20-cdf79325d9ad", "52297a4e-64c4-42b8-b963-cf3f402e1139", "886b5851-f1d7-47fb-a52c-10abec6311d3", "2cf0d3c1-a4e0-4921-b941-50465aabdff7", "155d9078-b357-42c6-baf7-488c98c6786b", "6e423656-f5f6-48fa-8862-b4e5191f70ee", "28192359-4a24-4677-8893-60eb1bd55464", "daf3dc9f-31b5-4876-86e6-2ad599a274f5", "b6e214ea-5fd6-4596-8ea4-7b6c47f1f61e", "38dc23f6-f199-4d5c-9f83-a300d14fbece", "6875bfa0-2466-4903-aa87-920ad1a94366"))
```
That's R code, so I need to change it slightly to the python version:

In [6]:
fcs_files = hisepy.read_files(["d00fd209-7c6c-4f39-a857-3c945ae23b82", "2f1ea469-587c-4815-bb6e-77462eaa0748", "d564eebd-241c-4025-8b20-cdf79325d9ad", "52297a4e-64c4-42b8-b963-cf3f402e1139", "886b5851-f1d7-47fb-a52c-10abec6311d3", "2cf0d3c1-a4e0-4921-b941-50465aabdff7", "155d9078-b357-42c6-baf7-488c98c6786b", "6e423656-f5f6-48fa-8862-b4e5191f70ee", "28192359-4a24-4677-8893-60eb1bd55464", "daf3dc9f-31b5-4876-86e6-2ad599a274f5", "b6e214ea-5fd6-4596-8ea4-7b6c47f1f61e", "38dc23f6-f199-4d5c-9f83-a300d14fbece", "6875bfa0-2466-4903-aa87-920ad1a94366"])

In [7]:
len(fcs_files)

13

In [10]:
test_sample = flowkit.Sample(fcs_files[0].path)
test_sample.pns_labels

['',
 '',
 '',
 '',
 '',
 '',
 'CD3',
 'CD45',
 'CD15',
 'CD45RA',
 'CD14',
 'CD8',
 'CD11c',
 'CD25',
 'CD4',
 'Live/Dead',
 'CD16',
 'CD123',
 'CD127',
 'IgD',
 'CD304',
 'CD141',
 'CD11b',
 'CD19',
 'CD27',
 'abTCR',
 'CD34',
 'CD197',
 'CD38',
 'CD56',
 'HLA-DR',
 '']

In [11]:
test_events = test_sample.get_raw_events()
test_events

array([[1.02012023e+05, 8.91688438e+04, 1.44771969e+05, ...,
        1.36947742e+03, 2.56013153e+02, 3.01377630e-02],
       [1.19290570e+05, 1.08187844e+05, 1.45237297e+05, ...,
        2.50700586e+04, 8.14550928e+03, 3.03957605e-02],
       [1.24016266e+05, 1.07464906e+05, 1.46331109e+05, ...,
        1.85670215e+03, 3.35393677e+02, 3.07395220e-02],
       ...,
       [6.96097500e+04, 6.05296680e+04, 1.34352266e+05, ...,
        7.07714600e+02, 2.74119598e+02, 1.00024570e+02],
       [1.28355367e+05, 1.14163805e+05, 1.38775109e+05, ...,
        7.99269592e+02, 2.04366501e+02, 1.00024980e+02],
       [5.93681406e+04, 5.58736875e+04, 1.21945148e+05, ...,
        3.21582251e+03, 1.25084834e+04, 1.00025020e+02]])

### 2. Convert the fcs file data into dataframes

In [15]:
events = pandas.DataFrame()
for fcs_file in fcs_files:
    # Read sample
    fcs_sample = flowkit.Sample(fcs_file.path)
    # Get events
    fcs_events = fcs_sample.get_raw_events()
    # Convert to DataFrame for filtering columns
    fcs_events = pandas.DataFrame(fcs_events)
    fcs_events.columns = fcs_sample.pns_labels
    fcs_events = fcs_events.iloc[:,10:]
    events = pandas.concat([events, fcs_events], axis = 0)

MemoryError: Unable to allocate 1.18 GiB for an array with shape (22, 7222425) and data type float64

Oh no. I've got too much data for the instance I'm currently running, and I haven't even gotten to the HSNE dimension reduction yet. What can I do? 

### 3. Schedule This Notebook

I'm going to write the rest of the code that I want to execute on this data, and then schedule this notebook to run on a big instance. The first thing I will do is define some variables that name the output I expect my notebook to have, and set the project that I want those output files to belong to (I only need to do the second thing if I'm currently working on multiple projects in HISE). 

In [8]:
#I expect this notebook to output two files:
clustering_output = "clustering.rds"
embedding_output = "embedding.rds"
#it is VERY IMPORTANT to get the names of the output files correct when scheduling an instance

output_project = "cohorts"


Now I'm going to write a cell that has all the code I want to execute to produce those outputs. I have kind of a chicken-and-egg problem here: I don't actually want to run all this code right now. Instead I want to run it on the much bigger scheduled instance. So I'm going to write it all out and not execute the cell. But that's a little risky, because what if I accidentally make some typos or my code doesn't output the things I think it should? Here I don't have this problem because it's a sample notebook and I already tested the code. But what if that's not the case? Here are some things I could do:

 * Run the entire notebook on one .fcs file in order to test it, then change the code so it runs on all of them, then schedule it. 
 * Execute the cell to make sure that it at least doesn't have any syntax errors, and then kill the kernel before it gets too far.

In [None]:
#I need to define this helper function before I use it
def make_graph_from_transition_matrix(tmat):    
    row = []
    col = []
    data = []

    for r_ind, rcol in enumerate(tmat):
        for tup in rcol:
            if not isinstance(tup, tuple):
                continueexit()
            row.append(r_ind)
            col.append(tup[0])
            data.append(tup[1])
    
    g = networkx.Graph()
    g.add_weighted_edges_from(list(zip(row, col, data)))
    return g

#I'll add some print statements so I can debug this later if it fails
print("Running HSNE")
hsne = nptsne.HSne(True)
hsne.create_hsne(events.to_numpy(), 5)
print("HSNE Completed")
hsne_scale = hsne.get_scale(4)
#that helper function I defined above
hsne_graph = make_graph_from_transition_matrix(hsne_scale.transition_matrix)   
print("Running Louvain Partitioning")
clusters = community_louvain.best_partition(hsne_graph, resolution = 1)
print("Partitioning Done - Saving clusters data frame as %s" % (clustering_output))
cluster_df = pandas.DataFrame(list(clusters.items()), columns = ['orig_cell','cluster_id'])
cluster_df = cluster_df.sort_values('orig_cell')
cluster_df = cluster_df.reset_index()
cluster_df['cell_idx'] = list(hsne_scale.landmark_orig_indexes)
#write the clustering dataframe to an output file
pyreadr.write_rds(clustering_output, cluster_df)
print("Running embedding")
model = nptsne.hsne_analysis.Analysis(hsne, nptsne.hsne_analysis.EmbedderType.CPU)
for i in range(2000):
    model.do_iteration()
nptsne_embedding = model.embedding
print("Embedding done -- Saving embedding data frame as %s" % (embedding_output))
nptsne_df = pandas.DataFrame({'x' : [val[0] for val in nptsne_embedding], 'y' : [val[1] for val in nptsne_embedding]})
nptsne_df['cell_idx'] = hsne_scale.landmark_orig_indexes
nptsne_df = nptsne_df.sort_values('cell_idx')
nptsne_df = nptsne_df.reset_index()
nptsne_df['cluster_id'] = cluster_df['cluster_id'].astype('category')
#write the embedding output
pyreadr.write_rds(embedding_output, nptsne_df)
print("Done")

Awesome. Now I'm ready to schedule. I just pass the output files I expect as an array to the scheduler function. When this scheduled notebook runs later, it will execute the entire notebook, cell by cell. The only difference is that it will ignore the schedule_notebook function, because we do not want the scheduled notebook to schedule another notebook (and so on, forever. That would, like, be bad?)  

In [14]:
job = hisepy.schedule_notebook([clustering_output, embedding_output], project = output_project)

About to schedule notebook /home/jupyter/examples/NotebookSchedulerDemo.ipynb for run on a large instance.
I will run all the cells in the notebook, only skipping this schedule function.
I expect this notebook to produce the following output files:
	clustering.rds
	embedding.rds
I will copy those files back to HISE where they will be available for later download into this or another IDE instance.
OK? (y/n) 

 Y


Scheduling...
Scheduled.


In [None]:
job.check_status()

In [None]:
job.is_completed()

### 4. Wait for Scheduled Job To Complete

I can periodically run the either of the cells above (job.check_status() or job.is_completed() to check on the status of my job). I can also visit https://allenimmunology/#/notebook-jobs to see the status of my job. I can also shut this notebook down now and resume it later by clicking the "Clone IDE" button next to my completed job on the notebook-jobs page.


&lt;six hours go by&gt;

...finally...

### 5. Download The Output

In [None]:
files = job.download_output()

### 6. Profit

&lt;insert your own profitable code here&gt;