## Retrieving data from HISE in R

v1.0 (2021-03-24)  
Lucas Graybuck  

### Purpose

In this notebook, we'll examine how to use the `hise` package in our IDE environment to locate and download datasets related to our cohorts.

This notebook was generated using the `R` language, running in the Jupyter Notebook environment on a HISE IDE instance. See the end of the document for the [Session Info](aSession-Info) for additional software version details.

<a name = "contents"></a>

### Contents

- [Loading hise](#Loading-hise)
- [Querying datasets](#Querying-datasets)
    - [Example 1: scRNA-seq files from the FH1 cohort](#Example-1)
    - [Example 2: Survey Panel fllow counts from BR1 7 days post-vaccination](#Example-2)
    - [Example 3: scATAC-seq .arrow files from CU1 donors](#Example-3)
- [Downloading datasets](#Downloading-datasets)
    - [Downloading to the local file cache](#Downloading-to-the-local-file-cache)
    - [Reading scRNA data directly into R](#Reading-scRNA-data-directly-into-R)
    - [Reading flow data directly into R](#Reading-flow-data-directly-into-R)
    - [Reading scATAC data directly into R](#Reading-scATAC-data-directly-into-R)
- [Saving cache info for downstream steps](#Saving-cache-info-for-downstream-steps)
- [Session Info](#Session-Info)

### Loading `hise`

Functions used to interact with the HISE SDK are stored in the `hise` package for R, which is installed by default on our IDE instances. It can be loaded with the `library()` function:

In [1]:
library(hise)

We'll also use this helper function to make it a bit easier to navigate the file description results we obtain:

In [2]:
fileDescToDataframe <- function(descriptors,
                                keep_labs = FALSE) {
  
  assertthat::assert_that(typeof(descriptors) == "list")
  assertthat::assert_that(typeof(keep_labs) == "logical")
  
  do.call(
    rbind,
    lapply(
      descriptors,
      function(desc) {
        desc <- unlist(desc)
        desc <- desc[!grepl("scheme", names(desc))]
        names(desc) <- sub("^descriptors.","",names(desc))
        if(!keep_labs) {
          desc <- desc[!grepl("^lab", names(desc))]
        }
        desc <- as.list(desc)
        df <- as.data.frame(desc)
        df
      }
    )
  )
}

[Return to contents](#Contents)

### Querying datasets

To obtain file information we use the `getFileDescriptors` function along with a `fileType` parameter and a `filter` list object to perform more specific filtering.

A few key filters are described below: 
- `cohort.cohortGuid`: Cohort Globally Unique ID (GUID)
    - "BR1": Benaroya Research Institute Healthy Adult cohort 25-35 years old
    - "BR2": Benaroya Research Institute Healthy Adult cohort 55-65 years old
    - "CU1": University of Colorado Rheumatoid Arthritis cohort
    - "FH1": Fred Hutch Newly Diagnosed Multiple Myeloma cohort
    - "FH2": Fred Hutch Refractory Multiple Myeloma cohort
    - "SD1": University of California San Diego Rheumatoid Arthritis cohort
    - "UP1": University of Pennsylvania Healthy Pediatric cohort
    - "UP2": University of Pennsylvania Melanoma anti-PD1 Immunotherapy cohort
    - "UP3": University of Pennsylvania Inflammatory Bowel Disease cohort
- `file.id`: File Globally Unique ID (e.g. "a14dc9a1-0a48-40fe-9f3f-7ec570d46bd3")
- `file.fileType`: The file type to search for.
    - "FlowCytometry": Flow Cytometry .fcs files (see the `file.panel` option for panel selection)
    - "FlowCytometry-supervised-stats": Flow Cytometry supervised gating population counts (see the `file.panel` option for panel selection)
    - "scRNA-seq-labeled": Labeled scRNA-seq .h5 dataset
    - "atac-assembly-archr-arrow": scATAC-seq .arrow file for use with the ArchR framework
    - "atac-assembly-filtered-fragments-tsv-gz": scATAC-seq fragments.tsv.gz file for use with other analysis frameworks
    - "atac-assembly-read-counts-per-windows-h5": scATAC-seq fragment counts per genomic 5kb window
    - "atac-assembly-read-counts-gene-bodies-h5": scATAC-seq fragment counts per gene body
    - "atac-assembly-read-counts-per-region-h5": scATAC-seq fragment counts per TSS (+/-2kb)
    - "Olink": Olink data matrix (cross-project)
- `file.panel`: Used to select a flow cytometry panel. No applicable to other data types.
    - "PS1": Survey Panel
    - "PB1": B Cell Panel
    - "PM1": Monocyte Panel
    - "PT1": T Cell Panel
- `sample.visitName`: Visit Name for the sample blood draw. "Flu Year 1" examples shown below. Year 2 is also becoming available for the 2021 flu shot season.
    - "Other - Non-Flu": Samples unrelated to flu shot time courses
    - "Flu Year 1 Day 0": Samples drawn on the same day as a flu vaccine
    - "Flu Year 1 Day 7": Samples drawn near 7 days post-vaccination
    - "Flu Year 1 Day 90": Samples drawn near 90 days post-vaccination
    - "Flu Year 1 Stand-Alone": Free-standing samples not timed relative to vaccination
- `sample.sampleKitGuid`: Sample submission kit Globally Unique ID (e.g. "KT00134")
- `subject.biologicalSex`: Biological sex of donors ("Female" or "Male")
- `subject.birthYear`: Birth year of the donors, in YYYY format as a character object (e.g. "1958")
- `subject.race`: Subject self-described race
- `subject.ethnicity`: Subject self-described ethnicity
- `subject.subjectGuid`: A subject Globally Unique ID (e.g. "FH1002")

[Return to contents](#Contents)

#### Example 1  
#### scRNA-seq files from the FH1 cohort

In [3]:
fh1_filter_list <- list(
    cohort.cohortGuid = "FH1"
)

fh1_rna_desc <- getFileDescriptors(
    fileType = "scRNA-seq-labeled", 
    filter = fh1_filter_list)

In [4]:
fh1_rna_desc <- fileDescToDataframe(fh1_rna_desc)

How many files did we find?

In [5]:
nrow(fh1_rna_desc)

What information do we have available for these files?

In [6]:
str(fh1_rna_desc)

'data.frame':	30 obs. of  21 variables:
 $ lastUpdated               : chr  "2021-03-19T18:00:54.396Z" "2021-03-19T18:00:54.396Z" "2021-03-19T18:00:54.396Z" "2021-03-19T18:00:54.396Z" ...
 $ sample.id                 : chr  "990f590a-83ab-4c31-95bc-f8bec685e96a" "a43d2542-b590-476a-bc19-3d07d3df59de" "687aa7ab-7cc3-4a8f-8c2c-6d94e24c5804" "7fcef0d1-1a4c-4d59-ba3a-7fbbad731fa4" ...
 $ sample.bridgingControl    : chr  "false" "false" "false" "false" ...
 $ sample.sampleKitGuid      : chr  "KT00132" "KT00134" "KT00135" "KT00136" ...
 $ sample.visitName          : chr  "Other - Non-Flu" "Other - Non-Flu" "Other - Non-Flu" "Other - Non-Flu" ...
 $ sample.drawDate           : chr  "2020-01-01T00:00:00Z" "2020-03-01T00:00:00Z" "2020-04-01T00:00:00Z" "2020-04-01T00:00:00Z" ...
 $ sample.daysSinceFirstVisit: chr  "0" "0" "67" "0" ...
 $ file.id                   : chr  "7d3a8295-402e-47ab-80c2-e51b1a2a6c59" "58687b56-b0d6-4d0b-a46e-da57f7cf4665" "2716b9e4-38f3-4225-bfbe-1fef010e1926" "8f8de5fb-

Let's take a look at the first few:

In [7]:
head(fh1_rna_desc)

Unnamed: 0_level_0,lastUpdated,sample.id,sample.bridgingControl,sample.sampleKitGuid,sample.visitName,sample.drawDate,sample.daysSinceFirstVisit,file.id,file.name,file.batchID,⋯,file.pool,file.fileType,subject.id,subject.biologicalSex,subject.birthYear,subject.ethnicity,subject.partnerCode,subject.race,subject.subjectGuid,cohort.cohortGuid
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,2021-03-19T18:00:54.396Z,990f590a-83ab-4c31-95bc-f8bec685e96a,False,KT00132,Other - Non-Flu,2020-01-01T00:00:00Z,0,7d3a8295-402e-47ab-80c2-e51b1a2a6c59,automated/merged/2020-09-15T19:35:44.423205814Z/B017/labeled/B017-P1_PB00132-01_2020-09-15T19:35:44.423205814Z_labeled.h5,B017,⋯,,scRNA-seq-labeled,46607e6b-2665-4b73-b5c2-7a771ff1f023,Female,1957,non-Hispanic origin,FH,Caucasian,FH1002,FH1
2,2021-03-19T18:00:54.396Z,a43d2542-b590-476a-bc19-3d07d3df59de,False,KT00134,Other - Non-Flu,2020-03-01T00:00:00Z,0,58687b56-b0d6-4d0b-a46e-da57f7cf4665,automated/merged/2020-09-15T19:35:44.423205814Z/B017/labeled/B017-P1_PB00134-01_2020-09-15T19:35:44.423205814Z_labeled.h5,B017,⋯,,scRNA-seq-labeled,5fffe7b1-da27-479e-9a09-038fdef9709c,Male,1958,non-Hispanic origin,FH,Caucasian,FH1003,FH1
3,2021-03-19T18:00:54.396Z,687aa7ab-7cc3-4a8f-8c2c-6d94e24c5804,False,KT00135,Other - Non-Flu,2020-04-01T00:00:00Z,67,2716b9e4-38f3-4225-bfbe-1fef010e1926,automated/merged/2020-09-15T19:35:44.423205814Z/B017/labeled/B017-P1_PB00135-01_2020-09-15T19:35:44.423205814Z_labeled.h5,B017,⋯,,scRNA-seq-labeled,46607e6b-2665-4b73-b5c2-7a771ff1f023,Female,1957,non-Hispanic origin,FH,Caucasian,FH1002,FH1
4,2021-03-19T18:00:54.396Z,7fcef0d1-1a4c-4d59-ba3a-7fbbad731fa4,False,KT00136,Other - Non-Flu,2020-04-01T00:00:00Z,0,8f8de5fb-16b2-4e9a-acc3-eaf060fc2591,automated/merged/2020-09-15T19:35:44.423205814Z/B017/labeled/B017-P1_PB00136-01_2020-09-15T19:35:44.423205814Z_labeled.h5,B017,⋯,,scRNA-seq-labeled,edd5cfbc-f56d-4c50-9bb0-9da9aad4aa79,Male,1972,non-Hispanic origin,FH,Caucasian,FH1004,FH1
5,2021-03-19T18:00:54.396Z,1cf38bae-2a87-41f9-9330-29a627eecb82,False,KT00137,Other - Non-Flu,0001-01-01T00:00:00Z,0,b29c2b7d-8d3e-436e-bc4a-10548e752248,automated/merged/2020-09-15T19:35:44.423205814Z/B017/labeled/B017-P1_PB00137-01_2020-09-15T19:35:44.423205814Z_labeled.h5,B017,⋯,,scRNA-seq-labeled,5fffe7b1-da27-479e-9a09-038fdef9709c,Male,1958,non-Hispanic origin,FH,Caucasian,FH1003,FH1
6,2021-03-19T18:00:54.396Z,c903c635-b5f2-4afe-a103-72d7537d9caf,False,KT00138,Other - Non-Flu,0001-01-01T00:00:00Z,0,eaebab13-1531-42e4-958b-78184a65d68c,automated/merged/2020-09-15T19:35:44.423205814Z/B017/labeled/B017-P1_PB00138-01_2020-09-15T19:35:44.423205814Z_labeled.h5,B017,⋯,,scRNA-seq-labeled,edd5cfbc-f56d-4c50-9bb0-9da9aad4aa79,Male,1972,non-Hispanic origin,FH,Caucasian,FH1004,FH1


How many did we get from each donor?

In [8]:
table(fh1_rna_desc$subject.subjectGuid)


FH1002 FH1003 FH1004 FH1005 FH1006 FH1007 
     6      6      6      4      4      4 

[Return to contents](#Contents)

#### Example 2
#### Flow Cytometry Survey Panel .fcs files and gating counts from BR1 7 days post-vaccination

In [9]:
br1_filter_list <- list(
    cohort.cohortGuid = "BR1",
    file.panel = "PS1",
    sample.visitName = "Flu Year 1 Day 7"
)

br1_flow_desc <- getFileDescriptors(
    fileType = "FlowCytometry", 
    filter = br1_filter_list)

In [10]:
br1_flow_desc <- fileDescToDataframe(br1_flow_desc)

How many available?

In [11]:
nrow(br1_flow_desc)

What's the basic demographic breakdown of these samples?

In [12]:
table(br1_flow_desc$subject.race, br1_flow_desc$subject.biologicalSex)

           
            Female Male
  Asian          2    1
  Caucasian     10   12
  Other          0    1

We can obtain matching gating counts for each of these samples as well:

In [13]:
br1_count_desc <- getFileDescriptors(
    fileType = "FlowCytometry-supervised-stats", 
    filter = br1_filter_list)

In [14]:
br1_count_desc <- fileDescToDataframe(br1_count_desc)

In [15]:
nrow(br1_count_desc)

Looks like we don't have counts for everything. We can match these up to the .fcs files above using the `sample.sampleKitGuid` (which corresponds to the blood draw for the samples).

In [16]:
matching_br1_flow_desc <- br1_flow_desc[match(br1_count_desc$sample.sampleKitGuid, br1_flow_desc$sample.sampleKitGuid),]

In [17]:
nrow(matching_br1_flow_desc)

[Return to contents](#Contents)

#### Example 3
#### scATAC-seq .arrow files from CU1 donors

In [18]:
cu1_filter_list <- list(
    cohort.cohortGuid = "CU1"
)

cu1_atac_desc <- getFileDescriptors(
    fileType = "atac-assembly-archr-arrow", 
    filter = cu1_filter_list)

In [19]:
cu1_atac_desc <- fileDescToDataframe(cu1_atac_desc)

In [20]:
nrow(cu1_atac_desc)

In [21]:
table(cu1_atac_desc$subject.subjectGuid, cu1_atac_desc$sample.visitName)

        
         Flu Year 1 Day 90 Flu Year 1 Stand-Alone Flu Year 2 Day 0
  CU1001                 1                      0                0
  CU1002                 1                      0                1
  CU1003                 1                      0                1
  CU1004                 1                      0                1
  CU1005                 0                      0                1
  CU1019                 0                      1                0
  CU1020                 0                      1                0
  CU1021                 0                      1                0
  CU1022                 0                      1                0
        
         Flu Year 2 Day 7
  CU1001                0
  CU1002                1
  CU1003                1
  CU1004                1
  CU1005                1
  CU1019                0
  CU1020                0
  CU1021                0
  CU1022                0

[Return to contents](#Contents)

### Downloading datasets

Once we've found the data we want to use, there are multiple ways to download these datasets for analysis.

The key piece of information we need to retrieve data is the `file.id` column we retrieved, above.

#### Downloading to the local file cache

The simplest is to simply download the files to our locale cache using the `cacheFiles()` function.

`cacheFiles()` will pull data out of HISE and into a local directory ("cache/" within the current working directory).

To demonstrate, let's cache the first file from each of the 3 examples above

Each call to cacheFiles returns a `list` object, which contains both the `filePath` to the file, as well as the same descriptors retrieved using `getFileDescriptors()`. We'll convert these to data.frame objects using `fileDescToDataframe()`.

In [22]:
fh1_rna_file_1 <- cacheFiles(list(fh1_rna_desc$file.id[1]))
fh1_rna_file_1 <- fileDescToDataframe(fh1_rna_file_1)

In [23]:
br1_flow_file_1 <- cacheFiles(list(br1_flow_desc$file.id[1]))
br1_flow_file_1 <- fileDescToDataframe(br1_flow_file_1)

In [24]:
cu1_atac_file_1 <- cacheFiles(list(cu1_atac_desc$file.id[1]))
cu1_atac_file_1 <- fileDescToDataframe(cu1_atac_file_1)

If we cache multiple files, we'll get an entry for each one:

In [25]:
fh1_rna_files_2_to_5 <- cacheFiles(list(fh1_rna_desc$file.id[2:5]))
fh1_rna_files_2_to_5 <- fileDescToDataframe(fh1_rna_files_2_to_5)

In [26]:
br1_count_files <- cacheFiles(list(br1_count_desc$file.id))
br1_count_files <- fileDescToDataframe(br1_count_files)

[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."
[1] "Initiating file download for population-stats.csv"
[1] "Download successful."


In each case, the file location is in the `filePath` column:

In [27]:
fh1_rna_file_1$filePath

In [28]:
fh1_rna_files_2_to_5$filePath

#### Reading scRNA data directly into R

Helper functions are available to directly read scRNA-seq files for use with Seurat, BioConductor (SingleCellExperiment), and Monocle (cell_data_set):

In [29]:
fh1_rna_so <- readSCRNAFileSeurat(fh1_rna_desc$file.id[1])
fh1_rna_so

Loading required namespace: Seurat



An object of class Seurat 
33538 features across 17791 samples within 1 assay 
Active assay: RNA (33538 features, 0 variable features)

In [30]:
fh1_rna_se <- readSCRNAFileSCE(fh1_rna_desc$file.id[1])
fh1_rna_se

Loading required namespace: SingleCellExperiment



class: SingleCellExperiment 
dim: 33538 17791 
metadata(0):
assays(1): counts
rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
  ENSG00000268674
rowData names(4): id name type i
colnames(17791): 75464c00fa4d11ea81509e49704367d9
  75464e4efa4d11ea81509e49704367d9 ... e144c3f0fa5211ea8383e6a022d3c9e0
  e144c918fa5211ea8383e6a022d3c9e0
colData names(19): barcodes batch_id ... umap_2 well_id
reducedDimNames(0):
altExpNames(0):

You may need to run the following from a terminal to install `monocle3`:
```
apt-get update
apt-get install libudunits2-dev libpq-dev libgdal-dev
Rscript --vanilla -e "devtools::install_github('cole-trapnell-lab/monocle3', ref='develop')"
```

#### Reading flow data directly into R

Flow cytometry data can be directly loaded into R as a `flowFrame` object using the `readCytometryFile()` function for use with `flowCore` and related packages:

In [32]:
br1_flow_data <- readCytometryFile(br1_flow_desc$file.id[1])

In [33]:
br1_flow_data[[1]]

flowFrame object 'd967c76b-60be-434c-9f46-734f8357c7e4'
with 381484 cells and 32 observables:
                  name      desc  range minRange maxRange
$P1              FSC-A      <NA> 262144        0   262144
$P2              FSC-H      <NA> 262144        0   262144
$P3              FSC-W      <NA> 262144        0   262144
$P4              SSC-A      <NA> 262144        0   262144
$P5              SSC-H      <NA> 262144        0   262144
$P6              SSC-W      <NA> 262144        0   262144
$P7           BUV395-A       CD3 262144     -111   262144
$P8           BUV496-A      CD45 262144     -111   262144
$P9           BUV563-A      CD15 262144     -111   262144
$P10          BUV615-A    CD45RA 262144     -111   262144
$P11          BUV661-A      CD14 262144     -111   262144
$P12          BUV737-A       CD8 262144     -111   262144
$P13          BUV805-A     CD11c 262144     -111   262144
$P14           BV421-A      CD25 262144     -111   262144
$P15           BV480-A       CD4 262

#### Reading scATAC data directly into R

We currently don't have methods to immediately load scATAC-seq data into R.

See the scATAC-seq analysis demo notebook for additional details about loading cached scATAC-seq data.

### Saving cache info for downstream steps

For use with other scripts in this notebook, we'll save the file descriptor tables we've generated as a .csv file that we can read in later.

In [34]:
cache_info_list <- list(
    fh1_rna_file_1,
    fh1_rna_files_2_to_5,
    br1_flow_file_1,
    br1_count_files,
    cu1_atac_file_1
)

all_cache_info <- do.call(rbind, cache_info_list)

write.csv(all_cache_info, "cache_info.csv")

### Session Info

In [35]:
sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hise_1.0.3

loaded via a namespace (and not attached):
  [1] Seurat_4.0.1                Rtsne_0.15                 
  [3] colorspace_2.0-0            deldir_0.2-10              
  [5] ellipsis_0.3.1              ggridges_0.5.3             
  [7] IRdisplay_1.0               cytolib_2.0.3              
  [9] X