# Integrating Controlled and Open Access 10X Visium Data in SB-CGC Data Studio

    Title:   Integrating Controlled and Open Access 10X Visium Data in SB-CGC Data Studio
    Author:  Clarisse Lau (clau@systemsbiology.org)
    Created: April 2023

# 1. Introduction & Overview
[HTAN](https://humantumoratlas.org/) is a National Cancer Institute (NCI)-funded Cancer MoonshotSM initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease. [Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469) 

__Important__: This notebook is intended to be run within the Seven Bridges Cancer Genomics Cloud (SB-CGC) Data Studio. You must have dbGaP authorization to access controlled-access HTAN data within SB-CGC. See the [HTAN Missing Manual](https://docs.humantumoratlas.org/access_controlled/db_gap/) for instructions on how to request dbGaP access.

### 1.1 Goal

This notebook will demonstrate how open-access HTAN data can be added to the Seven Bridges Cancer Genomics Cloud and used in conjunction with lower level dbGaP authorized data in the cloud. We utilize ISB-CGC BigQuery metadata tables to obtain relevant file info. 

### 1.2 Inputs and Outputs

In this example, we aim to replicate outputs of the spaceranger pipeline using 10X Visium data submitted by the Washington University in St. Louis (WashU) HTAN Center for run _HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_.



# 2. Environment & Module Setup

#### 2.1 Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project, instructions for creating a project can be found in the [Google Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started in the cloud see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found in the [Google Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console).

Before running this notebook, follow Google's documentation to install [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) for your OS. 
Then open Terminal from Data Studio Launcher and run the following to set up application credentials to access Google BigQuery:

`gcloud auth application-default login`

Follow the prompts to complete authentication.

#### 2.2 Download and Install Spaceranger

Follow 10X [documentation](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/installation) to download and install Spaceranger

#### 2.3 Install Libraries

In [None]:
%pip install google-cloud-bigquery
%pip install synapseclient
%pip install protobuf==3.20.1 
%pip install db-dtypes

## 3.0 Import and Instantiate Libraries

In [8]:
import sevenbridges as sbg
import os

from google.cloud import bigquery
import synapseclient

In [9]:
# set the google project that will be billed for this notebook's computations
google_project = 'htan-dcc'

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

In [None]:
# instantiate synapse client
syn = synapseclient.Synapse()
syn.login()

In [7]:
# instantiate SB python client
# Requires SB developer auth token: https://docs.sevenbridges.com/docs/get-your-authentication-token

auth_token = '<your-auth-token>'

os.environ['SB_API_ENDPOINT'] = 'https://cgc-api.sbgenomics.com/v2' 
os.environ['SB_AUTH_TOKEN'] = auth_token

api = sbg.Api()

# 4. Analysis

### 4.1 Obtain relevant file metadata info from ISB-CGC

In [11]:
f = client.query("""
    WITH l1 AS (
        SELECT Filename,
            HTAN_Parent_Biospecimen_ID,
            Component,
            entityId,
            Run_ID,
            md5
        FROM `htan-dcc.ISB_CGC_r3.10xVisiumSpatialTranscriptomics-RNA-seqLevel1`
        WHERE RUN_ID = 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'
    ),
    l2 AS (
        SELECT Filename,
            HTAN_Parent_Biospecimen_ID,
            Component,
            entityId,
            Run_ID,
            md5
        FROM `htan-dcc.ISB_CGC_r3.10xVisiumSpatialTranscriptomics-RNA-seqLevel2`
        WHERE RUN_ID = 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'
    ),
    aux AS (
        SELECT Filename,
            HTAN_Parent_Biospecimen_ID,
            Component,
            entityId,
            Run_ID,
            md5
        FROM `htan-dcc.ISB_CGC_r3.10xVisiumSpatialTranscriptomics-AuxiliaryFiles`
        WHERE RUN_ID = 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'
    )
    SELECT * FROM l1
    UNION ALL 
    SELECT * FROM l2
    UNION ALL
    SELECT * FROM aux

""").result().to_dataframe()

f

Unnamed: 0,Filename,HTAN_Parent_Biospecimen_ID,Component,entityId,Run_ID,md5
0,visium_level_2_pdac_bam/HT264P1-S1H2Fc2U1Z1Bs1...,HTA12_27_5,10xVisiumSpatialTranscriptomics-RNA-seqLevel2,syn51201377,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test,be3fb3eddc25e2d16149e6d1c291fbe7
1,visium_level_1/TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2B...,HTA12_27_5,10xVisiumSpatialTranscriptomics-RNA-seqLevel1,syn29282084,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test,1f0c842a138a7eba44d01982cb8a3c8c
2,visium_level_1/TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2B...,HTA12_27_5,10xVisiumSpatialTranscriptomics-RNA-seqLevel1,syn29290193,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test,0e68bdda626ee7fa83b4a83f7d23e2c3
3,visium_auxiliary_pdac/HT264P1-S1H2Fc2U1Z1Bs1-H...,HTA12_27_5,10xVisiumSpatialTranscriptomics-AuxiliaryFiles,syn51283237,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test,3f0a3cdf4a7dfbac9ca9cf593795cdac
4,visium_auxiliary_pdac/HT264P1-S1H2Fc2U1Z1Bs1-H...,HTA12_27_5,10xVisiumSpatialTranscriptomics-AuxiliaryFiles,syn51283252,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test,04cbce98dfba0636dd53bbaadf4cf6fd
5,visium_auxiliary_pdac/B1-HT264P1-S1H2Fc2U1.tif,HTA12_27_5,10xVisiumSpatialTranscriptomics-AuxiliaryFiles,syn51283214,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test,541e816439946e6cc0c3588996f03a45


### 4.2 Pull in fastq files

RNA-seq Level 1 fastq files are controlled access and can be accessed via SB-CGC with dbGaP authorization

1. Navigate to the CDS Data File Explorer: https://cgc.sbgenomics.com/datasets/file-repository 
2. Search by Sample ID 'HTA12_27_5'
3. Select for 'FASTQ.GZ' files
4. Add the resulting files to your project

Check that the files have been added to your workspace:

In [12]:
! ls /sbgenomics/project-files

TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_S3_L002_R1_001.fastq.gz
TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_S3_L002_R2_001.fastq.gz


### 4.3 Download image file

Auxiliary files including tiffs are open access. We can download the high res image from Synapse

In [5]:
tiff = syn.get('syn51283214')

In [6]:
tiff.path

'/home/jovyan/.synapseCache/778/123019778/B1-HT264P1-S1H2Fc2U1.tif'

### 4.4 Run Spaceranger pipeline

In [19]:
# Spaceranger installation: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/installation
# https://github.com/reykajayasinghe/HTAN/blob/main/Single_cell_preprocessing/run_spaceranger_1.2.2.sh

! sample="HT264P1-Test" # Output directory
! sample2="TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test" # Sample name from FASTQ filename
! TIF_image=tiff.path # Path to brightfield image input
! SLIDE_SERIAL_ID="V10Y07-094" # Slide ID
! AREA=B1 # https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/slide-info
! datadirectory="/sbgenomics/project-files/" # Path to FASTQs
! reference="refdata-gex-GRCh38-2020-A" # Path to Reference


In [None]:
! spaceranger count --id=${sample} --transcriptome=${reference} --fastqs=${datadirectory} --sample=${sample2} --image=${TIF_image} --slide=${SLIDE_SERIAL_ID} --area=${AREA} --reorient-images=true --localcores=32 --localmem=150

### 4.5 Compare outputs with files submitted by WashU Center

In [13]:
! md5sum /sbgenomics/workspace/HT264P1-Test/outs/possorted_genome_bam.bam

00f1ca7606385970a3c7b0371bc2cda2  /sbgenomics/workspace/HT264P1-Test/outs/possorted_genome_bam.bam


# 5. Relevant Citations and Links

https://github.com/reykajayasinghe/HTAN/blob/main/Single_cell_preprocessing/run_spaceranger_1.2.2.sh

https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/count