# PoPS Workflow

This pipeline is modified from the [FinucaneLab's code](https://github.com/FinucaneLab/gene_features/tree/master/code). 

> [Weeks, E. M., Ulirsch, J. C., Cheng, N. Y., Trippe, B. L., Fine, R. S., Miao, J., ... & Finucane, H. K. (2020). Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. medRxiv.](https://doi.org/10.1101/2020.09.08.20190561;)

## Overview

### 1. MAGMA

### 2. Gene features

* Read in, QC, filter, scale, and normalize data

* Perform PCA and independent component analysis (ICA) across all cells or meta-cells or tissues

* Perform clustering and UMAP and plot features on projection

* Perform differential expression analysis

### 3. Feature selection

### 4. Predict scores

## A working example

In [3]:
sos run pops.ipynb -h

usage: sos run pops.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  qc
  pca
  umap
  deg

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --name VAL (as str, required)
                        A string to identify your analysis run
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container-pops 'guangyou/pops:1.0.0'
                        Software container option
  --n

In [11]:
sos run pops.ipynb genefeature\
    --cwd ~/pops/code/ \
    --expr_matrix ../data/human_airway/Raw_exprMatrix.tsv.gz \
    --tsv ../data/human_airway/meta.tsv \
    --anno_file ../resources/gene_annot_jun10.txt \
    --symbol ../resources/ensg2symbol.txt \
    --container_pops guangyou/pops:1.0.0 \
    --name human_airway 

INFO: Running [32mpops[0m: 
HINT: Pulling docker image guangyou/pops:1.0.0
HINT: Docker image guangyou/pops:1.0.0 is now up to date
INFO: [32mpops[0m is [32mcompleted[0m.
[91mERROR[0m: [91m[pops]: [pops]: Output target /home/lg/pops/code/features/human_airway_ does not exist after the completion of step pops[0m



In [None]:
sos run pops.ipynb magma\
    --cwd ~/pops/code/ \
    --bfile ~/pops/data/data/1000G.EUR\
    --gene-annot ~/pops/data/data/magma_0kb.genes.annot\
    --pval  ~/pops/data/AFib-gwas-summary-statistics.tbl ncol=N\
    --gene-model snp-wise=mean\
    --out AFib

In [None]:
sos run pops.ipynb feature_selection\
    --cwd ~/pops/code/ \
    --features ~/pops/data/data/PoPS.features.txt.gz\
    --gene_results AFib\
    --out AFib

In [None]:
sos run pops.ipynb predict_scores\
    --cwd ~/pops/code/ \
    --gene_loc ~/pops/data/data/gene_loc.txt\
    --gene_results AFib\
    --features ~/pops/data/data/PoPS.features.txt.gz\
    --selected_features AFib.features\
    --control_features ~/pops/data/data/control.features\
    --chromosome 1\
    --out AFib

In [10]:
[global]
# the output directory for generated files
parameter: cwd = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Software container option
parameter: container_pops = 'guangyou/pops:1.0.0'
cwd = f"{cwd:a}"

In [13]:
[genefeature]
# Path to expression matrix file
parameter: expr_matrix = path
# Path to the tsv file 
parameter: tsv = path
# Path to the annotation file 
parameter: anno_file = path
# Path to the symbol file 
parameter: symbol = path
# A string to identify your analysis run
parameter: name = str
# Number of PCs applied
parameter: number_pcs = 35
# Number of features to select as top variable features
parameter: vargenes = 1500
# Value of the resolution parameter, use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities
parameter: clus_res = 0.6
input: expr_matrix, tsv, anno_file, symbol
output: f'{cwd}/plots/{name}_variablegenes.pdf',
        f'{cwd}/plots/{name}_pcaelbow.pdf',
        f'{cwd}/plots/{name}_umap_clusters.pdf',
        f'{cwd}/plots/{name}_umap_clusters_pre_def.pdf',
        f'{cwd}/plots/{name}_umap_pcs.pdf',
        f'{cwd}/plots/{name}_umap_ics.pdf',
        f'{cwd}/plots/{name}_umap_knownmarkers.pdf',
        f'{cwd}/features/{name}_',
        f'{cwd}/plots/{name}_umap_degenes.pdf',
        f'{cwd}/plots/{name}_umap_degenes_pre_def.pdf',
        f'{cwd}/features/{name}_so.rds'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{name}'
R:  container=container_pops, expand= "${ }", stderr = f'{name}.stderr', stdout = f'{name}.stdout'
    library(tidyverse)
    library(data.table)
    library(BuenColors)
    library(Seurat)
    library(irlba)
    library(Matrix)
    library(future)
    library(reticulate)
    library(ggrastr)
    library(tidytext)
    library(matrixTests)
    # Need a function source file, or the code would be too long
    source("utils.R")
    # Set up parallelization
    # Remember to use htop to delete forgotten forks
    Sys.setenv(R_FUTURE_FORK_ENABLE = T)
    options(future.globals.maxSize = 6 * 2048 * 1024^2)
    plan(strategy = "multicore", workers = 32)
    name <- "${name}"
    # Setup dictionary
    dir.create("./plots/")
    dir.create("./features/")
    # Notes on data:
    # Annotations provided. There is a mild batch effeect by method, but correcting it breaks the clustering.
    # So we ignore it. Looking at pre-defined clusters, we are clearly still capturing biology.
    ### Assumes a sparse dgCMatrix as input
    ### Accepts row_id_type = ENSG, ENSMUSG, human_symbol, mouse_symbol
    #------------------------------------------------LOAD AND FORMAT DATA-----------------------------------------------#
    # Read in data and annotations
    mat <- data.frame(fread(${_input[0]:r}), row.names=1)[1:10000,1:10000] %>%
      data.matrix() %>%
      Matrix(sparse = TRUE)
    mat.annot <- data.frame(fread(${_input[1]:r}), row.names=1, header=T)
    colnames(mat) <-  gsub("[.]", "-", colnames(mat))

    # Convert to ENSG, drop duplicates, and fill in missing genes
    mat <- ConvertToENSGAndProcessMatrix(mat, "human_symbol")

    # Load this in in case we need it later
    keep <- read.table(${_input[2]:r}, sep = "\t", header = T, stringsAsFactors = F, col.names = c("ENSG", "symbol", "chr", "start", "end", "TSS"))

    #--------------------------------------------------COMPUTE FEATURES-------------------------------------------------#

    # Create Seurat object
    # min.features determined for each dataset
    so <- CreateSeuratObject(counts = mat, project = name, min.features = 200, meta.data = mat.annot)

    # Clean up
    rm(mat)

    # QC
    so <- subset(so, 
                 subset = nFeature_RNA > quantile(so$nFeature_RNA, 0.05) & 
                   nFeature_RNA < quantile(so$nFeature_RNA, 0.95))
    so <- NormalizeData(so, normalization.method = "LogNormalize", scale.factor = 1000000)
    so <- ScaleData(so, min.cells.to.block = 1, block.size = 500)

    # Identify variable genes
    so <- FindVariableFeatures(so, nfeatures = ${vargenes})

    # Plot variable genes with and without labels
    PlotAndSaveHVG(so, ${_output[0]:r})
    
    # Run PCA
    so <- RunPCA(so, npcs = 100)
    # Project PCA to all genes
    so <- ProjectDim(so, do.center = T)
    # Plot Elbow
    PlotAndSavePCAElbow(so, 100, ${_output[1]:r})

    # Run ICA
    so <- RunICA(so, nics = ${number_pcs})
    # Project ICA to all genes
    so <- ProjectDim(so, reduction = "ica", do.center = T)

    # Cluster cells
    so <- FindNeighbors(so, dims = 1:${number_pcs}, nn.eps = 0)
    so <- FindClusters(so, resolution = ${clus_res}, n.start = 100)

    # UMAP dim reduction
    so <- RunUMAP(so, dims = 1:${number_pcs}, min.dist = 0.4, n.epochs = 500,
                  n.neighbors = 10, learning.rate = 0.1, spread = 2)

    # Plot UMAP clusters
    PlotAndSaveUMAPClusters(so, so@meta.data$seurat_clusters, ${_output[2]:r})
    # Plot known clusters on UMAP (if applicable)
    PlotAndSaveUMAPClusters(so, so@meta.data$CellType, ${_output[3]:r})

    # Plot PCs on UMAP
    PlotAndSavePCsOnUMAP(so, ${_output[4]:r})
    # Plot ICs on UMAP
    PlotAndSaveICsOnUMAP(so, ${_output[5]:r})
    # Plot known marker genes on UMAP 
    marker_genes <- c("CCDC67", "DEUP1", "FOXN4", "CDC20B", "RERGL", "MCAM", "PDGFRB", "ACTA2", "MYL9", "ASCL3", "CFTR", "FOXJ1", "MUC5AC", "SFTPA2", "CA2", "CAV1", "ANXA3", "CAV2")
    PlotAndSaveKnownMarkerGenesOnUMAP(so, keep, marker_genes, ${_output[6]:r})
  
    # Save global features
    SaveGlobalFeatures(so, ${_output[7]:r})
    
    # Compute any cluster dependent features (DE genes, within-cluster PCs, etc.) and save them
    # Seurat clusters
    Idents(object=so) <- "seurat_clusters"
    clus <- levels(so@meta.data$seurat_clusters)
    demarkers <- WithinClusterFeatures(so, "seurat_clusters", clus, ${_output[7]:r})
    
    # Pre-defined cluster dependent features (if applicable)
    Idents(object=so) <- "CellType"
    clus <- unique(so@meta.data$CellType)
    demarkers_pre_def <- WithinClusterFeatures(so, "CellType", clus, ${_output[7]:r}, suffix = "_pre_def")
    
    # Plot DE genes on UMAP
    PlotAndSaveDEGenesOnUMAP(so, demarkers, ${_output[8]:r}, height = 30, rank_by_tstat = TRUE)
    # Plot DE genes from pre-defined clusters on UMAP (if applicable)
    PlotAndSaveDEGenesOnUMAP(so, demarkers_pre_def, ${_output[9]:r}, height = 30, rank_by_tstat = TRUE)

    # Save Seurat object
    saveRDS(so, ${_output[10]:r})

In [None]:
# Get gene-level z scores
[magma]
input: bedfiles, gene-annot, pval
output: f'{cwd}'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_pops, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    magma\
    --bfile ${_input[0]:n}\
    --gene-annot ~/pops/data/data/magma_0kb.genes.annot\
    --pval  ~/pops/data/AFib-gwas-summary-statistics.tbl ncol=N\
    --gene-model snp-wise=mean\
    --out ${_output:n} 

In [None]:
# Feature selection by the GLS model
[feature_selection]
input: features, gene_results
output: f'{cwd}'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
python: container=container_pops, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    pops.feature_selection.py\
    --features ${_input[0]:n}\
    --gene_results ${_input[1]:n}\
    --out ${_output:n}

In [None]:
# Polygenic Priority Score calculation
[predict_scores]
# Chromosome
parameter: chromosome = int
input: gene_loc, gene_results, features, selected_features, control_features
output: f'{cwd}'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
python: container=container_pops, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    pops.predict_scores.py\
    --gene_loc ${_input[0]:n}\
    --gene_results ${_input[1]:n}\
    --features ${_input[2]:n}\
    --selected_features ${_input[3]:n}\
    --control_features ${_input[4]:n}\
    --chromosome ${chromosome}\
    --out ${_output:n}