# TreeAlign with allele specfic information
## Introduction
TreeAlign is a model for scRNA and scDNA integration. TreeAlign can take in either total copy number information or allele specific copy number information or both to assign cells from scRNA to a phylogenetic tree constructed with scDNA

## Loading data

In [1]:
from treealign import CloneAlignClone
from treealign import CloneAlignTree

import pandas as pd
from Bio import Phylo

In [2]:
# load total copy number input

# scRNA read count matrix where each row represents a gene, 
# each column represents a cell
expr = pd.read_csv("../data/example_expr.csv", index_col=0)

# scDNA copy number matrix where each row represents a gene,
# each column represents a cell
cnv = pd.read_csv("../data/example_gene_cnv.csv", index_col=0)

In [3]:
# load allele specific input

# b allele frequency matrix
# each row represents a snp
# each column represents a cell
# The number in the matrix is the b allele frequency at the given snp and cell
hscn = pd.read_csv("../data/example_snp_baf.csv", index_col=0)

# reference allele count matrix from scRNA
# each row represents a snp
# each column represents a cell
snv_allele = pd.read_csv("../data/example_snp_allele.csv", index_col=0)

# total count matrix at SNPs from scRNA
# each row represents a snp
# each column represents a cell
snv_total = pd.read_csv("../data/example_snp_total.csv", index_col=0)

In [4]:
# load phylogenetic tree used for clone assignment

# phylogenetic tree constructed with scDNA data in newick format
tree = Phylo.read("../data/example_hdbscan.newick", "newick")

## Running TreeAlign with tree

In [5]:
# construct CloneAlignTree object for data preprocessing
# run TreeAlign with both total copy number & allele specific datasets

# `repeat` is set to 1 here for demonstration purposes. it would be better to set `repeat` larger than 5. 
obj = CloneAlignTree(tree=tree, expr=expr, cnv=cnv, hscn=hscn, snv_allele=snv_allele, snv=snv_total, repeat=1)

# it is possible to run TreeAlign with total copy number data only
# obj = CloneAlignTree(tree=tree, expr=expr, cnv=cnv, repeat=1)

# it is also possible to run TreeAlign with allele specific data only
# obj = CloneAlignTree(tree=tree, hscn=hscn, snv_allele=snv_allele, snv=snv_total, repeat=1)

# running TreeAlign to assign cells to phylogenetic subclades
obj.assign_cells_to_tree()

  intersect_index = self.hscn_df.index & self.snv_allele_df.index & self.snv_df.index
  intersect_cells = self.snv_allele_df.columns & self.snv_df.columns
  intersect_cells = self.expr_df.columns & self.snv_allele_df.columns & self.snv_df.columns





Start processing 
At node_0, one of the child clade is node_1 with 341 terminals. 
At node_0, one of the child clade is node_341 with 721 terminals. 


  intersect_index = clone_hscn_df.index & snv.index & snv_allele.index


There are 1371 genes in matrices. 
Start run clonealign for clade: node_0
cnv gene count: 1371
expr cell count: 1000
hscn snp count: 1355
snv allele matrix cell count: 1000
seed = 55, initial_loss = 6514374.023016376
Start Inference.
ELBO converged at iteration 75
Clonealign finished!
CloneAlign Tree finishes at clade: node_0 with correct frequency 0.995




Start processing 
At node_1, one of the child clade is node_2 with 77 terminals. 
At node_1, one of the child clade is node_78 with 264 terminals. 
There are 1579 genes in matrices. 
Start run clonealign for clade: node_1
cnv gene count: 1579
expr cell count: 266
hscn snp count: 1291
snv allele matrix cell count: 266
seed = 55, initial_loss = 2091303.4187890508
Start Inference.
ELBO converged at iteration 246
Clonealign finished!
CloneAlign Tree finishes at clade: node_1 with correct frequency 1.0




Start processing 
At node_2, there are less than 20 cells in the expr matrix.



Start processing 
At node_78, one of the child clad

In [6]:
# to view more details about parameters you can customize when you run TreeAlign
help(CloneAlignTree)

Help on class CloneAlignTree in module treealign.clonealign_tree:

class CloneAlignTree(treealign.clonealign.CloneAlign)
 |  CloneAlignTree(tree, expr=None, cnv=None, hscn=None, snv_allele=None, snv=None, normalize_cnv=True, cnv_cutoff=10, infer_s_score=True, infer_b_allele=True, repeat=10, min_clone_assign_prob=0.8, min_clone_assign_freq=0.7, min_consensus_gene_freq=0.6, min_consensus_snv_freq=0.6, max_temp=1.0, min_temp=0.5, anneal_rate=0.01, learning_rate=0.1, max_iter=400, rel_tol=5e-05, record_input_output=False, min_cell_count_expr=20, min_cell_count_cnv=20, min_gene_diff=100, min_snp_diff=100, level_cutoff=10, min_proceed_freq=0.7, min_record_freq=0.7)
 |  
 |  Method resolution order:
 |      CloneAlignTree
 |      treealign.clonealign.CloneAlign
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, tree, expr=None, cnv=None, hscn=None, snv_allele=None, snv=None, normalize_cnv=True, cnv_cutoff=10, infer_s_score=True, infer_b_allele=True, repeat=10, min_

## Getting results
The output of TreeAlign includes: 1. a table indicating the subclades to which the cells in scRNA data are assigned. 2. for each gene, a score ranging between 0 and 1 reflecting dosage effects.

In [7]:
clone_assign_df, gene_type_score_df, allele_assign_prob_df = obj.generate_output()

In [8]:
# subclade assignment for each cell in scRNA data
clone_assign_df

Unnamed: 0,cell_id,clone_id
0,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTAGACAT...,node_462
1,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTCCATGA...,node_462
2,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGGGTAG...,node_351
3,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CCCTCAAC...,node_351
4,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGTTTGA...,node_149
...,...,...
990,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_AGCTACAT...,node_351
991,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_TCCTGCAGT...,node_462
992,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_GTTACCCC...,node_462
993,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_ATTTCACTC...,node_462


In [9]:
# the probability of having dosage effects for each gene
gene_type_score_df

Unnamed: 0,gene,gene_type_score
0,TINAGL1,3.315783e-01
1,PEF1,9.254875e-01
2,COL16A1,6.666608e-01
3,SPOCD1,2.565527e-02
4,AL136115.1,1.004067e-01
...,...,...
3442,LINC00992,1.000000e+00
3443,ZNF474,8.513761e-17
3444,KIAA1024L,7.467213e-17
3445,SOWAHA,1.491436e-11
