# TreeAlign with allele specfic information
## Introduction
TreeAlign is a model for scRNA and scDNA integration. TreeAlign can take in either total copy number information or allele specific copy number information or both to assign cells from scRNA to a phylogenetic tree constructed with scDNA

## Loading data

In [1]:
from treealign import CloneAlignClone
from treealign import CloneAlignTree

import pandas as pd
from Bio import Phylo

In [2]:
# load total copy number input

# scRNA read count matrix where each row represents a gene, 
# each column represents a cell
expr = pd.read_csv("../data/example_expr.csv", index_col=0)

# scDNA copy number matrix where each row represents a gene,
# each column represents a cell
cnv = pd.read_csv("../data/example_gene_cnv.csv", index_col=0)

In [3]:
# load allele specific input

# b allele frequency matrix
# each row represents a snp
# each column represents a cell
# The number in the matrix is the b allele frequency at the given snp and cell
hscn = pd.read_csv("../data/example_snp_baf.csv", index_col=0)

# reference allele count matrix from scRNA
# each row represents a snp
# each column represents a cell
snv_allele = pd.read_csv("../data/example_snp_allele.csv", index_col=0)

# total count matrix at SNPs from scRNA
# each row represents a snp
# each column represents a cell
snv_total = pd.read_csv("../data/example_snp_total.csv", index_col=0)

In [4]:
# load phylogenetic tree used for clone assignment

# phylogenetic tree constructed with scDNA data in newick format
tree = Phylo.read("../data/example_hdbscan.newick", "newick")

## Running TreeAlign with tree

Using the integrated model with both total CN and allele-specific input data.

Note that we are setting `repeat=1` for runtime purposes in this notebook but by default the model is run 10 times to generate consensus assignment of scRNA cells to the tree.

In [5]:
# construct CloneAlignTree object for data preprocessing
# run TreeAlign with both total copy number & allele specific datasets

# `repeat` is set to 1 here for demonstration purposes. it would be better to set `repeat` larger than 5. 
obj = CloneAlignTree(tree=tree, expr=expr, cnv=cnv, hscn=hscn, snv_allele=snv_allele, snv=snv_total, repeat=1)

# it is possible to run TreeAlign with total copy number data only
# obj = CloneAlignTree(tree=tree, expr=expr, cnv=cnv, repeat=1)

# it is also possible to run TreeAlign with allele specific data only
# obj = CloneAlignTree(tree=tree, hscn=hscn, snv_allele=snv_allele, snv=snv_total, repeat=1)

# running TreeAlign to assign cells to phylogenetic subclades
obj.assign_cells_to_tree()

  intersect_index = self.hscn_df.index & self.snv_allele_df.index & self.snv_df.index
  intersect_cells = self.snv_allele_df.columns & self.snv_df.columns
  intersect_cells = self.expr_df.columns & self.snv_allele_df.columns & self.snv_df.columns





Start processing 
At node_0, one of the child clade is node_1 with 341 terminals. 
At node_0, one of the child clade is node_341 with 721 terminals. 


  intersect_index = clone_hscn_df.index & snv.index & snv_allele.index


Start run clonealign for clade: node_0
cnv gene count: 1371
expr cell count: 1000
hscn snp count: 1355
snv allele matrix cell count: 1000




seed = 55, initial_loss = 3691866.7827850343
Start Inference.
ELBO converged at iteration 217
Clonealign finished!
CloneAlign Tree finishes at clade: node_0 with correct frequency 1.0




Start processing 
At node_1, one of the child clade is node_2 with 77 terminals. 
At node_1, one of the child clade is node_78 with 264 terminals. 
Start run clonealign for clade: node_1
cnv gene count: 1571
expr cell count: 259
hscn snp count: 1291
snv allele matrix cell count: 259
seed = 50, initial_loss = 1071128.4455590134
Start Inference.
ELBO converged at iteration 299
Clonealign finished!
CloneAlign Tree finishes at clade: node_1 with correct frequency 1.0




Start processing 
At node_2, there are less than 20 cells in the expr matrix.



Start processing 
At node_78, one of the child clade is node_79 with 32 terminals. 
At node_78, one of the child clade is node_110 with 232 terminals. 
Start run clonealign for clade: node_78
cnv gene count: 355
expr cell count: 257
hscn snp count: 208
snv al

In [6]:
# to view more details about parameters you can customize when you run TreeAlign
help(CloneAlignTree)

Help on class CloneAlignTree in module treealign.clonealign_tree:

class CloneAlignTree(treealign.clonealign.CloneAlign)
 |  CloneAlignTree(tree, expr=None, cnv=None, hscn=None, snv_allele=None, snv=None, normalize_cnv=True, cnv_cutoff=10, infer_s_score=True, infer_b_allele=True, repeat=10, min_clone_assign_prob=0.8, min_clone_assign_freq=0.7, min_consensus_gene_freq=0.6, min_consensus_snv_freq=0.6, max_temp=1.0, min_temp=0.5, anneal_rate=0.01, learning_rate=0.1, max_iter=400, rel_tol=5e-05, record_input_output=False, min_cell_count_expr=20, min_cell_count_cnv=20, min_gene_diff=100, min_snp_diff=100, level_cutoff=10, min_proceed_freq=0.7, min_record_freq=0.7)
 |  
 |  Method resolution order:
 |      CloneAlignTree
 |      treealign.clonealign.CloneAlign
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, tree, expr=None, cnv=None, hscn=None, snv_allele=None, snv=None, normalize_cnv=True, cnv_cutoff=10, infer_s_score=True, infer_b_allele=True, repeat=10, min_

## Getting results
The output of TreeAlign includes: 1. a table indicating the subclades to which the cells in scRNA data are assigned. 2. for each gene, a score ranging between 0 and 1 reflecting dosage effects.

In [7]:
clone_assign_df, gene_type_score_df, allele_assign_prob_df = obj.generate_output()

In [8]:
# subclade assignment for each cell in scRNA data
clone_assign_df

Unnamed: 0,cell_id,clone_id
0,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTAGACAT...,node_462
1,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTCCATGA...,node_462
2,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGGGTAG...,node_351
3,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CCCTCAAC...,node_351
4,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGTTTGA...,node_118
...,...,...
995,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_AGCTACAT...,node_351
996,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_TCCTGCAGT...,node_462
997,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_GTTACCCC...,node_462
998,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_ATTTCACTC...,node_462


In [9]:
# the probability of having dosage effects for each gene
gene_type_score_df

Unnamed: 0,gene,gene_type_score
0,TINAGL1,1.000000e+00
1,PEF1,1.000000e+00
2,COL16A1,6.667235e-01
3,SPOCD1,6.203965e-14
4,AL136115.1,5.000000e-01
...,...,...
3450,KIAA1024L,8.827711e-02
3451,SLC22A4,7.272410e-01
3452,CCNI2,9.999994e-01
3453,SOWAHA,9.999984e-01


### Running the total CN TreeAlign model

In [10]:
# it is possible to run TreeAlign with total copy number data only
obj1 = CloneAlignTree(tree=tree, expr=expr, cnv=cnv, repeat=1)

# running TreeAlign to assign cells to phylogenetic subclades
obj1.assign_cells_to_tree()




Start processing 
At node_0, one of the child clade is node_1 with 341 terminals. 
At node_0, one of the child clade is node_341 with 721 terminals. 
Start run clonealign for clade: node_0
cnv gene count: 1371
expr cell count: 1000




seed = 55, initial_loss = 3667581.3937673043
Start Inference.
ELBO converged at iteration 81
Clonealign finished!
CloneAlign Tree finishes at clade: node_0 with correct frequency 0.998




Start processing 
At node_1, one of the child clade is node_2 with 77 terminals. 
At node_1, one of the child clade is node_78 with 264 terminals. 
Start run clonealign for clade: node_1
cnv gene count: 1584
expr cell count: 268
seed = 95, initial_loss = 1096122.6190733353
Start Inference.
ELBO converged at iteration 290
Clonealign finished!
CloneAlign Tree finishes at clade: node_1 with correct frequency 1.0




Start processing 
At node_2, there are less than 20 cells in the expr matrix.



Start processing 
At node_78, one of the child clade is node_79 with 32 terminals. 
At node_78, one of the child clade is node_110 with 232 terminals. 
Start run clonealign for clade: node_78
cnv gene count: 355
expr cell count: 266
seed = 50, initial_loss = 262864.70742489933
Start Inference.
ELBO converged at 

In [11]:
# generate output from the total CN model
clone_assign_df1, gene_type_score_df1, allele_assign_prob_df1 = obj1.generate_output()

In [12]:
# subclade assignment for each cell in scRNA data
clone_assign_df1

Unnamed: 0,cell_id,clone_id
0,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTAGACAT...,node_462
1,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTCCATGA...,node_462
2,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGGGTAG...,node_462
3,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CCCTCAAC...,node_351
4,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGTTTGA...,node_149
...,...,...
995,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_AGCTACAT...,node_351
996,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_TCCTGCAGT...,node_462
997,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_GTTACCCC...,node_462
998,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_ATTTCACTC...,node_462


In [13]:
# the probability of having dosage effects for each gene
gene_type_score_df1

Unnamed: 0,gene,gene_type_score
0,TINAGL1,9.976153e-01
1,PEF1,9.994708e-01
2,COL16A1,3.498424e-01
3,SPOCD1,6.595093e-04
4,AL136115.1,9.999971e-01
...,...,...
3449,LINC00992,1.000000e+00
3450,ZNF474,1.444011e-13
3451,KIAA1024L,7.423630e-17
3452,SOWAHA,6.839382e-17


### Running the allele-specific TreeAlign model

In [14]:
# it is also possible to run TreeAlign with allele specific data only
obj2 = CloneAlignTree(tree=tree, hscn=hscn, snv_allele=snv_allele, snv=snv_total, repeat=1)

# running TreeAlign to assign cells to phylogenetic subclades
obj2.assign_cells_to_tree()

  intersect_index = self.hscn_df.index & self.snv_allele_df.index & self.snv_df.index
  intersect_cells = self.snv_allele_df.columns & self.snv_df.columns





Start processing 
At node_0, one of the child clade is node_1 with 341 terminals. 
At node_0, one of the child clade is node_341 with 721 terminals. 


  intersect_index = clone_hscn_df.index & snv.index & snv_allele.index


Start run clonealign for clade: node_0
hscn snp count: 1355
snv allele matrix cell count: 1000
seed = 83, initial_loss = 37807.96081815402
Start Inference.
ELBO converged at iteration 213
Clonealign finished!
CloneAlign Tree finishes at clade: node_0 with correct frequency 0.997




Start processing 
At node_1, one of the child clade is node_2 with 77 terminals. 
At node_1, one of the child clade is node_78 with 264 terminals. 
Start run clonealign for clade: node_1
hscn snp count: 1291
snv allele matrix cell count: 290
seed = 0, initial_loss = 8646.4531740246
Start Inference.
ELBO converged at iteration 234
Clonealign finished!
CloneAlign Tree finishes at clade: node_1 with correct frequency 1.0




Start processing 
At node_2, one of the child clade is node_3 with 26 terminals. 
At node_2, one of the child clade is node_28 with 51 terminals. 
Start run clonealign for clade: node_2
hscn snp count: 1242
snv allele matrix cell count: 31
seed = 59, initial_loss = 355.5534299098649
Start 

In [15]:
# generate output from the allele-specific model
clone_assign_df2, gene_type_score_df2, allele_assign_prob_df2 = obj2.generate_output()

In [16]:
# subclade assignment for each cell in scRNA data
clone_assign_df2

Unnamed: 0,cell_id,clone_id
0,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTAGACAT...,node_351
1,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CTCCATGA...,node_36
2,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGGGTAG...,node_400
3,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_CCCTCAAC...,node_462
4,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_TTGTTTGA...,node_114
...,...,...
995,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_AGCTACAT...,node_462
996,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_TCCTGCAGT...,node_351
997,SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_GTTACCCC...,node_351
998,SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA_ATTTCACTC...,node_462


In [17]:
# the probability of having dosage effects for each gene
# this should be empty because this version of the model lacks total copy number input data
gene_type_score_df2