# Extracting CNVs from snRNA using copykat

This project works on data from [(Terekhanova et al. 2023)](https://www.nature.com/articles/s41586-023-06682-5#data-availability).
Please have a look at the [copykat](https://github.com/navinlabcode/copykat) and [Seurat](https://satijalab.org/seurat/reference/) documentation.

#### TODO:

* provide container with installed software for analysis? (particularly due to `igraph` compatibility, this would be better suited than a conda env.)
* learn more about [cellranger feature barcode matrices](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-mex-matrices)

## Setting up
For installing the neccessary software, please have a look at `install.R`.
Note that the `igraph` package, required by the `Seurat` package, does usually not work with conda environments.
Thus, please install this software to your machine directly.

In [2]:
# lodaing libraries
library(data.table)
library(Seurat)
library(copykat)

Loading required package: SeuratObject

Loading required package: sp


Attaching package: ‘SeuratObject’


The following objects are masked from ‘package:base’:

    intersect, t




### Getting the data
The data we want to process is available from the Gene Expression Omnibus (GEO) at accession number [GSE240822](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE240822).
Particularly we are interested in the samples [C3N-00495-T1_CPT0078510004_snRNA_ccRCC](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7710088) and [C3L-00004-T1_CPT0001540013_snRNA_ccRCC](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7710073).
Download the data manually, or execute the following commands in a terminal (designed for bash).
```sh
# download metadata
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE240nnn/GSE240822/suppl/GSE240822%5FGBM%5FccRCC%5FRNA%5Fmetadata%5FCPTAC%5Fsamples%2Etsv%2Egz
gunzip GSE240822_GBM_ccRCC_RNA_metadata_CPTAC_samples.tsv.gz

# download snRNA raw counts
wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM7710nnn/GSM7710073/suppl/GSM7710073%5FC3L%2D00004%2DT1%5FCPT0001540013%5FsnRNA%5FccRCC%2Etar%2Egz
wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM7710nnn/GSM7710088/suppl/GSM7710088%5FC3N%2D00495%2DT1%5FCPT0078510004%5FsnRNA%5FccRCC%2Etar%2Egz
tar -xf GSM7710073_C3L-00004-T1_CPT0001540013_snRNA_ccRCC.tar.gz
tar -xf GSM7710088_C3N-00495-T1_CPT0078510004_snRNA_ccRCC.tar.gz
```

## Filtering
Since the data is only provided in raw 10X cellranger gene expression counts from single nucleotide RNA sequencing (snRNA-seq), we need to filter out all unwanted elements.
For this, read the metadata, as well as the 10X cellranger output and filter the latter based on the Barcodes provided in the mata data.

In [4]:
getwd()
data_root <- file.path("data/")
dir(data_root)

In [5]:
# load meta data
metadata <- fread(file.path(data_root, "GSE240822_GBM_ccRCC_RNA_metadata_CPTAC_samples.tsv"))
sample_names <- c("C3N-00495-T1_CPT0078510004_snRNA_ccRCC", "C3L-00004-T1_CPT0001540013_snRNA_ccRCC")

In [25]:
# explore our data set
str(metadata[`GEO.sample` == sample_names[1]])
metadata

Classes ‘data.table’ and 'data.frame':	6577 obs. of  13 variables:
 $ Merged_barcode             : chr  "ccRCC_C3N-00495-T1_AAACCCAAGCGTACAG-1" "ccRCC_C3N-00495-T1_AAACCCAAGGCTGAAC-1" "ccRCC_C3N-00495-T1_AAACCCAAGTAGGCCA-1" "ccRCC_C3N-00495-T1_AAACCCAGTACTGGGA-1" ...
 $ Barcode                    : chr  "AAACCCAAGCGTACAG-1" "AAACCCAAGGCTGAAC-1" "AAACCCAAGTAGGCCA-1" "AAACCCAGTACTGGGA-1" ...
 $ Sample_RNA                 : chr  "CPT0078510004-CPT0078510004-lib1" "CPT0078510004-CPT0078510004-lib1" "CPT0078510004-CPT0078510004-lib1" "CPT0078510004-CPT0078510004-lib1" ...
 $ Sample_ATAC                : chr  "CPT0078510004" "CPT0078510004" "CPT0078510004" "CPT0078510004" ...
 $ Case_ID                    : chr  "C3N-00495" "C3N-00495" "C3N-00495" "C3N-00495" ...
 $ Piece_ID                   : chr  "C3N-00495-T1" "C3N-00495-T1" "C3N-00495-T1" "C3N-00495-T1" ...
 $ Sample_type                : chr  "Tumor" "Tumor" "Tumor" "Tumor" ...
 $ data.type.rna              : chr  "snRNA" "snRNA" "snRN

Merged_barcode,Barcode,Sample_RNA,Sample_ATAC,Case_ID,Piece_ID,Sample_type,data.type.rna,Chemistry,Cancer,cell_type.harmonized.cancer,Aliquot,GEO.sample
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
ccRCC_C3L-00088-T1_AAACCCAAGACGACTG-1,AAACCCAAGACGACTG-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCACAAATGATG-1,AAACCCACAAATGATG-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,T-cells,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCACAATCCAGT-1,AAACCCACAATCCAGT-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCACACAATCTG-1,AAACCCACACAATCTG-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCACACTCCTTG-1,AAACCCACACTCCTTG-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCACAGCACCCA-1,AAACCCACAGCACCCA-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCAGTCAACACT-1,AAACCCAGTCAACACT-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Macrophages,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCAGTCAACCTA-1,AAACCCAGTCAACCTA-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCAGTCTTACAG-1,AAACCCAGTCTTACAG-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Tumor,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC
ccRCC_C3L-00088-T1_AAACCCAGTGTCTTCC-1,AAACCCAGTGTCTTCC-1,CPT0000870003-CPT0000870003-lib1,CPT0000870003,C3L-00088,C3L-00088-T1,Tumor,snRNA,snATAC,ccRCC,Macrophages,CPT0000870003,C3L-00088-T1_CPT0000870003_snRNA_ccRCC


In [21]:
# load 10X cellranger data for a single sample
sample_name <- sample_names[1]
raw_10X <- Read10X(data.dir=file.path(data_root, sample_name, 'outs', 'raw_feature_bc_matrix'))

In [11]:
str(raw_10X[, colnames(raw_10X) %in% metadata[`GEO.sample` == sample_name, Barcode]])
str(raw_10X)

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:9776487] 50 78 97 113 138 184 211 246 298 355 ...
  ..@ p       : int [1:6578] 0 1116 1920 3908 5094 5943 6955 8775 10072 11095 ...
  ..@ Dim     : int [1:2] 36601 6577
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
  .. ..$ : chr [1:6577] "AAACCCAAGCGTACAG-1" "AAACCCAAGGCTGAAC-1" "AAACCCAAGTAGGCCA-1" "AAACCCAGTACTGGGA-1" ...
  ..@ x       : num [1:9776487] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:27803757] 28264 29591 51 190 475 774 1237 1503 1538 1709 ...
  ..@ p       : int [1:940789] 0 0 0 0 2 189 190 197 377 378 ...
  ..@ Dim     : int [1:2] 36601 940788
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
  .. ..$ : chr [1:940788] "AAACCCAAGAAACACT-1" "AAACCCAAGAAACTAC-1" "AAACCCAAGAAAGACA-1" "AAACCCAAGAAATGGG-1" ...
  .

In [22]:
# filter raw 10X data based on metadata
filtered_10X <- raw_10X[, colnames(raw_10X) %in% metadata[`GEO.sample` == sample_name, Barcode]]

# convert to Seurat Object
filtered_10X <- CreateSeuratObject(counts=filtered_10X, project=sample_name, min.cells=1, min.features=1) # TODO: adapt filtering parameters min.cells = 0, min.features = 0?

In [23]:
str(filtered_10X@assays$RNA@layers$counts)
print(filtered_10X@assays$RNA)

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:9776487] 36 62 80 95 109 136 159 188 232 256 ...
  ..@ p       : int [1:6578] 0 1116 1920 3908 5094 5943 6955 8775 10072 11095 ...
  ..@ Dim     : int [1:2] 28090 6577
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:9776487] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()


Assay (v5) data with 28090 features for 6577 cells
First 10 features:
 MIR1302-2HG, AL627309.1, AL627309.2, AL627309.5, AL627309.4, LINC01409,
FAM87B, LINC01128, LINC00115, FAM41C 
Layers:
 counts 


In [27]:
# run copykat
copykat_10X <- copykat(rawmat=filtered_10X@assays$RNA@layers$counts, sam.name=sample_name, id.type="E")

[1] "running copykat v1.1.0"
[1] "step1: read and filter data ..."
[1] "28090 genes, 6577 cells in raw data"


“sparse->dense coercion: allocating vector of size 1.4 GiB”
“sparse->dense coercion: allocating vector of size 1.4 GiB”


[1] "7744 genes past LOW.DR filtering"
[1] "step 2: annotations gene coordinates ..."
[1] "start annotation ..."


ERROR: Error in copykat(rawmat = filtered_10X@assays$RNA@layers$counts, sam.name = sample_name, : all cells are filtered


In [26]:
?copykat

copykat                package:copykat                 R Documentation

_c_o_p_y_c_a_t _m_a_i_n__f_u_n_c.

_D_e_s_c_r_i_p_t_i_o_n:

     copycat main_func.

_U_s_a_g_e:

     copykat(
       rawmat = rawdata,
       id.type = "S",
       cell.line = "no",
       ngene.chr = 5,
       LOW.DR = 0.05,
       UP.DR = 0.1,
       win.size = 25,
       norm.cell.names = "",
       KS.cut = 0.1,
       sam.name = "",
       distance = "euclidean",
       output.seg = "FALSE",
       plot.genes = "TRUE",
       genome = "hg20",
       n.cores = 1
     )
     
_A_r_g_u_m_e_n_t_s:

  rawmat: raw data matrix; genes in rows; cell names in columns.

 id.type: gene id type: Symbol or Ensemble.

cell.line: if the data are from pure cell line,put "yes"; if cell line
          data are a mixture of tumor and normal cells, still put "no".

ngene.chr: minimal number of genes per chromosome for cell filtering.

  LOW.DR: minimal population fractions of genes for smo

### References
Terekhanova, N.V., Karpova, A., Liang, WW. et al. Epigenetic regulation during cancer transitions across 11 tumour types. Nature 623, 432–441 (2023). https://doi.org/10.1038/s41586-023-06682-5
