# Machine Learning in Single Cell RNA Sequencing (scRNAseq) Data

## Problem Statements
- Applying ML techniques to remove noises from count data
- Pattern recognition to compare healthy cells and diseased cells
- Trajectory inference or cell differentiation of biological processes

## Recent Advancement in Single Cell RNA sequencing Data
### What is sequencing ?
- Determining the sequences of nucleotides (A, C, G, T) from DNA of a Cell.
- Helpful for determining differences between healthy and diseased/mutated cells
- New technique to diagnose disease and develop personalized medicines
- Application in paleontology, fossile discovery and to trace lineage of extinct organisms

## RNA Sequencing 
- We sequence RNA as central dogma says: DNA-> RNA-> Protein
- Rather than whole nucleotide seqeunces, we are interested in genes expressed, and mRNA are used to encode proteins via genetic code
- Initial technology: Bulk RNA Sequencing: averages out the effect of expression from different cells

## Single Cell RNA Sequencing
- Isolate single cell
- Amplify region of interest or whole genome
- Construct Sequencing libraries
- Apply next gen DNA sequencing technology such as illumina and 10X

# Applying Machine Learning to scRNA-seq Data

- Machine learning based approaches allow cell cycle stage to be predicted from single-cell RNA-sequencing data.
- Data imputation/denoising of sequencing data using Machine Learning
- 

### ML to predict cell cycle stage from scRNA-seq data
- In the context of scRNA-seq experiments, the transcriptome data itself provides informative cues about the cell cycle stage of individual cells
- Use supervised learning to evaluate the ability of six algorithms to predict the unobserved cell cycle stage of a cell from its transcriptome profile
- Each algorithm was trained on the same scRNA-seq dataset where the cell-cycle stage of each cell was known.

![Approach](https://ars.els-cdn.com/content/image/1-s2.0-S1046202315300098-gr1.jpg)

- Performance measure: 10-fold cross-validation on the training dataset and a variety of independent datasets. 
- Predictive power calculation: the F1-score (harmonic mean of recall and precision) to summarize multi-class classification
- 6 classifiers for cell-cycle prediction

#### Random Forest:
- Trained 500 trees by minimising the entropy in the leaves of the individual randomized trees, constructed using a subset of all N features
####Logistic regression and lasso
- With and without L1 regression
#### Support Vector Machines
- with an rbf kernel with feature selection.
- Kernel parameters were determined using a cross-validated grid search. 
#### PCA-based classification
- The first principal component (PC) of a set of annotated cell cycle marker genes is sufficient for constructing a cell–cell covariance matrix, reflecting the cell cycle induced correlation among cells
- Evaluated a Gaussian Naive Bayes classifier based on the first PC derived from the set of cell cycle markers
#### Pairs
- A classification algorithm based on the idea of the relative expression of “marker pairs”

### Data
- Combined all genes annotated to cell cycle in the Gene Ontology database (GO:0007049) along with the 600 top-ranked genes from CycleBase
- Constructed an informative set of cell cycle marker genes, by excluding those genes whose variation was below the technical noise in the training dataset
- Normalized the gene expression based on FPKM (Fragments Per Kilobase)
- Alternative normalisation strategy: The data from each cell was normalised with the total number of reads mapped to the gene set used for prediction
- __Training and test Data__: Single-cell RNA-seq dataset comprised of 182 mouse embryonic stem cells (mESCs) with known cell-cycle phase.
- __Prediction Data__: Dataset without cell cycle information:Blastomeres,e Liver cells from a published study, T-Cells

### Results
![Result1](https://ars.els-cdn.com/content/image/1-s2.0-S1046202315300098-gr2.jpg)
Validation on data with known cell-cycle phase. a–c, F1 scores from internal cross validation for different gene sets; F1 score for G1 phase is shown in green, for S-phase in orange and for G2M phase in blue. Red lines represent the macro-averaged F1 score. A, all variable genes, B, all annotated cell-cycle genes, C, all variable cell-cycle genes. D–F, F1 scores on independent test set. D, all variable genes, E, all annotated cell-cycle genes, F, all variable cell-cycle genes.

### Findings
- Poor generalizability to independent test data for many methods but PCA and pairs method
- Alternative normalisation strategy results in poor generalizability
- 