# Summarization of Research Papers
This document is meant to summarize the papers read during the Research Training, started on 30 July.
## 4CAC: 4-class classification of metagenome assemblies using machine learning and assembly graphs
Link: [4CAC](https://www.biorxiv.org/content/10.1101/2023.01.20.524935v1.full.pdf "‌")

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Summary

The paper introduces a novel classifier called 4CAC that identifies 4 classes of microbial entities, namely, phages, plasmids, prokaryotes, and microeukaryotes, in metagenome assemblies simultaneously.

This system utilizes XGBoost algorithms and the assembly graph.

### Key Points

- While bacteria have been widely studied as the dominant species in the microbial communities, our understanding of phages, plasmids, and microeukaryotes lags behind _due to lower abundance._ The latter species are crucial in horizontal gene transfer and antibiotic resistance.
- The XGBoost classifier is trained and tested on all complete assemblies from the NCBI GenBank database. Several XGBoost models are trained for different sequence lengths.
- The classification is refined using the assembly graph, in which nodes represent contigs and edges represent subsequence overlaps between the corresponding contigs.
- 4CAC works well on _short-read_ assemblies.

### Potential Areas for Improvement

- Classification of underrepresented classes in real datasets lags behind other classifiers. Thus, more refined algorithms, e.g. Deep Learning models, can be explored to enhance the precision of 4CAC.
- 4CAC’s performance on long-reads can be improved.
- Graph-based approach can be refined.


## 3CAC: improving the classification of phages and plasmids in metagenomic assemblies using assembly graphs
Link: [3CAC](https://academic.oup.com/bioinformatics/article/38/Supplement_2/ii56/6702013)

### Summary

The paper introduces 3CAC, a three-class classifier that improves the precision of phage and plasmid classification in metagenomic assemblies. The 3CAC classifier starts with initial classification results from existing tools and enhances the classification of short and low-confidence contigs using adjacency information in the assembly graph. The tool has shown to significantly outperform existing classifiers, such as PPR-Meta and viralVerify, especially in simulated and real metagenome datasets.

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Key Points

- **Background**: Phages and plasmids coexist with bacterial chromosomes in microbial communities and play crucial roles in evolution, antibiotic resistance, and gene transfer. Existing tools perform poorly in classifying phages and plasmids due to the high abundance of chromosome contigs in assemblies.
  
- **Methodology**: 3CAC generates an initial classification using existing tools (viralVerify, PPR-Meta) and improves it by leveraging assembly graph information. It corrects classifications based on neighboring contig classes and propagates classifications from classified to unclassified contigs.
  
- **Performance**: 3CAC significantly improves precision and recall compared to existing tools. It is particularly effective for short contigs, where previous classifiers often fail.

- **Evaluation**: The classifier was tested on both simulated datasets and real human gut microbiome samples, demonstrating a marked improvement in classification performance with increased F1 scores by 10-60 percentage points compared to existing methods.

### Potential Areas for Improvement

- **Handling Isolated Contigs**: The propagation step of 3CAC does not work on isolated contigs without neighbors in the assembly graph, limiting the classification improvement for these contigs. Enhancing methods for isolated contigs could improve overall recall.
  
- **Refinement of Existing Tools**: The classifier currently relies on the output of existing two-class and three-class classifiers. Developing a stand-alone version of 3CAC could further optimize performance.

- **Detection of Integrated Elements**: The detection of prophages and other integrated elements remains challenging. Specific tools designed for prophage detection could be incorporated to enhance classification accuracy.

- **Expansion to Four-Class Classification**: Extending 3CAC to include eukaryotic contigs could broaden its applicability in diverse metagenomic studies, addressing a wider range of microbial entities.



## DeepVirFinder: Identifying Viruses from Metagenomic Data Using Deep Learning
Link: [DeepVirFinder](https://doi.org/10.1007/s40484-019-0187-4)

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Summary

The paper introduces DeepVirFinder, a deep learning-based tool that identifies viral sequences in metagenomic data using convolutional neural networks (ConvNets). Unlike traditional reference-based and gene homology-based methods, DeepVirFinder uses a reference-free approach that learns viral genomic signatures directly from data, improving accuracy in identifying both known and unknown viral sequences across various contig lengths.

### Key Points

- **Background**: Viruses play crucial roles in ecosystems and human health, but their identification in metagenomic datasets is challenging due to high mutation rates and the incompleteness of reference databases. Traditional methods struggle to accurately identify short or novel viral sequences.
  
- **Methodology**: DeepVirFinder uses a deep learning model with a convolutional neural network that automatically learns features from viral sequences without relying on pre-defined metrics such as k-mers. The model was trained on viral sequences from RefSeq and metavirome datasets, enhancing its capacity to recognize under-represented viral groups.

- **Performance**: DeepVirFinder significantly outperformed the previous state-of-the-art tool, VirFinder, with AUROC scores of 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences, respectively. It demonstrated robust performance even for very short sequences and showed improved accuracy when training was expanded to include millions of viral sequences from metavirome samples.

- **Application**: Applied to human gut metagenomic samples, DeepVirFinder identified over 51,000 viral sequences, linking 10 viral bins with colorectal cancer status. This highlights the potential role of viruses in human disease.

### Potential Areas for Improvement

- **Training with More Diverse Data**: While DeepVirFinder performs well, expanding its training dataset to include more metavirome samples and newly discovered viral sequences will further enhance its predictive power, particularly for under-represented viral groups.
  
- **Handling Eukaryotic Contaminations**: The current model is trained primarily on prokaryotic viral sequences and may misidentify eukaryotic sequences as viral. Filtering out potential eukaryotic contaminants before analysis or enhancing the model to differentiate between viral and eukaryotic sequences would be beneficial.

- **Assembly Integration**: DeepVirFinder could be used to improve viral assembly pipelines by classifying viral reads before assembly, reducing computational complexity and improving the accuracy of viral genome reconstruction.

- **Detection of Out-of-Distribution Sequences**: As DeepVirFinder is a deep learning model, it may sometimes misclassify out-of-distribution sequences. Future work could explore methods for detecting and handling such sequences to improve classification reliability.


## Terms and Definitions
**Genomics**: focuses on studying the entire genetic material of a *single organism*

**Metagenomics**: focuses on studying the genetic material from *multiple organisms* found in a mixed sample

**Phage (Bacteriophage)**: specifically infects and replicates within bacteria. Phages play a key role in controlling bacterial populations in nature and can influence microbial ecosystems

**Horizontal Gene Transfer (HGT)**: the process by which an organism transfers genetics material to another organism that is not its offspring
* Why is HGT important?
⋅⋅* Spread of antibiotic resistance: bacteria can pass genes that confer resistance to antibiotics, even across species, making infections harder to treat
⋅⋅* Evolution and adaptation: HGT allows for rapid evolution, enabling organisms to quickly acquire traits that help them survive in new environments or under changing conditions 
* How is HGT done?
⋅⋅* Transduction: transfer of DNA between bacteria by viruses (phages)
⋅⋅* Conjugation: transfer of DNA through direct contact between 2 cells. **Plasmids** are small, circular DNA molecules that exist independently of a bacterial chromosome. A bacterium with a plasmid transfers it to another bacterium via a physical connection, called Pilus.
⋅⋅* Transformation: uptake of free DNA from the environment by a cell. **Microeukaryotes** may release or uptake free DNA in such a way 

**Vertical Gene Transfer (VGT)**: the process by which genes are transferred from parent to offspring through reproduction