# Summarization of Research Papers
This document is meant to summarize the papers read during the Research Training, started on 30 July.
## 4CAC: 4-class classification of metagenome assemblies using machine learning and assembly graphs
Link: [4CAC](https://www.biorxiv.org/content/10.1101/2023.01.20.524935v1.full.pdf "‌")

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Summary

The paper introduces a novel classifier called 4CAC that identifies 4 classes of microbial entities, namely, phages, plasmids, prokaryotes, and microeukaryotes, in metagenome assemblies simultaneously.

This system utilizes XGBoost algorithms and the assembly graph.

### Key Points

- While bacteria have been widely studied as the dominant species in the microbial communities, our understanding of phages, plasmids, and microeukaryotes lags behind _due to lower abundance._ The latter species are crucial in horizontal gene transfer and antibiotic resistance.
- The XGBoost classifier is trained and tested on all complete assemblies from the NCBI GenBank database. Several XGBoost models are trained for different sequence lengths.
- The classification is refined using the assembly graph, in which nodes represent contigs and edges represent subsequence overlaps between the corresponding contigs.
- 4CAC works well on _short-read_ assemblies.

### Potential Areas for Improvement

- Classification of underrepresented classes in real datasets lags behind other classifiers. Thus, more refined algorithms, e.g. Deep Learning models, can be explored to enhance the precision of 4CAC.
- 4CAC’s performance on long-reads can be improved.
- Graph-based approach can be refined.


## Overview of the 4CAC System
The 4CAC system aims to classify contigs (contiguous DNA sequences) in metagenomic assemblies into four distinct categories: viral, plasmid, prokaryotic, and eukaryotic. This classification helps in identifying the origin of contigs within a complex metagenomic sample. Distinguishing these categories is crucial for understanding the structure and function of microbial communities, identifying novel viruses, understanding horizontal gene transfer through plasmids, and studying microbial diversity in different environments.

### Classification Process
#### Input
- **Contigs**: Short, contiguous sequences obtained from the metagenomic assembly, representing fragments of genomes from different organisms in the sample.
- **FASTA Format**: The input contigs are stored in a FASTA file, where each sequence is identified by a header line followed by the nucleotide sequence.

#### Feature Extraction
- **k-mer Frequencies*: The system extracts k-mer frequencies, which are counts of all possible subsequences of length ‘k’ within each contig. These frequencies are calculated for multiple partition lengths (e.g., 500, 1000, 5000, etc.), capturing different scales of sequence composition.
- **Partition Lengths**: Using multiple partition lengths allows the model to capture features from both short and long-sequence contexts, improving classification accuracy.


#### XGBoost
- **Training**: The classifier is trained on labeled datasets with known categories to learn patterns associated with each category.
- **Classification Scores**: After training, XGBoost assigns probabilities (scores) to each contig for belonging to each category. These scores indicate the confidence of classification, with higher values suggesting a higher likelihood of the contig belonging to that specific category.

### Understanding the Output
The following command was executed to classify the contigs using the 4CAC system:

`python3 classify_xgb.py -f test_data/assembly.fasta -o test_data/xgb_output.csv`


- **Viral Score (viral_score)**: Represents the probability that a given contig is of viral origin. A high value (close to 1) suggests the contig likely belongs to a virus, while a low value (close to 0) indicates it is not viral. Viral sequences are essential to identify, as they can influence microbial dynamics and host interactions in the environment.
- **Plasmid Score (plas_score)**: Indicates the probability that the contig is a plasmid. Plasmids are extrachromosomal DNA molecules often carrying genes for antibiotic resistance or other adaptive traits. A high plasmid score implies the contig is likely a plasmid, which can be crucial for understanding horizontal gene transfer within microbial communities.
- **Prokaryotic Score (prokar_score)**: This score reflects the likelihood that the contig belongs to a prokaryotic organism, such as bacteria or archaea. A high prokaryotic score means the contig is more likely to be from a prokaryotic source. Understanding the proportion of prokaryotic sequences can help in assessing the microbial community structure and function.
- **Eukaryotic Score (eukar_score)**: Represents the probability that the contig is of eukaryotic origin. Eukaryotic sequences include fungi, algae, or small eukaryotic organisms. A high eukaryotic score indicates a strong likelihood of the contig being eukaryotic, providing insights into the presence of non-microbial eukaryotic life in the sample.

*Ambiguous Scores*: Contigs with scores distributed across multiple categories may have characteristics that are not well-defined or could share features with multiple categories. This ambiguity could be due to mixed-origin contigs, horizontal gene transfer, or incomplete sequences.

These results focus on the output from the XGBoost classification, which includes scores for viral, plasmid, prokaryotic, and eukaryotic origins based on the sequence data alone. It does not include results from using the assembly graph data for refinement.

### Refinement of the Classification Results
The 4CAC system further refines the initial classification by incorporating the assembly graph data, which provides structural and connectivity information about the contigs. This step is crucial for resolving ambiguous classifications and improving the accuracy of the results. The command below is run to do so

`python3 classify_4CAC.py --assembler metaFlye --asmdir test_data/ --classdir test_data/ --outdir test_data/`
The refined results, saved in 4CAC\_classification.fasta, represent the final classification after integrating both sequence features and assembly graph context. Refined Classification Results:

- Prokaryotic Contigs: 39
- Plasmid Contigs: 18
- Eukaryotic Contigs: 13
- Phage Contigs: 13

How the results changed

- **Increase in Prokaryotic Classifications**: The number of prokaryotic contigs has increased, indicating that some previously ambiguous contigs were resolved as prokaryotic after considering their connectivity and structural features within the assembly graph.
- **Refinement of Plasmid and Viral Contigs**: Some contigs initially classified as plasmids or viral have been reassigned to other categories. This suggests that the initial classifications based on sequence features alone were not sufficient, and the structural data helped in providing a more accurate context.
- **Eukaryotic Contigs**: The number of eukaryotic contigs decreased slightly, likely due to the correction of misclassifications based on assembly graph data.

### Conclusion of 4CAC
The output is approximately 80.49\% compared to the ground truth provided in the repo. 


### How the Assembly Graph is Used in 4CAC for Refinement
#### 1. Initial classification
- The `contigCategoryToNodes` function assigns XGBoost class probabilities to nodes in the assembly graph based on contig paths.
- Scores are used to decide initial classifications, setting contig categories based on defined thresholds.

-- **Input**: XGBoost scores, contig lengths, and their corresponding nodes.
-- **Process**: Assigns initial classifications to nodes based on contig categories using thresholds (e.g., phage, plasmid). If a contig’s classification is confident (e.g., high score), it is propagated to all nodes in the contig’s path.

#### 2. Correction Step
The `correctionMaxAdj` function corrects nodes with conflicting classifications by majority voting from neighboring nodes with consistent scores and classifications.
-- **Purpose**: Corrects nodes with conflicting classifications.
-- **Process**: Each node checks its neighbors' classifications. If two or more neighbors agree, the node adopts their classification.

#### 3. Propagation Step
The `propagationMaxAdj` function assigns classifications to unclassified nodes using the scores and classes of adjacent nodes.
-- **Purpose**: Assigns classifications to unclassified nodes.
-- **Process**: Nodes with one or more classified neighbors adopt their classification based on majority support.

#### 4. Final classification
The `finalContigClassifyFromEdges` function determines the final contig class based on the majority node classification, considering the node lengths and score consistency.
-- **Purpose**: Determine the final class of each contig.
-- **Process**: Each contig is classified based on the majority classification of its nodes, weighted by node length and score confidence.




## 3CAC: improving the classification of phages and plasmids in metagenomic assemblies using assembly graphs
Link: [3CAC](https://academic.oup.com/bioinformatics/article/38/Supplement_2/ii56/6702013)

### Summary

The paper introduces 3CAC, a three-class classifier that improves the precision of phage and plasmid classification in metagenomic assemblies. The 3CAC classifier starts with initial classification results from existing tools and enhances the classification of short and low-confidence contigs using adjacency information in the assembly graph. The tool has shown to significantly outperform existing classifiers, such as PPR-Meta and viralVerify, especially in simulated and real metagenome datasets.

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Key Points

- **Background**: Phages and plasmids coexist with bacterial chromosomes in microbial communities and play crucial roles in evolution, antibiotic resistance, and gene transfer. Existing tools perform poorly in classifying phages and plasmids due to the high abundance of chromosome contigs in assemblies.
  
- **Methodology**: 3CAC generates an initial classification using existing tools (viralVerify, PPR-Meta) and improves it by leveraging assembly graph information. It corrects classifications based on neighboring contig classes and propagates classifications from classified to unclassified contigs.
  
- **Performance**: 3CAC significantly improves precision and recall compared to existing tools. It is particularly effective for short contigs, where previous classifiers often fail.

- **Evaluation**: The classifier was tested on both simulated datasets and real human gut microbiome samples, demonstrating a marked improvement in classification performance with increased F1 scores by 10-60 percentage points compared to existing methods.

### Potential Areas for Improvement

- **Handling Isolated Contigs**: The propagation step of 3CAC does not work on isolated contigs without neighbors in the assembly graph, limiting the classification improvement for these contigs. Enhancing methods for isolated contigs could improve overall recall.
  
- **Refinement of Existing Tools**: The classifier currently relies on the output of existing two-class and three-class classifiers. Developing a stand-alone version of 3CAC could further optimize performance.

- **Detection of Integrated Elements**: The detection of prophages and other integrated elements remains challenging. Specific tools designed for prophage detection could be incorporated to enhance classification accuracy.

- **Expansion to Four-Class Classification**: Extending 3CAC to include eukaryotic contigs could broaden its applicability in diverse metagenomic studies, addressing a wider range of microbial entities.



## DeepVirFinder: Identifying Viruses from Metagenomic Data Using Deep Learning
Link: [DeepVirFinder](https://doi.org/10.1007/s40484-019-0187-4)

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Summary

The paper introduces DeepVirFinder, a deep learning-based tool that identifies viral sequences in metagenomic data using convolutional neural networks (ConvNets). Unlike traditional reference-based and gene homology-based methods, DeepVirFinder uses a reference-free approach that learns viral genomic signatures directly from data, improving accuracy in identifying both known and unknown viral sequences across various contig lengths.

### Key Points

- **Background**: Viruses play crucial roles in ecosystems and human health, but their identification in metagenomic datasets is challenging due to high mutation rates and the incompleteness of reference databases. Traditional methods struggle to accurately identify short or novel viral sequences.
  
- **Methodology**: DeepVirFinder uses a deep learning model with a convolutional neural network that automatically learns features from viral sequences without relying on pre-defined metrics such as k-mers. The model was trained on viral sequences from RefSeq and metavirome datasets, enhancing its capacity to recognize under-represented viral groups.

- **Performance**: DeepVirFinder significantly outperformed the previous state-of-the-art tool, VirFinder, with AUROC scores of 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences, respectively. It demonstrated robust performance even for very short sequences and showed improved accuracy when training was expanded to include millions of viral sequences from metavirome samples.

- **Application**: Applied to human gut metagenomic samples, DeepVirFinder identified over 51,000 viral sequences, linking 10 viral bins with colorectal cancer status. This highlights the potential role of viruses in human disease.

### Potential Areas for Improvement

- **Training with More Diverse Data**: While DeepVirFinder performs well, expanding its training dataset to include more metavirome samples and newly discovered viral sequences will further enhance its predictive power, particularly for under-represented viral groups.
  
- **Handling Eukaryotic Contaminations**: The current model is trained primarily on prokaryotic viral sequences and may misidentify eukaryotic sequences as viral. Filtering out potential eukaryotic contaminants before analysis or enhancing the model to differentiate between viral and eukaryotic sequences would be beneficial.

- **Assembly Integration**: DeepVirFinder could be used to improve viral assembly pipelines by classifying viral reads before assembly, reducing computational complexity and improving the accuracy of viral genome reconstruction.

- **Detection of Out-of-Distribution Sequences**: As DeepVirFinder is a deep learning model, it may sometimes misclassify out-of-distribution sequences. Future work could explore methods for detecting and handling such sequences to improve classification reliability.


## Evaluation of Computational Phage Detection Tools for Metagenomic Datasets
Link: [Evaluation of Computational Phage Detection Tools](https://www.frontiersin.org/articles/10.3389/fmicb.2023.1078760/full)

### Evaluation criteria
* $ Precision = \frac{True Positive}{True Positive + False Positive} $ 

* $ Recall = \frac{True Positive}{True Positive + False Negative} $ 

* $ F1 Score = \frac{2 * Precision * Recall}{Precision + Recall} $ 

### Summary

The paper evaluates the performance of nine computational tools used for detecting phages in metagenomic datasets. The study benchmarks these tools using various simulated and real datasets, assessing their performance under different conditions, such as varying contig lengths, sequencing errors, eukaryotic contamination, and phage taxonomy biases. The tools evaluated include homology-based tools like VirSorter and VIBRANT, and sequence composition-based tools like DeepVirFinder and VirFinder. The study finds that these tools yield significantly different results, with homology-based methods demonstrating high precision but lower sensitivity, while sequence-based methods show the opposite pattern.

### Key Points

- **Background**: Phages are the most abundant biological entities and play critical roles in ecosystems and human health. Accurate detection in metagenomic datasets is challenging due to the high diversity and low abundance of phages compared to their bacterial hosts.
- **Tools Evaluated**: The study assessed nine phage detection tools, categorized into homology-based and sequence-based methods. Homology-based tools (e.g., VirSorter, VIBRANT) are highly specific but less sensitive, while sequence-based tools (e.g., DeepVirFinder, VirFinder) exhibit higher sensitivity but lower specificity.
- **Benchmark Datasets**: The evaluation utilized several benchmark datasets, including simulated metagenomes, real human gut metagenomes, and viromes. The study highlighted that real-world metagenomes present specific challenges such as eukaryotic contamination and low viral content.
- **Performance Metrics**: Key performance metrics include precision, recall, F1 score, and area under the precision-recall curve (AUPRC). Sequence-based tools generally showed higher recall but lower precision, particularly in datasets with short contigs or low viral content.
- **Tool Limitations**: The tools demonstrated varying degrees of robustness to sequencing errors and eukaryotic contamination. Homology-based tools maintained high precision regardless of these factors, while sequence-based tools were more affected.

### Potential Areas for Improvement

- **Handling Low Viral Content**: Tools often struggle with datasets where viral sequences are in low abundance. Future development could focus on improving detection accuracy under these conditions, potentially by integrating multi-tool approaches.
- **Enhancing Sensitivity for Underrepresented Phages**: Current tools show biases towards well-represented phage groups. Developing methods that can reliably detect less common phage families could expand the utility of these tools in diverse environments.
- **Improving Performance on Short Contigs**: Tools like VirFinder and DeepVirFinder, which are better suited for short sequences, often sacrifice precision. Refining algorithms to balance sensitivity and specificity for short reads could enhance their performance.
- **Reducing Computational Resource Requirements**: The study highlights variability in computational demands among the tools. Enhancements that reduce processing times and memory usage, especially for large metagenomic datasets, would improve accessibility.
- **Addressing Eukaryotic Contamination**: Tools need better mechanisms to distinguish phage sequences from eukaryotic contaminants, especially in complex microbiomes. Incorporating eukaryotic sequence data into training could improve specificity.

This evaluation underscores the need for continued tool development and benchmarking to enhance phage detection capabilities across diverse metagenomic samples.


## Identifying Phage Sequences From Metagenomic Data Using Deep Neural Network With Word Embedding and Attention Mechanism
Link: [MetaPhaPred](https://pubmed.ncbi.nlm.nih.gov/37812548/)

### Summary

The paper introduces MetaPhaPred, a novel deep neural network model designed to identify phage sequences from metagenomic data. The approach integrates word embedding, CNN, Bi-LSTM, and an attention mechanism to improve phage identification accuracy. MetaPhaPred encodes DNA sequences using the word embedding technique dna2vec, extracts feature maps using CNN, captures long-term dependencies with Bi-LSTM, and highlights important features using an attention mechanism. The model is evaluated using both simulated and real metagenomic datasets, demonstrating superior performance compared to state-of-the-art phage detection methods.

### Key Points

- **Background**: Phages, or bacteriophages, are viruses that infect bacteria and archaea and are significant in microbial ecosystems. Identifying phage sequences in metagenomic data is challenging due to the mixed presence of bacterial and phage DNA.
- **MetaPhaPred Overview**: The model uses dna2vec to encode k-mer representations of DNA sequences, which are then processed by CNN for feature extraction, Bi-LSTM for dependency capture, and an attention mechanism to emphasize key features.
- **Datasets**: The study uses simulated metagenomic sequences generated from phage and bacterial genomes, including real metagenomic data and prophage datasets, to evaluate MetaPhaPred’s performance.
- **Comparison with Other Methods**: MetaPhaPred is benchmarked against eleven state-of-the-art methods, including sequence alignment-based, machine learning-based, and deep learning-based approaches. It consistently outperforms these methods, especially on short sequences where other models struggle.
- **Performance Metrics**: Key metrics used for evaluation include Matthews correlation coefficient (MCC), accuracy, F1-score, and AUROC. MetaPhaPred shows significant improvements in accuracy and robustness across different sequence lengths and types.
- **Ablation Study**: The study highlights the importance of each component of MetaPhaPred (embedding, CNN, Bi-LSTM, and attention mechanism) by evaluating variants of the model without these components, confirming that each plays a crucial role in overall performance.

### Potential Areas for Improvement

- **Prediction of Non-Phage Sequences**: MetaPhaPred is currently limited to phage prediction and may struggle with sequences from other organisms, such as plasmids or fungi. Expanding its classification capabilities could enhance its utility.
- **Handling Imbalanced Datasets**: The model’s performance declines when trained on imbalanced datasets, suggesting a need for improved techniques to handle class imbalance, such as data augmentation or cost-sensitive learning approaches.
- **Computational Efficiency**: Although MetaPhaPred achieves high accuracy, its computational requirements are significant. Optimizing the model for faster processing without sacrificing accuracy would make it more practical for large-scale analyses.
- **Prophage and Homologous Fragment Removal**: The current model does not specifically address the removal of prophages and homologous fragments, which could improve prediction accuracy further by refining data preprocessing steps.
- **Multi-Class Classification**: Developing MetaPhaPred into a multi-class classification model could enable it to differentiate between various types of sequences, including bacteria, archaea, plasmids, and fungi, making it a more comprehensive tool for metagenomic analysis.

This evaluation emphasizes MetaPhaPred’s effectiveness in identifying phages from metagenomic data, highlighting both its innovative approach and areas where further research could enhance its performance and application scope.


## Accurate Identification of Bacteriophages From Metagenomic Data Using Transformer
Link: [PhaMer](https://academic.oup.com/bib/article/23/4/bbac258/6620872?login=false)
### Summary

The paper introduces PhaMer, a deep learning tool leveraging the Transformer model to identify phage sequences from metagenomic data. By constructing a protein-cluster vocabulary, PhaMer encodes phage sequences into sentences of protein-based tokens, which are then analyzed by the Transformer model. This approach enables PhaMer to learn the composition and associations of proteins in phages, distinguishing them from bacterial sequences. Rigorous testing on various datasets, including simulated and real metagenomic data, demonstrates that PhaMer outperforms existing state-of-the-art phage detection tools, significantly improving precision and recall, particularly on complex datasets.

### Key Points

- **Background**: Phages are highly diverse viruses that infect bacteria, playing crucial roles in microbial ecosystems. Identifying phages from metagenomic data is challenging due to shared sequence similarities between phages and bacterial genomes.
- **PhaMer Overview**: PhaMer utilizes a Transformer model with protein-cluster tokens to embed phage sequences contextually. This approach captures both the protein composition and interactions within phages, enhancing the model’s ability to differentiate phage sequences from bacterial ones.
- **Datasets**: The study evaluated PhaMer on multiple datasets, including the RefSeq dataset, short contigs, simulated metagenomes, mock metagenomic communities, and the IMG/VR dataset, to assess its performance across various scenarios.
- **Comparison with Other Methods**: PhaMer was benchmarked against five state-of-the-art tools, including alignment-based and learning-based approaches like VirSorter, Seeker, and DeepVirFinder. PhaMer showed superior performance, particularly in distinguishing phage sequences from bacterial contigs.
- **Performance Metrics**: Key metrics used for evaluation include precision, recall, and F1-score. PhaMer consistently achieved the highest scores, showing notable improvements in F1-score, particularly in complex datasets such as mock metagenomic data.
- **Transformer Model**: The model employs self-attention mechanisms and positional embeddings to capture protein-protein relationships and their importance in phage identification, providing a significant advantage over traditional k-mer-based and alignment-based methods.

### Potential Areas for Improvement

- **Prophage Detection**: PhaMer currently focuses on identifying phages and does not specifically detect prophages. Incorporating prophage detection algorithms could enhance the model’s utility in metagenomic studies.
- **Handling Plasmid Sequences**: PhaMer treats plasmid sequences as bacterial, which could impact its precision. Future updates could include a multi-class classification approach to distinguish plasmids from both phages and bacterial chromosomes.
- **Optimization for Computational Efficiency**: The current implementation of PhaMer is resource-intensive, mainly due to protein translation and sequence alignment steps. Streamlining these processes or using alternative alignment methods could reduce computational costs.
- **Balancing Sensitivity and Specificity**: Adjusting the criteria for protein matching (e.g., E-value cutoffs) impacts the balance between recall and precision. Exploring more adaptive approaches could help optimize this trade-off, especially for novel or highly diverse phages.

PhaMer demonstrates a significant step forward in phage identification from metagenomic data, showcasing the potential of advanced deep learning models like Transformer in bioinformatics applications. Further refinements and expansions of its capabilities could enhance its performance and broaden its applicability in phage discovery and microbial ecology studies.


## Terms and Definitions
**Genomics**: focuses on studying the entire genetic material of a *single organism*

**Metagenomics**: focuses on studying the genetic material from *multiple organisms* found in a mixed sample

**Phage (Bacteriophage)**: specifically infects and replicates within bacteria. Phages play a key role in controlling bacterial populations in nature and can influence microbial ecosystems

**Horizontal Gene Transfer (HGT)**: the process by which an organism transfers genetics material to another organism that is not its offspring
* Why is HGT important?
⋅⋅* Spread of antibiotic resistance: bacteria can pass genes that confer resistance to antibiotics, even across species, making infections harder to treat
⋅⋅* Evolution and adaptation: HGT allows for rapid evolution, enabling organisms to quickly acquire traits that help them survive in new environments or under changing conditions 
* How is HGT done?
⋅⋅* Transduction: transfer of DNA between bacteria by viruses (phages)
⋅⋅* Conjugation: transfer of DNA through direct contact between 2 cells. **Plasmids** are small, circular DNA molecules that exist independently of a bacterial chromosome. A bacterium with a plasmid transfers it to another bacterium via a physical connection, called Pilus.
⋅⋅* Transformation: uptake of free DNA from the environment by a cell. **Microeukaryotes** may release or uptake free DNA in such a way 

**Vertical Gene Transfer (VGT)**: the process by which genes are transferred from parent to offspring through reproduction