## Literature Review

#### By Tiffany Tang

This study employs an integrative computational pipeline to investigate how the cervicovaginal microbiome and systemic inflammatory cytokines influence the persistence of high risk human human papillomavirus infection among Nigerian women. The dataset includes cervicovaginal 16S rRNA gene amplicon sequencing, serum cytokine measurements, hrHPV PCR results and detailed metadata. Analyses will be conducted using a combination of R and Python based tools for microbiome profiling, statistical testing and machine learning. 

### Genomic Tools and Resources

This project integrates several genomic and analytical tools across R and Python to process and analyze a multi-omic dataset comprising cervicovaginal 16S rRNA sequencing, systemic cytokine levels, and clinical metadata. A key component of the microbiome analysis is the use of **DADA2**, an R package designed to denoise 16S amplicon sequencing data by inferring exact amplicon sequence variants (ASVs) instead of clustering reads into operational taxonomic units (OTUs). DADA2 is necessary because it increases taxonomic resolution and reproducibility, which is essential for identifying microbial signatures associated with HPV persistence. The pipeline which is automated via an in house script, includes quality filtering, dereplication, error modeling, chimera removal and taxonomy assignment using reference databases such as SILVA or Greengenes. This step ensures that only high quality, biologically meaningful sequence data is retained. 

Once the ASV’s and taxonomic assignments are generated, they are imported into **phyloseq**, which is an R package tailored for microbiome data integration and analysis. Phyloseq provides a standardized object format to store ASV tables, taxonomic information, sample metadata, and phylogenetic trees, if available. This structure enables downstream analyses, including calculation of alpha diversity metrics (e.g., Shannon index, observed ASVs) and beta diversity (e.g., Bray–Curtis distances), as well as ordination techniques like principal coordinates analysis (PCoA). The package also supports hypothesis testing using PERMANOVA to examine whether microbial community structures differ by hrHPV status or other clinical covariates. Phyloseq’s seamless compatibility with ggplot2 allows for high-quality, publication-ready visualizations of diversity, taxa abundance, and sample clustering.

To identify microbial features that are differentially abundant based on hrHPV status or cytokine levels, the project will apply **DESeq2** or **ANCOM-BC**, both R packages capable of handling the compositional and sparse nature of microbiome count data. DESeq2 models raw count data using negative binomial distributions and performs normalization, dispersion estimation, and differential testing, while ANCOM-BC adjusts for compositional effects inherent to sequencing data. These tools are essential for drawing robust statistical conclusions about microbial taxa that may influence or respond to systemic inflammation or viral persistence. Importantly, they allow for covariate adjustment and control for false discovery, which is crucial in high-dimensional microbiome datasets.

For the integrative analysis and predictive modeling, Python libraries including **pandas**, **numpy**, and **scikit-learn** will be employed. Pandas and numpy provide the foundation for manipulating and merging microbiome, cytokine, and metadata tables, while scikit-learn offers an accessible and flexible framework for building machine learning models. Supervised models such as random forests and logistic regression will be used to predict HPV persistence based on microbial and immune features. These models will be trained using cross-validation to ensure generalizability and evaluated using performance metrics like ROC-AUC and confusion matrices. The ability to integrate diverse data types and build predictive models makes Python a powerful complement to R in this multi-omic study.



### Analysis Pipeline

The overall analysis begins with data preparation, which includes demultiplexing raw FASTQ files using a custom Bash script provided by the research team. These demultiplexed files are then processed through the DADA2 pipeline in R to filter low-quality reads, infer ASVs, remove chimeras, and assign taxonomy. This generates a high-resolution representation of the microbial community in each cervicovaginal sample.

Following sequence processing, the resulting ASV tables and taxonomy data are loaded into phyloseq, where they are merged with sample metadata. Microbiome composition is analyzed by calculating alpha diversity metrics such as the Shannon index to assess within-sample diversity, and beta diversity to examine differences in microbial communities across hrHPV persistence groups. Ordination plots based on Bray–Curtis dissimilarities will be used to visualize sample clustering, and PERMANOVA will test for statistically significant differences in community composition.

Next, differential abundance analysis will be conducted using DESeq2 or ANCOM-BC to identify taxa whose relative abundance varies by hrHPV infection status or systemic cytokine levels. These models will include relevant metadata as covariates to control for confounding factors. The cytokine data, which consists of approximately 30 serum analytes, will be cleaned, normalized, and used to compute correlations with microbial taxa. Correlation networks will be generated to visualize microbial-immune interactions and identify key biomarkers of viral persistence.

The final stage of the pipeline is data integration and predictive modeling. Here, microbial and cytokine data will be combined into a single feature matrix. Using Python's scikit-learn library, supervised machine learning models such as random forests and logistic regression will be trained to predict hrHPV persistence. These models will be evaluated using techniques like cross-validation and receiver operating characteristic (ROC) curve analysis to assess their predictive performance and identify the most informative features.


### Justification of Tools and Approach

This analytical pipeline leverages state-of-the-art tools tailored to the unique characteristics of microbiome and immunological data. DADA2 offers high-resolution, accurate sequence inference that surpasses traditional OTU-based methods, which is vital for distinguishing subtle microbial differences. Phyloseq provides a powerful, integrated environment for exploring and visualizing microbial community structure, while DESeq2 and ANCOM-BC allow for rigorous differential testing that accounts for the compositional and sparse nature of sequencing data. On the other hand, Python’s machine learning ecosystem enables the project to go beyond exploratory analysis and test predictive models for identifying individuals at risk of persistent hrHPV infection. The combination of these tools supports a comprehensive, reproducible, and biologically meaningful investigation into the microbial and immunological correlates of HPV persistence.