Preface: We used Python 3 (v 5.5.0) to write this project

# Final Project: Metagenomic Contigs and Microbial Abundance Analysis

Group members: Yin Chen Wan, Ethan (Yixun) Tan, Jocelyn (Jinghua) Wu, Michael Xu

In Week 1, we attempted to assemble contigs using ```SPAdes``` but failed due to memory constraints. We also quantified microbial abundances using [One Codex](https://app.onecodex.com).

The assigned data were ```gzip``` Illumina reads in FASTQ format:  

160523Alm_D16-4706_1_sequence.fastq.gz  
160523Alm_D16-4706_2_sequence.fastq.gz


## Contig Assembly

We tried to assemble contigs using ```SPAdes``` in Terminal.

```spades --meta -t 2 -m 16 -1 /data/metagenomes/160523Alm_D16-4706_1_sequence.fastq.gz -2 /data/metagenomes/160523Alm_D16-4706_2_sequence.fastq.gz --o Output```

However, we were not able to due to memory constraints.

```The reads contain too many k-mers to fit into available memory. You need approx. 139.871 GB of free RAM to assemble your dataset```

To mitigate these constraints, we could try to downsize the sampling of our assigned reads using ```Hunzip```. We could use ```gzip```, ```head```, and/or ```pipes``` to measure how many lines are in our assigned files and align partitions of the data based on lines.

Alternatively, we could use this command from [Biostars](https://www.biostars.org/p/9610/) to perform sequence number counts: ```zcat name_of_file.fastq.gz | echo $((`wc -l`/4))``` 

Using ```zcat``` would allow us to partition our assigned data and align them based on sequence number.

### Statistical Analyses

Not performed, as alignments were not completed.

## Microbial Abundance Analysis

We imported our assigned ```gzip``` Illumina reads to One Codex to perform metagenomic analysis. Both inputs were mixed/metagenomic samples consisting of whole genome sequences.

```onecodex upload --forward 160523Alm_D16-4706_1_sequence.fastq.gz --reverse 160523Alm_D16-4706_2_sequence.fastq.gz```

### **160523Alm_D16-4706_sequence.fastq.gz**

There were 186200628 reads in total, 1.56% of which were host reads. The host reads were found by crosschecking to GRC38 genome assembly and subsequently removed. We looked at a 10% sample of all **classified** reads (1290626 reads). 

The most abundant species were _Cutibacterium acnes_, _Pseudomonas sp._ NFPP08, _Acidobacteria bacterium_ 13_1_20CM_2_60_10, and _Pseudomonas sp._ NFPP19. Other notable species in this sample include _Escherichia coli_, _Acidovorax delafieldi_, and _Micrococcus luteus_. 


### **Summary**

It is clear that much microbiological work remains to be completed. A plurality of our reads were not recognized by the One Codex database (39%), with a further 25% reads not identified to their genus.

Both samples have an abundance of _Pseudomonas_ (20%), _Streptomyces_ (11%), _Mycobacterium_ (2.2%), _Clostridium_ (1.5%), and _Bradyrhizobium_ (1.2%) reads. There is thus a diverse mixture of Gram-negative (_Pseudomonas_, _Bradyrhizobium_) and Gram-positive (_Streptomyces_, _Mycobacterium_, _Clostridium_) microbes in our soil samples.

Many reads correspond to bacteria we expect to live in soil. Members of _Pseudomonas_ and _Streptomyces_ are commonly found in soil and decaying vegetation. _P. putida_, for example, notably participates in many bioremediation reactions. _Streptomyces_ species develop complex mycelium networks to absorb organic compounds. Some species of _Mycobacterium_ also thrive in humid conditions such as soil. Members of _Bradyrhizobium_ are involved in nitrogen fixation. _Acidovorax delafieldi_, one of the most common bacteria in our sample, is also a soil bacterium.

It is interesting that many species in our samples are pathogenic. _C. acnes_, the most common bacterium found in both samples, is a slow-growing Gram-positive bacterium implicated in acne. Many _Pseudomonas_ species are also implicated in disease. _P. aeruginosa_, for example, is a multidrug-resistant Gram-negative bacterium that often infects immunocompromised individuals. _Clostridium_ species such as _C. botulinum_ and _C. difficile_ also cause botulism and diarrhea, respectively. All three genuses thrive in soil, water, and skin flora alike. They are also "hardy" species that can survive in low-oxygen and normal atmosphereic conditions alike. Other pathogenic species include _E. coli_ and _M. luteus_, a bacterium normally part of skin flora that may infect immunocompromised individuals.

![](Week_1_Analysis.png)