GitHub - xy-repo/q2-data-augment

q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment)

Data augmentation is a very useful and widely used method in data science (see: https://en.wikipedia.org/wiki/Data_augmentation). Especially, it can increase the sample size of the training set for machine learning models.

Rarefaction can be used as an effective and reliable method for data augmentation, given the following reasons:

Essentially, biological sample collection and sequencing are random sampling processes, which capture microbes from an unknown population. Rarefaction is just another random sampling process, which can also be seen as sampling certain reads from the same population, just the same as biological sample collection and sequencing.
Under the hypothesis that "rarefaction" = "biological sample collection and sequencing", each iteration of rarefaction on a sequencing sample in fact generates a new sequencing sub-sample. This new sub-sample contains a subset of reads of the original sample, and come from the same population that the original sample belongs to.
Some preliminary results show using rarefaction for data augmentation can significantly the results of machine learning classification (unpublished results).

This method, named Rarefy for Augment, is very simple. Run random rarefaction N times. Each time, rename the samples and corresponding metadata and concatenate them with the previous two files. Finally, the sample size can be enlarged N times.

Installing

conda activate qiime2-2020.11
pip install git+https://github.com/yxia0125/q2-data-augment.git

Type "qiime data-augment" to test if the installation is successful.

Uninstalling

pip uninstall q2-data-augment

Using

qiime data-augment augment --i-table raw_table.qza 
                           --m-raw-metadata-file raw_metadata.tsv 
                           --p-sampling-depth 2000 
                           --p-augment-times 10
                           --p-output-path-metadata augmented_meta.tsv  
                           --o-augmented-table augmented_table.qza

"raw_table.qza" and "raw_metadata.tsv" are the input raw feature table and metadata; --p-sampling-depth --> the rarefaction depth; --p-augment-times set to 10 means repeating rarefaction 10 times (i.e., enlarge sample 10 times); "augmented_table.qza" is the augmented feature table, its sample size is 10 times larger than "raw_table.qza", and new rarefied samples end with "_X" (X represents the i_th rarefaction); "augmented_meta.tsv" is the augmented metadata that has matching sample names in "augmented_table.qza".

Note: Only need to augment the training set.

Citing

If you are interested to use this method, please include the following citation:

Yao Xia, q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment), (2021), GitHub repository, https://github.com/yxia0125/q2-repeat-rarefy.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
q2_data_augment		q2_data_augment
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

q2_data_augment

q2_data_augment

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment)

Installing

Uninstalling

Using

Citing

About

Releases

Packages

Languages

License

xy-repo/q2-data-augment

Folders and files

Latest commit

History

Repository files navigation

q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment)

Installing

Uninstalling

Using

Citing

About

Resources

License

Stars

Watchers

Forks

Languages