# Data Sources for Bioinformatic Projects

This section holds information on selected data sources that readers of the text have found useful in the past. It is far from comprehensive. Typically, you'll want to look for relevant data sources when doing a large-scale bioinformatics project. This is more art than science (i.e. Google and literature searches with 'database' thrown in liberally are often involved). Usually you are looking for databases that allow for 'bulk download' of data (sometimes found in 'Data', 'Download' or similar tabs), and then for relevant data in a format that you can interpret. 

Some great formats to work with include:

- `.csv`  These are comma separated values files. One can image these as text representations of a table. If you image an Excel table, the file basically holds the same data except a comma (',') is placed at the boundary between each column and the next column. This type of file is typically opened using `read_csv` function in the `pandas` python package. This will load the .csv as a `pandas` `DataFrame` object. You can also look at these files by opening Microsoft Excel, then opening the file from within Excel. 
- `.tsv` (tab-separated values). This is essentially the same as a .csv file, except that a tab character (`\t`) rather than a comma separates the rows of the table.
- FASTA files (which may be labelled `.fasta`,`.fna`,`.faa`, etc depending on the resource and whether the file describes amino acids or protein; can be parsed in python) 
- `.newick` (for phylogenetic tree data, can be opened with DendroPy or the ETE3 python package)

# Genome Data


National Center for Biotechnology Information (NCBI) hosts a huge collection of diverse data. Navigating the interface and finding what you want can be a bit tricky. I often find the direct download interface to be a useful way to get at underlying data:

https://www.ncbi.nlm.nih.gov/home/download/


## Microbial Genomes

PATRIC has convenient downloads for FASTA files of bacterial genomes split up by coding sequence.

Download a new genome from the PATRIC database.

Using the web interface:

1. Go to this web address: https://www.patricbrc.org/view/DataType/Genomes

2. Click on a genome name.  On the right side of the screen you will see a green tab pop up. Find the icon with the G marked 'Genome'.  Click on it.

3. You will now be on a page for that genome.  On the FAR RIGHT you will see a download button.  

4. Click it and you will get a ton of options of what to download.  Check the box marked ' DNA Sequences of Protein Coding Genes (*.ffn)'

5. Save the file somewhere convenient (the file may be zipped, usually you can double click or right click and select 'Extract all' to unzip it). 

Alternatively, you can use the ftp interface (it can be slow though): 

ftp://ftp.patricbrc.org/

## Viral Genomes

Viper viral genomics resource
https://www.viprbrc.org/

# Data on Mutations and Polymorphisms

## Single Nucleotide Polymorphism (SNP) Data in Humans 

OpenSNP has an ~15 GB download of user SNP data and a CSV file of associated phenotypes (eye color, handedness,etc). Download the raw data here:
https://www.opensnp.org/genotypes

# Epigenetic Data

## Epigenetics Data from the Genomics Data Commons Cancer Data

Student groups interested in epigenetics and DNA methylation have used the Genomics Data Commons resources on 
cancer. Critically these resources include both data on methylation and exposures (so the two can be correlated).
The interface for bulk data download was a little bit tricky but otherwise we found this resource very useful:
https://portal.gdc.cancer.gov/

# Transcriptomic Data

Transcriptomic data measures the level of transcription from various genes in the genome. It is useful for studying how organisms alter their physiology in response to external (stress, infection, etc) or internal conditions (development).

## ArrayExpress

The [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/) database has both raw microarray and RNA-seq data and critically provides metadata files that say what each sample is (e.g. treatment vs. control). Look for sdrf (sample-data relationship files) files and download them along with your fasta files. They will be .tsv files that open in Microsoft Excel. https://www.ebi.ac.uk/arrayexpress/

Getting the data: head to Browse, then click the orange filter button in the upper left to apply filters (e.g. I just want corn so I click Zea mays under organisms). This gives a list of studies filtered by your criteria. Clicking on study ids will let you download the raw data.

**Advanced RNA-seq tutorial:** this paper (Links to an external site.) has a Jupyter Notebook in python that draws on a Docker container to analyze a Zika virus dataset. (Docker containers hold software in its own 'environment' separate from your system - that let's you run software within the Docker container without installing it onto your system). I haven't used it personally but it looks promising



# Microbial Community Data


## QIITA 
QIITA has feature tables and metadata for many microbial ecology projects publicly available. One advantage of QIITA for class projects is that the 'feature table' that describe which microbes are in which samples are already pre-calculated, making it easier to jump right in to analysis. These can then be imported into the `qiime2` software package, which has a python interface. https://qiita.ucsd.edu/ 





# Protein Structure Data

## Protein Databank  

The Protein Databank (PDB) allows for download of coordinate files describing protein's 3D shape:
https://www.rcsb.org/#Category-download

# Coronavirus Structure & Sequence

## CoronaVirus 3D
(hat-tip to Jocel Clark, Jenny Harston, and Nathasya Asnawi)

A resource documenting coronavirus mutations and allowing visualization
of where they fall on 3D sturctures of coronavirus proteins.
https://coronavirus3d.org/#/

## The NCBI SARS-CoV2 Resource Page
https://www.ncbi.nlm.nih.gov/sars-cov-2/

# Fossil Data

## Paleobiology Database

This database has raw data on the fossil record, and where different fossils have been found over time

https://paleobiodb.org/#/

# Bioacoustic Data

## Xenocanto birdsong database. 

This is a bird sound recording database. It has manual access and download of about 635,000 bird calls. It also has an API (application programming interface) that may allow for automation of bulk downloads. However, a direct built-in option to bulk download calls wasn't obvious to me. Explore all calls here: https://www.xeno-canto.org/explore?dir=0&order=xc (Links to an external site.) 

# Neurobiology Data

## Electroencephalogram Data
https://github.com/meagmohit/EEG-Datasets