#  Data mining of DNA sequences submitted by Peruvian institutions to public genetic databases

This Jupyter notebook will adress the code used in this study. For more information please contact:

Pedro Romero, pedro.romero@upch.pe
Camila Castillo-Vilcahuaman, camila.castillo.v@upch.pe


First of all, let's see which files we have available for this analysis. 

**05/2020**: As of now, I'm unable to upload the `journal_and_organism` file due to its size. I will be adressing this issue in the future. Meanwhile, I'll show here which code we used for extracting data from the `Nucleotide` database.

In [2]:
ls

1.txt                            organizationNames
200804_bioproject.xml            organizationNames_2
archaea.txt                      organizationNames_3
[0m[01;34mBinderBash-master[0m/               PATRIC_genome.csv
[01;31mBinderBash-master.zip[0m            Pedro_bioproject20200408.xml
bioproject_peru.xml              Pedro_bioproject_peru.xml
bold_data_species.txt            peptides.txt
bold_data.txt                    prueba
data_orgn_type_inst              prueba2
draft_scripts                    [01;34mtax[0m/
grepjournal                      [01;34mtaxonom[0m/
journal_and_organism             test.csv
list                             Universidad_Cientifica_del_Sur
list_commas                      used_queries_per_inst_nucleotide.txt
mining_peruvian_sequences.ipynb


## Nucleotide

`Nucleotide` was the biggest database we had to analyze. In this case, I used a server to download all the data related to the term: "Peru" using Entrez Direct (available at the NCBI page):

`esearch -db nucleotide -query "Peru" | efetch -format gb > peru.gb`

Once we had the `peru.gb` file, we ran an `awk` script for fetching all the data related to journals and organisms in the `peru.gb` file.

` awk '{ if($1 ~ /JOURNAL/ || $1 ~ /ORGANISM/){ ban = 1; } if($1 ~ /REFERENCE/ || $1 ~ /COMMENT/ || $1 ~ /PUBMED/ || $1 ~ /FEATURES/ || $1 ~ /AUTHORS/|| $1 ~ /REMARK/ || $1 ~ /TITLE/){ ban = 0; } if(ban == 1){ print $0; } }' peru.gb > journal_and_organism `

For checking out how many organisms we had in our `journal_and_organism` data, we used the `grep` command.

`grep -c "ORGANISM" journal_and_organism`

This gave us 817 694 records associated with the term "Peru" in our `journal_and_organism` file. However, we knew that this number could be an overstimation, because the term "Peru" is not only related to sequences uploaded from this country. Thus, we decided to perform an analysis including all institutions.

The Nucleotide format breaks paragraphs, thus making it difficult to search for institutions using their complete name. Also, uploaders have the freedom to adress their sequences as they need, which means that some institutions have variations in their names. For example, the Instituto Nacional de Salud was also found in this database using the english translation. This meant that we had to search carefully for query words. The file `used_queries_per_inst_nucleotide.txt` contains all the query words used for this analysis. 

In [None]:
head used_queries_per_inst_nucleotide.txt

To make the analysis easier, we made a list of all query words used:

In [None]:
head list

For the first part, we decided to perform manually a search of each query word, to see how many times an institution appeared in the `journal_and_organism` data. For example, here we used the query word "SAN MARCOS":

`awk -v j=0 -v col=0 -v total_col=0 '{
if($1 ~ /JOURNAL/){
if(j == 1){
if(col == 1){
total_col = total_col + 1;
col = 0;
}
} else {
j = 1;
}
}
if(toupper($0) ~ /SAN MARCOS/){
col = 1;
}
}END{
print "Numero total de revistas con palabra clave: " total_col;
}' journal_and_organism`

Once we had a first scan of all the institutions and how many times their names appeared in our database, we performed a search to determine which organisms came from this institutions. For this, we used a `while` loop:

`while read p; do awk -v orgn="" -v p="$p" '{ if ($1 ~ /ORGANISM/) { ban = 1; orgn = $2 " " $3 } if (toupper($0) ~ p) { if (ban == 1) { print orgn; ban = 0 } } }' journal_and_organism > j_and_orgn_"$p"; done < list`

After this, we concatenated all the output data using the `cat` command. The output file was called `total` (I prefer having all these files in a directory, so I created one and then moved all the files generated by the previous script). After this, we used a `sort` and a `uniq` command to count all unique species in our data.

`sort j_and_orgn_total | uniq -c > j_and_orgn_cont_total`

For checking out information about taxa, we used this command:

`while read p; do awk -v orgn="" -v p="$p" -v orgn2="" -v orgn3="" '{ if ($1 ~ /ORGANISM/) { ban = 1; getline orgn; getline orgn2; getline orgn3 } if (toupper($0) ~ p) { if (ban == 1) { print orgn; print orgn2; print orgn3; ban = 0 } } }' journal_and_organism > j_and_orgn_"$p"; done < list`

This allowed us to check out which domains and classes were the most represented in peruvian sequences. We also used the `cat` command to create a file containing all the information from all the generated files. For checking out which taxa were present in this `total` file, we used the `grep -c` command.

`grep -c "Viruses" total`

If you have run this code, you will notice that certain numbers are lost during the process. We assume this is because of how heterogenous the metadata is. Institution names could be repeated twice or could be badly written, and, when in search for organism data, we detected some organisms which had no complete taxonomic classification. 

## Bioproject

## PATRIC

The PATRIC database showed to be more homogenous. PATRIC data can be downloaded in a `.csv` format, which makes it easier to analyze. PATRIC data can be found in this repository.

In [None]:
head PATRIC_genome.csv

To count how many institutions have uploaded information to the PATRIC database, we used the `grep` command.

In [None]:
grep -c "Universidad Nacional Mayor de San Marcos" PATRIC_genome.csv

To extract all organisms uploaded in the PATRIC database, we used an `awk` command:

In [None]:
cat PATRIC_genome.csv | cut -f2 -d , | awk '{print $1,$2}' | sort | uniq -c

To extract all organisms per institution, I had to download the `csvgrep` package.

**05/2020**: I'm unsure if I can download that package to this enviroment. I'll try this in the future.

In [None]:
csvgrep -c 22 -m "Instituto Nacional de Salud" PATRIC_genome.csv | cut -f2 -d , | awk '{print $1,$2}' | sort | uniq -c