In [1]:
!git clone -b basic-search-feature https://github.com/bscrow/pysradb

Cloning into 'pysradb'...


In [None]:
!cd pysradb && pip install -U .

In [None]:
!cd pysradb && git log -n 5 

# pysradb search
##### The pysradb search module supports querying the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) databases for sequencing data. The module also includes several built-in flags that can be used to fine-tune a search query.

In [1]:
%%html
<style>
th {font-size: 16px;}
td {font-size: 14px;}
td:first-child {font-size: 15px; font-weight: 500;}
</style>

### Terminal flags for the pysradb search module:

|Flags | Explanation|
|----------|------------|
| -h, --help | Displays the help message |
| --saveto | Saves the result in the file specified by the user.<br>Supported file types: txt, tsv, csv |
| --db   | Selects the database (SRA, ENA, or both SRA and Geo DataSets) to query. Default database is SRA. Accepted inputs: sra, ena, geo|
| -v, --verbosity  | This determines how much details are retrieved and shown in the search result: <br>0: run_accession only <br>1: run_accession and experiment_description only <br>2: <u>(default)</u> study_accession, experiment_accession, experiment_title, description, tax_id, scientific_name, library_strategy, <br>library_source, library_selection, sample_accession, sample_title, instrument_model, run_accession, read_count, base_count <br>3: Everything in verbosity level 2, followed by all other retrievable information from the database|
| -m, --max | Maximum number of returned entries. Default number is 20.<br>Note: If the maximum number set is large, querying the SRA and GEO DataSets databases will take significantly longer due to API limits|
| -q, --query | The main query string. <br><u>Note: if this flag is not used, at least one of the following flags must be supplied</u>: |
| --accession | A relevant study / experiment / sample / run accession number|
| --organism | Scientific name of the sample organism |
| --layout | Library layout. Accepted inputs: single, paired|
| --mbases | Size of the sample rounded to the nearest megabase|
| --publication-date | The publication date of the run in the format dd-mm-yyyy. If a date range is desired, <br>enter the start date, followed by end date, separated by a colon ':' in the format dd-mm-yyyy:dd-mm-yyyy <br>Example: 01-01-2010:31-12-2010|
| --platform | Sequencing platform used for the run. Possible inputs: illumina, ion torrent, oxford nanopore |
| --selection | Library selection. Possible inputs: cdna, chip, dnase, pcr, polya |
| --source | Library source. Possible inputs: genomic, metagenomic, transcriptomic |
| --strategy | Library Preparation strategy. Possible inputs: wgs, amplicon, rna seq |
| --title | Title of the experiment associated with the run |
| --geo-query | The main query string to be sent to Geo DataSets |
| --geo-dataset-type | Dataset type. Possible inputs: expression profiling by array, expression profiling by high throughput sequencing, non coding rna profiling by high throughput sequencing |
| --geo-entry-type | Entry type. Accepted inputs: gds, gpl, gse, gsm |

### Using pysradb search in python:

##### pysradb search organises each search query as a instance of either the SraSearch, EnaSearch or the GeoSearch classes. These classes takes in the following parameters in their constructor: 



<b>`SraSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,) ` </b>


<b>`EnaSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,) ` </b>

<b>`GeoSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False,) ` </b>

| Parameters | Explanations|
|----------|------------|
| verbosity | This determines how much details are retrieved and shown in the search result (default=2). Same as -v / --verbosity on terminal |
| return_max | Maximum number of returned entries (default=20). Same as -m / --max on terminal |
| suppress_validation | Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check.|

Other parameters match the command line flags of the same name.
<br>
<br>



<br>

##### To query the SRA database for ribosome profiling, expecting an output of verbosity level 2, and returning at most 5 entries, we can do the following:

In [2]:
from pysradb.search import SraSearch

instance = SraSearch(2, 5, query="ribosome profiling")
instance.search()
df = instance.get_df()
print(df)

  from tqdm.autonotebook import tqdm
100%|██████████| 5/5 [00:01<00:00,  3.01it/s]

  experiment_accession                                 experiment_title  \
0           SRX8683210  GSM4661079: RNAseq_12hr_Rep2; Zea mays; RNA-Seq   
1           SRX8683209  GSM4661078: RNAseq_12hr_Rep1; Zea mays; RNA-Seq   
2           SRX8683208   GSM4661077: RNAseq_6hr_Rep2; Zea mays; RNA-Seq   
3           SRX8683207   GSM4661076: RNAseq_6hr_Rep1; Zea mays; RNA-Seq   
4           SRX8683206   GSM4661075: RNAseq_0hr_Rep2; Zea mays; RNA-Seq   

  sample_taxon_id sample_scientific_name experiment_library_strategy  \
0            4577               Zea mays                     RNA-Seq   
1            4577               Zea mays                     RNA-Seq   
2            4577               Zea mays                     RNA-Seq   
3            4577               Zea mays                     RNA-Seq   
4            4577               Zea mays                     RNA-Seq   

  experiment_library_source experiment_library_selection sample_accession  \
0            TRANSCRIPTOMIC            




<br>

### Quickstart

##### To query ENA instead, replace SraSearch class with the EnaSearch class:

In [3]:
from pysradb.search import EnaSearch

instance = EnaSearch(2, 5, "ribosome profiling")
instance.search()
df = instance.get_df()
print(df)

  study_accession experiment_accession  \
0      PRJEB12126           ERX1264364   
1      PRJEB12126           ERX1264365   
2      PRJEB12126           ERX1264366   
3      PRJEB12126           ERX1264367   
4      PRJEB12126           ERX1264368   

                                    experiment_title  \
0  Illumina HiSeq 2000 sequencing; Analysis of co...   
1  Illumina HiSeq 2000 sequencing; Analysis of co...   
2  Illumina HiSeq 2000 sequencing; Analysis of co...   
3  Illumina HiSeq 2000 sequencing; Analysis of co...   
4  Illumina HiSeq 2000 sequencing; Analysis of co...   

                                         description tax_id scientific_name  \
0  Illumina HiSeq 2000 sequencing; Analysis of co...  10090    Mus musculus   
1  Illumina HiSeq 2000 sequencing; Analysis of co...  10090    Mus musculus   
2  Illumina HiSeq 2000 sequencing; Analysis of co...  10090    Mus musculus   
3  Illumina HiSeq 2000 sequencing; Analysis of co...  10090    Mus musculus   
4  Illumina HiS

##### To query GEO DataSets instead and retrieve the metadata of linked entries in SRA:

In [5]:
from pysradb.search import GeoSearch

instance = GeoSearch(2, 5, geo_query="ribosome profiling")
instance.search()
df = instance.get_df()
print(df)

100%|██████████| 5/5 [00:01<00:00,  3.29it/s]

  experiment_accession                                   experiment_title  \
0           SRX1756760  GSM2150571: riboseq_shoot_2; Arabidopsis thali...   
1           SRX1764038  GSM2152853: mRNA-associated with polysome in h...   
2           SRX1764037  GSM2152852: mRNA-associated with monosome in  ...   
3           SRX1764036  GSM2152851: mRNA-assocatied with polyosme in t...   
4           SRX1764035  GSM2152850: mRNA-assocaited with monosome in t...   

  sample_taxon_id sample_scientific_name experiment_library_strategy  \
0            3702   Arabidopsis thaliana                       OTHER   
1            9606           Homo sapiens                     RNA-Seq   
2            9606           Homo sapiens                     RNA-Seq   
3            9606           Homo sapiens                     RNA-Seq   
4            9606           Homo sapiens                     RNA-Seq   

  experiment_library_source experiment_library_selection sample_accession  \
0            TRANSCRIPTOMIC




<br>

### Error Handling

##### When suppress_validation is not set to True, query fields with incorrect entries will raise IncorrectFieldException, which provides the complete list of acceptable inputs for fields such as "selection", etc:

In [12]:
# 1. Invalid query entered for "selection"
SraSearch(selection="Mudkip")

IncorrectFieldException: Incorrect selection: Mudkip
--selection must be one of the following: 
5-methylcytidine antibody, CAGE, ChIP, ChIP-Seq, DNase, HMPR, Hybrid Selection,  
Inverse rRNA, Inverse rRNA selection, MBD2 protein methyl-CpG binding domain, 
MDA, MF, MNase, MSLL, Oligo-dT, PCR, PolyA, RACE, RANDOM, RANDOM PCR, RT-PCR,  
Reduced Representation, Restriction Digest, cDNA, cDNA_oligo_dT, cDNA_randomPriming 
other, padlock probes capture method, repeat fractionation, size fractionation, 
unspecified



In [13]:
# 2. Ambiguous query entered for "source":
EnaSearch(source="metagenomic viral rna ")

IncorrectFieldException: Multiple potential matches have been identified for metagenomic viral rna :
['METAGENOMIC', 'VIRAL RNA']
Please check your input.



<br>

### Usage Examples:

##### 1. Checking the help message on terminal:

In [6]:
!pysradb search -h

usage: pysradb search [-h] [--saveto SAVETO] [--db {ena,geo,sra}]
                      [-v {0,1,2,3}] [-m MAX] [-q QUERY [QUERY ...]]
                      [--accession ACCESSION]
                      [--organism ORGANISM [ORGANISM ...]]
                      [--layout {SINGLE,PAIRED}] [--mbases MBASES]
                      [--publication-date PUBLICATION_DATE]
                      [--platform PLATFORM [PLATFORM ...]]
                      [--selection SELECTION [SELECTION ...]]
                      [--source SOURCE [SOURCE ...]]
                      [--strategy STRATEGY [STRATEGY ...]]
                      [--title TITLE [TITLE ...]]
                      [--geo-query GEO_QUERY [GEO_QUERY ...]]
                      [--geo-dataset-type GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]]
                      [--geo-entry-type GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]]

optional arguments:
  -h, --help            show this help message and exit
  --saveto SAVETO       Save search result datafram

<br>

##### 2. Searching for 5 illumina sequences related to the covid-19 pandemic on ENA, using the terminal:

In [7]:
!pysradb search -q covid-19 --platform illumina --db ena -m 5

study_accession	experiment_accession	experiment_title	description	tax_id	scientific_name	library_strategy	library_source	library_selection	sample_accession	sample_title	instrument_model	run_accession	read_count	base_count

PRJNA612578  SRX7918726                                                                            Illumina MiSeq sequencing; RNA-seq of COVID-19 positive human  from San Diego county                                                                            Illumina MiSeq sequencing; RNA-seq of COVID-19 positive human  from San Diego county  2697049  Severe acute respiratory syndrome coronavirus 2  RNA-Seq           VIRAL RNA     PCR  SAMN14380341                                     Pathogen: clinical or host-associated sample from Severe acute respiratory syndrome coronavirus 2  Illumina MiSeq  SRR11314339   517281   156068315
 PRJNA612578  SRX7949318                                                                                     MinION sequencing; RNA-seq of C

<br>

##### 3. Searching for illumina sequences related to the covid-19 pandemic on ENA, using the terminal, and saving the results in a nicely formatted text file:

In [8]:
!pysradb search -q covid-19 --db ena --saveto query.txt

<br>

##### 4. Searching for illumina sequences related to the covid-19 pandemic on ENA, within python: (outputs a pandas dataframe)

In [9]:
from pysradb.search import EnaSearch

instance = EnaSearch(2, 20, query="covid-19", platform="illumina")
instance.search()
df = instance.get_df()
print(df)

   study_accession experiment_accession  \
0      PRJNA612578           SRX7918726   
1      PRJNA612578           SRX7949318   
2      PRJNA613951           SRX7966547   
3      PRJNA624231           SRX8089022   
4      PRJNA625669           SRX8128264   
5      PRJNA628125           SRX8176770   
6      PRJNA628125           SRX8176769   
7      PRJNA628125           SRX8176768   
8      PRJNA628125           SRX8176767   
9      PRJNA628125           SRX8176766   
10     PRJNA628125           SRX8176765   
11     PRJNA628125           SRX8176764   
12     PRJNA628125           SRX8176763   
13     PRJNA628125           SRX8176762   
14     PRJNA628125           SRX8176761   
15     PRJNA628125           SRX8176760   
16     PRJNA628125           SRX8176759   
17     PRJNA628125           SRX8176758   
18     PRJNA628125           SRX8176757   
19     PRJNA632678           SRX8341156   

                                     experiment_title  \
0   Illumina MiSeq sequencing; RNA-seq 

<br>

##### 5. More complex example:

In [10]:
from pysradb.search import EnaSearch

instance = EnaSearch(
    3, 100000, query="Escherichia coli", accession="SRS6898940", organism="Escherichia coli", 
    layout="Paired", mbases=10, publication_date="01-01-2019:31-12-2019", platform="Illumina", 
    selection="random", source="Genomic", strategy="WGS"
)
instance.search()
df = instance.get_df()
print(df)

      study_accession experiment_accession  \
0          PRJDA51849            DRX000357   
1              PRJDB5            DRX001133   
2           PRJDB2622            DRX001465   
3           PRJDB2622            DRX001466   
4           PRJDB2622            DRX001467   
...               ...                  ...   
37904     PRJNA230969           SRX6737267   
37905     PRJNA230969           SRX6737268   
37906     PRJNA368991           SRX6737270   
37907     PRJNA230969           SRX6737273   
37908     PRJNA318589           SRX6737281   

                                        experiment_title  \
0      Illumina Genome Analyzer sequencing; Deep sequ...   
1      AB SOLiD System 3.0 sequencing; Escherichia co...   
2      Illumina Genome Analyzer IIx sequencing; Reseq...   
3      Illumina Genome Analyzer IIx sequencing; Reseq...   
4      Illumina Genome Analyzer IIx sequencing; Reseq...   
...                                                  ...   
37904  Illumina MiSeq paire

<br>

##### 6. Corresponding terminal command example, with max set to 20:

In [11]:
!pysradb search --db ena -m 20 -v 3 -q Escherichia coli --accession SRS6898940 --organism Escherichia coli --layout Paired --mbases 100 --publication-date 01-01-2019:31-12-2019 --platform illumina --selection random --source Genomic --strategy wgs

    

study_accession	experiment_accession	experiment_title	description	tax_id	scientific_name	library_strategy	library_source	library_selection	sample_accession	sample_title	instrument_model	run_accession	read_count	base_count	accession	altitude	assembly_quality	assembly_software	binning_software	bio_material	broker_name	cell_line	cell_type	center_name	checklist	collected_by	collection_date	completeness_score	contamination_score	country	cram_index_aspera	cram_index_ftp	cram_index_galaxy	cultivar	culture_collection	depth	dev_stage	ecotype	elevation	environment_biome	environment_feature	environment_material	environmental_package	environmental_sample	experiment_alias	experimental_factor	fastq_aspera	fastq_bytes	fastq_ftp	fastq_galaxy	fastq_md5	first_created	first_public	germline	host	host_body_site	host_genotype	host_gravidity	host_growth_conditions	host_phenotype	host_sex	host_status	host_tax_id	identified_by	instrument_platform	investigation_type	isolate	isolation_source	last_updated	lat	lib