*Please note: This notebook uses open access data*  
*Please note: CANINE Google Login in the BRH Profile Page needs to be authorized*


## Installing Gen3 SDK client and import modules




In [1]:
from gen3.file import Gen3File
from gen3.query import Gen3Query
from gen3.auth import Gen3Auth
from gen3.submission import Gen3Submission
from gen3.index import Gen3Index

## Setting up data common access

In [2]:
endpoint = "https://caninedc.org"
auth = Gen3Auth(endpoint, refresh_file = "/home/jovyan/.gen3/credentials.json")
sub = Gen3Submission(endpoint, auth)
file = Gen3File(endpoint, auth)

## Sample code to pull programs and projects

In [3]:
sub.get_programs()

{'links': ['/v0/submission/Canine']}

If the above returns {'links': []}, skip the next cell.

In [5]:
sub.get_projects("Canine")

{'links': ['/v0/submission/Canine/Korean_DongGyeongi',
  '/v0/submission/Canine/Osteosarcoma',
  '/v0/submission/Canine/Cornell_GWAS',
  '/v0/submission/Canine/Mizzou_Comparative_Resequencing',
  '/v0/submission/Canine/Glioma',
  '/v0/submission/Canine/Bladder_cancer',
  '/v0/submission/Canine/melanoma',
  '/v0/submission/Canine/B_cell_lymphoma',
  '/v0/submission/Canine/Non-Hodgkin_lymphoma',
  '/v0/submission/Canine/NHGRI',
  '/v0/submission/Canine/PMed_trial']}

## Use gen3 sdk to download a sample fastq file (guid provided)

In [6]:
!gen3 drs-pull object dg.C78ne/4527012c-3a5f-481d-820c-da7b77a26b48



In [7]:
!gunzip SRR7012463_1.fastq.gz

## Use Bioinfokit to read the fastq file, get information such as sequence, base count etc.

In [9]:
from bioinfokit.analys import fastq
records = fastq.fastq_reader(file='SRR7012463_1.fastq')
# read fastq file
counter = 0
for record in records:                       # only process first ten records for demo purpose
    if counter < 10:
        _, sequence, _, quality = record     # process sequence, headers and related info
        base_count = {}
        base_count['A'] = sequence.count('A')
        base_count['C'] = sequence.count('C')
        base_count['G'] = sequence.count('G')
        base_count['T'] = sequence.count('T')
        print(sequence, quality, base_count)
    else: break
    counter += 1

CGCGGATCCTGAGAGAAATGGATCAAGAAGAGGAGGAAGAATAATTGTAAA BBBFFFBFFFFFFFIFFFFIIFIIIFIIFFFFIBFFBF7FBFBBFBF7BF# {'A': 22, 'C': 5, 'G': 16, 'T': 8}
CGAGAGGGAACGTCGAGTCAGGGACACAGCAAAGCTCCACAGGCAGGGAGG BBBFFFFBFFFFFFFFFIIIIFFFIFIIIFFFFIIIBFBFIBFIFI##### {'A': 16, 'C': 12, 'G': 20, 'T': 3}
GCGATGTTCTTCAGCCCTGCACGGTACTCCAGTCGCACAGACTCCAACCAC BBBFFFFFFFFFFIIIIIIIIIIIBFFIIIIIFFIIIIIIIIIIIIIIFII {'A': 11, 'C': 20, 'G': 10, 'T': 10}
CTGCTTACCAAAAGTGGCCCACTAGGCACTCGCATTCCACGCCCGGCTCCA BBBFFFFFFFFFFIFFFIIIFIFFFIIIIIIIIIIIIIFFFFFIIIIIBFF {'A': 11, 'C': 21, 'G': 10, 'T': 9}
CCGGGTCAGTGAAAAACGATAAGAGTAGTGGTATTTCACCGGCGGCCCGCA BBBFFFFFFFF<<BBBBFBFFBFBF<BFBFFBBFIF<B<7FF'BF7<BB<B {'A': 14, 'C': 12, 'G': 16, 'T': 9}
CTGGGCTTTAGGCCCCAGAAAGCAGGAGAAAAGGACCAGCGCTGGTGAAAC BBBFFFFFFFFFBFFIBFFBFFFFFIFIBFFIIFFFFFBFFIIIIIIBBFB {'A': 16, 'C': 12, 'G': 17, 'T': 6}
CCGTCCCTCTCGCGCGCGTCACCGACTGCCAGCGACGGCCGGGTATGGGCC BBBFFFFFFFFFFIIIIIIFFIIIIIIIIIIIIFBF<BBBFBF77BFB0BB {'A': 5, 'C': 22, 'G': 17, 'T': 7}
CTCCTGGTCATTCCGAAACCA

Above code snippet prints out basic information from FASTQ file, such as sequence, quality and base counts. For detailed fastq file format information, https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html