## GET lib info for TARGET NB RNAseq dataset from GDC

GDC query: https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TARGET-NBL%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22STAR%202-Pass%20Genome%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_format%22%2C%22value%22%3A%5B%22bam%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D&searchTableTab=files
 
![](./gdc-query.png)

download the manfiest as `gdc_manifest.2021-04-06.txt`

## quick summary 
1. 161 BAM files
2. most of them has 2 read-group (??)
  - one as `Other`
  - one `Poly-T Enrichment`
3. 7 of them only have one read-group, those are are `Poly-T Enrichment`
  ```
  17a78aac-5c3d-4c19-9c25-5ba6e6c33bc7.rna_seq.genomic.gdc_realn.bam
  2e437a60-c6df-407d-9f8f-c8b418434821.rna_seq.genomic.gdc_realn.bam
  371f6839-724c-4b5b-a5a7-5e453bdd1a8e.rna_seq.genomic.gdc_realn.bam
  8761d55f-706a-4180-a535-42b64bef36ec.rna_seq.genomic.gdc_realn.bam
  8b397736-190e-410f-9a42-27984ec919ba.rna_seq.genomic.gdc_realn.bam
  ad601a5c-0b18-4535-a922-37b67e7f989c.rna_seq.genomic.gdc_realn.bam
  e4ab0d64-9d7d-4a3c-b904-227ea0487f57.rna_seq.genomic.gdc_realn.bam
```
4. 4 of them has 3 read-groups

In [1]:
import requests
import json
import pandas as pd

In [2]:
gdc_url = 'https://api.gdc.cancer.gov/files'
headers = {'Content-Type': 'application/json'}

fields = [
    'file_name',
    'analysis.metadata.read_groups.read_group_id',
    'analysis.metadata.read_groups.library_selection',
    'analysis.metadata.read_groups.library_strand'
]
fields = ','.join(fields)

manifest = pd.read_csv("gdc_manifest.2021-04-06.txt", sep='\t')

In [3]:
## API request body 
payload = {
        'filters':{
            'op':'=',
            'content':{
                'field':'files.file_id',
                'value':manifest.id.tolist()}},
        'format':'json',
        'fields':fields,
        'size':5000 # make sure we get all the returns
}
payload = json.dumps(payload)

In [4]:
gdc_response = requests.post(gdc_url, headers=headers, data=payload)
gdc_response = gdc_response.json()

lib_info = []
for i in gdc_response['data']['hits']:
    for j in i['analysis']['metadata']['read_groups']:
        try:
            j['library_strand']
        except:
            j['library_strand'] = "N/A"
        lib_info.append([
            i['file_name'],
            j['read_group_id'], 
            j['library_selection'],
            j['library_strand']
        ])


In [5]:
df = pd.DataFrame(lib_info).drop_duplicates()
df.columns = ['file_name','read_group_id','library_selection','library_strand']
df

Unnamed: 0,file_name,read_group_id,library_selection,library_strand
0,537216c6-4d20-42f5-b191-7e6c10cafbe7.rna_seq.g...,75fc4dec-dc76-4af3-8979-c2d06a52950a,Poly-T Enrichment,
1,537216c6-4d20-42f5-b191-7e6c10cafbe7.rna_seq.g...,8c059ba1-e1f4-4d41-b20e-585900d18bf1,Other,
2,80304b6d-45f9-4210-9a2a-dee958cf32c4.rna_seq.g...,5f9fead6-3566-43c7-9dae-230c09b6e621,Other,
3,80304b6d-45f9-4210-9a2a-dee958cf32c4.rna_seq.g...,e3c85f50-2bfd-47fa-ac98-26fe88d46541,Poly-T Enrichment,
4,e9bec6fd-6a37-4f07-b396-2c960be6e9ef.rna_seq.g...,4559bd4f-6433-44d1-91d8-dec23a0ffe33,Other,
...,...,...,...,...
314,ff6ebcd0-18b3-4334-aee2-187f41068eb9.rna_seq.g...,86a58970-81c4-46a8-8d46-d4bbd84fa517,Poly-T Enrichment,
315,89d1d838-3d20-4983-b4ef-996f96ccf88a.rna_seq.g...,af6a764c-d997-4a17-bfd9-0fa1486ce22d,Other,
316,89d1d838-3d20-4983-b4ef-996f96ccf88a.rna_seq.g...,638c2c0a-4f88-4a25-8191-2d857edb353b,Poly-T Enrichment,
317,39b8cc57-059d-433d-a22a-dc450104c78f.rna_seq.g...,09c9dc73-50ad-435c-a80a-25c397af7e91,Poly-T Enrichment,


In [6]:
df.to_csv('./target-nb-rnaseq.lib-info.csv', index=False)