# Tutorial 2: Generate raw data packages with metadata

Raw data is either generated in house by an instrument, or as in the case of this demo, curated from a public source. Here, we downloaded raw RNA-sequencing data in the form of fastqs from the Sequence Read Archive (SRA). 

Raw sequencing data was then packaged into Quilt packages, 1 per package per sample. Sample-level metadata describing both biological (tumor type, patient age, histology ...) and technical (sequencer used, library kit, freezing media used for storage ...) features of the sample were obtained from SRA and attached as metadata to each Quilt package housing raw data. 

Quilt workflows & metadata schemas were used to ensure the integrity of the metadata across samples -- a key step to maximize the utility of sample metadata in downstream analysis! No more Tumor vs. tumor vs. tumour...!!

# 1. Set Up

In [7]:
import pandas as pd
import quilt3

In [4]:
quilt3.config()

<QuiltConfig at '/Users/laurarichards/Library/Application Support/Quilt/config.yml' {
    "navigator_url": "https://demo.quiltdata.com",
    "default_local_registry": "file:///Users/laurarichards/Library/Application%20Support/Quilt/packages",
    "default_remote_registry": null,
    "default_install_location": null,
    "registryUrl": "https://demo-registry.quiltdata.com",
    "telemetry_disabled": false,
    "s3Proxy": "https://demo-s3-proxy.quiltdata.com",
    "apiGatewayEndpoint": "https://lb4wpoeup8.execute-api.us-east-1.amazonaws.com/prod",
    "binaryApiGatewayEndpoint": null,
    "default_registry_version": 1
}>

# Create raw data packages

Here we will create 1 Quilt package per sample to house the raw sequencing data. In this case, the raw sequencing data is a pair of fastqs from RNA-sequencing.

Using information from the CCLE SRA metadata, we will 


In [5]:
# specify location where raw sequencing data lives
# this can be a local path, or an s3 path
# for this demo, we downloaded fastqs to !/ in Tutorial0
# replace this with your own path if using
fastq_dir = "~/ccle_demo_fastqs/"

In [9]:
# load sample metadata
meta = pd.read_csv("./demo_data/sample_metadata/demo_ccle_rnaseq_metadata.csv", index_col=0)
meta.head()


Unnamed: 0,SampleID,Run,FlowCellID,Age,AssayType,AssemblyName,AvgSpotLen,Bases,BiomaterialProvider,BioProject,...,PrimaryTumorSite,Proteomics10PlexID,ProteomicsTMTLabel,Purity,SiteOfFinding,SiteSubtype1,SiteSubtype2,Subtype,Supplements,TMBNonSynonymous
0,SRR8615253,SRR8615253,20190223_PRJNA523380_SRR8615253,70.0,RNA-Seq,GCA_000001405.13,202,17711333336,ATCC:NCI-H2066,PRJNA523380,...,Lung,14.0,128c,1.0,,NS,NS,Small Cell Lung Cancer (SCLC),".005 mg/ml insulin, .01 mg/ml transferrin, 30n...",20.366667
1,SRR8615472,SRR8615472,20190223_PRJNA523380_SRR8615472,57.0,RNA-Seq,GCA_000001405.13,202,26673244732,KCLB:SNU-886,PRJNA523380,...,Liver,,,1.0,,NS,NS,Hepatocellular Carcinoma,25mM Hepes and 25mM NaHCo3,7.766667
2,SRR8615479,SRR8615479,20190223_PRJNA523380_SRR8615479,34.0,RNA-Seq,GCA_000001405.13,202,15542995848,DSMZ:L-1236,PRJNA523380,...,Haematopoietic_And_Lymphoid_Tissue,,,0.99,,NS,NS,"B-cell, Hodgkins",,19.166667
3,SRR8615493,SRR8615493,20190223_PRJNA523380_SRR8615493,39.0,RNA-Seq,GCA_000001405.13,202,18186692664,Unknown,PRJNA523380,...,Endometrium,42.0,127n,0.92,,NS,NS,Endometrial Adenocarcinoma,,91.7
4,SRR8615509,SRR8615509,20190223_PRJNA523380_SRR8615509,27.0,RNA-Seq,GCA_000001405.13,202,20421075264,DSMZ:OCI-LY-19,PRJNA523380,...,Haematopoietic_And_Lymphoid_Tissue,,,1.0,,NS,NS,Diffuse Large B-cell Lymphoma (DLBCL),,6.033333


In [11]:
s = "SRR8615472"
pname = "ccle/" + list(meta.loc[meta["Run"] == s, "FlowCellID"])[0]
pname

'ccle/20190223_PRJNA523380_SRR8615472'

In [None]:
# for each sample, create a quilt package & push up fastqs

for s in meta["SampleID"]:
    
    print(">>>>> " + s)
    
    # define package name
    pname = "ccle/" + list(meta.loc[meta["SampleID"] == s, "FlowCellID"])[0]
    
    # create quilt package & stage fastqs
    p = quilt3.Package()
    for i in [1, 2]:
        fq = s + "_" + str(i) + ".fastq.gz"
        p.set(fq, fastq_dir + "/" + batch + "/" + fq)
    
    # set metadata
    # TIP: check out metadata with p.meta
    meta_s = meta.loc[meta["SampleID"] == s, ]
    meta_s.dropna(axis=1, inplace = True)
    p.set_meta(meta_s.to_dict("records")[0])
    
    # push package 
    # with sra-raw-data workflow to check metadata conforms to schemas
    p.push(pname,
           registry='s3://quilt-example-bucket',
           message='upload raw data fastq from SRA',
           workflow='sra-raw-data'
          )