In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os, sys

# 2018-10-12 FASTQ and map
Here I want to experiment whether it is feasible to take the raw FASTQ files, map them, and use the feature counting methods available on the market to produce a standardized, uniform set of files that we can then use in successive analyses.

In [None]:
# directories
tissue_ai_rootdir = "../"
datadir = "%s/data"%(tissue_ai_rootdir)
md_fname = "%s/metadata.txt"%(datadir)
md = pd.read_csv(md_fname, sep='\t', low_memory=False)

Now let's look at the files in the metadata list that correspond to raw FASTQ files. The FASTQ files correspond either to single-ended runs or paired-ended runs, and the two cases need to be treated differently.

In [None]:
# subselect the files that have FASTQ as type
fastqs = md.loc[md["File format"] == "fastq"]

In [None]:
# let's check whether the experiments have a single-ended or paired-ended flavour
for index, sample in fastqs.iterrows() :
    if sample["Run type"] != 'paired-ended' and sample["Run type"] != 'single-ended':
        raise ValueError("Unrecognized run type")

Okay so all the samples are either one or the other.

In [None]:
# let's pick an example sample in the list
i = 0
sample = fastqs.loc[i]

if sample["Run type"] == 'paired-ended' :
    sample_id = sample["File accession"]
    sample_paired = fastqs.loc[fastqs["Paired with"] == sample_id]

print sample["Experiment accession"]
print sample_paired["Experiment accession"]

In [None]:
# get the list of all the experiments
experiments = fastqs["Experiment accession"].unique()
len(experiments)

In [None]:
# get the samples corresponding to an experiment
experiment = experiments[34]
samples = fastqs.loc[fastqs["Experiment accession"] == experiment]
n = 0
for index, sample in samples.iterrows() :
    print sample["File accession"], sample["Paired with"]
    n += 1
    if n == 1 : break

I created a directory in `scratch/test_map` to play around with this data. First, I download the raw FASTQ files, then I try to map them.

1. BWA: 28% unmapped reads, took about 17 minutes with 16 cores and was using about 23 Gb of RAM. I read online that for RNA-seq you should use a splicing-aware aligner, such as STAR.

2. STAR: basically the same results as before

3. kallisto: this program is much more adequate and accurate for quantification of transcriptome in RNA-seq data. It is much faster and much lighter in terms of memory consumption. It requires only the sequences of the transcripts, which means that the quantification of expression is alignment-free