In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os, sys

# 2018-12-06 Metadata management
In this first part of the project, we will develop a few functions to load and analyze preliminarly the data that we downloaded.

In [None]:
# directories
tissue_ai_rootdir = "../"
datadir = "%s/data"%(tissue_ai_rootdir)

The data directory contains a bunch of .tsv files, along with the following extra files:

- `files.txt`: the original file downloaded from the ENCODE project website, as explained in the README
- `filtered-files.txt`: is the list of files that we downloaded from the ENCODE site
- `metadata.txt`: the file containing the metadata corresponding to 
- `filtered-metadata.txt`: the metadata of the filtered files

Let's start by opening the metadata of the filtered files, so that we can then decide on whether there are some files that we don't need, etc.

In [None]:
# use a Pandas DataFrame to store the information. The "read_csv" function reads
# a comma-separated file by default, so we must pass the sep='\t' option to indicate
# that it's a tab-separated value
md_fname = "%s/filtered-metadata.txt"%(datadir)
md = pd.read_csv(md_fname, sep='\t')

In [None]:
md

Now, by a simple `ls` command in the data directory, we realize that most of the files have a size of about 10Mb, whereas others have a size of less than 2Mb. Let's try to find out where this difference comes from.

In [None]:
# easier to hand-pick two files and interrogate the database for differences
small_expid = "ENCFF994WND"
large_expid = "ENCFF994UBN"

In [None]:
small_md = md.loc[md["File accession"] == small_expid]
large_md = md.loc[md["File accession"] == large_expid]

In [None]:
for column in md :
    print md[column].name
    print "SMALL: %s"%(small_md[column].values[0])
    print "LARGE: %s"%(large_md[column].values[0])
    print

So doing some digging into the various experiments, it's easy to realize that there is a problem due to inhomogeneous data management. Here, the experiment we labelled "small" has a date stamp that goes back to 2010. The output of that experiment seems similar to the output generated by `htseq`, or by `STAR`. However, neither of these software packages existed back then, so it's difficult to figure out what the values reported there actually mean.

Worse: there is no guarantee that the data from different experiments will actually mean the same thing. Or better, we're sure that it won't mean the same thing. Here, it is crucial that there is some homogeneity between the samples and their meaning, otherwise the accuracy of the steps in the development of the AI will be seriously compromised.