## Processing lmd files for 16S rRNA metadata from HMP 

In this notebook, we retrieve all metadata downloaded from the [raw data page of the human microbiome project](http://downloads.ihmpdcc.org/data/HMR16S/SRP002395_metadata_lmd.tar.gz). A SOP describing the contents of these `lmd` files can also be [found on the page](https://www.hmpdacc.org/hmp/doc/SFF_LibraryMetadataFiles_SOP.pdf). Here, we access the already downloaded and un-zipped folder (using `tar`) and retrieve all relevant metadata. For our project, body site is the most important variable.  

In [9]:
import pandas as pd
import os
import glob
import tarfile
!pwd
dpath = "/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s"

/dartfs-hpc/rc/home/k/f00345k/research/microbe_set_trait/analysis


In [69]:
metadata = pd.read_csv("../python/hmp_16s.txt")
metadata = metadata.rename(columns={"Run": "srr", "gene (exp)" : "region", 
                 "analyte_type":"body_site", 
                 "Sample Name": "sample_name", "Bases": "bases"})
metadata = metadata[["srr", "bases"]]

Even though this csv file has a lot of good information, details on specific region of sequencing and body site information is slightly better in the full `lmd` metadata files. 

In [4]:
lmdfiles = os.listdir(r"../python/hmp_metadata");

We define a function that retreives all the relevant metadata per `lmd` file. Note that each `lmd` file contains information for all runs within an experiment, and some files might have multiple sample runs that have not been demultiplexed. Here, we retain only samples that have one run associated with them. 

In [5]:
def process_lmd(path):
    data = pd.read_csv(path, sep = "\t", header=None)
    srr = data.dropna()
    if srr.shape[0] >= 2 or srr.shape[0] == 0:
        return(None)
    else: 
        sample = srr.iloc[0]
        d = {"srr": sample[0], "region" : sample[7], "body_site": sample[11], "sample_id": sample[12], 
            "subject_id" : sample[10], "reverse_primer": sample[8]}
        return(pd.Series(d))

In [6]:
lmddata = []
for idx, val  in enumerate(lmdfiles):
    if idx % 1000 == 0:
        print(idx)
    lmddata.append(process_lmd("../python/hmp_metadata/" + val))

0
1000
2000
3000
4000
5000
6000
7000


In [70]:
metapd = pd.DataFrame([i for i in lmddata if isinstance(i, pd.Series)])
metapd = metapd.astype({'sample_id': 'string', 'subject_id' : 'string'})
metapd = metapd[metapd.region.isin(["V5-V3","V3-V5"])]

We merge the two data frames for the purpose of getting total number of base pairs per sequencing run and extract the maximum for multiple runs of the same sample (same `sample_id`)

In [71]:
metadata = pd.merge(metapd, metadata, how="left", on="srr")
metadata

Unnamed: 0,srr,region,body_site,sample_id,subject_id,reverse_primer,bases
0,SRR041296,V5-V3,Anterior nares,700033977.0,159510762.0,CCGTCAATTCMTTTRAGT,1175722.0
1,SRR044244,V3-V5,Palatine Tonsils,700024179.0,764245047.0,CCGTCAATTCMTTTRAGT,6266240.0
2,SRR046510,V3-V5,L_Antecubital fossa,700023584.0,763860675.0,CCGTCAATTCMTTTRAGT,517709.0
3,SRR041518,V5-V3,R_Retroauricular crease,700016810.0,159268001.0,CCGTCAATTCMTTTRAGT,2662468.0
4,SRR042831,V5-V3,Throat,700032266.0,159753524.0,CCGTCAATTCMTTTRAGT,1225.0
...,...,...,...,...,...,...,...
4460,SRR041489,V5-V3,R_Retroauricular crease,700032117.0,159672603.0,CCGTCAATTCMTTTRAGT,3212382.0
4461,SRR044423,V3-V5,Stool,700024866.0,764649650.0,CCGTCAATTCMTTTRAGT,5046645.0
4462,SRR048147,V3-V5,Palatine Tonsils,700095449.0,158418336.0,CCGTCAATTCMTTTRAGT,
4463,SRR044387,V3-V5,Posterior fornix,700024882.0,764649650.0,CCGTCAATTCMTTTRAGT,3738153.0


In [72]:
metadata = metadata[metadata.groupby('sample_id')['bases'].transform(max) == metadata.bases]
metadata = metadata.reset_index().drop('index', axis = 1)

In [74]:
metadata.shape

(2553, 7)

The result is around 2553 files of data to be unpacked and preprocess

In [88]:
num_identifier = [int(x.split("SRR0")[1]) for x in metadata.srr.tolist()]

In [90]:
print("Samples ranging from {} to {}".format(min(num_identifier), max(num_identifier)))

Samples ranging from 40576 to 51587


In [102]:
ranges = {
    "r1" : [40000, 40999],
    "r2" : [41000, 41999],
    "r3" : [42000, 42999],
    "r4" : [43000, 43999], 
    "r5": [44000, 44999], 
    "r6": [45000, 45999],
    "r7": [46000, 46999], 
    "r8": [47000, 47999], 
    "r9": [48000, 48999],
    "r10": [49000, 49999],
    "r11": [50000, 59999],
}

In [103]:
for key in ranges:
    query = [x for x in num_identifier if x <= ranges[key][1] and x >= ranges[key][0]]
    if len(query) >= 1:
        print(key)

r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11


In [105]:
metadata.to_csv("../metadata/hmp_16s_metadata.csv")

## Loading and filtering

The strategy in this section is to iteratively extract each `tar.bz2` file and then remove all samples that are not within the required set. 

In [2]:
metadata = pd.read_csv("../metadata/hmp_16s_metadata.csv")

In [43]:
dwn_files = glob.glob(dpath + "/*.tar.bz2")
extract_path = dpath + "/sff/"
dwn_files

['/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR049000_SR049999.tar.bz2',
 '/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR048000_SR048999.tar.bz2',
 '/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR050000_SRR059999.tar.bz2',
 '/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR047000_SR047999.tar.bz2']

In [44]:
# if the first file already exists within the sff file then pass 
for i in dwn_files:
    # do not extract file if the first file is already there 
    if dwn_files[0].split(".tar.bz2")[0].split("hmp_16s/")[1].split("_")[0] + ".sff" in os.listdir(extract_path):
        pass
    else: 
        print(i)
        tar = tarfile.open(i, "r:bz2")  
        tar.extractall(extract_path)
        tar.close()
    # else just remove files that are not in the metadata
    remove_list = [i for i in os.listdir(extract_path) if i.split('.sff')[0] not in metadata.srr.tolist()]
    for j in remove_list:
        os.remove(extract_path + j)
        

/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR049000_SR049999.tar.bz2
/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR048000_SR048999.tar.bz2
/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR050000_SRR059999.tar.bz2
/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/SRR047000_SR047999.tar.bz2


In [55]:
len([i for i in metadata.srr.tolist() if i + ".sff" not in os.listdir(extract_path)])

0