# Concatenate TCGA MAF Files
This notebook loads all the MAF files downloaded from the TCGA data portal and joins them. Duplicates are removed prior to analyzing the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os

%matplotlib inline

## Joining all MAF Files
First, let's try to read all the MAF files and concatenate them. After that, I should be left with one big MAF that I can write to file.

In [77]:
count = 0
maf_all = None
col_names = None
root_dir = '../data/pancancer/TCGA/downloads/'
for subdir, d, files in os.walk(root_dir):
    for fname in files:
        if fname.endswith('.maf'):
            p = os.path.join(subdir, fname)
            if maf_all is None:
                maf_all = pd.read_csv(p, sep='\t', comment='#', header=0)
                col_names = maf_all.columns
            else:
                print ("[{}] Got {} mutations and {} common columns".format(count, maf_all.shape[0], maf_all.shape[1]))
                try:
                    maf_new = pd.read_csv(p, sep='\t', comment='#', header=0)
                except:
                    print ("Exception occured while reading {}... Skipping!".format(p))
                    continue
                if {'Hugo_Symbol', 'Variant_Classification', 'Tumor_Sample_Barcode'}.issubset(maf_new.columns):
                    maf_join = pd.concat([maf_all, maf_new], ignore_index=True, join='inner')
                    maf_join.dropna(axis=1, inplace=True, how='all') # remove stupid all-na columns
                    if len(maf_join.columns) < len(col_names):
                        print ("Lost {} columns reading {}".format(len(col_names)-maf_join.shape[1], p))
                        print ("Lost columns: {}".format([c for c in col_names if not c in maf_join.columns]))
                        print ("New Maf cols: {}".format(maf_new.columns))
                        print ("Old MAF cols: {}".format(maf_all.columns))
                        print ("Join MAF cols: {}".format(maf_join.columns))
                        col_names = maf_all.columns
                    maf_all = maf_join
                else:
                    print ("File contains not all columns needed or is empty {}".format(p))
            count += 1
count

[1] Got 4858 mutations and 90 common columns
Lost 69 columns reading ../data/cancer/TCGA/downloads/08f8c62e-4a47-45f1-8b87-35acee6956c1/genome.wustl.edu_OV.ABI.26.0.0.somatic.maf
Lost columns: ['Match_Norm_Seq_Allele1', 'Match_Norm_Seq_Allele2', 'Tumor_Validation_Allele1', 'Tumor_Validation_Allele2', 'Match_Norm_Validation_Allele1', 'Match_Norm_Validation_Allele2', 'Verification_Status', 'Sequence_Source', 'Score', 'BAM_file', 'Sequencer', 'Tumor_Sample_UUID', 'Matched_Norm_Sample_UUID', 'Genome_Change', 'Annotation_Transcript', 'Transcript_Strand', 'Transcript_Exon', 'Transcript_Position', 'cDNA_Change', 'Codon_Change', 'Protein_Change', 'Other_Transcripts', 'Refseq_mRNA_Id', 'Refseq_prot_Id', 'SwissProt_acc_Id', 'SwissProt_entry_Id', 'Description', 'UniProt_AApos', 'UniProt_Region', 'UniProt_Site', 'UniProt_Natural_Variations', 'UniProt_Experimental_Info', 'GO_Biological_Process', 'GO_Cellular_Component', 'GO_Molecular_Function', 'COSMIC_overlapping_mutations', 'COSMIC_fusion_genes',

  interactivity=interactivity, compiler=compiler, result=result)


[11] Got 113135 mutations and 18 common columns
[12] Got 116199 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[13] Got 127600 mutations and 18 common columns
[14] Got 131155 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[15] Got 182954 mutations and 18 common columns
[16] Got 201832 mutations and 18 common columns
[17] Got 204463 mutations and 18 common columns
[18] Got 229594 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[19] Got 302135 mutations and 18 common columns
[20] Got 330555 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[21] Got 342918 mutations and 18 common columns
[22] Got 405978 mutations and 18 common columns
[23] Got 423720 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[24] Got 440671 mutations and 18 common columns
[25] Got 487914 mutations and 18 common columns
[26] Got 504832 mutations and 18 common columns
[27] Got 507895 mutations and 18 common columns
[28] Got 510480 mutations and 18 common columns
[29] Got 518366 mutations and 18 common columns
[30] Got 572408 mutations and 18 common columns
[31] Got 601128 mutations and 18 common columns
[32] Got 671175 mutations and 18 common columns
[33] Got 675820 mutations and 18 common columns
[34] Got 698746 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[35] Got 726491 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[36] Got 735640 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[37] Got 765997 mutations and 18 common columns
[38] Got 775434 mutations and 18 common columns
[39] Got 779960 mutations and 18 common columns
[40] Got 795545 mutations and 18 common columns
[41] Got 814518 mutations and 18 common columns
[42] Got 874158 mutations and 18 common columns
[43] Got 900046 mutations and 18 common columns
[44] Got 929654 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/0aaab448-a53f-4c40-a7d5-3e9e3e7ccc6e/SKCM_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[44] Got 929654 mutations and 18 common columns
[45] Got 932407 mutations and 18 common columns
[46] Got 938578 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/2ec6bad8-3dba-4c5d-9fda-8074ab960b1b/genome.wustl.edu_STAD.IlluminaHiSeq_DNASeq_automated.1.3.0.somatic.maf... Skipping!
[46] Got 938578 mutations and 18 common columns
[47] Got 1036511 mutations and 18 common columns
[48] Got 1047439 m

  interactivity=interactivity, compiler=compiler, result=result)


[54] Got 1174079 mutations and 18 common columns
[55] Got 1194215 mutations and 18 common columns
[56] Got 1210123 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/42084331-06c4-42c6-8d40-a1a89141c809/step4_LUSC_Paper_v8.aggregated.tcga.maf2.4.migrated.somatic.maf... Skipping!
[56] Got 1210123 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/d7e90ea9-49b5-4efc-9f78-bd5244cd6367/gsc_LUSC_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[56] Got 1210123 mutations and 18 common columns
[57] Got 1257366 mutations and 18 common columns
[58] Got 1303243 mutations and 18 common columns
[59] Got 1312895 mutations and 18 common columns
[60] Got 1354467 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[61] Got 1385997 mutations and 18 common columns
[62] Got 1393554 mutations and 18 common columns
[63] Got 1447331 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[64] Got 1467990 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/059c40fa-f329-46dc-a095-d0cb23bc7342/HNSC_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[64] Got 1467990 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[65] Got 1541998 mutations and 18 common columns
[66] Got 1590195 mutations and 18 common columns
[67] Got 1598385 mutations and 18 common columns
[68] Got 1618143 mutations and 18 common columns
[69] Got 1624850 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[70] Got 1773370 mutations and 18 common columns
[71] Got 1788042 mutations and 18 common columns
[72] Got 1801171 mutations and 18 common columns
[73] Got 1829384 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[74] Got 1839404 mutations and 18 common columns
[75] Got 1850319 mutations and 18 common columns
[76] Got 1855678 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[77] Got 2040539 mutations and 18 common columns
[78] Got 2050424 mutations and 18 common columns
[79] Got 2056866 mutations and 18 common columns
[80] Got 2063621 mutations and 18 common columns
[81] Got 2068283 mutations and 18 common columns
[82] Got 2093286 mutations and 18 common columns
[83] Got 2208643 mutations and 18 common columns
[84] Got 2236535 mutations and 18 common columns
[85] Got 2421396 mutations and 18 common columns
[86] Got 2425631 mutations and 18 common columns
[87] Got 2445797 mutations and 18 common columns
[88] Got 2454022 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/35fd32b4-e725-4c6b-9711-5dc43dcc0cee/gsc_KIRC_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[88] Got 2454022 mutations and 18 common columns
[89] Got 2478141 mutations and 18 common columns
[90] Got 2546762 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[91] Got 2556791 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/7abdba45-acfd-4e8d-a297-3f4af85362ad/gsc_GBM_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[91] Got 2556791 mutations and 18 common columns
[92] Got 2565368 mutations and 18 common columns
[93] Got 2575119 mutations and 18 common columns
[94] Got 2625374 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/b2e25bdf-f2b5-4a37-b330-05251ea09f2c/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[94] Got 2625374 mutations and 18 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[95] Got 2652689 mutations and 18 common columns
[96] Got 2761946 mutations and 18 common columns
[97] Got 2894358 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/87ac6b5b-806e-4de0-b8d8-ae6888759667/gsc_BRCA_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf... Skipping!
[97] Got 2894358 mutations and 18 common columns
File contains not all columns needed or is empty ../data/cancer/TCGA/downloads/7e05d87e-966c-4321-84b0-19983dddcf8a/hgsc.bcm.edu_GBM.ABI.1.somatic.maf
[98] Got 2894358 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/1cc660a2-7c88-4ce5-8dad-8f8ba0115694/step4_An_UCEC_194.aggregated.tcga.maf2.4.migrated.somatic.maf... Skipping!
[98] Got 2894358 mutations and 18 common columns
[99] Got 2905700 mutations and 18 common columns
Exception occured while reading ../data/cancer/TCGA/downloads/15ce66c6-0211-4f03-bd41-568d0818a044/gsc_LIHC_pairs.aggregated.capture.tcga.uuid.automated.s

  interactivity=interactivity, compiler=compiler, result=result)


[115] Got 3381260 mutations and 16 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[116] Got 3390913 mutations and 16 common columns
[117] Got 3437460 mutations and 16 common columns
[118] Got 3437536 mutations and 16 common columns
[119] Got 3463236 mutations and 16 common columns
[120] Got 3465024 mutations and 16 common columns
[121] Got 3467277 mutations and 16 common columns
[122] Got 3540139 mutations and 16 common columns
[123] Got 3546447 mutations and 16 common columns
[124] Got 3605049 mutations and 16 common columns


  interactivity=interactivity, compiler=compiler, result=result)


[125] Got 3617397 mutations and 16 common columns
[126] Got 3707887 mutations and 16 common columns
[127] Got 3717143 mutations and 16 common columns
[128] Got 3719669 mutations and 16 common columns
[129] Got 3722603 mutations and 16 common columns


130

## Do Some Preprocessing
After I now have one big DataFrame with all the MAF files concatenated, I want to clean it a little.

I will do the following:
* Remove all mutations without `Hugo_Symbol`. Mutations that are not in genes can't help me to compute mutation frequencies.
* Remove duplicated rows. If a mutation is an exact duplicate of another, remove it.

In [78]:
maf_no_nas = maf_all.loc[~maf_all.Hugo_Symbol.isnull()]

In [75]:
no_mut_dup = maf_no_nas.shape[0]
maf_no_nas.drop_duplicates(subset=['Hugo_Symbol', 'Variant_Classification', 'Tumor_Sample_Barcode'], inplace=True)
print ("Joined MAF contained {} duplicates. Left with {} out of {} mutations".format(no_mut_dup-maf_no_nas.shape[0],
                                                                                  maf_no_nas.shape[0], no_mut_dup))

Joined MAF contained 1859965 duplicates. Left with 1864774 out of 3724739 mutations


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return func(*args, **kwargs)


In [79]:
maf_no_nas.isnull().

(3724739, 16)

## Write Back To File

In [None]:
maf_no_nas.to_csv('../data/pancancer/TCGA/all_mutations.maf')

((60, 43), (1575, 50), 1635)