# BreCaNet

This Jupyter notebook goes through the steps to process the full data sets before running PANDA

##### Gene Expression Data

Came from the TCGA database: 
Settings for data: *Cases tab*
    Primary site -> breast,
    Disease type -> ductal and lobular neoplasms,
    Gender -> female,
    Race -> White
 
 *Files tab*
    Experimental Strategy -> RNA-seq,
    Workflow -> HTSeq-FPKM,
    Project -> TCGA-BRCA
    
Click on the Download button and select Cart

##### Clinical Data with IHC Receptor Status

Using the same settings as for gathering the expression data, except instead of downloading the cart, download the manifest data. The manifest file is used with the supplemental R code and necessary to retrieve the clinical IHC information.

Use the same settings in the TCGA repository on the Cases tab.

*Files tab:*
Data Category -> Clinical

Data Format -> BCR Biotab

Add to cart 

Download the nationawidechildrens.org_clinical_patient_brca.txt file

##### TF Binding Motif Data

GTex from Camila

##### Protien-Protein Interaction (PPI) Data

GTex from Camila

##### PANDA algorithm

https://netzoo.github.io

In [13]:
# Packages

import gzip
import numpy as np
import os
import shutil

In [14]:
# User's paths for all files

# gene expression paths
dataPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/gdc_download_FPKM_832' #zipped file path
dataNewFolder = '/Users/ursulawidocki/Desktop/BreCaNet/Data/processedTCGA_832' # folder for unzipped
PANDAexpPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/PANDAgeneExp_832.txt' # final exp file

# motif data paths
motifPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/motif.txt' # original motif file
motifNew = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/motifDataNew.txt'
PANDAmotifPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/PANDAmotifs.txt' # file motif file

# ppi data paths
ppiPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/ppi2015_freeze.txt' # initial ppi path
PANDAppiPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/PANDAppi/txt' # file ppi file

In [15]:
# Unzips the expression data dn creates a folder (processedTCGA) of the unzipped files 
# Gets the codes of the patient folders and expression files

folderIDs = list() # codes of the patient IDs on the folder
indivList = list() # codes of the patients' expression file

for fileName in os.listdir(dataPath):
    if fileName == '.DS_Store' or fileName == 'MANIFEST.txt':
        continue
        
    else:
        folderIDs.append(fileName)
        
    newPath = dataPath + '/' + fileName
    
    # goes into the individual folder
    for indivName in os.listdir(newPath):
        if (indivName == '.DS_Store') or (indivName == 'MANIFEST.txt') or (indivName == 'annotations.txt'):
            continue
        filePath = newPath + '/' + indivName
        
        fileEnding = indivName[-3] + indivName[-2] + indivName[-1]
        if(fileEnding == '.gz'):
            with gzip.open(filePath, 'rb') as f_in:
                newFilePath = dataNewFolder + '/' + indivName[0:-3]
                indivList.append(indivName[0:-3])
                
                with open(newFilePath, 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)
                

In [16]:
# makes the folderIDs and indivList into files

IDPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/IDList_FPKM.txt'
folderIDPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/foldIDList_FPKM.txt'

IDFile = open(IDPath, "w+")
folderFile = open(folderIDPath, "w+")

for i in range(0, 832):
    newLineIDs = indivList[i][0:-9] + "\n"
    newLineFolder = folderIDs[i] + "\n"
    
    IDFile.write(newLineIDs)
    folderFile.write(newLineFolder)
    
IDFile.close()
folderFile.close()


In [17]:
# gets list of genes that are in the file

geneList = list() # list of all genes that are noted in the expression files

with open(dataNewFolder + '/' + indivList[0], "r") as file:
    for line in file.readlines():
        potentialGene = line.strip().split("\t")[0]
        
        if potentialGene[0:3] == "ENS":
            newGeneName = potentialGene.strip().split(".")[0]
            geneList.append(newGeneName)
print(len(geneList))

60483


In [19]:
# makes a matrix where patients are columns and genes are rows

allTesting = np.zeros((len(geneList), len(indivList))) # matrix with all info from files
j = 0 # index of the column (sample)
i = 0 # index of the row (gene)
print(allTesting.shape) # just to make sure dimensions are correct

for fileName in os.listdir(dataNewFolder):
    if fileName == '.DS_Store' or fileName == 'MANIFEST.txt':
        continue
        
    if fileName in indivList:
        temp_folder = folderIDs[indivList.index(fileName)]
        readFile = dataNewFolder + '/' + fileName
    
        with open(readFile, "r") as file:
            for line in file.readlines():
                allTesting[i, j] = float(line.strip().split("\t")[1])
                i+=1
                
            i = 0
            if j < len(indivList):
                j+=1
            
#for fileName in os.listdir(dataNewFolder):
    #if fileName == '.DS_Store' or fileName == 'MANIFEST.txt':
        continue
    
    #if fileName in indivList:
        
        #temp_folder = folderIDs[indivList.index(fileName)]
        #readFile = dataNewFolder + '/' + fileName
            
        # adds to luminal A expression matrix
        #if temp_folder in lumA_samples:
            
            #lumA_i = 0
            #with open(readFile, "r") as file:
                #for line in file.readlines():
                    #lumA_exp[lumA_i, lumA_j] = float(line.strip().split("\t")[1])
                    #lumA_i += 1
                    
                #lumA_j += 1

(60483, 832)


In [20]:
## makes a matrix without zero rows and makes a list of only expressed genes
# rows of zeros cause errors in PANDA

# Makes note of which rows are zero rows
rowList = list()
for i in range(0, len(allTesting)):
    if np.sum(allTesting[i,:]) == 0.0:
        rowList.append(i)
        
allTesting2 = np.zeros(((len(allTesting)-len(rowList)), len(indivList))) # matrix without zero rows
geneList2 = list() # list of only the expressed genes
print("OLD ", len(allTesting))
print("ZEROS ", len(rowList))
print("NEW ", len(allTesting2))

j = 0 # index in old matrix
r = 0 # index in new matrix
while (j < len(allTesting)):
    if j in rowList:
        j+=1
    else:
        allTesting2[r,:] = allTesting[j,:]
        geneList2.append(geneList[j])
        j+=1
        r+=1
        
print(len(geneList2))
del allTesting
del geneList

OLD  60483
ZEROS  2393
NEW  58090
58090


In [21]:
## TPM normalization
# because this allows genes to be compared well across samples
## TPMi = (FPKMi / sum(FPKMj)) * 10^6, where i is the gene and j is the sample

for j in range(0, len(indivList)):
    sumPat = np.sum(allTesting2, axis = 0)[j]
    
    for i in range(0, len(allTesting2)):
        allTesting2[i,j] = (allTesting2[i,j] / sumPat) * (10**6)
        
# makes sure that the rows add up to the same sum

for j in range(0, len(allTesting2[0])):
    sumPat = np.sum(allTesting2, axis = 0)[j]
    print(sumPat)

999999.999999999
999999.9999999944
1000000.0000000113
999999.9999999946
999999.999999997
999999.9999999914
999999.9999999987
1000000.000000014
1000000.0000000066
1000000.0000000115
1000000.0000000031
1000000.0000000086
1000000.0000000006
999999.999999993
1000000.0000000064
999999.999999986
1000000.0000000057
999999.9999999962
999999.9999999921
999999.9999999952
1000000.0000000118
1000000.0000000079
999999.9999999978
1000000.0000000092
999999.9999999881
1000000.0000000007
1000000.0000000114
1000000.0000000028
1000000.0000000104
1000000.0000000044
999999.9999999925
1000000.0000000009
999999.9999999919
1000000.0000000108
1000000.000000009
999999.9999999963
1000000.0000000088
999999.9999999979
1000000.0000000021
999999.9999999695
1000000.0000000088
1000000.0000000128
1000000.0000000029
999999.999999991
999999.9999999944
999999.9999999976
999999.999999991
999999.9999999938
1000000.0000000073
1000000.0000000128
999999.9999999832
999999.999999995
1000000.000000002
1000000.000000002
1000000.00

999999.9999999801
999999.9999999965
1000000.0000000002
999999.9999999991
999999.9999999894
1000000.0000000121
999999.9999999905
999999.9999999778
1000000.0000000021
1000000.000000001
999999.9999999998
999999.9999999991
1000000.0000000073
999999.999999996
1000000.0000000033
1000000.000000003
1000000.000000017
999999.9999999977
1000000.0000000005
999999.9999999868
999999.9999999985
999999.9999999978
999999.9999999987
1000000.0000000072
1000000.0000000069
999999.9999999983
1000000.0000000009
1000000.0000000008
1000000.0000000023
1000000.0000000017
1000000.0000000143
1000000.000000006
999999.9999999934
999999.9999999845
999999.9999999951
1000000.0000000059
1000000.0000000001
999999.9999999995
999999.9999999831
999999.9999999806
1000000.0000000033
1000000.0000000208
999999.9999999978
1000000.0000000014
1000000.0000000111
1000000.0000000052
1000000.000000015
1000000.0000000024
1000000.0000000041
999999.9999999879
1000000.0000000116
999999.9999999966
1000000.0000000021
1000000.0000000085
9999

In [22]:
# Extract TFBM information if the gene is a gene expressed

motifTFs = list() # list of TFs with genes that are expressed
motifGenes = list() # list of genes with TFs and in expression file

with open(motifPath, "r") as file:
    for line in file.readlines():
        if line.strip().split("\t")[2] == "1":
            if line.strip().split("\t")[1] in geneList2:
                motifTFs.append(line.strip().split("\t")[0])
                motifGenes.append(line.strip().split("\t")[1])
                

In [26]:
print(len(motifTFs))
print(len(motifGenes))

1792776
1792776


In [23]:
# checks if there are genes expressed and not in the motif file
# if not, then doesn't add to updated list of genes in the exp matrix

rowList2 = list() # list of indeces to remove from expressed genes
for i in range(0, len(geneList2)):
    if geneList2[i] not in motifGenes:
        rowList2.append(i)
        
print(len(rowList2))


29216


In [24]:
# removes the rows and genes that are not in the motif file

allTesting3 = np.zeros((len(allTesting2)-len(rowList2), len(indivList)))
geneList3 = list() # list of genes that are in motif file and the motifs are expressed genes

r = 0 # index of row in allTesting3
for i in range(0, len(allTesting2)):
    if i not in rowList2:
        allTesting3[r,:] = allTesting2[i,:]
        geneList3.append(geneList2[i])
        r+=1
        
print(allTesting3.shape)
print(len(geneList3))

(28874, 832)
28874


In [29]:
# to save memory
del allTesting2
del geneList2

In [25]:
# reads in the ppi data and extracts the proteins 

ppiPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/ppi2015_freeze.txt'

protein1 = list()
protein2 = list()

with open(ppiPath, "r") as file:
    next(file) #skips the first line since that is just a header
    for line in file.readlines():
        if line.strip().split("\t")[0] in motifTFs:
            protein1.append(line.strip().split("\t")[0])
            protein2.append(line.strip().split("\t")[1])

In [26]:
# add checks here
len(protein1)

84387

In [27]:
# convert ENS IDs to Entrez

convertFilePath = "/Users/ursulawidocki/Desktop/BreCaNet/Data/geneNames.txt"

# gets Entrez IDs of genes 
with open(convertFilePath, "r") as file:
    next(file)
    for line in file.readlines():
        #print(line.strip().split("\t")[0], line.strip().split("\t")[1])
        ensID = line.strip().split("\t")[0]
        if ensID in geneList3:
            geneList3[geneList3.index(ensID)] = line.strip().split("\t")[1]

# locates duplicate genes
print(len(geneList3) - len(set(geneList3))) # there are no duplicates

0


In [32]:
# makes the expression file

PANDAexpPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/PANDAgeneExp.txt'
geneExpFile = open(PANDAexpPath,"w+")

for i in range(0, len(geneList3)):
    newLine = geneList3[i]
    
    for j in range(0, len(indivList)):
        newLine = newLine + "\t" + str(allTesting3[i,j])
        
    newLine = newLine + "\n"
    geneExpFile.write(newLine)
    
geneExpFile.close()

In [33]:
# makes the motif file

PANDAmotifPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/PANDAmotifs.txt'
motifFile = open(PANDAmotifPath, "w+")

for i in range(0, len(motifTFs)):
    newLine = motifTFs[i] + "\t" + motifGenes[i] + "\t" + "1.000000" + "\n"
    motifFile.write(newLine)

motifFile.close()

In [34]:
# makes the ppi file

PANDAppiPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/PANDAppi.txt'
ppiFile = open(PANDAppiPath, "w+")

for i in range(0, len(protein1)):
    newLine = protein1[i] + "\t" + protein2[i] + "\t" + "1" + "\n"
    ppiFile.write(newLine)

ppiFile.close()

In [35]:
# makes a file of the Entrez IDs

PANDAgeneListPath = '/Users/ursulawidocki/Desktop/BreCaNet/Data/PANDAinput/finalGeneList.txt'
geneFile = open(PANDAgeneListPath, "w+")

for i in range(0, len(geneList3)):
    newLine = geneList3[i] + "\t" + "\n"
    geneFile.write(newLine)
    
geneFile.close()

### Terminal commands to run pypanda