####Permutation experiment for FVL modifiers project

##File preparation

Downloaded **mouse.protein.gpff.fromRefSeq09-2014** from Uniprot website(http://www.uniprot.org/)

As reference I used **annovar_mm10_refLink.txt**, **annovar_mm10_refGene.txt** files that were also used for annotating the variant in variant filtration pipeline (FVLmod_VariantFiltration.ipynb)

These files were used to identify a protein sequence length for each gene annotated in our variant file. I used **annotate.py** for this step:

In [None]:
##UNIX command:
./annotate.py

In [None]:
#!/usr/bin/env python
#define protein coding refGene gene set and the protein length

import sys
import re
from collections import defaultdict

#create filenames
input1=open('mouse.protein.gpff.fromRefSeq09-2014','r')
input2=open('annovar_mm10_refLink.txt','r')
input3=open('annovar_mm10_refGene.txt','r')

outfile=open('refGene_protein_coding_genes.txt','w')


pIDAA=dict()
gIDAA=dict()
nameAA=dict()

for line in input1:
	if not line.startswith('LOCUS'):
		continue
	row=line.strip().split()
	pIDAA[row[1]]=row[2]

for line in input2:
	row=line.strip().split('\t')
	if row[3] in pIDAA:
		gIDAA[row[2]]=pIDAA[row[3]]
	else:
		continue

for line in input3:
	line=line.strip()
	row=line.split('\t')
	name=row[12]
	gID=row[1]
	if not gID in gIDAA:
		continue
	if name in nameAA:
		if int(nameAA[name])>=int(gIDAA[gID]):
			continue
		else:
			nameAA[name]=gIDAA[gID]
	else:
		nameAA[name]=gIDAA[gID]

counter=0
for name in sorted(nameAA):
	counter+=1
	print >> outfile, nameAA[name], name, str(counter)

	

Next, using R, I added weighted proportions to each gene normalized to protein size:

In [None]:
table=read.table('refGene_protein_coding_genes.txt',head=FALSE)
table$V4=table$V1/(sum(table$V1)/20582)
write.table(table,file='refGene_protein_coding_genes_forSim.txt',sep="\t",quote=FALSE,col.names=FALSE,row.names=FALSE)

The sampeling of 3482 hits out of 20582 protein coding genes by weighted probability was accomplished using the **simulate.py** program. In total 10 x 1,000,000 permutations were performed.

In [None]:
##UNIX command:
./simulate.py 1000000 3482 out1.txt
./simulate.py 1000000 3482 out2.txt
./simulate.py 1000000 3482 out3.txt
./simulate.py 1000000 3482 out4.txt
./simulate.py 1000000 3482 out5.txt
./simulate.py 1000000 3482 out6.txt
./simulate.py 1000000 3482 out7.txt
./simulate.py 1000000 3482 out8.txt
./simulate.py 1000000 3482 out9.txt
./simulate.py 1000000 3482 out10.txt

**simulate.py**:

In [None]:
#!/usr/bin/env python

import numpy as np
import sys

#create filenames
#NUM_SAMPLES is the number of permutations
#mut is the number of mutations per permutation

NUM_SAMPLES=int(sys.argv[1])
mut=int(sys.argv[2])
out=sys.argv[3]
outfile=open(out,'w')

ids = np.loadtxt('refGene_protein_coding_genes_forSim.txt', usecols=[2])
probs = np.loadtxt('refGene_protein_coding_genes_forSim.txt', usecols=[3])
probs = probs / np.sum(probs)

gene_freqs = np.zeros((len(ids), mut))
for i in range(NUM_SAMPLES):
        if i%1000==0:
                print(i)
                
        sample = np.random.choice(ids, size=mut, p=probs)
        counts = np.zeros(len(ids))
        selected_genes = set()
        # only counts if a gene has a non-zero frequency in this sample
        # can add the zero frequencies at the end
        for sampled_gene in sample:
                counts[sampled_gene-1] += 1
                selected_genes.add(sampled_gene-1)
        for j in selected_genes:
                gene_freqs[j][counts[j]] += 1

# fill in counts for number of samples a gene wasn't selected
for a in range(len(ids)):
        gene_freqs[a][0] = NUM_SAMPLES - np.sum(gene_freqs[a])
	print >> outfile,   str(a)+"\t"+"\t".join(map(str, gene_freqs[a][0:21]))

Final step incuded combining data from the 10 simulations in R:

In [None]:
data1=read.table('out1.txt')
data2=read.table('out2.txt')
data3=read.table('out3.txt')
data4=read.table('out4.txt')
data5=read.table('out5.txt')
data6=read.table('out6.txt')
data7=read.table('out7.txt')
data8=read.table('out8.txt')
data9=read.table('out9.txt')
data10=read.table('out10.txt')

data=data1+data2+data3+data4+data5+data6+data7+data8+data9+data10

data$V1=seq(1,20582,1)

input=read.table('refGene_protein_coding_genes_forSim.txt')

new=cbind(input,data)

write.table(new,file="simulation_new_10M_3482.txt", col.names=FALSE, append=FALSE, quote=FALSE,row.names=FALSE,sep="\t")