Thanks very much - Veronica, Ru Chen and I went over the pedigrees this morning and reverified the identities. I have attached the sample list here.

The sibship defined by B04522, B04523, B04524, B04525, B04526, B04527 and B04528 should all be compared to each other. Since B04523 and B04528 could not be collected, we are left with assuming B04522, B04524, B04525, B04526, B04527 are full sibs. Of those 5 samples, 3 are affected and 2 are unaffected. We are interested in the list of rare variants (either coding or non-coding) that are shared among the affecteds (B04522, B04524 and B04525) and not shared with the unaffecteds (B04526, B04527). 
These can be homozygous (autosomal recessive model) or heterozygous (autosomal dominant model).

B04529 and B04530 are third cousins to each other, both affected. We are interested in the list of rare variants that are shared between them.

The other samples are all individuals from different families, unrelated to each other. It is possible (though not likely, given what we know about the genetic architecture of nonsyndromic brain aneurysms) that one or more of these samples has a rare variant in or near a gene that is flagged by one of the above approaches.
If there is an algorithmic way to approach that, I'd appreciate you letting me know, but as far as I am aware the onus will be on me to scroll through the variant lists for those individuals one at a time to see by eye whether any rare variants in these singletons are in the same genes as are flagged by the filtering of the multiplex sibship and/or the two third cousins.

# email correspondence to help filtering

The set of samples IBA1863, IBA1823, IBA1801, IBA1837 as labelled here each represent individual affected members from multiplex families. For each of these we will want .vcf files that represent all rare coding variants present at a MAF of 0.01 or less in gnomAD. It is likely that these families each have a different causative allele, <font color='red'>so there is no rationale for overlapping them to exclude rare alleles from consideration.</font>

It is possible that any two of them may have different causative alleles in the same gene, but that is not likely so we will analyse them in conjunction with other rare families we have.



The two samples labelled "IV-1" and "IV-2" here will be provided with IBA numbers when we send them to you. As you can see, they are third cousins. Thus, we would want to analyse these samples together - the .vcf files we would want from these two can safely <font color='red'>be restricted to rare coding variants with a MAF of 0.01 or less in gnomAD that are shared between both individuals.</font>



Lastly (though perhaps most importantly), the multiplex family II-1, II-2, II-3, II-4, II-5, II-6 and II-7.



Here we are looking for rare coding variants that satisfy the following
<font color='red'>

criteria:



1. MAF of 0.01 or less in gnomAD



2. Shared among all 4 of II-1, II-2, II-4, II-5.



AND



3. Not shared with any of II-3, II-6 or II-7.
</font>


We would appreciate a quote for an SOW that satisfied the above parameters, with 40X coverage across the whole genome.



Of course we are most interested in the variants that are predicted to be damaging. If CADD scores can be provided, that would be great, alongside predictions from SIFT, Polyphen, Mutation Taster and/or Align GVGD. I am not sure if there is a "standard" pipeline that you would use to call heterozygous dominant variants

A haplotype is a group of genes within an organism that was inherited together from a single parent. The word "haplotype" is derived from the word "haploid," which describes cells with only one set of chromosomes, and from the word "genotype," which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term "haplotype" can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.

By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.

# annotations

https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/

In [7]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%reload_ext autoreload
%autoreload 2
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [8]:
wkdir = '/projects/da_workspace/DA-337/'

In [9]:
siblibs = ['B04531', 'B04532','B04533','B04534']

# vcf v4.2 specification

QUAL: −10log10prob(no variant)

AC : allele count in genotypes, for each ALT allele, in the same order as listed, it is the alternativew allele count for all samples. AC=3 means 3 alt alleles found in 5 individuals in this case.

AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary
data, not called genotypes, AF= AC/AN

AN : total number of alleles in called genotypes, in this case AN=10, 5 diploid samples

BQ : RMS base quality at this position

DB : dbSNP membership

DP : combined depth across samples, e.g. DP=154

GT : genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference
allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and
so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc. For haploid calls, e.g. on Y, male nonpseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like
0/0/1. If a call cannot be made for a sample at a given locus, ‘.’ should be specified for each missing allele
5

in the GT field (for example ‘./.’ for a diploid genotype and ‘.’ for haploid genotype). The meanings of the
separators are as follows (see the PS field below for more details on incorporating phasing information into the
genotypes):

◦ / : genotype unphased

◦ | : genotype phased

GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible
genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same
ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in
REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by:
F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the
ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)


DP : read depth at this position for this sample (Integer)

GQ : conditional genotype quality, encoded as a phred quality −10log10 p(genotype call is wrong, conditioned
on the site’s being variant) (Integer)

PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as
the GL field) (Integers)

Phasing:  
If the GTs in the VCF have the pipe symbol instead of the forward slash, i.e., 1|0, 1|1, or 0|1, it means that we know which variants were called on the same paternal or maternal chromosome, i.e., the same haplotype. To get this level of information, faithfully, requires a special type of sample prep that I won't go into right now. To interpret this data, all variants reported on the left of the pipe symbol will be on the same haplotype, whilst all on the right will be on the another.


20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.  
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3  
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4  
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2  
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3  
This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is
below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference
sequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellite with
two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are
given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth
and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite
calls are unphased.

# annotated vcf

In [10]:
# f = '/projects/da_workspace/DA-337/haplotype/B04522_27/haplotype_caller-1.1.4/joint/B04522_B04524_B04525_B04526_B04527/379/joint_calling.exclude_no_variation.snpeff.hg19_multianno.vcf.gz'
# f ='/projects/da_workspace/DA-337/haplotype/B04529_30/haplotype_caller-1.1.4/joint/B04529_B04530/382/joint_calling.exclude_no_variation.snpeff.hg19_multianno.vcf.gz'
f = '/projects/da_workspace/DA-337/haplotype/B04531_34/haplotype_caller-1.1.4/joint/B04531_B04533_B04534_B04532/387/joint_calling.exclude_no_variation.snpeff.hg19_multianno.vcf.gz'
df = pd.read_csv(f, sep='\t', comment='#', compression='gzip', low_memory=False, header=None)
df.columns = ['CHROM', 'POS','ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 
              'FORMAT', 'B04531', 'B04532','B04533','B04534']
df.head(2)

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,B04531,B04532,B04533,B04534
0,1,10146,rs779258992,AC,A,465.56,PASS,AC=2;AF=0.250;AN=8;BaseQRankSum=-1.480e-01;DB;...,GT:AD:DP:GQ:PGT:PID:PL,"0/0:123,0:123:0:.:.:0,0,922","0/1:2,6:8:23:0|1:10146_AC_A:70,0,23","0/0:116,0:116:0:.:.:0,0,1211","0/1:7,19:26:99:.:.:403,0,119"
1,1,10230,rs775928745,AC,A,50.68,PASS,AC=1;AF=0.125;AN=8;BaseQRankSum=0.770;DB;DP=28...,GT:AD:DP:GQ:PGT:PID:PL,"0/0:85,0:85:0:.:.:0,0,385","0/0:75,0:75:0:.:.:0,0,1074","0/1:9,5:14:56:0|1:10230_AC_A:56,0,303","0/0:101,0:101:0:.:.:0,0,1226"


In [11]:
df.shape

(8786409, 13)

In [12]:
# filter step by  step
df = df[df.FILTER=='PASS']

In [13]:
df.shape

(7848131, 13)

In [14]:
def get_gnomAD_anno(info):
    sps = [(i.split('=')) for i in info.split(';')[:-1]]
#     sps  = [(i.split('=')) for i in items]
    d = {i[0]: i[1] for i in sps if len(i) == 2}
    ks = ['cosmic70',  'ExAC_ALL', 'gnomAD_genome_ALL'] #'Gene.refGene', 'AAChange.refGene',
#     return {k: d[k] for k in ks}
    return pd.Series([d[k] for k in ks])

In [15]:
# keep it easy for now pick HIGH, MODERATE and then LOW and the first transcript. could pull out all transcripts if needed
def get_annotations(info):
    effs = info.split('EFF=')[1].split(',')
#     effs = eff_line.split(',')
    #     extract impact, impact_type, amino acid change, and gene
    effs = ['@'.join(list(np.array(re.split('\(|\|',ef))[[0,1,2,4,6,9]])) for ef in effs if ('HIGH' in  ef) or ('MODERATE' in ef) or ('LOW' in ef)]
    effs = list(set(effs))
    high = [ef for ef in effs if 'HIGH' in ef]
    moderate = [ef for ef in effs if 'MODERATE' in ef]
    low = [ef for ef in effs if 'LOW' in ef]
#     print(high, moderate, low)
    if high:
#         anno = '$'.join(high)
        anno = high[0]
    elif moderate:
#         anno = '$'.join(moderate)
        anno = moderate[0]
    elif low:
#         anno = '$'.join(low)
        anno = low[0]
    else:
        print('ERROR!')
        pass
#     make sure the genes have the same name
#     genes = [ef.split('@')[2] for ef in effs]
#     print(anno)
    # anno= list of annotations
    return pd.Series(anno.split('@'))


In [16]:
libs = [i for i in df.columns if i.startswith('B')]
target_feat = 'GT'

In [17]:
def get_genotype(target_feat, features, x):
    feats = features.split(':')
    x = x.split(':')
    d = {k:v for k, v in zip(feats, x)}
    return d[target_feat]

In [18]:
get_genotype(target_feat, 'GT:AD:DP:GQ:PL',  '0/1:36,23:59:99:827,0,1516')

'0/1'

In [19]:
dfhml = df[(df['INFO'].str.contains("HIGH"))|(df['INFO'].str.contains("MODERATE"))|(df['INFO'].str.contains("LOW"))]
if not dfhml.empty:
    dfhml[['variant', 'impact','mutation_type','aa_change','gene','transcript']] = dfhml['INFO'].apply(lambda x: get_annotations(x))   
    dfhml[['cosmic70',  'ExAC_ALL', 'gnomAD_genome_ALL']] = dfhml['INFO'].apply(lambda x: get_gnomAD_anno(x))
    for lib in libs:
        dfhml['_'.join([lib, 'gt'])] = dfhml[['FORMAT', lib]].apply(lambda x: get_genotype(target_feat, x['FORMAT'], x[lib]), axis=1)
    dfhml.drop(['INFO', 'FILTER'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [20]:
dfhml.shape

(57901, 24)

In [23]:
dfhml.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04531,B04532,B04533,...,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04531_gt,B04532_gt,B04533_gt,B04534_gt,gnomAD_genome_ALL_tmp
31,1,13656,.,CAG,C,4614.28,GT:AD:DP:GQ:PL,"0/1:28,13:41:99:445,0,1244","0/1:40,14:54:99:449,0,1620","0/1:48,36:84:99:1337,0,1903",...,DDX11L1,ENST00000518655,.,.,0.4901,0/1,0/1,0/1,0/1,0.4901
65,1,15903,rs557514207,G,GC,1806.22,GT:AD:DP:GQ:PL,"1/1:0,28:28:84:995,84,0","1/1:0,12:12:36:428,36,0","0/1:20,13:33:99:394,0,612",...,WASH7P,ENST00000438504,.,.,0.5042,1/1,1/1,0/1,./.,0.5042
924,1,523495,rs573870209,AC,A,39.28,GT:AD:DP:GQ:PL,"0/0:9,0:9:24:0,24,360","0/1:4,2:6:47:47,0,106","0/0:7,0:7:4:0,4,141",...,RP5-857K21.1,ENST00000417636,.,.,0.1124,0/0,0/1,0/0,0/0,0.1124
2491,1,874778,rs568340123,GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA,G,438.71,GT:AD:DP:GQ:PL,"0/0:35,0:35:93:0,93,1395","0/0:42,0:42:99:0,120,1800","0/0:51,0:51:99:0,107,1800",...,SAMD11,ENST00000341065,ID\x3dCOSM1344642\x3bOCCURENCE\x3d2(large_inte...,0.0490,0.0485,0/0,0/0,0/0,0/1,0.0485
2502,1,877831,rs6672356,T,C,7959.25,GT:AD:DP:GQ:PL,"1/1:0,35:35:99:1424,104,0","1/1:0,54:54:99:2321,162,0","1/1:0,44:44:99:1843,132,0",...,SAMD11,ENST00000455979,.,0.9999,0.9998,1/1,1/1,1/1,1/1,0.9998


In [24]:
dfhml['gnomAD_genome_ALL_tmp'] = dfhml['gnomAD_genome_ALL'].replace('.', .0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [57]:
dff = dfhml[(dfhml.gnomAD_genome_ALL_tmp.astype(float) <0.01)|(dfhml.gnomAD_genome_ALL_tmp.isna())]

In [58]:
dff.head()
dff.shape

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04531,B04532,B04533,...,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04531_gt,B04532_gt,B04533_gt,B04534_gt,gnomAD_genome_ALL_tmp
2534,1,888628,.,C,T,775.71,GT:AD:DP:GQ:PL,"0/0:44,0:44:99:0,115,1762","0/0:36,0:36:93:0,93,1395","0/0:42,0:42:99:0,104,1525",...,NOC2L,ENST00000327044,.,.,.,0/0,0/0,0/0,0/1,0.0
2576,1,900526,rs146302352,G,A,620.71,GT:AD:DP:GQ:PL,"0/0:32,0:32:80:0,80,1292","0/1:20,18:38:99:630,0,684","0/0:26,0:26:72:0,72,1080",...,KLHL17,ENST00000455747,.,0.0011,0.0008,0/0,0/1,0/0,0/0,0.0008
2828,1,955596,rs764659938,C,G,862.71,GT:AD:DP:GQ:PGT:PID:PL,"0/0:54,0:54:99:.:.:0,100,1800","0/1:24,23:47:99:0|1:955596_C_G:872,0,1528","0/0:49,0:49:99:.:.:0,108,1800",...,AGRN,ENST00000379370,.,0.0002,0.0010,0/0,0/1,0/0,0/0,0.001
2951,1,985855,rs147990356,C,T,688.71,GT:AD:DP:GQ:PL,"0/0:40,0:40:99:0,99,1575","0/0:43,0:43:99:0,101,1613","0/0:41,0:41:99:0,114,1710",...,AGRN,ENST00000379370,.,0.0061,0.0051,0/0,0/0,0/0,0/1,0.0051
3650,1,1221539,.,C,T,728.71,GT:AD:DP:GQ:PL,"0/0:45,0:45:99:0,108,1687","0/0:41,0:41:99:0,105,1533","0/0:35,0:35:99:0,99,1420",...,SCNN1D,ENST00000379101,.,.,.,0/0,0/0,0/0,0/1,0.0


(6101, 25)

# filter based on genotype

In [69]:
ps = ['0/1', '1/0', '1/1', '0|1', '1|0', '1|1', './.', '.|']
# ns = ['0/0', '0|0', './.', '.|.']
# affected = ['B04522', 'B04524', 'B04525']
# unaffected = ['B04526', 'B04527']

all 4 individuals affected(B04531	B04532	B04533	B04534) 



In [60]:
dff.drop('gnomAD_genome_ALL_tmp', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


# annotated if a variant is found in either 5 full siblings or the 2 third cousins

In [61]:
of = '/projects/da_workspace/DA-337/haplotype/B04522_27/haplotype_caller-1.1.4/joint/B04522_B04524_B04525_B04526_B04527/379/B04522_B04524_B04525_B04526_B04527_potential_causal_contributing_coding_variants_20190910.tsv'
gene1 = pd.read_csv(of, sep='\t')['gene'].unique().tolist()
gene1[:2]

['NFIA', 'KANK4']

In [62]:
of = '/projects/da_workspace/DA-337/haplotype/B04529_30/haplotype_caller-1.1.4/joint/B04529_B04530/382/B04529_B04530_potential_causal_contributing_coding_variants_20190912.tsv'
gene2 = pd.read_csv(of, sep='\t')['gene'].unique().tolist()
gene2[:2]

['FOXO6', 'RP11-423O2.5']

In [63]:
dff.head(2)

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04531,B04532,B04533,...,aa_change,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04531_gt,B04532_gt,B04533_gt,B04534_gt
2534,1,888628,.,C,T,775.71,GT:AD:DP:GQ:PL,"0/0:44,0:44:99:0,115,1762","0/0:36,0:36:93:0,93,1395","0/0:42,0:42:99:0,104,1525",...,p.Arg310Gln/c.929G>A,NOC2L,ENST00000327044,.,.,.,0/0,0/0,0/0,0/1
2576,1,900526,rs146302352,G,A,620.71,GT:AD:DP:GQ:PL,"0/0:32,0:32:80:0,80,1292","0/1:20,18:38:99:630,0,684","0/0:26,0:26:72:0,72,1080",...,p.Pro398Pro/c.1194G>A,KLHL17,ENST00000455747,.,0.0011,0.0008,0/0,0/1,0/0,0/0


In [64]:
dff['gene_in_5_sibs_set'] = dff.gene.isin(gene1)
dff['gene_in_2_cousin_set'] = dff.gene.isin(gene2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [65]:
dff.tail()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04531,B04532,B04533,...,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04531_gt,B04532_gt,B04533_gt,B04534_gt,gene_in_5_sibs_set,gene_in_2_cousin_set
8754916,GL000212.1,65258,.,T,C,396.71,GT:AD:DP:GQ:PL,"0/0:62,0:62:99:0,119,1800","0/0:64,0:64:99:0,107,1800","0/1:15,13:28:99:406,0,443",...,ENST00000391545,.,.,.,0/0,0/0,0/1,0/0,False,True
8754917,GL000212.1,65289,.,G,GGGGACGCCGTCAAGGGCATCGCTGATC,2914.35,GT:AD:DP:GQ:PL,"1/1:0,20:20:56:886,56,0","1/1:3,14:17:24:637,24,0","1/1:0,15:15:50:925,50,0",...,ENST00000391545,.,.,.,1/1,1/1,1/1,0/1,False,True
8754919,GL000212.1,65540,.,T,C,5370.27,GT:AD:DP:GQ:PL,"1/1:0,41:41:99:1680,123,0","1/1:0,25:26:94:1276,94,0","1/1:0,31:31:93:1328,93,0",...,ENST00000391545,.,.,.,1/1,1/1,1/1,1/1,False,True
8761891,GL000193.1,92471,.,CAG,C,16.64,GT:AD:DP:GQ:PL,"0/0:3,0:3:3:0,3,45","0/1:8,2:10:25:25,0,330","0/0:2,0:2:6:0,6,79",...,ENST00000402833,.,.,.,0/0,0/1,0/0,0/0,False,False
8761965,GL000193.1,95299,.,CAG,C,68.71,GT:AD:DP:GQ:PL,"0/0:49,0:49:99:0,120,1800","0/0:113,0:113:99:0,120,1800","0/0:139,0:139:99:0,120,1800",...,ENST00000536753,.,.,.,0/0,0/0,0/0,0/1,False,False


In [66]:
dff.columns

Index(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FORMAT', 'B04531',
       'B04532', 'B04533', 'B04534', 'variant', 'impact', 'mutation_type',
       'aa_change', 'gene', 'transcript', 'cosmic70', 'ExAC_ALL',
       'gnomAD_genome_ALL', 'B04531_gt', 'B04532_gt', 'B04533_gt', 'B04534_gt',
       'gene_in_5_sibs_set', 'gene_in_2_cousin_set'],
      dtype='object')

In [71]:
dff.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04531,B04532,B04533,...,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04531_gt,B04532_gt,B04533_gt,B04534_gt,gene_in_5_sibs_set,gene_in_2_cousin_set
2534,1,888628,.,C,T,775.71,GT:AD:DP:GQ:PL,"0/0:44,0:44:99:0,115,1762","0/0:36,0:36:93:0,93,1395","0/0:42,0:42:99:0,104,1525",...,ENST00000327044,.,.,.,0/0,0/0,0/0,0/1,False,False
2576,1,900526,rs146302352,G,A,620.71,GT:AD:DP:GQ:PL,"0/0:32,0:32:80:0,80,1292","0/1:20,18:38:99:630,0,684","0/0:26,0:26:72:0,72,1080",...,ENST00000455747,.,0.0011,0.0008,0/0,0/1,0/0,0/0,False,False
2828,1,955596,rs764659938,C,G,862.71,GT:AD:DP:GQ:PGT:PID:PL,"0/0:54,0:54:99:.:.:0,100,1800","0/1:24,23:47:99:0|1:955596_C_G:872,0,1528","0/0:49,0:49:99:.:.:0,108,1800",...,ENST00000379370,.,0.0002,0.0010,0/0,0/1,0/0,0/0,False,False
2951,1,985855,rs147990356,C,T,688.71,GT:AD:DP:GQ:PL,"0/0:40,0:40:99:0,99,1575","0/0:43,0:43:99:0,101,1613","0/0:41,0:41:99:0,114,1710",...,ENST00000379370,.,0.0061,0.0051,0/0,0/0,0/0,0/1,False,False
3650,1,1221539,.,C,T,728.71,GT:AD:DP:GQ:PL,"0/0:45,0:45:99:0,108,1687","0/0:41,0:41:99:0,105,1533","0/0:35,0:35:99:0,99,1420",...,ENST00000379101,.,.,.,0/0,0/0,0/0,0/1,False,False


In [74]:
dff['num_patients_with_variant_in_unrelated_set'] = dff[['B04531_gt', 'B04532_gt', 'B04533_gt', 'B04534_gt']].isin(ps).sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [75]:
dff.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04531,B04532,B04533,...,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04531_gt,B04532_gt,B04533_gt,B04534_gt,gene_in_5_sibs_set,gene_in_2_cousin_set,num_patients_with_variant_in_unrelated_set
2534,1,888628,.,C,T,775.71,GT:AD:DP:GQ:PL,"0/0:44,0:44:99:0,115,1762","0/0:36,0:36:93:0,93,1395","0/0:42,0:42:99:0,104,1525",...,.,.,.,0/0,0/0,0/0,0/1,False,False,1
2576,1,900526,rs146302352,G,A,620.71,GT:AD:DP:GQ:PL,"0/0:32,0:32:80:0,80,1292","0/1:20,18:38:99:630,0,684","0/0:26,0:26:72:0,72,1080",...,.,0.0011,0.0008,0/0,0/1,0/0,0/0,False,False,1
2828,1,955596,rs764659938,C,G,862.71,GT:AD:DP:GQ:PGT:PID:PL,"0/0:54,0:54:99:.:.:0,100,1800","0/1:24,23:47:99:0|1:955596_C_G:872,0,1528","0/0:49,0:49:99:.:.:0,108,1800",...,.,0.0002,0.0010,0/0,0/1,0/0,0/0,False,False,1
2951,1,985855,rs147990356,C,T,688.71,GT:AD:DP:GQ:PL,"0/0:40,0:40:99:0,99,1575","0/0:43,0:43:99:0,101,1613","0/0:41,0:41:99:0,114,1710",...,.,0.0061,0.0051,0/0,0/0,0/0,0/1,False,False,1
3650,1,1221539,.,C,T,728.71,GT:AD:DP:GQ:PL,"0/0:45,0:45:99:0,108,1687","0/0:41,0:41:99:0,105,1533","0/0:35,0:35:99:0,99,1420",...,.,.,.,0/0,0/0,0/0,0/1,False,False,1


In [95]:
dff.gene_in_2_cousin_set.sum()
dff.gene_in_5_sibs_set.sum()
dff[dff.num_patients_with_variant_in_unrelated_set==4].gene

170

65

318201     RP11-31F15.2
347949     RP11-423O2.5
785423          C2orf56
785424          C2orf56
785425          C2orf56
1062046      AC097532.2
1062318      AC097532.2
1116166           FMNL2
1116167           FMNL2
1409690           TPRXL
1584598          ZNF717
1778694    RP11-680B3.2
1911843            MUC4
2706897           CENPK
3151710        HLA-DRB6
3151728        HLA-DRB6
3151778        HLA-DRB6
3151779        HLA-DRB6
3157356        HLA-DQA1
3157357        HLA-DQA1
3157550        HLA-DQA1
3157551        HLA-DQA1
3157574        HLA-DQA1
3157577        HLA-DQA1
3157584        HLA-DQA1
3160288        HLA-DQB1
3160289        HLA-DQB1
3160290        HLA-DQB1
3160300        HLA-DQB1
3503553           SYNE1
               ...     
7329488        C17orf66
7329493        C17orf66
7410186    CTD-2653B5.1
7414807           ABCA6
7693783         MADCAM1
7971508         FAM182B
7971573         FAM182B
8008576    RP4-564F22.2
8109636           BAGE2
8631241             ZFY
8631242         

In [77]:
of = '/projects/da_workspace/DA-337/haplotype/B04531_34/haplotype_caller-1.1.4/joint/B04531_B04533_B04534_B04532/387/B04531_B04533_B04534_B04532_potential_causal_contributing_coding_variants_20190912.tsv'
dff.to_csv(of, sep='\t', index=False)

In [78]:
dff.gene.unique()
dff.shape

array(['NOC2L', 'KLHL17', 'AGRN', ..., 'AL356585.1', 'AC018692.2',
       'AC018692.3'], dtype=object)

(6101, 27)

# get the vcf entries based on index

In [79]:
dfvcf = df.reindex(dff.index)

In [82]:
of = '/projects/da_workspace/DA-337/haplotype/B04531_34/haplotype_caller-1.1.4/joint/B04531_B04533_B04534_B04532/387/B04531_B04533_B04534_B04532_vcf_20190912.vcf'
dfvcf.to_csv(of, sep='\t', index=False)