Thanks very much - Veronica, Ru Chen and I went over the pedigrees this morning and reverified the identities. I have attached the sample list here.

The sibship defined by B04522, B04523, B04524, B04525, B04526, B04527 and B04528 should all be compared to each other. Since B04523 and B04528 could not be collected, we are left with assuming B04522, B04524, B04525, B04526, B04527 are full sibs. Of those 5 samples, 3 are affected and 2 are unaffected. We are interested in the list of rare variants (either coding or non-coding) that are shared among the affecteds (B04522, B04524 and B04525) and not shared with the unaffecteds (B04526, B04527). 
These can be homozygous (autosomal recessive model) or heterozygous (autosomal dominant model).

B04529 and B04530 are third cousins to each other, both affected. We are interested in the list of rare variants that are shared between them.

The other samples are all individuals from different families, unrelated to each other. It is possible (though not likely, given what we know about the genetic architecture of nonsyndromic brain aneurysms) that one or more of these samples has a rare variant in or near a gene that is flagged by one of the above approaches.
If there is an algorithmic way to approach that, I'd appreciate you letting me know, but as far as I am aware the onus will be on me to scroll through the variant lists for those individuals one at a time to see by eye whether any rare variants in these singletons are in the same genes as are flagged by the filtering of the multiplex sibship and/or the two third cousins.

# email correspondence to help filtering

The set of samples IBA1863, IBA1823, IBA1801, IBA1837 as labelled here each represent individual affected members from multiplex families. For each of these we will want .vcf files that represent all rare coding variants present at a MAF of 0.01 or less in gnomAD. It is likely that these families each have a different causative allele, <font color='red'>so there is no rationale for overlapping them to exclude rare alleles from consideration.</font>

It is possible that any two of them may have different causative alleles in the same gene, but that is not likely so we will analyse them in conjunction with other rare families we have.



The two samples labelled "IV-1" and "IV-2" here will be provided with IBA numbers when we send them to you. As you can see, they are third cousins. Thus, we would want to analyse these samples together - the .vcf files we would want from these two can safely <font color='red'>be restricted to rare coding variants with a MAF of 0.01 or less in gnomAD that are shared between both individuals.</font>



Lastly (though perhaps most importantly), the multiplex family II-1, II-2, II-3, II-4, II-5, II-6 and II-7.



Here we are looking for rare coding variants that satisfy the following
<font color='red'>

criteria:



1. MAF of 0.01 or less in gnomAD



2. Shared among all 4 of II-1, II-2, II-4, II-5.



AND



3. Not shared with any of II-3, II-6 or II-7.
</font>


We would appreciate a quote for an SOW that satisfied the above parameters, with 40X coverage across the whole genome.



Of course we are most interested in the variants that are predicted to be damaging. If CADD scores can be provided, that would be great, alongside predictions from SIFT, Polyphen, Mutation Taster and/or Align GVGD. I am not sure if there is a "standard" pipeline that you would use to call heterozygous dominant variants

A haplotype is a group of genes within an organism that was inherited together from a single parent. The word "haplotype" is derived from the word "haploid," which describes cells with only one set of chromosomes, and from the word "genotype," which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term "haplotype" can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.

By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.

# annotations

https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
wkdir = '/projects/da_workspace/DA-337/'

In [3]:
siblibs = ['B04522', 'B04524', 'B04525', 'B04526', 'B04527']#'B04529' ,'B04530', 'B04531', 'B04533', 'B04534', 'B04532']

# vcf v4.2 specification

QUAL: −10log10prob(no variant)

AC : allele count in genotypes, for each ALT allele, in the same order as listed, it is the alternativew allele count for all samples. AC=3 means 3 alt alleles found in 5 individuals in this case.

AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary
data, not called genotypes, AF= AC/AN

AN : total number of alleles in called genotypes, in this case AN=10, 5 diploid samples

BQ : RMS base quality at this position

DB : dbSNP membership

DP : combined depth across samples, e.g. DP=154

GT : genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference
allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and
so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc. For haploid calls, e.g. on Y, male nonpseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like
0/0/1. If a call cannot be made for a sample at a given locus, ‘.’ should be specified for each missing allele
5

in the GT field (for example ‘./.’ for a diploid genotype and ‘.’ for haploid genotype). The meanings of the
separators are as follows (see the PS field below for more details on incorporating phasing information into the
genotypes):

◦ / : genotype unphased

◦ | : genotype phased

GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible
genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same
ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in
REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by:
F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the
ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)


DP : read depth at this position for this sample (Integer)

GQ : conditional genotype quality, encoded as a phred quality −10log10 p(genotype call is wrong, conditioned
on the site’s being variant) (Integer)

PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as
the GL field) (Integers)

Phasing:  
If the GTs in the VCF have the pipe symbol instead of the forward slash, i.e., 1|0, 1|1, or 0|1, it means that we know which variants were called on the same paternal or maternal chromosome, i.e., the same haplotype. To get this level of information, faithfully, requires a special type of sample prep that I won't go into right now. To interpret this data, all variants reported on the left of the pipe symbol will be on the same haplotype, whilst all on the right will be on the another.


20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.  
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3  
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4  
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2  
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3  
This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is
below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference
sequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellite with
two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are
given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth
and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite
calls are unphased.

# get GT: genotype for each sample

In [3]:
libs = [i for i in dft.columns if i.startswith('B')]

target_cols = ['GT']
def x_gt(fmtstr, x, target_cols):
    fmtstr = fmtstr.split(':')
    x = x.split(':')
    d = {k:v for k, v in zip(fmtstr, x)}
    return [d[i] for i in target_cols]
#     print(d)

NameError: name 'dft' is not defined

In [90]:
x_gt('GT:AD:DP:GQ:PL', '0/0:94,0:94:0:0,0,905', ['GT', 'AD', 'DP'])

['0/0', '94,0', '94']

In [1]:
target_cols = ['GT']#, 'AD', 'DP']
gts = []
for ix, row in df.head().iterrows():
    gt = []
    fmtstr = row['FORMAT']
    for lib in libs:
        x = row[lib]
        gt.append(x_gt(fmtstr, x, target_cols)[0])
    gts.append(gt)
# print(gts)
dfgt = pd.DataFrame(gts)
dfgt.columns = ['_'.join([lib, 'GT']) for lib in libs]
dfgt.head()

NameError: name 'df' is not defined

 affecteds (B04522, B04524 and B04525) and not shared with the unaffecteds (B04526, B04527)

# FILTER field does not equate PASS should be filter out, they failed FILTER

# annotated vcf

In [4]:
f = '/projects/da_workspace/DA-337/haplotype/B04522_27/haplotype_caller-1.1.4/joint/B04522_B04524_B04525_B04526_B04527/379/joint_calling.exclude_no_variation.snpeff.hg19_multianno.vcf.gz'
df = pd.read_csv(f, sep='\t', comment='#', compression='gzip', low_memory=False, header=None)
df.columns = ['CHROM', 'POS','ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 
              'FORMAT', 'B04522', 'B04524', 'B04525', 'B04526', 'B04527']
df.head(2)

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,B04522,B04524,B04525,B04526,B04527
0,1,10146,rs779258992,AC,A,207.23,PASS,AC=1;AF=0.100;AN=10;BaseQRankSum=-6.830e-01;DB...,GT:AD:DP:GQ:PL,"0/0:94,0:94:0:0,0,905","0/0:105,0:105:63:0,63,2541","0/0:130,0:130:14:0,14,3038","0/1:2,10:12:30:216,0,30","0/0:118,0:118:85:0,85,3066"
1,1,10403,.,ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC,A,132.33,PASS,AC=1;AF=0.100;AN=10;BaseQRankSum=-2.400e-02;DP...,GT:AD:DP:GQ:PGT:PID:PL,"0/0:101,0:101:0:.:.:0,0,2695","0/1:8,5:13:99:0|1:10403_ACCCTAACCCTAACCCTAACCC...","0/0:115,0:115:55:.:.:0,55,3276","0/0:120,0:120:0:.:.:0,0,2770","0/0:136,0:136:15:.:.:0,15,3309"


In [5]:
df.shape

(7040126, 14)

In [6]:
# filter step by  step
df = df[df.FILTER=='PASS']

In [7]:
df.shape

(6293924, 14)

In [8]:
def get_gnomAD_anno(info):
    sps = [(i.split('=')) for i in info.split(';')[:-1]]
#     sps  = [(i.split('=')) for i in items]
    d = {i[0]: i[1] for i in sps if len(i) == 2}
    ks = ['cosmic70',  'ExAC_ALL', 'gnomAD_genome_ALL'] #'Gene.refGene', 'AAChange.refGene',
#     return {k: d[k] for k in ks}
    return pd.Series([d[k] for k in ks])

In [9]:
# get_gnomAD_anno(dfsub.loc[27, 'INFO'])

NameError: name 'dfsub' is not defined

In [10]:
# keep it easy for now pick HIGH, MODERATE and then LOW and the first transcript. could pull out all transcripts if needed
def get_annotations(info):
    effs = info.split('EFF=')[1].split(',')
#     effs = eff_line.split(',')
    #     extract impact, impact_type, amino acid change, and gene
    effs = ['@'.join(list(np.array(re.split('\(|\|',ef))[[0,1,2,4,6,9]])) for ef in effs if ('HIGH' in  ef) or ('MODERATE' in ef) or ('LOW' in ef)]
    effs = list(set(effs))
    high = [ef for ef in effs if 'HIGH' in ef]
    moderate = [ef for ef in effs if 'MODERATE' in ef]
    low = [ef for ef in effs if 'LOW' in ef]
#     print(high, moderate, low)
    if high:
#         anno = '$'.join(high)
        anno = high[0]
    elif moderate:
#         anno = '$'.join(moderate)
        anno = moderate[0]
    elif low:
#         anno = '$'.join(low)
        anno = low[0]
    else:
        print('ERROR!')
        pass
#     make sure the genes have the same name
#     genes = [ef.split('@')[2] for ef in effs]
#     print(anno)
    # anno= list of annotations
    return pd.Series(anno.split('@'))


In [11]:
get_annotations(df.loc[27, 'INFO'])

0    splice_donor_variant+splice_region_variant+int...
1                                                 HIGH
2                                                     
3                                   n.734+2_734+3delAG
4                                              DDX11L1
5                                      ENST00000518655
dtype: object

In [13]:
libs = [i for i in df.columns if i.startswith('B')]
target_feat = 'GT'

In [14]:
def get_genotype(target_feat, features, x):
    feats = features.split(':')
    x = x.split(':')
    d = {k:v for k, v in zip(feats, x)}
    return d[target_feat]

In [15]:
get_genotype(target_feat, 'GT:AD:DP:GQ:PL',  '0/1:36,23:59:99:827,0,1516')

'0/1'

In [16]:
dfhml = df[(df['INFO'].str.contains("HIGH"))|(df['INFO'].str.contains("MODERATE"))|(df['INFO'].str.contains("LOW"))]
if not dfhml.empty:
    dfhml[['variant', 'impact','mutation_type','aa_change','gene','transcript']] = dfhml['INFO'].apply(lambda x: get_annotations(x))   
    dfhml[['cosmic70',  'ExAC_ALL', 'gnomAD_genome_ALL']] = dfhml['INFO'].apply(lambda x: get_gnomAD_anno(x))
    for lib in libs:
        dfhml['_'.join([lib, 'gt'])] = dfhml[['FORMAT', lib]].apply(lambda x: get_genotype(target_feat, x['FORMAT'], x[lib]), axis=1)
    dfhml.drop(['INFO', 'FILTER'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [17]:
dfhml.shape

(46321, 26)

In [18]:
dfhml.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04522,B04524,B04525,...,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04522_gt,B04524_gt,B04525_gt,B04526_gt,B04527_gt
27,1,13656,.,CAG,C,4649.2,GT:AD:DP:GQ:PL,"0/1:36,23:59:99:827,0,1516","0/1:35,22:57:99:803,0,1400","0/1:28,24:52:99:892,0,1108",...,DDX11L1,ENST00000518655,.,.,0.4901,0/1,0/1,0/1,0/1,0/1
54,1,15903,rs557514207,G,GC,1173.2,GT:AD:DP:GQ:PL,"1/1:0,7:7:21:249,21,0","1/1:0,11:11:33:392,33,0","1/1:0,8:8:24:285,24,0",...,WASH7P,ENST00000438504,.,.,0.5042,1/1,1/1,1/1,./.,1/1
2167,1,877831,rs6672356,T,C,8712.01,GT:AD:DP:GQ:PL,"1/1:0,42:42:99:1772,126,0","1/1:0,47:47:99:1962,141,0","1/1:0,33:33:99:1364,99,0",...,SAMD11,ENST00000455979,.,0.9999,0.9998,1/1,1/1,1/1,1/1,1/1
2169,1,879317,rs7523549,C,T,1601.96,GT:AD:DP:GQ:PL,"0/1:19,11:30:99:270,0,713","0/0:29,0:29:78:0,78,1170","0/1:13,20:33:99:667,0,460",...,SAMD11,ENST00000342066,.,0.0480,0.082,0/1,0/0,0/1,0/1,0/1
2175,1,881627,rs2272757,G,A,2911.2,GT:AD:DP:GQ:PL,"0/1:12,16:28:99:575,0,416","0/1:13,11:24:99:395,0,473","0/1:19,19:38:99:726,0,638",...,NOC2L,ENST00000327044,.,0.5653,0.4889,0/1,0/1,0/1,0/1,0/1


In [19]:
dfhml['gnomAD_genome_ALL_tmp'] = dfhml['gnomAD_genome_ALL'].replace('.', .0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [20]:
dff = dfhml[(dfhml.gnomAD_genome_ALL_tmp.astype(float) <0.01)|(dfhml.gnomAD_genome_ALL_tmp.isna())]

In [21]:
dff.head()
dff.shape

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04522,B04524,B04525,...,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04522_gt,B04524_gt,B04525_gt,B04526_gt,B04527_gt,gnomAD_genome_ALL_tmp
2284,1,915614,rs186435690,T,A,2041.96,GT:AD:DP:GQ:PL,"0/1:21,12:33:99:416,0,792","0/0:35,0:35:96:0,96,1440","0/1:15,23:38:99:710,0,543",...,ENST00000433179,.,0.0008,0.0011,0/1,0/0,0/1,0/1,0/1,0.0011
2440,1,955701,rs767809484,C,T,636.23,GT:AD:DP:GQ:PL,"0/0:41,0:41:99:0,111,1764","0/1:22,19:41:99:646,0,753","0/0:52,0:52:99:0,110,1800",...,ENST00000379370,.,8.649e-05,0.0002,0/0,0/1,0/0,0/0,0/0,0.0002
2529,1,984660,rs200232485,C,T,1589.96,GT:AD:DP:GQ:PL,"0/1:12,14:26:99:497,0,454","0/0:27,0:27:81:0,81,1077","0/1:12,14:26:99:536,0,418",...,ENST00000379370,.,0.0011,0.0007,0/1,0/0,0/1,0/1,0/1,0.0007
2898,1,1139498,rs11466693,C,T,1977.44,GT:AD:DP:GQ:PL,"0/0:36,0:36:99:0,99,1485","0/1:15,16:31:99:566,0,499","0/1:14,20:34:99:692,0,481",...,ENST00000379265,.,0.0078,0.0093,0/0,0/1,0/1,0/1,0/0,0.0093
3208,1,1216859,rs556090955,C,T,1124.94,GT:AD:DP:GQ:PL,"0/1:19,18:37:99:562,0,746","0/0:33,0:33:90:0,90,1350","0/0:37,0:37:99:0,99,1485",...,ENST00000379116,.,.,.,0/1,0/0,0/0,0/0,0/1,0.0


(4301, 27)

# filter based on genotype

In [22]:
ps = ['0/1', '1/0', '1/1', '0|1', '1|0', '1|1', './.', '.|']
ns = ['0/0', '0|0', './.', '.|.']
affected = ['B04522', 'B04524', 'B04525']
unaffected = ['B04526', 'B04527']

affecteds (B04522, B04524 and B04525) and not shared with the unaffecteds (B04526, B04527)



In [23]:
dfff = dff[(dff.B04522_gt.isin(ps))&(dff.B04524_gt.isin(ps))&(dff.B04525_gt.isin(ps))&(dff.B04526_gt.isin(ns))&(dff.B04527_gt.isin(ns))]

In [24]:
dfff.drop('gnomAD_genome_ALL_tmp', axis=1, inplace=True)

In [25]:
dfff.head(2)

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FORMAT,B04522,B04524,B04525,...,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04522_gt,B04524_gt,B04525_gt,B04526_gt,B04527_gt
136444,1,61869885,rs771792554,G,A,1906.44,GT:AD:DP:GQ:PL,"0/1:24,23:47:99:728,0,900","0/1:11,19:30:99:669,0,400","0/1:17,15:32:99:523,0,619",...,NFIA,ENST00000371185,.,8.246e-06,.,0/1,0/1,0/1,0/0,0/0
138596,1,62739482,.,T,C,1533.44,GT:AD:DP:GQ:PL,"0/1:15,15:30:99:576,0,513","0/1:13,14:27:99:540,0,415","0/1:18,13:31:99:431,0,642",...,KANK4,ENST00000371153,.,.,.,0/1,0/1,0/1,0/0,0/0


In [27]:
of = '/projects/da_workspace/DA-337/haplotype/B04522_27/haplotype_caller-1.1.4/joint/B04522_B04524_B04525_B04526_B04527/379/potential_causal_contributing_variants_20190910.tsv'
dfff.to_csv(of, sep='\t', index=False)

In [26]:
dfff.gene.unique()
dfff.shape

array(['NFIA', 'KANK4', 'ROR1', 'RAVER2', 'TGFBR3', 'BCAR3', 'DENND2D',
       'WNT2B', 'VANGL1', 'CASQ2', 'ATP1A1', 'NBPF12', 'POGZ', 'FLG',
       'GATAD2B', 'ASH1L', 'RP11-29H23.5', 'SMG5', 'CAMSAP2',
       'AC140481.2', 'CHL1', 'OGG1', 'ARPC4', 'POM121C', 'ZSCAN21',
       'MTUS1', 'PDLIM2', 'LETM2', 'ADAM9', 'PRKDC', 'CUBN', 'C10orf114',
       'SKIDA1', 'ZEB1-AS1', 'ANKRD30A', 'ABCD1P2', 'BMS1',
       'RP11-285G1.2', 'WDFY4', 'TFAM', 'AHNAK', 'VSIG2', 'AP000708.1',
       'CHEK1', 'SNX19', 'GLB1L2', 'BTBD11', 'ACACB', 'PLBD2', 'SERPINA5',
       'CDC42BPB', 'KIF26A', 'INF2', 'AHNAK2', 'BTBD6', 'UBL7', 'CSPG4',
       'SCAPER', 'ACSBG1', 'ADAMTSL3', 'ZSWIM7', 'FAM83G', 'ULK2',
       'DHRS7B', 'SLFN12', 'CWC25', 'KRT19', 'KAT2A', 'AOC3', 'ARHGAP27',
       'EFCAB13', 'TBKBP1', 'EPN3', 'SPATA20', 'BRIP1', 'MAP2K6', 'FHOD3',
       'KATNAL2', 'PLEKHJ1', 'RBMY2OP', 'RPS4Y2'], dtype=object)

(96, 26)

In [85]:
dfff[dfff.B04522_gt == '1/1']

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,B04522,B04524,B04525,B04526,...,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04522_gt,B04524_gt,B04525_gt,B04526_gt,B04527_gt
6887343,Y,9789275,rs368461823,CGT,C,467.28,"1/1:0,11:11:33:300,33,0","0/1:1,7:8:3:174,0,3","./.:0,0:0:.:0,0,0","./.:0,0:0:.:0,0,0",...,RBMY2OP,ENST00000447105,.,.,,1/1,0/1,./.,./.,./.
6921261,Y,22942897,rs34459399,T,C,1261.07,"1/1:0,14:14:42:583,42,0","1/1:0,16:16:48:685,48,0","./.:0,0:0:.:0,0,0","./.:0,0:0:.:0,0,0",...,RPS4Y2,ENST00000288666,.,0.0445,,1/1,1/1,./.,./.,./.


# get the vcf entries based on index

In [90]:
dfhml.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,B04522,B04524,B04525,B04526,...,gene,transcript,cosmic70,ExAC_ALL,gnomAD_genome_ALL,B04522_gt,B04524_gt,B04525_gt,B04526_gt,B04527_gt
27,1,13656,.,CAG,C,4649.2,"0/1:36,23:59:99:827,0,1516","0/1:35,22:57:99:803,0,1400","0/1:28,24:52:99:892,0,1108","0/1:21,29:50:99:1155,0,829",...,DDX11L1,ENST00000518655,.,.,0.4901,0/1,0/1,0/1,0/1,0/1
54,1,15903,rs557514207,G,GC,1173.2,"1/1:0,7:7:21:249,21,0","1/1:0,11:11:33:392,33,0","1/1:0,8:8:24:285,24,0","./.:1,0:1:.:0,0,0",...,WASH7P,ENST00000438504,.,.,0.5042,1/1,1/1,1/1,./.,1/1
2167,1,877831,rs6672356,T,C,8712.01,"1/1:0,42:42:99:1772,126,0","1/1:0,47:47:99:1962,141,0","1/1:0,33:33:99:1364,99,0","1/1:0,46:46:99:1921,138,0",...,SAMD11,ENST00000341065,.,0.9999,0.9998,1/1,1/1,1/1,1/1,1/1
2169,1,879317,rs7523549,C,T,1601.96,"0/1:19,11:30:99:270,0,713","0/0:29,0:29:78:0,78,1170","0/1:13,20:33:99:667,0,460","0/1:17,9:26:99:324,0,692",...,SAMD11,ENST00000342066,.,0.0480,0.082,0/1,0/0,0/1,0/1,0/1
2175,1,881627,rs2272757,G,A,2911.2,"0/1:12,16:28:99:575,0,416","0/1:13,11:24:99:395,0,473","0/1:19,19:38:99:726,0,638","0/1:18,26:44:99:906,0,629",...,NOC2L,ENST00000327044,.,0.5653,0.4889,0/1,0/1,0/1,0/1,0/1


In [112]:
dfvcf = df.reindex(dfff.index)

In [113]:
of = '/projects/da_workspace/DA-337/haplotype/B04522_27/haplotype_caller-1.1.4/joint/B04522_B04524_B04525_B04526_B04527/379/B04522_B04524_B04525_B04526_B04527_vcf_20190910.tsv'
dfvcf.to_csv(of, sep='\t', index=False)

# look into B04529 and B04530 are third cousins to each other, both affected. 