## Summary statistics of the simulated phenotypes and genotypes (MAF, genotypic frequencies, Odds Ratio)

To import a text file to python line by line

In [1]:
ped_file = ('seqsimla/input/results_200/Sim21.ped')
cases = ('seqsimla/input/results_200/cases_100.txt')

In [2]:
lines = [x.strip().split() for x in open(ped_file).readlines()]

To verify the number of lines in a file use the `len()` command

In [3]:
len(lines)

978

In [4]:
lines [0:10]

[['FAM1', 'M1', '0', '0', '2', '1', '1', '1', '1', '1', '1', '1', '2', '2'],
 ['FAM1', 'F1', '0', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1'],
 ['FAM1', 'O2', 'F1', 'M1', '2', '2', '1', '1', '1', '1', '1', '1', '1', '2'],
 ['FAM1', 'O1', 'F1', 'M1', '2', '2', '1', '1', '1', '1', '1', '1', '1', '2'],
 ['FAM2', 'M2', '0', '0', '2', '1', '1', '1', '1', '1', '1', '1', '2', '2'],
 ['FAM2', 'F2', '0', '0', '1', '1', '1', '1', '1', '2', '1', '1', '1', '1'],
 ['FAM2', 'O2', 'F2', 'M2', '1', '2', '1', '1', '2', '1', '1', '1', '1', '2'],
 ['FAM2', 'O1', 'F2', 'M2', '2', '2', '1', '1', '2', '1', '1', '1', '1', '2'],
 ['FAM3', 'M3', '0', '0', '2', '1', '1', '2', '1', '1', '1', '1', '1', '1'],
 ['FAM3', 'F3', '0', '0', '1', '1', '2', '1', '1', '1', '1', '1', '1', '1']]

In [5]:
lines [0:1]

[['FAM1', 'M1', '0', '0', '2', '1', '1', '1', '1', '1', '1', '1', '2', '2']]

To select a part of the list of lists use the slicing method shown below. Here for the `pedigree` variable I am selecting the first six columns of the file and for the `affected` variable only the column number 6

In [6]:
pedigree = [sublist[0:6] for sublist in lines]
affected = [sublist[5:6] for sublist in lines]
individual = [sublist[1:2] for sublist in lines]

Now to count the number of occurrences of 2 in the affection column, this means the number of affected individuals in a list of lists. Below two different ways of doing it

In [7]:
x = '2'
def countList(affected, x): 
    count = 0
    for i in range(len(affected)): 
        if x in affected[i]: 
            count+= 1
        
    return count
print(countList(affected, x))

573


In [8]:
def countList(affected, x): 
      
    return sum(x in item for item in affected) 
print(countList(affected, x))

573


To slice part of a file based on certain conditions it is better to use pandas. As per the example below. https://www.geeksforgeeks.org/saving-a-pandas-dataframe-as-a-csv/

In [9]:
import pandas as pd 
df_cases = pd.read_csv(ped_file, header=None, sep=' ', names=["famid", "iid", "fid", "mid", "sex", "aff", "snp1_1", "snp1_2", "snp2_1", "snp2_2", "snp3_1", "snp3_2", "snp4_1", "snp4_2"], index_col=False)
df1_cases = df_cases[df_cases['iid'].str.contains("O1")] 
df2_cases = df1_cases[:100] 
df2_cases.to_csv('seqsimla/input/results_200/cases100.txt', header=False, index=False, sep=' ')

In [10]:
len(df1_cases)

200

In [11]:
df_controls = pd.read_csv(ped_file, header=None, sep=' ', names=["famid", "iid", "fid", "mid", "sex", "aff", "snp1_1", "snp1_2", "snp2_1", "snp2_2", "snp3_1", "snp3_2", "snp4_1", "snp4_2"], index_col=False)
df1_controls = df_controls[df_controls['aff'] == 1 & df_controls['iid'].str.contains('^[^O]')] 
df2_controls = df1_controls[200:300] 
df2_controls
df2_controls.to_csv('seqsimla/input/results_200/controls100.txt', header=False, index=False, sep=' ')

Below, I try to sum the number of affected offspring in the `proband_file` that should be the same as the total number of affected individuals since it was simulated like this from the beginning. 

In [293]:
#FIXME: how to count the 01, 02, 03....present in the list?
s = range(1,10)
y = f'O{1}'
def countList(individual, y): 
    count = 0
    for j in range(len(individual)): 
        if y in individual[j]: 
            count+= 1
        
    return count
print(countList(individual, y))

172


Now to obtain some summary statistics, as for example the MAF for each variant, the OR and the genotype frequencies. 

In [12]:
getwd()

In [141]:
dat = read.table('seqsimla/input/results_200/Sim21.ped') #read the genotypic data
cases100 = read.table('seqsimla/input/results_200/cases100.txt') #100 cases from the simulation
controls100 = read.table('seqsimla/input/results_200/controls100.txt')#100 controls from the simulation

In [142]:
head(cases100)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
FAM1,O1,F1,M1,2,2,1,1,1,1,1,1,1,2
FAM2,O1,F2,M2,2,2,1,1,2,1,1,1,1,2
FAM3,O1,F3,M3,1,2,2,2,1,1,1,1,1,1
FAM4,O1,F4,M4,2,2,2,1,2,1,1,1,1,1
FAM5,O1,F5,M5,2,2,1,2,1,1,1,1,1,1
FAM6,O1,F6,M6,1,2,1,1,1,1,1,1,1,2


In [137]:
head(dat) #look at the data

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
FAM1,M1,0,0,2,1,1,1,1,1,1,1,2,2
FAM1,F1,0,0,1,1,1,1,1,1,1,1,1,1
FAM1,O2,F1,M1,2,2,1,1,1,1,1,1,1,2
FAM1,O1,F1,M1,2,2,1,1,1,1,1,1,1,2
FAM2,M2,0,0,2,1,1,1,1,1,1,1,2,2
FAM2,F2,0,0,1,1,1,1,1,2,1,1,1,1


In [143]:
head(controls100)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
FAM101,M101,0,0,2,1,1,1,1,1,1,1,1,1
FAM101,F101,0,0,1,1,1,2,1,2,1,1,1,1
FAM102,M102,0,0,2,1,1,1,1,1,1,1,1,1
FAM102,F102,0,0,1,1,2,1,1,1,1,2,2,1
FAM103,M103,0,0,2,1,1,1,1,2,1,1,1,1
FAM103,F103,0,0,1,1,1,1,2,1,1,1,2,1


To use the `rbind` function to append two data files using R:

In [148]:
dim(cases100)
class(cases100)

In [145]:
dim(controls100)

In [151]:
controls100[,'V3']<-factor(controls100[,'V3']) #convert integer to factor so we can bind the data later
controls100[,'V4']<-factor(controls100[,'V4'])
head(controls100)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
FAM101,M101,0,0,2,1,1,1,1,1,1,1,1,1
FAM101,F101,0,0,1,1,1,2,1,2,1,1,1,1
FAM102,M102,0,0,2,1,1,1,1,1,1,1,1,1
FAM102,F102,0,0,1,1,2,1,1,1,1,2,2,1
FAM103,M103,0,0,2,1,1,1,1,2,1,1,1,1
FAM103,F103,0,0,1,1,1,1,2,1,1,1,2,1


In [153]:
cases_controls <- rbind(cases100,controls100)
dim(cases_controls)

In [156]:
head(cases_controls)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
FAM1,O1,F1,M1,2,2,1,1,1,1,1,1,1,2
FAM2,O1,F2,M2,2,2,1,1,2,1,1,1,1,2
FAM3,O1,F3,M3,1,2,2,2,1,1,1,1,1,1
FAM4,O1,F4,M4,2,2,2,1,2,1,1,1,1,1
FAM5,O1,F5,M5,2,2,1,2,1,1,1,1,1,1
FAM6,O1,F6,M6,1,2,1,1,1,1,1,1,1,2


In [None]:
caco = cases_controls[,-(1:5)]-1 #remove pedigree info from first to 5th column included
caco

In [163]:
caco_counts = sapply(split.default(caco, 1:(length(caco)) %/% 2), rowSums) #sum allele count excluding column1 that corresponds to disease status
caco_counts

0,1,2,3,4
1,0,0,0,1
1,0,1,0,1
1,2,0,0,0
1,1,1,0,0
1,1,0,0,0
1,0,0,0,1
1,0,0,0,0
1,1,1,0,0
1,2,0,0,0
1,0,1,0,1


In [180]:
#colnames(caco_counts) <- c("Affection","SNP1","SNP2","SPN3","SNP4")
caco1 <- as.data.frame(caco_counts)
y = table(as.matrix(caco1)[,1], as.matrix(caco1)[,2], 
dnn = c("Affected", "Genotype"))
y

        Genotype
Affected  0  1  2
       0 64 33  3
       1 62 34  4

Calculate OR based on this data

In [184]:
odd_ratio = ((y[2,2] + y[2,3])/y[2,1])/((y[1,2] + y[1,3])/y[1,1])
odd_ratio

In [25]:
cases_geno = cases100[,-(1:6)] -1
cases_geno2 = sapply(split.default(cases_geno, 0:(length(cases_geno)-1) %/% 2), rowSums) #sum the allele counts for every marker
cases_geno3 = t(cases_geno2) #transpose the matrix to have snp-by-individual
cases_geno3

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
0,0,0,2,1,1,0,0,1,2,0,⋯,1,0,1,0,0,0,1,1,0,1
1,0,1,0,1,0,0,0,1,0,1,⋯,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,⋯,1,0,0,0,0,0,1,0,0,0
3,1,1,0,0,0,1,0,0,0,1,⋯,1,0,0,1,1,0,0,0,0,0


Select only the founders from the sample

In [18]:
founders = dat[which(dat[,'V3'] == '0'), ]
founders

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,FAM1,M1,0,0,2,1,1,1,1,1,1,1,2,2
2,FAM1,F1,0,0,1,1,1,1,1,1,1,1,1,1
5,FAM2,M2,0,0,2,1,1,1,1,1,1,1,2,2
6,FAM2,F2,0,0,1,1,1,1,1,2,1,1,1,1
9,FAM3,M3,0,0,2,1,1,2,1,1,1,1,1,1
10,FAM3,F3,0,0,1,1,2,1,1,1,1,1,1,1
13,FAM4,M4,0,0,2,1,1,1,1,1,1,1,1,1
14,FAM4,F4,0,0,1,1,2,1,2,1,1,1,1,1
17,FAM5,M5,0,0,2,1,2,1,1,1,1,1,1,1
18,FAM5,F5,0,0,1,1,1,1,1,1,1,1,1,1


In [19]:
geno = founders[,-(1:6)] -1 #This is called indexing, meaning that we are only looking at the genotypic data and removing columns 1 to 6 that is pedigree information. Also we are substracting one from the genotypes so it will be 0/0 instead of 0/1 and will make counts easier

In [20]:
head(geno) #look at the data

Unnamed: 0_level_0,V7,V8,V9,V10,V11,V12,V13,V14
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0,0,0,0,0,1,1
2,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,1
6,0,0,0,1,0,0,0,0
9,0,1,0,0,0,0,0,0
10,1,0,0,0,0,0,0,0


In [21]:
L = ncol(geno) #how many columns are there
L

In [22]:
#genotypes = apply(cbind(geno$V7,geno$V8),1,sum,na.rm = TRUE) This code can be useful when we only want to sum an specific set of columns for every row in the dataframe, and removing the missing information
geno1 = sapply(split.default(geno, 0:(length(geno)-1) %/% 2), rowSums) #Sum the allele counts for every marker
geno1

Unnamed: 0,0,1,2,3
1,0,0,0,2
2,0,0,0,0
5,0,0,0,2
6,0,1,0,0
9,1,0,0,0
10,1,0,0,0
13,0,0,0,0
14,1,1,0,0
17,1,0,0,0
18,0,0,0,0


To transpose the matrix so we have a SNP-by-individual matrix

In [23]:
geno2 = t(geno1)
geno2

Unnamed: 0,1,2,5,6,9,10,13,14,17,18,⋯,939,940,947,948,955,956,963,964,971,972
0,0,0,0,0,1,1,0,1,1,0,⋯,0,1,2,0,0,1,1,0,1,0
1,0,0,0,1,0,0,0,1,0,0,⋯,0,0,1,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,2,0,2,0,0,0,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,1


Now, calculate minor allele and genotype frequencies in the founders to check that they match the expectations. Code obtained from https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_snp_stats.R

In [26]:
m <- nrow(geno2)     ## number of snps
n <- ncol(geno2)     ## number of individuals

## assign all non {0,1,2} to NA
geno2[(geno2!=0) & (geno2!=1) & (geno2!=2)] <- NA 
geno2 <- as.matrix(geno2)

## calc_n
n0 <- apply(geno2==0,1,sum,na.rm=T)
n1 <- apply(geno2==1,1,sum,na.rm=T)
n2 <- apply(geno2==2,1,sum,na.rm=T)
n <- n0 + n1 + n2

## calculate allele frequencies in the founders
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q) #minor allele frequency
mgf <- apply(cbind(n0,n1,n2),1,min) / n #minor genotype frequency

## HWE: Chi-Square test
obs <- cbind(n0=n0,n1=n1,n2=n2)
exp <- cbind(p*p, 2*p*q, q*q)
exp <- exp*n
chisq <- (obs-exp)
chisq <- (chisq*chisq) /exp
hwe.chisq <- apply(chisq,1,sum)
hwe.chisq.p <- 1-pchisq(hwe.chisq,df=1)

## HWE: Fisher's Exact test
z <- cbind(n0, ceiling(n1/2), floor(n1/2), n2)
z <- lapply( split( z, 1:nrow(z) ), matrix, ncol=2 )
z <- lapply( z, fisher.test )
hwe.fisher <- as.numeric(unlist(lapply(z, "[[", "estimate")))
hwe.fisher.p <- as.numeric(unlist(lapply(z, "[[", "p.value")))

# MODIFIED 21 Oct 2012:  prior to this version, we had "mono=(mgf<0)" instead of "mono<(maf<0)"
res <- data.frame( n=n, n0=n0, n1=n1, n2=n2, p=p, maf=maf, mgf=mgf,
                        mono=(maf<=0), loh=(n1<=0), 
                        hwe.chisq=hwe.chisq, hwe.chisq.p=hwe.chisq.p,
                        hwe.fisher=hwe.fisher, hwe.fisher.p=hwe.fisher.p, 
                        stringsAsFactors=F )
row.names(res) <- row.names(geno2)
res

Unnamed: 0_level_0,n,n0,n1,n2,p,maf,mgf,mono,loh,hwe.chisq,hwe.chisq.p,hwe.fisher,hwe.fisher.p
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<dbl>,<dbl>,<dbl>,<dbl>
0,400,258,127,15,0.80375,0.19625,0.0375,False,False,0.0165321,0.89769218,0.9599179,1.0
1,400,324,70,6,0.8975,0.1025,0.015,False,False,0.9544675,0.32858473,1.5848229,0.41060401
2,400,328,71,1,0.90875,0.09125,0.0025,False,False,1.9748323,0.15993588,0.2608709,0.23010924
3,400,338,56,6,0.915,0.085,0.015,False,False,3.9974289,0.04556973,2.5783488,0.05596811


Calculate odd ratios based on cases100 and controls100 data:

$ OR = \dfrac{\dfrac{P(Y=1|X=1)}{1-P(Y=1|X=1)}}{\dfrac{P(Y=0|X=1)}{1-P(Y=0|X=1)}} $

$ P(X=1) = P(X=1|Y=1) P(Y=1)+ P(X=1|Y=0) P(Y=0) $

$ P(Y=1) = prevalence $

$ P(Y=1|X=1) = \dfrac {P(X=1|Y=1)P(Y=1)}{P(X=1)} $

$ P(Y=0|X=1) = \dfrac {P(X=1|Y=0)P(Y=0)}{P(X=1)} = 1 - P(Y=1|X=1) $

To count the number of individuals with causal allele in the group of cases and controls to calculate a 2x2 table.

In [57]:
m <- nrow(cases_geno3)     ## number of snps
n <- ncol(cases_geno3)     ## number of individuals

## assign all non {0,1,2} to NA
cases_geno3[(cases_geno3!=0) & (cases_geno3!=1) & (cases_geno3!=2)] <- NA 
cases_geno3 <- as.matrix(cases_geno3)

## calc_n
n0 <- apply(cases_geno3==0,1,sum,na.rm=T)
n1 <- apply(cases_geno3==1,1,sum,na.rm=T)
n2 <- apply(cases_geno3==2,1,sum,na.rm=T)
n <- n0 + n1 + n2

## calculate allele frequencies in the founders
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q) #minor allele frequency
mgf <- apply(cbind(n0,n1,n2),1,min) / n #minor genotype frequency


# MODIFIED 21 Oct 2012:  prior to this version, we had "mono=(mgf<0)" instead of "mono<(maf<0)"
res <- data.frame( n=n, n0=n0, n1=n1, n2=n2, p=p, maf=maf, mgf=mgf, 
                        stringsAsFactors=F )
row.names(res) <- row.names(geno2)
res

Unnamed: 0_level_0,n,n0,n1,n2,p,maf,mgf
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
0,100,62,34,4,0.79,0.21,0.04
1,100,81,18,1,0.9,0.1,0.01
2,100,89,11,0,0.945,0.055,0.0
3,100,77,22,1,0.88,0.12,0.01


In [73]:
a = res[1,3] + res[1,4]
a

In [74]:
c = res[1,2]
c

In [187]:
f <- function(x) {x^2 - 2*x + 1}
uniroot(f, lower = -1, upper = 1)