# Basics of Population Genetics

In this notebook we will step through some of the basic concepts in population genetics. In this notebook, you will 
* Compute the allele frequency
* Compute Hardy Weinberg Equilibrium and test if the healthy controls deviate from HWE using a chi-square test
* Execute a chi-square test to test the association in controls and diseased population

## Allele Frequency

The frequency of an allele is defined as the total number of copies of that allele in the population divided by the total number of copies of all alleles of the gene. 

Assume we have a population with the following distributions:

<img src = "alleleFrequency.png">

We can calculate:
- total number of A alleles: 2 $*$ 180 $+$ 240 $=$ 600
- total number of a alleles: 2 $*$ 80 $+$ 240 $=$ 400

*A* is referred to as the major allele and *a* is the minor allele

minor allele frequency $=$ total number of *a* alleles $/$ total number of alleles
                       $=$ 400$/$1000 



The year is 1999, and an investigator has painstakingly genotyped 1 SNP called (rsGOINGALLIN ) in individuals with bipolar disorder and without. rsGOINGALLIN can take on the 3 genotype configurations, CC, CT, TT. He has collected the following data:


|Disease/Controls  |CC    | CT  | TT  | 
|------------------|------|-----|-----| 
| Bipolar Disorder | 270  | 957 | 771 |
| Healthy Controls | 436  |1398 | 1170|


In [5]:
# What is the allele frequency (C and T) in the bipolar population? In the Controls?
c_bp = 270 * 2 + 957 * 1
c_cont = 436 * 2 + 1398 * 1
t_bp = 771 * 2 + 957 * 1
t_cont = 1170 * 2 + 1398 * 1
print('C in Bipolar', 'C in Healthy', 'T in Bipolar', 'T in Healthy')
print('counts', c_bp, c_cont, t_bp, t_cont)

bp_allele_cnt = (270 + 957 + 771) * 2
cont_allele_cnt = (436 + 1398 + 1170) * 2

print('frequencies', c_bp/bp_allele_cnt, c_cont/cont_allele_cnt, t_bp/bp_allele_cnt, t_cont/cont_allele_cnt)

C in Bipolar C in Healthy T in Bipolar T in Healthy
counts 1497 2270 2499 3738
frequencies 0.37462462462462465 0.37782956058588546 0.6253753753753754 0.6221704394141145


## Genetic Equilibrium and Hardy Weinberg Principle

(courtesy: https://www.cs.cmu.edu/~genetics/units/instructions/instructions-PGE.pdf)

A population is  in genetic equilibrium when allele frequencies in the gene pool remain constant across generations. A gene pool will be in equilibrium under the following conditions:

* the population is very large
* individuals in the population mate randomly
* there is no migration into or out of the population
* natural selection does not act on any specific genotypes
* males and females have the same allele frequencies 
* no mutations occur

In 1908 Godfrey Hardy and Wilhelm Weinberg, working independently, specified the relationship between 
genotype frequencies and allele frequencies that must occur in such an idealized population in equilibrium. This
relationship, known as the Hardy-Weinberg principle, is important because we can use it to determine if a 
population is in equilibrium for a particular gene.

<img src = "HWE.png" >

Assume 

* p = The frequency of the major allele A in the population (0.6 above)
* q = The frequency of the minor allele a in the population (0.4 above)

Hardy-Weinberg principle states that when a population is in equilibrium then:

* frequency of AA $= p^2$
* frequency of Aa $= 2pq$
* frequency of aa $= q^2$

And: $p^2 + 2pq + q^2 = 1$


To determine if a population is in equilibrium, given the population genotype numbers, 

(1) calculate the allele frequencies from the observed population genotype numbers

(2) calculate the genotype frequencies from the observed genotype numbers

(3) apply the Hardy-Weinberg principle to calculate the expected genotype frequencies from the allele frequencies 
in the population.

(4) If the population is in Hardy-Weinberg equilibrium the observed genotype frequencies in step 2 will be 
      (roughly) the same as the expected frequencies in step 3. (A Chi-Square test is used to determine if the 
      observed and expected genotype are statistically different)
      
HWE $=  (observed - expected)^2 / expected$ 

chi.square (HWE).pvalue < 0.05 implies that the population is not in equilibrium

In [32]:
# Compute the HWE equilibrium for healthy controls. Is there deviation from HWE?

import math
import scipy.stats as stats
# Frequency of C is p
healthy_c_freq = c_cont/cont_allele_cnt
healthy_t_freq = t_cont/cont_allele_cnt

CC_expected = healthy_c_freq**2
CT_expected = 2*healthy_c_freq*healthy_t_freq
TT_expected = healthy_t_freq**2

CC_actual = 436/(436 + 1398 + 1170)
CT_actual = 1398/(436 + 1398 + 1170)
TT_actual = 1170/(436 + 1398 + 1170)

HWE = (CC_actual-CC_expected)**2/CC_expected + (CT_actual-CT_expected)**2/CT_expected + (TT_actual-TT_expected)**2/TT_expected

In [34]:
# Chi-Square test
stats.chisquare([CC_actual, CT_actual, TT_actual], [CC_expected, CT_expected, TT_expected], ddof=1)

Power_divergenceResult(statistic=0.00010290437642710391, pvalue=0.9919062546599806)

## Genome Wide Association Studies 

The goal of the genome wide association studies is to determine if the difference in the allele frequencies of the diseased population is significantly different that the allele frequencies of the control population. 

In [29]:
# Compute the odds ratio for the minor allele in the bipolar disorder vs the controls? 
 
oddsratio = c_bp*t_cont/(t_bp*c_cont)
oddsratio

0.9864361603672306

In [None]:
# Execute a chi-square test to test the association of the allele frequencies in bipolar vs healthy controls? 
# Is the association signifcant?
