# Basics of Population Genetics

In this notebook we will step through some of the basic concepts in population genetics. In this notebook, you will 
* Compute the allele frequency
* Execute a chi-square test to test the association in controls and diseased population
* Compute Hardy Weinberg Equilibrium and test if the healthy controls deviate from HWE using a chi-square test

## Allele Frequency

The frequency of an allele is defined as the total number of copies of that allele in the population divided by the total number of copies of all alleles of the gene. 

Assume we have a population with the following distributions:

<img src = "alleleFrequency.png">

We can calculate:
- total number of A alleles: 2 $*$ 180 $+$ 240 $=$ 600
- total number of a alleles: 2 $*$ 80 $+$ 240 $=$ 400

*A* is referred to as the major allele and *a* is the minor allele

minor allele frequency $=$ total number of *a* alleles $/$ total number of alleles
                       $=$ 400$/$1000 



The year is 1999, and an investigator has painstakingly genotyped 1 SNP called (rsGOINGALLIN ) in individuals with bipolar disorder and without. rsGOINGALLIN can take on the 3 genotype configurations, CC, CT, TT. He has collected the following data:


|Disease/Controls  |CC    | CT  | TT  | 
|------------------|------|-----|-----| 
| Bipolar Disorder | 270  | 957 | 771 |
| Healthy Controls | 436  |1398 | 1170|


In [26]:
# What is the allele frequency (C and T) in the bipolar population? In the Controls?
bip_c = (270*2 + 957)/((270+957+771)*2)        # minor allele frequency
print("Bipolar population C: " + str(bip_c))
bip_t = (771*2 + 957)/((270+957+771)*2)        # major allele frequency
print("Bipolar population T: " + str(bip_t))
health_c = (436*2+1398)/((436+1398+1170)*2)
print("Healthy population C: " + str(health_c))
health_t = (1170*2+1398)/((436+1398+1170)*2)
print("Healthy population T: " + str(health_t))

Bipolar population C: 0.37462462462462465
Bipolar population T: 0.6253753753753754
Healthy population C: 0.37782956058588546
Healthy population T: 0.6221704394141145


## Genetic Equilibrium and Hardy Weinberg Principle

(courtesy: https://www.cs.cmu.edu/~genetics/units/instructions/instructions-PGE.pdf)

A population is  in genetic equilibrium when allele frequencies in the gene pool remain constant across generations. A gene pool will be in equilibrium under the following conditions:

* the population is very large
* individuals in the population mate randomly
* there is no migration into or out of the population
* natural selection does not act on any specific genotypes
* males and females have the same allele frequencies 
* no mutations occur

In 1908 Godfrey Hardy and Wilhelm Weinberg, working independently, specified the relationship between 
genotype frequencies and allele frequencies that must occur in such an idealized population in equilibrium. This
relationship, known as the Hardy-Weinberg principle, is important because we can use it to determine if a 
population is in equilibrium for a particular gene.

<img src = "HWE.png" >

Assume 

* p = The frequency of the major allele A in the population (0.6 above)
* q = The frequency of the minor allele a in the population (0.4 above)

Hardy-Weinberg principle states that when a population is in equilibrium then:

* frequency of AA $= p^2$
* frequency of Aa $= 2pq$
* frequency of aa $= q^2$

And: $p^2 + 2pq + q^2 = 1$


To determine if a population is in equilibrium, given the population genotype numbers, 

(1) calculate the allele frequencies from the observed population genotype numbers

(2) calculate the genotype frequencies from the observed genotype numbers

(3) apply the Hardy-Weinberg principle to calculate the expected genotype frequencies from the allele frequencies 
in the population.

(4) If the population is in Hardy-Weinberg equilibrium the observed genotype frequencies in step 2 will be 
      (roughly) the same as the expected frequencies in step 3. (A Chi-Square test is used to determine if the 
      observed and expected genotype are statistically different)
      
HWE $=  (observed - expected)^2 / expected$ 

chi.square (HWE).pvalue < 0.05 implies that the population is not in equilibrium

In [25]:
# Compute the HWE equilibrium for healthy controls. Is there deviation from HWE?
# Step 1: Allele frequencies already calculated

# Step 2: Calculating genotype frequencies
tot = 436+1398+1170
print("Step 2")
health_cc = 436/tot
print("Healthy population CC: " + str(health_cc))
health_ct = 1398/tot
health_tt = 1170/tot
print("Healthy population CT: "+ str(health_ct))
print("Healthy population TT: "+ str(health_tt))

# Step 3: apply HWP to calc genotype frequencies based on allele frequencies
print("\nStep 3")
exp_health_cc = health_c**2
exp_health_ct = health_t * health_c * 2
exp_health_tt = health_t**2
print("Expected Healthy population CC: " + str(exp_health_cc))
print("Expected Healthy population CT: " + str(exp_health_ct))
print("Expected Healthy population TT: " + str(exp_health_tt))

# Step 4: diff between expected vs. reality
print("\nStep 4")
cc_hwe = (health_cc-exp_health_cc)**2/exp_health_cc
print("CC HWE = " + str(cc_hwe))
ct_hwe = (health_ct-exp_health_ct)**2/exp_health_ct
print("CT HWE = " + str(ct_hwe))
tt_hwe = (bip_tt-exp_bip_tt)**2/exp_bip_tt
print("TT HWE = " + str(tt_hwe))
from scipy.stats import chisquare
reality = [health_cc,health_ct,health_tt]
expectations = [exp_health_cc,exp_health_ct,exp_health_tt]
print(str(chisquare(reality,expectations,ddof=1)))

Step 2
Healthy population CC: 0.14513981358189082
Healthy population CT: 0.46537949400798934
Healthy population TT: 0.38948069241011984

Step 3
Expected Healthy population CC: 0.14275517685252329
Expected Healthy population CT: 0.47014876746672435
Expected Healthy population TT: 0.3870960556807524

Step 4
CC HWE = 3.983387822722016e-05
CT HWE = 4.838036574413464e-05
TT HWE = 6.936485583701719e-05
Power_divergenceResult(statistic=0.00010290437642710391, pvalue=0.9919062546599806)


## Genome Wide Association Studies 

The goal of the genome wide association studies is to determine if the difference in the allele frequencies of the diseased population is significantly different that the allele frequencies of the control population. 

In [38]:
# Compute the odds ratio for the minor allele in the bipolar disorder vs the controls? 
import scipy
num_bip_c=1497
num_bip_t=2499
num_health_c=2270
num_health_t=3738
odds_ratio = num_bip_c*num_health_t/(num_health_c*num_bip_t)
print("Odds Ratio: "+str(odds_ratio))
bip = [num_bip_c,num_bip_t]
health = [num_health_c,num_health_t]
arr = scipy.stats.chi2_contingency(bip,health)
print("p-value: "+ str(arr[1]))                              # since p-value = 1, there is no association

Odds Ratio: 0.9864361603672306
p-value: 1.0


In [None]:
# Execute a chi-square test to test the association of the allele frequencies in bipolar vs healthy controls? 
# Is the association signifcant?
