# Term Enrichment Examples
**References**  
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html
- http://prob140.org/sp17/textbook/ch8/Hypergeometric_Distribution.html

**Website for checking p-values**
- https://stattrek.com/online-calculator/hypergeometric

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import hypergeom

## marble example

https://alexlenail.medium.com/understanding-and-implementing-the-hypergeometric-test-in-python-a7db688a7458  
Suppose you have a jar containing 10 red marbles and 90 black marbles.  
You collect 10 marbles from the jar.  
What is the probability you collect k red marbles?  

We can then compute a probability of drawing X red marbles out of N from a jar containing n red marbles out of M in the following way.  

We draw 10 marbles, of which `7` are red (X = 7), and we’re interested to know how unlikely such a result is to occur by chance.  
M: the population size (i.e., number of marbles) (previously N)  
n: number of successes in the population (number of red marbles) (previously K)   
N is the sample size (i.e., number of draws) (previously n)  
x is still the number of drawn “successes”.  


```
from scipy.stats import hypergeom  
pval = hypergeom.sf(x-1, M, n, N)  
```

In [3]:
x = 7 - 1
M = 100
n = 10
N = 10

In [4]:
hypergeom.sf(x, M, n, N)

8.248683269314892e-07

## red and black balls example  
Say I have a bag with 19275 balls, 4 of which are red, the rest of which are black, and I draw 2880 balls without replacement. I happen to draw all 4 red balls.  
x: 4-1 (number of successful red balls drawn)  
M: 19275 (number of balls)  
n: 4 (number of red balls)  
N: 2880 (number of daws)  

`hypergeom.sf(3, 19275, 4, 2880)`

In [5]:
hypergeom.sf(3, 19275, 4, 2880)

0.000497533644709495

### genes and cancer
Suppose our database contains 1000 patients (P), of which 300 of the patients have diagnosed with lung cancer (L).   
We find a gene (G) that may be associated with lung cancer.  
400 patients have been determined to have G. Of these patients, 200 are determined to have lung cancer.  
Is G overexpressed (enriched)?  

In [6]:
x = 200 - 1
M = 1000
n = 300
N = 400

In [7]:
hypergeom.sf(3, 19275, 4, 2880)

0.000497533644709495

## hypergeometic with genes  
https://www.biostars.org/p/66729/  
Suppose I have a list of genes: `mygenes: gene13,gene2,gene111`  
And given another list of genes:  
```
gene_categoryA: gene1, gene2, gene44, gene111
gene_categoryB: gene13,gene34
```
After comparing mygenes against gene_categoryA we see that there are 2 genes of categoryA in mygenes.  
What I want to know whether these 2 genes (gene2 and gene111) occurrence is significantly more than expected.  
What's is the best way to go about it.  

```python
import sys
import scipy.stats as stats

print 'total number in population: ' + sys.argv[1]
print 'total number with condition in population: ' + sys.argv[2]
print 'number in subset: ' + sys.argv[3]
print 'number with condition in subset: ' + sys.argv[4]
print 'p-value <= ' + sys.argv[4] + ': ' + \ 
    stats.hypergeom.cdf(sys.argv[4], sys.argv[1], sys.argv[2], sys.argv[3])
print 'p-value >= ' + sys.argv[4] + ': ' + \
    stats.hypergeom.sf(sys.argv[4]) - 1, sys.argv[1]), sys.argv[2]), sys.argv[3])
```

Mapping to function: `hypergeom.sf(x, M, n, N)`  
```python
x = sys.argv[4] # number with condition in subset
M = sys.argv[1] # total number in population
n = sys.argv[2] # total number with condition in population
N = sys.argv[3] # total number with condition in subset
```

In [8]:
gene1, gene2, gene44, gene111 = 1000, 12345, 32567, 9845
gene13, gene34 = 5877, 34879

gene_categoryA = [gene1, gene2, gene44, gene111]
gene_categoryB = [gene13, gene34]

x = gene13 + gene34
M = gene1 + gene2 + gene44 + gene111 + gene13 + gene34 # total number of genes in the list
n = gene1 + gene2 + gene44 + gene111
N = gene111 + gene13 + gene34
hypergeom.sf(x -1, M, n, N)

0.0

## playing cards example

https://gist.github.com/fbrundu/cfa675c1d79b4ade4724

A poker hand consists of 5 cards dealt at random without replacement from a standard deck of 52 cards of which 26 are red and the rest black. A poker hand is dealt. Find the chance that the hand contains three red cards and two black cards.

To achieve it, we use the hypergeometric probability mass function. We want 3 cards from the set of 26 red cards and 2 from the set of 26. So the parameters for the hypergeometric function are:

In [9]:
x = 3   # Number of Type I cards we want in one hand
M = 52  # Total number of cards
n = 26  # Number of Type I cards (e.g. red cards) 
N = 5   # Number of draws (5 cards dealt in one poker hand)

In [10]:
hypergeom.sf(x-1, M, n, N)

0.5

In [11]:
1 - hypergeom.cdf(x-1, M, n, N)

0.5