In [10]:
import pandas as pd
import numpy as np

In [17]:
df = pd.read_csv("bats.csv", header=None, names=['gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'Ebola'])
expressed_trait = df["Ebola"]
df = df.drop(columns=["Ebola"])
df.head()

Unnamed: 0,gene1,gene2,gene3,gene4,gene5
0,True,False,True,True,False
1,False,False,True,True,True
2,True,False,True,False,False
3,True,False,True,True,True
4,False,True,True,True,True


In [18]:
total_num = df.shape[0]

In [19]:
# 11.a) The probability of the trait being expressed
active_num = df[expressed_trait == True].shape[0]
prob = active_num / total_num
prob

0.30079

In [24]:
# 11.b) For each gene 8 calculate and report P(G_i)
gene_probs = {}
for gene in df.columns:
    gene_probs[gene] = df[df[gene] == True].shape[0] / total_num

for key, val in gene_probs.items():
    print(f"P({key}) = {val}")

P(gene1) = 0.70228
P(gene2) = 0.30076
P(gene3) = 0.5009
P(gene4) = 0.80162
P(gene5) = 0.32705


In [21]:
# 11.c) G_i independent of Ebola <=> P(G_i & Ebola) = P(G_i) * P(Ebola)
def joint_prob(gene):
    return df[(df[gene] == True) & (expressed_trait == True)].shape[0] / total_num

EPSILON = 0.001
independent_genes = []
for gene in df.columns:
    if np.abs(joint_prob(gene) - prob * gene_probs[gene]) < EPSILON:
        independent_genes.append(gene)

print("Some independent genes:", independent_genes)

Some independent genes: ['gene1', 'gene2']


In [23]:
# 11.d) For each gene i that is not assumed to be independent of T, calculate P(T | G_i)
def conditional_prob(gene):
    return df[(df[gene] == True) & (expressed_trait== True)].shape[0] / df[df[gene] == True].shape[0]

conditional_probs = {}
for gene in df.columns:
    if gene not in independent_genes:
        conditional_probs[gene] = conditional_prob(gene)
for key, val in conditional_probs.items():
    print(f"P(T | {key}) = {val}")

P(T | gene3) = 0.5831902575364344
P(T | gene4) = 0.37053716224645095
P(T | gene5) = 0.8999847118177648


Step 1: Analyze P(T) = 0.300079
- Approximately 30.01% of the sampled bats can carry Ebola.
- This suggests that Ebola carriage is relatively common in this bat population, affecting nearly one-third of the bats.

Step 2: Analyze P(Gi) for each gene
- Gene 1 (G1): P(G1) = 0.70228
  - Expressed in about 70.23% of bats, making it a commonly expressed gene.
- Gene 2 (G2): P(G2) = 0.30076
  - Expressed in about 30.08% of bats, less common than G1.
- Gene 3 (G3): P(G3) = 0.5009
  - Expressed in about 50.09% of bats, showing an almost even split in the population.
- Gene 4 (G4): P(G4) = 0.80162
  - Most commonly expressed gene, present in about 80.16% of bats.
- Gene 5 (G5): P(G5) = 0.32705
  - Expressed in about 32.71% of bats, similar in frequency to G2.

Step 3: Evaluate independence of genes and trait
- Gene 1 and Gene 2 are reported as independent of T.
  - This suggests that the expression of G1 and G2 does not significantly influence the bat's ability to carry Ebola.
- Genes 3, 4, and 5 are not independent of T, indicating potential relationships with Ebola carriage.

Step 4: Analyze P(T | Gi) for non-independent genes
- Gene 3: P(T | G3) = 0.5831902575364344
  - When G3 is expressed, the probability of Ebola carriage increases to about 58.32%, compared to the baseline 30.01%.
  - G3 expression is associated with a higher likelihood of Ebola carriage.
- Gene 4: P(T | G4) = 0.37053716224645095
  - When G4 is expressed, the probability of Ebola carriage is about 37.05%.
  - G4 expression is associated with a slight increase in Ebola carriage compared to the baseline, but less so than G3.
- Gene 5: P(T | G5) = 0.8999847118177648
  - When G5 is expressed, the probability of Ebola carriage dramatically increases to about 90%.
  - G5 shows the strongest association with Ebola carriage among all genes.

Step 5: Synthesize findings
- G5 appears to be the most significant gene related to Ebola carriage. Its expression is strongly associated with the bat's ability to carry Ebola.
- G3 also shows a positive association with Ebola carriage, but to a lesser extent than G5.
- G4, despite being the most commonly expressed gene, has a relatively minor positive association with Ebola carriage.
- G1 and G2, being independent, likely play little to no role in determining Ebola carriage.

Step 6: Propose potential mechanisms
- G5 might be involved in cellular receptors or pathways that facilitate Ebola virus entry or replication.
- G3 could be part of a secondary mechanism that supports Ebola carriage, perhaps related to the bat's immune response.
- G4, being widely expressed but with less impact on carriage, might play a subtle role in creating a favorable environment for the virus.

Step 7: Consider limitations
- These results are based on a sample of 100,000 bats, providing strong statistical power but still subject to sampling variability.
- The study only shows correlations; causation cannot be inferred without further experimental evidence.
- Other environmental or genetic factors not captured in this data might also influence Ebola carriage.

In [26]:
# 11f) Expected value for K: K being the number of genes expressed. 
# Hint: K has 6 values: 0, 1, 2, 3, 4, 5
# E(K) = Î£ k * P(K = k)

expected_value = 0
num_expressed_genes = df.sum(axis=1)
for k in range(6):
    expected_value += k * df[num_expressed_genes == k].shape[0] / total_num
expected_value

2.63261

In [ ]:
# DONE AND DUSTED