# Probabilistic Models – Spring 2021
## Exercise Session 1
Jan 27nd 16.15.

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs (`Edit > Clear All Outputs`), running all cells (`Run > Run All Cells`), and finally correcting any errors.

To get points:
1. Submit your answers to the automatically checked Moodle test. 
 - You have 5 tries on the test: the highest obtained score will be taken into account.
 - For numerical questions the tolerance is +/- 0.01.
2. Submit this notebook containing your derivations to Moodle.

## Exercise 1
***

Consider the following joint distribution $P$:

In [2]:
!cat data/1.csv

A	B	C	P
True	True	True	0.075
True	True	False	0.05
True	False	True	0.225
True	False	False	0.15
False	True	True	0.025
False	True	False	0.1
False	False	True	0.075
False	False	False	0.3


(a) What is $P(A=T, C=T)$?

Update the distribution by conditioning on the event $C=T$, that is, construct the conditional distribution $P( \cdot |C=T$).

(b) What is $P(A=T|C=T)$? $P(B=T|C= T)$?

(c) Is the event $A=T$ independent of the event $C=T$? Is $B=T$ independent of $C=T$?

### Instructions

If you're using Python you can start by reading the provided file into a [Pandas](https://pandas.pydata.org/) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or similarly to a [data.frame](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) in R. To check for equality between two real numbers do not use `x == y`, as it gives false negatives on limited precision floats. Rather, use for example [`math.isclose(x, y)`](https://docs.python.org/3/library/math.html#math.isclose) in Python or [`near(x, y)`](https://dplyr.tidyverse.org/reference/near.html) in R.


(a) $P(A=T, C=T) = 0.075 + 0.225 = 0.3$  
  
(b)   
  $P(A=T|C=T) = 0.75$  
  $P(B=T|C= T) = 0.25$
  
(c)  
$A=T$ is not independent of the event $C=T$ because $P(A=T|C=T) \neq P(A=T)$  
$B=T$ is independent of the event $C=T$ because $P(B=T|C=T) \neq P(B=T)$

In [3]:
# (b)

import pandas as pd
import numpy as np
import math

df = pd.read_csv("data/1.csv", sep="\t")

df



Unnamed: 0,A,B,C,P
0,True,True,True,0.075
1,True,True,False,0.05
2,True,False,True,0.225
3,True,False,False,0.15
4,False,True,True,0.025
5,False,True,False,0.1
6,False,False,True,0.075
7,False,False,False,0.3


In [4]:
C_is_true = df.loc[df['C'] == True]

A_given_C = C_is_true.loc[df['A'] == True]['P'].sum() / C_is_true["P"].sum()

A_given_C


0.7499999999999999

In [5]:
B_given_C = C_is_true.loc[df['B'] == True]['P'].sum() / C_is_true["P"].sum()

B_given_C

0.25

In [6]:
# (c)

A_is_true = df.loc[df['A'] == True]["P"].sum()
B_is_true = df.loc[df['B'] == True]["P"].sum()

math.isclose(A_is_true, A_given_C)




False

In [7]:
math.isclose(B_is_true, B_given_C)

True

## Exercise 2
***

Consider again the joint distribution $P$ from Exercise 1.

(a) What is $P(A=T \vee B=T)$?


Update the distribution by conditioning on the event $(A=T \vee B=T)$, this is, construct the conditional distribution $P( \cdot |A=T \vee B=T)$.

(b) What is $P(A=T|A=T \vee B=T)$? $P(B=T|A=T \vee B=T)$?

(c) Is the event $B=T$ conditionally independent of $C=T$ given the event $(A=T \vee B=T)$?

In [8]:
# Provide your answer in cells here

#(a)
A_is_true_or_B_is_true = df.loc[(df['B'] == True) | (df['A'] == True)]

A_is_true_or_B_is_true["P"].sum()


0.625

In [9]:
#(b)
given_A_is_true_or_B_is_true = A_is_true_or_B_is_true.assign(P = A_is_true_or_B_is_true["P"] / A_is_true_or_B_is_true["P"].sum())

A_is_true = given_A_is_true_or_B_is_true.loc[given_A_is_true_or_B_is_true['A'] == True]["P"].sum()

A_is_true

0.8

In [10]:
B_is_true = given_A_is_true_or_B_is_true.loc[given_A_is_true_or_B_is_true['B'] == True]["P"].sum()

B_is_true

0.4

In [11]:
# (c)
P_BiC = given_A_is_true_or_B_is_true.loc[(given_A_is_true_or_B_is_true['B'] == True) & (given_A_is_true_or_B_is_true['C'] == True)]["P"].sum()
P_B = given_A_is_true_or_B_is_true.loc[given_A_is_true_or_B_is_true['B'] == True]["P"].sum()
P_C = given_A_is_true_or_B_is_true.loc[given_A_is_true_or_B_is_true['C'] == True]["P"].sum()

math.isclose(P_BiC, P_B * P_C)

False

## Exercise 3
***

Consider the following joint distribution.

In [12]:
!cat data/3.csv

A	B	C	P
True	True	True	0.27
True	True	False	0.18
True	False	True	0.03
True	False	False	0.02
False	True	True	0.02
False	True	False	0.03
False	False	True	0.18
False	False	False	0.27


For each pair of variables, state whether they are independent. State also whether they are independent given the third variable. Justify your answers.

In [42]:
df = pd.read_csv("data/3.csv", sep="\t")

vars = ["A", "B", "C"]

for i in range(0, 3):
    v1 = vars[i]
    P_v1 = df.loc[df[v1] == True]["P"].sum()

    for j in range(i, 3):
        if vars[i] != vars[j]:
            v2 = vars[j]

            P_v1iv2 = df.loc[(df[v1] == True) & (df[v2] == True)]["P"].sum()
            P_v1v2 = df.loc[(df[v1] == True)]["P"].sum() * df.loc[(df[v2] == True)]["P"].sum()

            print("____________")
            print(f"P({v1},{v2}) = {P_v1iv2}")
            print(f"P({v1})P({v2}) = {P_v1v2}")
            if math.isclose(P_v1iv2, P_v1v2):
                print("independant")
            else:
                print("dependant")
            print("")

            v3 = next(k for k in vars if k != v1 and k != v2)
            P_v3 = df.loc[df[v3] == True]
            gP_v3 = P_v3.assign(P = P_v3["P"] / P_v3["P"].sum())

            P_v1iv2_gv3 = gP_v3.loc[(gP_v3[v1] == True) & (gP_v3[v2] == True)]["P"].sum()
            P_v1gv3_v2gv3 = gP_v3.loc[(gP_v3[v1] == True)]["P"].sum() * gP_v3.loc[(gP_v3[v2] == True)]["P"].sum()

            print(f"P({v1},{v2}|{v3}) = {P_v1iv2_gv3}")
            print(f"P({v1}|{v3})P({v2}|{v3}) = {P_v1gv3_v2gv3}")

            if math.isclose(P_v1iv2_gv3, P_v1gv3_v2gv3):
                print("independant")
            else:
                print("dependant")
            print("")


           

____________
P(A,B) = 0.45
P(A)P(B) = 0.25
dependant

P(A,B|C) = 0.54
P(A|C)P(B|C) = 0.3480000000000001
dependant

____________
P(A,C) = 0.30000000000000004
P(A)P(C) = 0.25
dependant

P(A,C|B) = 0.54
P(A|B)P(C|B) = 0.5220000000000001
dependant

____________
P(B,C) = 0.29000000000000004
P(B)P(C) = 0.25
dependant

P(B,C|A) = 0.54
P(B|A)P(C|A) = 0.5400000000000001
independant



## Exercise 4
***

We have three urns labeled 1, 2 and 3. The urns contain, respectively, three white and three black balls, four white and two black balls, and one white and two black balls. An experiment consists of selecting an urn at random then drawing a ball from it.

Define the joint probability distribution over $U$ and $C$, where $U$ is the chosen urn with values 1, 2 and 3; and $C$ is the color of the ball, with values black and white.

(a) What is the probability of drawing a black ball?

(b) What is the conditional probability that urn 2 was selected given that a black ball was drawn?

(c) What is the probability of selecting urn 1 or a white ball?

In [45]:
# (a)

U1 = [3 / 6, 3 / 6]
U2 = [4 / 6, 2 / 6]
U3 = [1 / 3, 2 / 3]

P_B = (U1[1] * (1/3)) + (U2[1] * (1/3)) + (U3[1] * (1/3))

P_B

0.5

(b) 
  
$P(U2 | B) = P(B | U2) * P(U2) / P(B)$ 



In [47]:
# (b)

P_B_g_U2 = U2[1]
P_B_g_U2 * (1/3) / P_B 

0.2222222222222222

(c) 
  
$P(U1 \cup W) = P(U1) + P(W) - P(U1 \cap W) = P(U1) + P(W) - P(U1)P(W | U1)$ 

In [48]:
# (c)

P_W = 1 - P_B
P_U1 = 1/3
P_W_g_U1 = U1[0]

P_U1 + P_W - P_U1 * P_W_g_U1

0.6666666666666666

## Exercise 5
***

Suppose Ed keeps track of forecasts of Finnish Meterological Institute (FIM) and believes they are correct with 80% probability and Mary belives the forecasts of Foreca are correct with 70% probability. Then suppose FIM predicts rain and Foreca does not.

Consider four sets of bets:

> (1) Bookie offers to sell Ed a bet for 85 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 60 euros returning Mary 100 euros if it does not rain.
> 
> (2) Bookie offers to sell Ed a bet for 79 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 69 euros returning Mary 100 euros if it does not rain.
> 
> (3) Bookie offers to sell Ed a bet for 73 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 73 euros returning Mary 100 euros if it does not rain.
> 
> (4) Bookie offers to sell Ed a bet for 55 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 34 euros returning Mary 100 euros if it does not rain.

(a) Which set of bets is a Dutch book?

(b) How much money is the bookie guaranteed to make in the Dutch book scenario?

Provide some calculations justifying your answers in the notebook.

(1) 85 + 60 - 100 = 45  
(2) 79 + 69 - 100 = 48  
(3) 73 + 73 - 100 = 46  
(4) 55 + 34 - 100 = -11


(a) Set 4 is the only one that guarantees a loss to the bookie. Set 1, 2, and 3 are all a net gain for the bookie. However, it does not make sense for Ed to accept the set 1 bet, or for Mary to accept set 3 bet, as the relative payout is too low compared to forecast probability they have estimated. Thus set 2 is a Dutch book, as it results in a net gain for the bookie, and is also sensible for both Ed and Mary.
  
(b) 79 + 69 - 100 = 48  

The file `data/6.csv` contains 200 data points sampled from the distribution defined in exercise 3, with `True` mapped to 1 and `False` to 0.

For each pair of variables, conduct the G²-test for statistical independence. Also conduct the test for each pair of variables given the third variable. That is, repeat the task specified in exercise 3, but this time based on data sampled from the distribution instead of direct access to the distribution. For each conducted test report the p-value obtained when the null hypothesis is that the independence holds.

You can also try sampling data from the distribution yourself to see how the obtained p-values behave, but for the Moodle return use the given data set.

### G²-test

Under the null hypothesis $H_0: X \mathrel{\unicode{x2AEB}} Y \mid C$ we have that

$$\#_{e}(X=x \wedge Y=y \wedge C=c) = \frac{\#_{e}(X=x \wedge C=c) \cdot \#_{e}(Y=y \wedge C=c)}{\#_{e}(C=c)}$$

where $\#$ marks the number of samples satisfying the condition after, and $\#_{e}$ is the expected number of samples under $H_{0}$.

Then examine the following quantity:

$$G^{2} = 2 \sum \# \log \frac{\#}{\#_{e}} $$

where the summation is over the different configurations of the variables (i.e., different values the variables can assume).

Under $H_0$ the quantity $G^2$ is distributed as [$\chi^2$](https://en.wikipedia.org/wiki/Chi-square_distribution) with $(m_X - 1)(m_Y - 1)m_C$ degrees of freedom, where $m_X,m_Y,m_C$ are the number of possible configurations for $X$, $Y$ and $C$, respectively.

### Instructions

You can use any libraries you find for the task, but it probably makes sense to implement the $G^2$ computation yourself, and then compute the p-value for example using [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) (if you're using Python) or the built-in [chisquare functions](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) in R.

In [8]:
# Provide your answer in cells here