## EXERCISE QUIZ: Sprinkler example (modified)
Repeat the third exercise about the Bayesian network (i.e., Sprinkler example)

Compute the probability distribution of the wet grass given that is cloudy with the following data (NB: Some values are changed)

<img src='https://drive.google.com/uc?id=10FdO9kMQ_ZrH0sHt261uVPILxms7s4zj'>


In [None]:
# I report the code explained during the lesson below since it is the base for solving this exercise

import numpy as np

# 'true' and 'false' indexes
t, f = 0, 1

P_R_c = np.array([0.8, 0.2])
# this is a 2D vector, the elements of which can be accessed as follows
print('P(r|c) = ', P_R_c[t])
print('P(¬r|c) = ', P_R_c[f])

#################################################################
P_W_RS = np.array([[[0.95, 0.90],[0.75, 0.80]],[[0.05, 0.10], [0.15, 0.2]]]) #!!!!!!!! I modified the probabilities in this line...pay attention to the order
#####################################################################

# this is a 2x2x2 matrix, the elements of which can be accessed as follows
print('P(w|¬r,s) = ', P_W_RS[t,f,t])

P_S_c = np.array([0.1, 0.9])

Phi_S = P_W_RS[:,:,t] * P_S_c[t] + P_W_RS[:,:,f] * P_S_c[f]
print(Phi_S)

Phi_R = P_R_c[t] * Phi_S[:,t] + P_R_c[f] * Phi_S[:,f]
print(Phi_R)

P(r|c) =  0.8
P(¬r|c) =  0.2
P(w|¬r,s) =  0.75
[[0.905 0.795]
 [0.095 0.195]]
[0.883 0.115]


## EXERCISE 1: Weather's probability
You are given a (fake) <a href="https://drive.google.com/file/d/1LjZLE9ozaHcBwiCl90mHaS1nXKcglfr4/view">padua_weather.csv</a>
of historical records for Padua's weather. The weather, which can be either rainy (= 1 in the dataset), misty (= 2), or sunny (= 3), is reported for each day of the week, for a whole year (52 weeks).

After you formalised the problem (i.e. identify the random variables and necessary mathematical formulae), write a Python program that reads the dataset and computes the following:
- probability of being sunny during the weekend (one or both days);
- expected weather for each day of the week (*);
- supposed you don't know which day of the week is today: although very unrealistic, how could you guess which day is today based only on the weather?

(\*) An expected value of, for example, 2.5 can be interpreted as "a mix of misty and sunny weather".

# SOLUTIONS

## Probability of being sunny during the weekend (one or both days)

We have two variables, $W = \{1, 2, 3\}$ and $D = \{mo, tu, we, th, fr, sa, su\}$.

There are 23 out of 52 weekends in which one of the two days is sunny, therefore

$P(W = 3 | D = sa \vee D = su) = 23/52 \approx 0.4423$

## Expected weather for each day of the week

There are 17 rainy, 14 misty, and 21 sunny Mondays out of 52, therefore

$
\begin{align}
E(W | D = mo) &= \sum_y y~P(W = y | D = mo) \nonumber\\
&= 1 \times P(W = 1 | D = mo) + 2 \times P(W = 2 | D = mo) + 3 \times P(W = 3 | D = mo) \nonumber\\
&= 1 \times (17/52) + 2 \times (14/52) + 3  \times (21/52) \nonumber\\
&\approx 2.077 \nonumber
\end{align}
$

This is equivalent to sum all the values of the Monday column and divide by the number of rows.

Same reasoning for all the other days.


## Guess which day is today based only on the weather

If it's rainy, we need to compute the conditional probability distributions of the days given $W=1$

${\bf P}(D | W=1) = \langle P(mo | W=1), P(tu | W=1), P(we | W=1), P(th | W=1), P(fr | W=1), P(sa | W=1), P(su | W=1)\rangle$

and then choose the day with the highest probability.

In the dataset, out of 137 rainy days, 17 are Mondays, 19 Tuesdays, 16 Wednesdays, 19 Thursdays, 19 Fridays, 20 Saturdays and 27 Sundays. Therefore

$
\begin{align}
{\bf P}(D | W=1) &= \langle (17/137), (19/137), (16/137), (19/137), (19/137), (20/137), (27/137) \rangle \nonumber\\
&\approx \langle 0.124, 0.139, 0.117, 0.139, 0.139, 0.146, {\bf 0.197} \rangle \nonumber
\end{align}
$

Since the last one is the biggest, if it's raining, it's most likely Sunday.

Same reasoning for the other weather conditions.

PS: notice we could get to the same conclusions simply counting and comparing the number of rainy Mondays, Tuesdays, etc. without normalising by the total number 137 (but then of course it wouldn't be a *probability* distribution).

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd
import numpy as np # ADD FOR NEXT MANIPULATION

###################### READ THE DATA FROM THE SHAREABLE LINK ####################################
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

link = 'https://drive.google.com/open?id=1LjZLE9ozaHcBwiCl90mHaS1nXKcglfr4' # The shareable link

fluff, id = link.split('=')
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('padua_weather.csv')
df3 = pd.read_csv('padua_weather.csv')


In [None]:
################################################################################################
# Probability of being sunny during the weekend (one or both days)

saturday_indeces = np.array(df3.loc[df3['Saturday'] ==3].index)
sunday_indeces = np.array(df3.loc[df3['Sunday'] ==3].index)
interest_indeces = np.concatenate((saturday_indeces, sunday_indeces), axis=0)
interest_indeces = np.unique(interest_indeces) #we need to filter the indeces that we count twice
P_W_sa_su = interest_indeces.size/len(df3)
print(P_W_sa_su)

0.4423076923076923


In [None]:
#########################################################################################################
# Expected weather for each day of the week

days = np.array(df3.columns)
weather = np.array(range(1,4))

weekly_count = np.array(np.zeros((days.size,len(weather))))
for i in range(days.size):
  for j in range(len(df3)):
    weekly_count[i][int(df3.iat[j,i])-1]+=1

weekly_probabilities= weekly_count[:][:] / len(df3)

weekly_averages=np.sum(weekly_probabilities*np.array([1,2,3]), axis=1)

print("Expected weather for each day of the week: " + str(weekly_averages))

Expected weather for each day of the week: [2.07692308 1.98076923 2.03846154 1.94230769 1.96153846 1.84615385
 1.75      ]


In [None]:
#########################################################################################################
# supposed you don't know which day of the week is today: although very unrealistic, how could you guess which day is today based only on the weather?

days = np.array(df3.columns)
weather = np.array(range(1,4))

weather_weekly_count = np.array(np.zeros((len(weather), days.size)))

for w in weather:
  for d in days:
    current_d = np.array(df3.loc[df3[d]==w].index)
    day_index = np.where(days==d)[0]
    weather_index = np.where(weather==w)[0]
    weather_weekly_count[weather_index, day_index] = current_d.size

total = np.sum(weather_weekly_count, axis=1)
total_count = np.array(np.zeros((len(weather), days.size)))
for c in range(len(total)):
  total_count[c,:] = np.repeat(total[c], days.size)

P_D_w = np.divide(weather_weekly_count, total_count)
print(P_D_w)

print("Most probable day for each weather (Monday=1, ... , Sunday=7): " + str(np.argmax(P_D_w, axis=1)+1))


[[0.12408759 0.13868613 0.11678832 0.13868613 0.13868613 0.1459854
  0.19708029]
 [0.12612613 0.13513514 0.16216216 0.15315315 0.14414414 0.18018018
  0.0990991 ]
 [0.18103448 0.15517241 0.15517241 0.13793103 0.14655172 0.10344828
  0.12068966]]
Most probable day for each weather (Monday=1, ... , Sunday=7): [7 6 1]


## EXERCISE 2: Broad Street cholera outbreak

The following is a simplified version of an example in Judea Pearl's *The Book of Why*. It refers to a case of cholera epidemic, caused by contaminated water, which killed hundreds of people in London between 1853 and 1854. The diagram below illustrates some of the key factors explaining this epidemic, in particular:
- $X$ indicates whether the water company's intake was downstream of the London's sewers;
- $W$ indicates whether the water was contaminated or not;
- $Z$ indicates the presence of other external factors (e.g. poverty, miasma, etc.);
- $Y$ indicates the outbreak of cholera.

<img src='https://drive.google.com/uc?id=10O10x_nuuxF55rqRk0TpanHV_7Q819MA'>

(please note the probabilities in the diagram are fake)

> - Formalise the problem using opportune mathematical notations and derive an expression for computing the probability distribution of the cholera given that the water company's intake is upstream (i.e. what is the query? how can it be decomposed?)
> - Write a Python program that computes the actual probabilities of the above distribution using the information from the given CPTs.

## SOLUTIONS: Probability distribution of the cholera given that the water company's intake is upstream

The question is to find the probability distribution ${\bf P}(Y | \neg x)$, where $Y$ is the cholera variable, and $\neg x$ is the conditional event "not downstream". As usual, we can get this from the joint probability distribution of the Bayesian network, summing out the hidden variables ($Z$ and $W$ in this case) and normalising:

$
\begin{align}
{\bf P}(Y | \neg x) &= \alpha {\bf P}(Y, \neg x) \nonumber\\
&= \alpha \sum_{z, w} {\bf P}(Y, w, \neg x, z) \nonumber\\
&= \alpha \sum_z \sum_w {\bf P}(Y | w, z) P(w | \neg x, z) P(\neg x) P(z) \nonumber\\
&= \alpha P(\neg x) \sum_z P(z) \sum_w {\bf P}(Y | w, z) P(w | \neg x, z) \nonumber
\end{align}
$


## Computes the actual probabilities of the above distribution

The Python program should take the probabilities from the CPTs and compute the following numbers:

$
\begin{align}
{\bf P}(Y | \neg x) &= \alpha \times 0.5 \times \Big[ 0.25 \times \big( \langle 0.8, 0.2 \rangle \times 0.10 + \langle 0.15, 0.85 \rangle \times 0.90 \big) + 0.75 \times \big( \langle 0.75, 0.25 \rangle \times 0.02 + \langle 0.05, 0.95 \rangle \times 0.98 \big) \Big] \nonumber\\
&= \alpha \times 0.5 \times \Big[ 0.25 \times \big( \langle 0.08, 0.02 \rangle + \langle 0.135, 0.765 \rangle \big) + 0.75 \times \big( \langle 0.015, 0.005 \rangle + \langle 0.049, 0.931 \rangle \big) \Big] \nonumber\\
&= \alpha \times 0.5 \times \Big[ 0.25 \times \langle 0.215, 0.785 \rangle + 0.75 \times \langle 0.064, 0.936 \rangle \Big] \nonumber\\
&= \alpha \times 0.5 \times \Big[ \langle 0.05375, 0.19625 \rangle + \langle 0.048, 0.702 \rangle \Big] \nonumber\\
&= \alpha \times 0.5 \times \langle 0.10175, 0.89825 \rangle \nonumber\\
&= \langle 0.10175, 0.89825 \rangle \nonumber
\end{align}
$

In [None]:
import numpy as np

t,f =0,1

P_x = np.array([0.5,0.5])
P_z = np.array([0.25,0.75])
P_w_xz= np.array([[[0.90, 0.10],[0.85, 0.15]],[[0.10, 0.90], [0.02, 0.98]]])
P_y_wz= np.array([[[0.80, 0.20],[0.75, 0.25]],[[0.15, 0.85], [0.05, 0.95]]])


step1= P_y_wz[t][:][f]* P_w_xz[f][f][t] + P_y_wz[f][:][f]*P_w_xz[f][f][f]

step2= P_y_wz[t][:][t]* P_w_xz[t][t][f] + P_y_wz[f][:][t]*P_w_xz[f][t][f]
print( "P(y|w,z=false, x=false) = "+ str(step1))

print("P(y|w, z=true, x=false) = " + str(step2))
step3 = P_z[f]* step1 + P_z[t] * step2
step4 = P_x[f] * step3 #multiply by the probability of the upstream condition
step5 = step4/sum(step4) #normalization
print("P(y|w, z, x=false) = P(y|x=false) = "+ str(step5))

P(y|w,z=false, x=false) = [0.064 0.936]
P(y|w, z=true, x=false) = [0.215 0.785]
P(y|w, z, x=false) = P(y|x=false) = [0.10175 0.89825]
