# **CAP-417: Estatística Computacional (Computational Statistics)**

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/thedice.jpg" alt="Drawing" width="350"/>

<br>

# ***Probability***
-----

This notebook was developed by Prof. <a href="https://www.linkedin.com/in/valdivino-alexandre-de-santiago-j%C3%BAnior-103109206/?locale=en_US">Valdivino Alexandre de Santiago Júnior</a> for the Computational Statistics CAP course at INPE.</a>

<br>

Most of this material was based on: R. E. Walpole, R. H. Myers, S. L. Myers, K. E. Ye. <a href="https://www.pearson.com/us/higher-education/product/Walpole-Probability-and-Statistics-for-Engineers-and-Scientists-9th-Edition/9780321629111.html">Probability and Statistics for Engineers and Scientists, 9th Edition</a>. Pearson, 2012.

<br>

**Licence**: GNU GENERAL PUBLIC LICENSE, Version 3 (GPLv3)



## Introduction to Probability
----

When we talk **Probability**, we are wondering of how likely an event is to occur. The probability of an event is a number between 0 and 1. Number 0 indicates impossibility of the event to occur and number 1 indicates that the event is completely sure to happen.

<br>

Statisticians use the word experiment to describe any process that generates
a set of data. A simple example of a statistical experiment is the tossing of a coin. In this experiment, there are only two possible outcomes, heads or tails.

<br>

The main goal is to obtain observations by repeating the experiment several times. In most cases, this is not deterministic and outcomes will depend on chance.


In [1]:
import random
for i in range(10):
  n = random.randint(1,30) # Interval: [low, high]
  print('Number in iteration {} is: {}'.format(i,n))  

Number in iteration 0 is: 5
Number in iteration 1 is: 19
Number in iteration 2 is: 11
Number in iteration 3 is: 26
Number in iteration 4 is: 24
Number in iteration 5 is: 30
Number in iteration 6 is: 3
Number in iteration 7 is: 4
Number in iteration 8 is: 30
Number in iteration 9 is: 21


## Sample Space 
----

The set of all possible outcomes of a statistical experiment is called the **sample space** and we denote it by $S$.


<br>

Each outcome in a sample space is called an **element** of the
sample space, or simply a sample point. Thus, the sample space $S$, of possible outcomes when a coin is flipped, may be written

$$
S = \{H, T\},
$$

where $H$ and $T$ correspond to heads and tails, respectively.

<br>

Ex: Suppose that three items are selected at random from a manufacturing process. Each item is inspected and classified as defective, $D$, or nondefective, $N$. What are the elements of the sample space, $S$? 

R: Below, tree diagram.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/treediag.png" alt="Drawing" width="400"/>



## Events
----

For any given experiment, we may be interested in the occurrence of certain **events** rather than in the occurrence of a specific element in the sample space.

<br>

Each event is assigned a collection of sample points which constitute a subset
of the sample space. Hence, An **event** is a subset of a sample space.

<br>

Ex: Let us consider the previous example of the manufacturing process. What is the event $A_1$ which considers that the number of defective items is greater than 1?

R: $A_1 = \{DDD, DDN, DND, NDD\}$



In [2]:
elem_a1 = []
for i in range(3):
  elem_a1.append(random.randint(0,1)) # 0 = N; 1 = D 

print('Possible element of A1:', elem_a1)

if elem_a1.count(1) > 1:
  print('Element is indeed in A1')
else:
  print('Element is NOT in A1')    

Possible element of A1: [0, 1, 1]
Element is indeed in A1


## Intersection, Mutually Exclusive, Union of Events
----

The **intersection** of two events $A$ and $B$, denoted by the $A \cap B$, is the
event containing all elements that are common to $A$ and $B$.

<br>

Two events $A$ and $B$ are **mutually exclusive**, or disjoint, if $A \cap B = \emptyset$. 

<br>

The **union** of the two events $A$ and $B$, denoted by the symbol $A \cup B$, is the event containing all the elements that belong to $A$ or $B$ or both.

<br>

Question: Let $A = \{a \quad | \quad 5 \leq a \leq 12 \}$ and $B = \{b \quad | \quad 1 \leq b \leq 8 \}$. Calculate:



1.   $A \cap B$;
2.   $A \cup B$;
3.   What change can de done so that $A$ and $B$ become disjoint?




## Probability of an Event
----

The probability of an event $A$ is the sum of all the probabilities assigned to
the sample points (elements) in $A$. This sum is called the probability of $A$ and is denoted by $P(A)$.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/probdefin.png" alt="Drawing" width="600"/>

<br>

Rule:

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/rule2.3.png" alt="Drawing" width="600"/>

<br>

Ex: In a flight, there are 100 brazilians, 90 americans, 38 spaniards, and 73 brits in the economic class. If a person is randomly selected to upgrade to the business class, find the probability that the person chosen is (a) a brazilian, (b) an american or brit, and (c) a spaniard or brazilian. 


In [3]:
import pandas as pd

data = {'total': [100, 90, 38, 73]}
row_labels = ['braz', 'amer', 'span', 'brit']

persons = pd.DataFrame(data=data, index=row_labels)

print('Persons: \n', persons)

all_persons = persons['total'].sum()

print('\nAll Passengers:', all_persons)

print('\nItem a: ', persons.loc['braz'].at['total'] / all_persons)
print('Item b: ', (persons.loc['amer'].at['total'] + persons.loc['brit'].at['total']) / all_persons)
print('Item c: ', (persons.loc['span'].at['total'] + persons.loc['braz'].at['total']) / all_persons)

Persons: 
       total
braz    100
amer     90
span     38
brit     73

All Passengers: 301

Item a:  0.33222591362126247
Item b:  0.5415282392026578
Item c:  0.4584717607973422


Question: In the example above, is $P(S)=1$?

## Additive Rules
----

One of the main goal of rules is to simplify the computation of probabilities. Below we present some theorems and corollaries related to addition.

<br>


<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/adr1.png" alt="Drawing" width="600"/>



<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/adr2.png" alt="Drawing" width="600"/>

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/adr3.png" alt="Drawing" width="600"/>

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/adr4.png" alt="Drawing" width="600"/>

<br>
<br>

**PS:** Note that $P(A \cap B)$ is known as a **joint probability**, i.e. the probability of two simultaneous events occur. Alternate symbol is $P(A,B)$.

<br>

Ex: The probability that tomorrow will be a cold day in SJCampos is 0.5, the probability that it will be a rainy day is 0.65, and the probability that it will be both cold and rainy is 0.45. What is the probability that it will be neither cold nor rainy?

R: $P(cold) = 0.5$, $P(rainy) = 0.65$, $P(cold \cap rainy) = 0.45$. 

$P(cold \cup rainy) = 0.5 + 0.65 - 0.45 = 0.7 $

But, we want to the complementary event. Thus:

$P((cold \cup rainy)^c) = 1 - P(cold \cup rainy) = 1 - 0.7 = 0.3$.



## Conditional Probability
----

The probability of an event $B$ occurring when it is known that some event $A$
has occurred is called a **conditional probability**. It is denoted by $P(B|A)$ and it usually means “the probability that $B$ occurs given that $A$ occurs”
or simply “the probability of $B$, given $A$”.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/condprob.png" alt="Drawing" width="600"/>

<br>

Ex: In a flight, there are 100 brazilians (55 men and 45 women), 90 americans (30 men and 60 women), 38 spaniards (all men), and 73 brits (12 men and 61 women) in the economic class. If a person is randomly selected to upgrade to the business class, find the probability that the person chosen is (a) a brazilian given it is a man, (b) an american given it is a woman, and (c) a man given it is a spaniard. 

In [4]:
data = {'man': [55, 30, 38, 12],
        'woman': [45, 60, 0, 61],
        'total': [100, 90, 38, 73]}
row_labels = ['braz', 'amer', 'span', 'brit']

persons = pd.DataFrame(data=data, index=row_labels)

print('Persons: \n', persons)

all_persons = persons['total'].sum()

print('\nAll Passengers:', all_persons)

# Item a: We want P(braz|man).
p_man_and_braz = persons.loc['braz'].at['man'] / all_persons
p_man = persons['man'].sum() / all_persons
print('\nItem a: P(braz|man): ', p_man_and_braz / p_man)

# Item b: We want P(amer|woman).
p_woman_and_amer = persons.loc['amer'].at['woman'] / all_persons
p_woman = persons['woman'].sum() / all_persons
print('\nItem b: P(amer|woman): ', p_woman_and_amer / p_woman)

# Item c: We want P(man|span).
p_span_and_man = persons.loc['span'].at['man'] / all_persons
p_span = persons.loc['span'].at['total'] / all_persons
print('\nItem c: P(man|span): ', p_span_and_man / p_span)


Persons: 
       man  woman  total
braz   55     45    100
amer   30     60     90
span   38      0     38
brit   12     61     73

All Passengers: 301

Item a: P(braz|man):  0.40740740740740744

Item b: P(amer|woman):  0.3614457831325302

Item c: P(man|span):  1.0


## Independent Events and Multiplicative Rule
----

When the occurrence of an event had no impact on the odds of occurrence of another event, we say that both events are **independent**.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/indepevents.png" alt="Drawing" width="600"/>

<br>

Based on the definition of conditional probability, if we multiply both sides of the equation by $P(A)$, we get the **multiplicative rule**.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/multrule.png" alt="Drawing" width="600"/>

<br>

Since the events $A \cap B$ and $B \cap A$ are equivalent, we also have:

$$
P(A \cap B) = P(B \cap A) = P(B)P(A|B)
$$

<br>

Hence, it does not matter which event is referred to as $A$ and which event
is referred to as $B$. Just realise that even if the joint probability is symmetrical, in other words $P(A \cap B) = P(B \cap A)$, the conditional probability is not symmetrical, i.e. $P(A | B ) \neq P(B | A)$.

<br>

Moreover, there is this another theorem.

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/multtheorem.png" alt="Drawing" width="600"/>


<br>

Ex: A small transport goods delivery company has one truck and one drone in its fleet. The probability that the truck is available to deliver some goods on fridays is 0.7, and the probability that the drone is available on the same day is 0.9. In the event of some goods can be delivered either by the truck or by the drone, find the probability that both the truck and the drone will be available on fridays, assuming they operate independently.

R: $P(truck) = 0.7$, $P(drone) = 0.9$. 

Hence, $P(truck \cap drone) = 0.7 \times 0.9 = 0.63$.

## Bayes' Rule (Theorem)
----

The Bayes' Rule (Theorem) is one of the most important concept in probability theory. It is the basis of Bayesian inference which specifies how one should update one’s beliefs upon observing data.

<br>

We have already shown how to calculate the conditional probability, i.e. $P(B|A)$, $P(A|B)$. But, there is another way to do that, based on the Bayes' Rule below.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/bayesrule.png" alt="Drawing" width="300"/>

<br>
<br>

In the formula above (see <a href="https://towardsdatascience.com/what-is-bayes-rule-bb6598d8a2fd">D. Soni's TDS post</a>):



1.   $P(A|B)$ is called the posterior, i.e. this is what we want to estimate;
2.   $P(B|A)$ is called the likelihood. This is the probability of observing the new evidence, given our initial hypothesis;
3.   $P(A)$ is called the prior. This is the probability of our hypothesis without any additional prior information;
4.   $P(B)$ is the marginal likelihood. This is the total probability of observing the evidence.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/bayesrule2.png" alt="Drawing" width="300"/>

Source: <a href ="https://towardsdatascience.com/bayes-rule-with-a-simple-and-practical-example-2bce3d0f4ad0">T. Sarkar's TDS post</a>

<br>

Note that the Bayes' Rule concept is closely related to data science (or the other way arround).

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/baysrule3.png" alt="Drawing" width="350"/>

Source: <a href ="https://towardsdatascience.com/bayes-rule-with-a-simple-and-practical-example-2bce3d0f4ad0">T. Sarkar's TDS post</a>

<br>
<br>

**PS:** The Bayes' Rule can be applied considering several events $A_i$ which constitute a partition of the sample space, $S$.

<br>

Ex: A new COVID test was evaluated with a group of people $5\%$ of which are known to have been truly infected. The test diganosis is positive among 98% of those which are truly infected and $3\%$ of those who are not indeed infected. What is the probability that someone testing positive for COVID under this new test is actually infected?

R: Likelihood: $P(positive|infected) = 0.98$.

Prior: $P(infected) = 0.05$. 

Marginal likelihood: $P(positive) = (0.98 \times 0.05) + (0.03 \times 0.95)$.

Hence:

$$
P(infected|positive)=\frac{P(positive|infected)P(infected)}{P(positive)} = \frac{0.98 \times 0.05}{(0.98 \times 0.05) + (0.03 \times 0.95)} = 0.632258
$$









## Exercise
----

A regional telephone company operates three identical relay stations at different locations. During a one-year period, the number of malfunctions reported by each station and the causes are shown below.

<br>

<img src="https://raw.githubusercontent.com/vsantjr/CAP/master/Images/bayes2.png" alt="Drawing" width="550"/>

<br>

Suppose that a malfunction was reported and it was found to be caused by other human errors. Answer below:

1. What is the probability that it came from station $C$? What is the probability that it came from station $A$? Develop a simple program that accepts the table above as input and solve this item.
2. Solve the same item above but now assume that "caused by other human errors" is three times higher in the case of $C$ and two times higher in the case of $A$. Update the table before calculating them. Your program must be able to accept changes in the input table.

**Deadline**: 19 april 2023
