<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-1/01_computing_probabilities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Computing probabilities

**Few things in life are certain; most things are driven by chance**. Whenever we cheer
for our favorite sports team, or purchase a lottery ticket, or make an investment in
the stock market, we hope for some particular outcome, but that outcome cannot
ever be guaranteed.

**Randomness permeates our day-to-day experiences**. Fortunately,
that randomness can still be mitigated and controlled.

We know that some
unpredictable events occur more rarely than others and that certain decisions carry
less uncertainty than other much-riskier choices. **Driving to work in a car is safer than riding a motorcycle.**

**These behaviors have been rigorously studied using
probability theory.** 

Probability theory is an inherently complex branch of math. However,
aspects of the theory can be understood without knowing the mathematical underpinnings. 

In fact, difficult probability problems can be solved in Python without
needing to know a single math equation. **Such an equation-free approach to probability requires a baseline understanding of what mathematicians call a sample space.**

We approach the  probability problems by following the four steps:

1. Defining all possible of problem.
2. Creating the sample space.
3. Defining the event condition.
4. Computing the event probability.

##Sample space analysis

**Certain actions have measurable outcomes. A sample space is the set of all the possible outcomes an action could produce.**

Let’s take the simple action of flipping a coin. The
coin will land on either heads or tails. Thus, the coin flip will produce one of two measurable outcomes: `heads` or `tails`.

In [1]:
sample_space = {"heads", "tails"}

Well, our sample space holds two possible elements.
Each element occupies an equal fraction of the space within the set. Therefore, we expect Heads to be selected with a frequency of `1/2`.

**That frequency is formally defined as the probability of an outcome. All outcomes within `sample_space` share an identical probability, which is equal to `1 / len(sample_space)`.**

In [2]:
probability_heads = 1 / len(sample_space)
print(f"Probability of choosing heads is {probability_heads}")

Probability of choosing heads is 0.5


The probability of choosing Heads equals 0.5. This relates directly to the action of flipping a coin.

Thus, a coin flip is conceptually equivalent to choosing
a random element from sample_space. The probability of the coin landing on heads
is therefore 0.5; the probability of it landing on tails is also equal to 0.5.

**An event is the subset of those elements within `sample_space` that satisfy some event condition.**An event condition
is a simple Boolean function whose input is a single `sample_space` element.

Let’s define two event conditions: one where the coin lands on either heads or tails, and another where the coin lands on neither heads nor tails.

In [3]:
def is_heads_or_tails(outcome):
  return outcome in sample_space

In [4]:
def is_neither(outcome):
  return not is_heads_or_tails(outcome)

Also, for the sake of completeness, let’s define event conditions for the two basic
events in which the coin satisfies exactly one of our two potential outcomes.

In [5]:
def is_heads(outcome):
  return outcome == "heads"

def is_tails(outcome):
  return outcome == "tails"

We can pass event conditions into a generalized `get_matching_event` function.
Its inputs are an event condition and a generic sample space.

In [6]:
def get_matching_event(event_condition, sample_space):
  return set([outcome for outcome in sample_space if event_condition(outcome)])

Let’s execute `get_matching_event` on our four event conditions. Then we’ll output the four extracted events.

In [7]:
event_conditions = [is_heads_or_tails, is_heads, is_tails, is_neither]

for event_condition in event_conditions:
  print(f"Event Condition: {event_condition.__name__}")
  event = get_matching_event(event_condition, sample_space)
  print(f"Event: {event}\n")

Event Condition: is_heads_or_tails
Event: {'heads', 'tails'}

Event Condition: is_heads
Event: {'heads'}

Event Condition: is_tails
Event: {'tails'}

Event Condition: is_neither
Event: set()



The probability of a single-element outcome for a fair coin is `1 / len(sample_space)`. This property can be generalized to
include multi-element events. The probability of an event is equal to `len(event) / len(sample_space)`, but only if all outcomes are known to occur with equal likelihood.

In other words, **the probability of a multi-element event for a fair coin is equal to the event size divided by the sample space size.**

We now use event size to compute the four event probabilities.

In [8]:
def compute_probability(event_condition, generic_sample_space):
  # The compute_probability function extracts the event associated with an inputted event condition to compute its probability
  event = get_matching_event(event_condition, generic_sample_space)
  return len(event) / len(generic_sample_space)  # Probability is equal to event size divided by sample space size

In [9]:
for event_condition in event_conditions:
  prob = compute_probability(event_condition, sample_space)
  name = event_condition.__name__
  print(f"Probability of event arising from {'name'} is {prob}")

Probability of event arising from name is 1.0
Probability of event arising from name is 0.5
Probability of event arising from name is 0.5
Probability of event arising from name is 0.0


The outputs is a diverse range of event probabilities, the smallest of
which is 0.0 and the largest of which is 1.0.

These values represent the lower and
upper bounds of probability; no probability can ever fall below 0.0 or rise above 1.0.

###Analyzing a biased coin

What would happen if that coin was biased?

Suppose, for instance, that a coin is four times more likely to land on heads
relative to tails. 

How do we compute the likelihoods of outcomes that are not
weighted in an equal manner?

Well, **we can construct a weighted sample space represented
by a Python dictionary. Each outcome is treated as a key whose value maps to
the associated weight.**

In [10]:
weighted_sample_space = {"heads": 4, "tails": 1}

Our new sample space is stored in a dictionary. 

This allows us to redefine the size of
the sample space as the sum of all dictionary weights.

In [11]:
sample_space_size = sum(weighted_sample_space.values())
assert sample_space_size == 5

We can redefine event size in a similar manner. Each event is a set of outcomes, and
those outcomes map to weights. Summing over the weights yields the event size. 

Thus,
the size of the event satisfying the `is_heads_or_tails` event condition is also 5.

In [12]:
# This is because Python iterates over dictionary keys, not key-value pairs
event = get_matching_event(is_heads_or_tails, weighted_sample_space)
event_size = sum(weighted_sample_space[outcome] for outcome in event)
assert event_size == 5

In [13]:
event

{'heads', 'tails'}

Our generalized definitions of sample space size and event size permit us to create a
`compute_event_probability` function. 

The function takes as input a `generic_sample_space` variable that can be either a weighted dictionary or an unweighted set.

In [14]:
def compute_event_probability(event_condition, generic_sample_space):
  event = get_matching_event(event_condition, generic_sample_space)

  # Checks whether generic_event_space is a set
  if type(generic_sample_space) == type(set()):
    return len(event) / len(generic_sample_space)

  event_size = sum(generic_sample_space[outcome] for outcome in event)
  return event_size / sum(generic_sample_space.values())

We can now output all the event probabilities for the biased coin without needing to
redefine our four event condition functions.

In [15]:
for event_condition in event_conditions:
  prob = compute_event_probability(event_condition, weighted_sample_space)
  name = event_condition.__name__
  print(f"Probability of event arising from '{name}' is {prob}")

Probability of event arising from 'is_heads_or_tails' is 1.0
Probability of event arising from 'is_heads' is 0.8
Probability of event arising from 'is_tails' is 0.2
Probability of event arising from 'is_neither' is 0.0


Let's define 5 times biased coin.

In [16]:
weighted_sample_space = {"heads": 5, "tails": 1}

In [17]:
for event_condition in event_conditions:
  prob = compute_event_probability(event_condition, weighted_sample_space)
  name = event_condition.__name__
  print(f"Probability of event arising from '{name}' is {prob}")

Probability of event arising from 'is_heads_or_tails' is 1.0
Probability of event arising from 'is_heads' is 0.8333333333333334
Probability of event arising from 'is_tails' is 0.16666666666666666
Probability of event arising from 'is_neither' is 0.0


Let's define 3 times biased coin.

In [18]:
weighted_sample_space = {"heads": 3, "tails": 1}

for event_condition in event_conditions:
  prob = compute_event_probability(event_condition, weighted_sample_space)
  name = event_condition.__name__
  print(f"Probability of event arising from '{name}' is {prob}")

Probability of event arising from 'is_heads_or_tails' is 1.0
Probability of event arising from 'is_heads' is 0.75
Probability of event arising from 'is_tails' is 0.25
Probability of event arising from 'is_neither' is 0.0


##Computing nontrivial probabilities

We’ll now solve several example problems using `compute_event_probability`.

###Problem 1: Analyzing a family with four children

Suppose a family has four children. What is the probability that exactly two of the children are boys?

We’ll assume that each child is equally likely to be either a boy or a girl.
Thus we can construct an unweighted sample space where each outcome represents
one possible sequence of four children.

|   |   |   |   |
|---|---|---|---|
|B|B|B|B|
|B|B|B|G|
|B|B|G|B|
|B|G|B|B|
|G|B|B|B|
|G|G|B|B|
|G|B|B|G|
|B|B|G|G|
|B|G|B|G|
|G|B|G|B|
|B|G|G|B|
|B|G|G|G|
|G|G|G|B|
|G|B|G|G|
|G|G|B|G|
|G|G|G|G|

The sample space for four sibling children.
Each row in the sample space contains 1 of 16 possible
outcomes. Every outcome represents a unique combination
of four children. The sex of each child is indicated by a letter:
`B` for boy and `G` for girl. Outcomes with two boys are marked
by an arrow. There are six such arrows; thus, the probability
of two boys equals `6 / 16`.




In [19]:
possible_children = ["boy", "girl"]
sample_space = set()

# Each possible sequence of four children is represented by a four-element tuple
for child1 in possible_children:
  for child2 in possible_children:
    for child3 in possible_children:
      for child4 in possible_children:
        outcome = (child1, child2, child3, child4)
        sample_space.add(outcome)
sample_space

{('boy', 'boy', 'boy', 'boy'),
 ('boy', 'boy', 'boy', 'girl'),
 ('boy', 'boy', 'girl', 'boy'),
 ('boy', 'boy', 'girl', 'girl'),
 ('boy', 'girl', 'boy', 'boy'),
 ('boy', 'girl', 'boy', 'girl'),
 ('boy', 'girl', 'girl', 'boy'),
 ('boy', 'girl', 'girl', 'girl'),
 ('girl', 'boy', 'boy', 'boy'),
 ('girl', 'boy', 'boy', 'girl'),
 ('girl', 'boy', 'girl', 'boy'),
 ('girl', 'boy', 'girl', 'girl'),
 ('girl', 'girl', 'boy', 'boy'),
 ('girl', 'girl', 'boy', 'girl'),
 ('girl', 'girl', 'girl', 'boy'),
 ('girl', 'girl', 'girl', 'girl')}

We can more easily generate our sample space using Python’s
built-in `itertools.product` function, which returns all pairwise combinations of all elements across all input lists.

In [20]:
from itertools import product

In [21]:
all_combinations = product(*(4 * [possible_children]))  # The * operator unpacks multiple arguments stored within a list.
assert set(all_combinations) == sample_space            # Note that after running this line, all_combinations will be empty.

In general, running `product(possible_children, repeat=n)`
returns an iterable over all possible combinations of n children.

In [22]:
sample_space_efficient = set(product(possible_children, repeat=4))
assert sample_space == sample_space_efficient

Let’s calculate the fraction of `sample_space` that is composed of families with two boys. 

We define a `has_two_boys` event condition and then pass that condition into
`compute_event_probability`.

In [23]:
def has_two_boys(outcome):
  return len([child for child in outcome if child == "boy"]) == 2

In [24]:
prob = compute_event_probability(has_two_boys, sample_space)
print(f"Probability of 2 boys is {prob}")

Probability of 2 boys is 0.375


In [34]:
prob = compute_event_probability(lambda outcome: len([child for child in outcome if child == "boy"]) == 2, sample_space)
print(f"Probability of 2 boys is {prob}")

Probability of 2 boys is 0.375


The probability of exactly two boys being born in a family of four children is `0.375`. By implication, we expect `37.5%` of families with four children to contain an equal number
of boys and girls. 

Of course, the actual observed percentage of families with two
boys will vary due to random chance.

Now let's find answer of some more questions.

What is the probability that exactly three of the children are boys?

In [25]:
def has_thee_boys(outcome):
  return len([child for child in outcome if child == "boy"]) == 3

In [26]:
prob = compute_event_probability(has_thee_boys, sample_space)
print(f"Probability of 3 boys is {prob}")

Probability of 3 boys is 0.25


In [35]:
prob = compute_event_probability(lambda outcome: len([child for child in outcome if child == "boy"]) == 3, sample_space)
print(f"Probability of 3 boys is {prob}")

Probability of 3 boys is 0.25


What is the probability that exactly one of the children are boys?

In [27]:
def has_one_boys(outcome):
  return len([child for child in outcome if child == "boy"]) == 1

In [28]:
prob = compute_event_probability(has_one_boys, sample_space)
print(f"Probability of 1 boys is {prob}")

Probability of 1 boys is 0.25


In [36]:
prob = compute_event_probability(lambda outcome: len([child for child in outcome if child == "boy"]) == 1, sample_space)
print(f"Probability of 1 boys is {prob}")

Probability of 1 boys is 0.25


So  the probability of one and three boys are the same.

What is the probability that exactly four of the children are boys?



In [29]:
def has_four_boys(outcome):
  return len([child for child in outcome if child == "boy"]) == 4

In [37]:
prob = compute_event_probability(lambda outcome: len([child for child in outcome if child == "boy"]) == 4, sample_space)
print(f"Probability of 4 boys is {prob}")

Probability of 4 boys is 0.0625


Note that the  probability would be the same for girls also.

Let's combine all events together.

In [31]:
event_conditions = [has_one_boys, has_two_boys, has_thee_boys, has_four_boys]

for event_condition in event_conditions:
  prob = compute_event_probability(event_condition, sample_space)
  name = event_condition.__name__
  print(f"Probability of event arising from '{name}' is {prob}")

Probability of event arising from 'has_one_boys' is 0.25
Probability of event arising from 'has_two_boys' is 0.375
Probability of event arising from 'has_thee_boys' is 0.25
Probability of event arising from 'has_four_boys' is 0.0625


###Problem 2: Analyzing multiple die rolls

Suppose we’re shown a fair six-sided die whose faces are numbered from 1 to 6. The die is rolled six times. 

What is the probability that these six die rolls add up to 21?

We begin by defining the possible values of any single roll. These are integers that range from 1 to 6.

**Step-1:** Defining all possible rolls of a six-sided die.

In [38]:
possible_rolls = list(range(1, 7))
possible_rolls

[1, 2, 3, 4, 5, 6]

**Step-2:** Creating the sample space for six consecutive rolls using the product function.

In [40]:
sample_space = set(product(possible_rolls, repeat=6))

**Step-3:** Defining a `has_sum_of_21` event condition.

In [41]:
def has_sum_of_21(outcome):
  return sum(outcome) == 21

**Step-4:** Computing the probability of a die-roll sum.

In [42]:
prob = compute_event_probability(has_sum_of_21, sample_space)
print(f"6 rolls sum to 21 with a probability of {prob}")

6 rolls sum to 21 with a probability of 0.09284979423868313


In [43]:
prob = compute_event_probability(lambda outcome: sum(outcome) == 21, sample_space)
print(f"6 rolls sum to 21 with a probability of {prob}")

6 rolls sum to 21 with a probability of 0.09284979423868313


The six die rolls will sum to 21 more than 9% of the time.

###Problem 3: Computing die-roll probabilities using weighted sample spaces