<a href="https://colab.research.google.com/github/taskswithcode/probability_for_ml_notebooks/blob/main/ProbForML_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the notebook for  [the video - A less known fact about computing averages](https://youtu.be/6SrH0OQca7Y)

- [Code walkthrough of this notebook](https://youtu.be/QuFo_jWrbyE) [![Code Walkthrough](https://raw.githubusercontent.com/taskswithcode/image_assets/main/.github/images/codewalkthrough.svg)](https://youtu.be/QuFo_jWrbyE)

The **goal of this notebook is** to illustrate

  - By computing the **average** of **observed outcomes** of an experiment in a specific way, we can also **estimate** the **underlying probability distribution of all outcomes**.
  - We could use this distribution to then predict the occurrence of a specific event
  - Numerically, the average computed in the specific way is exactly the same as the average computation we normally do

### We start with some observed outcomes from an experiment


In [None]:
#A sample set of observations from an experiment
numbers = [2, 4, 3, 1, 1, 1, 6, 5, 1, 1, 3, 2, 5, 4, 1, 5, 3, 2, 1, 1]
len(numbers)

20

### 1. We first compute average the way we normally do

In [None]:
def average(numbers):
    return sum(numbers) / len(numbers)
print(f"Average as sum of observations divided by count of observations: {average(numbers)}")

Average as sum of observations divided by count of observations: 2.6


### 2. We then compute the average by estimating the underlying probability distribution

In [None]:
from collections import Counter
counter = Counter(numbers)
total_numbers = len(numbers)
print(total_numbers)

20


In [None]:
 #Count the number of occurrences of each outcome
 frequencies = {x: count for x, count in counter.items()}
 print(f"Event frequences:{frequencies}")

Event frequences:{2: 3, 4: 2, 3: 3, 1: 8, 6: 1, 5: 3}


In [None]:
#Compute sample probability distribution
sample_prob = [ float(freq)/total_numbers  for x, freq in frequencies.items()]
print(f"Sample probability distribution: {sample_prob}")
print(f"Event probabilities add up to: {sum(sample_prob)}")

Sample probability distribution: [0.15, 0.1, 0.15, 0.4, 0.05, 0.15]
Event probabilities add up to: 1.0


In [None]:
# compute the weighted average
weighted_average = sum(x * (freq / total_numbers) for x, freq in frequencies.items())
print(f"Average by weighting each observation by its probability: {round(weighted_average,1)}")

Average by weighting each observation by its probability: 2.6


Both computations yield the **same value for average** but the second approach gives us more insight - we estimate the **underlying probability distribution**

Few points to keep in mind

- our estimation of the underlying probability distribution gets better with more data
- our estimation of the average also gets closer to the true average as we increase the number of observations used to compute the average