# Probability and Distributions

In science, the probability of an event is a number that indicates how likely the event is to occur. It is expressed as a number in the range from 0 and 1.

$P(X) = \frac{Outcomes-where-value-is-X}{Total-outcomes}$

Example: In a flip of a fair coin, the probability the outcome is a `HEAD`

Total outcomes from a flip of a coin = 2 (`{H, T}`)

Total favourable outcome = 1, (`H`)

$\therefore$ $P(H) = \frac{Outcomes-with-HEADS}{Total-outcomes}$

$\implies P(H) = \frac{1}{2}$

## Pre-requisites
1. Basic python
2. Dataframes using [pandas](https://pandas.pydata.org/docs/user_guide/index.html). 

## Out of scope
1. Data preparation
2. Data analysis


### Code example

#### In a roll of dice, find the probability of getting 
   1. Value equal to 1.
   2. Odd Value
   3. Even Value
   4. Value less than 5
   5. Multiple of 3
   6. Value greater than 4
   7. Value equal to 6 or odd number

In [None]:
outcomes = [1,2,3,4,5,6]

total_outcomes = len(outcomes)

# value equal to 1.
fav_outcomes = len([x for x in outcomes if x == 1])
print("P(1):", fav_outcomes/total_outcomes)

# Odd value
fav_outcomes = len([x for x in outcomes if x % 2 == 1])
print("P(OddValues):", fav_outcomes/total_outcomes)

# Even value
fav_outcomes = len([x for x in outcomes if x % 2 == 0])
print("P(EvenValues):", fav_outcomes/total_outcomes)

# Value less than 5
fav_outcomes = len([x for x in outcomes if x < 5])
print("P(<5):", fav_outcomes/total_outcomes)

# Multiple of 3
fav_outcomes = 0 # TODO
print("P(MultipleOf3):", fav_outcomes/total_outcomes)

# Value greater than 4
fav_outcomes = 0 # TODO
print("P(ValueGreaterThan4):", fav_outcomes/total_outcomes)

# Value equal to 6 or odd number
fav_outcomes = 0 # TODO
print("P(6OrOdd):", fav_outcomes/total_outcomes)

#### In the diabetes data find the probability that a random row selected from the data 
1. Has diabetes.
2. BMI < 23
3. Glucos > 100
4. DiabetesPedigreeFunction < 0.5

In [None]:
import pandas as pd
data_path = '../resources/diabetes.csv'
df = pd.read_csv(data_path)
df.head()

In [None]:
# Has diabetes
fav_outcomes = len(df[df['Outcome'] == 1])
total_outcomes = df.shape[0]

print("P(HasDiabetes): ", fav_outcomes / total_outcomes)

# BMI < 23

fav_outcomes = len(df[df['BMI'] < 23])
total_outcomes = df.shape[0]

print("P(BMI < 23): ", fav_outcomes / total_outcomes)

# Glucos > 100

fav_outcomes = 0 # TODO
total_outcomes = df.shape[0]

print("P(Glucos > 100): ", fav_outcomes / total_outcomes)


# DiabetesPedigreeFunction < 0.5

fav_outcomes = 0 # TODO
total_outcomes = df.shape[0]

print("P(DiabetesPedigreeFunction < 0.5): ", fav_outcomes / total_outcomes)



## Types of Events

<u>**Independent Events**</u>
Two events are said to be independent, if the outcome of one event does not affect the outcome of the second event. 

Example: A coin is tossed twice. The outcome of the second coin toss is not dependent on the outcome of the first.


<u>**Dependent Events**</u>
Two events are said to be Dependent, if the outcome of one event affects the outcome of the second event. 

Example: A coin is tossed, if the outcome is `HEAD`, the coin is tossed again. In this example, the probability of a second coin toss is equal to the probability of getting a `HEAD` in the first coin toss.

<u> **Mutually exclusive events**</u>
Events which do not occur together. 

## Properties of probabilities

1. The probability of a sure event or certain event is 1.
2. The probability of an impossible event is 0.
3. $0 \le P(A) \le 1$
4. Probability of a complement  $P'(A) = 1 - P(A)$
5. Probability of mutually exclusive events $P(A \cap B) = 0$
6. Probability of independent events $P(A \cap B) = P(A)*P(B)$
7. Probability of dependent events $P(A \cap B) = P(A|B)*P(B)$ or $P(A \cap B) = P(B|A)*P(A)$  Also known as conditional probability.
7. Probability of mutually exclusive events $P(A \cup B) = P(A) + P(B)$
8. Probability of mutually non-exclusive events $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

### Code example

1. In a roll of two dice, find the probability of getting 
   1. Value equal to 1 or equal to 3.
   2. Odd Value and less than 3
   3. Even Value and greater than 3
   4. Value less than 3 or equal to 6 
   5. Multiple of 3 and multiple of 2
2. A coin is tossed. If the outcome is heads, a dice is rolled. If the outcome is tail, the coin is tossed again. Find the probability of getting a tail or number 3 on the dice.

In [None]:
outcomes = [1,2,3,4,5,6]

total_outcomes = len(outcomes)

# Value equal to 1 or equal to 3.
fav_outcome_1 = len([x for x in outcomes if x == 1])
fav_outcome_2 = len([x for x in outcomes if x == 3])
prob_of_event = (fav_outcome_1/total_outcomes) + (fav_outcome_2/total_outcomes)
print("P(1 or 3):", prob_of_event)

# Odd Value and less than 3.
fav_outcome_1 = len([x for x in outcomes if x % 2 == 1])
fav_outcome_2 = len([x for x in outcomes if x < 3])
prob_of_event = (fav_outcome_1/total_outcomes) * (fav_outcome_2/total_outcomes)
print("P(Odd and < 3):", prob_of_event)

# Even Value and greater than 3
fav_outcome_1 = len([x for x in outcomes if x % 2 == 0])
fav_outcome_2 = len([x for x in outcomes if x > 3])
prob_of_event = (fav_outcome_1/total_outcomes) * (fav_outcome_2/total_outcomes)
print("P(Even and > 3):", prob_of_event)

# Value less than 3 or equal to 6 
fav_outcome_1 = 0 # TODO
fav_outcome_2 = 0 # TODO
prob_of_event = 0 # TODO
print("P(< 3 or =6):", prob_of_event)

# Multiple of 3 and multiple of 2 
fav_outcome_1 = 0 # TODO
fav_outcome_2 = 0 # TODO
prob_of_event = 0 # TODO
print("P(multipleOf3 and multipleOf 2):", prob_of_event)


# A coin is tossed. If the outcome is heads, a dice is rolled. 
# If the outcome is tail, the coin is tossed again. Find the probability of getting a tail or number 3 on the dice.
# TODO: Conditional Probability


#### In the diabetes data find the probability that a random row selected from the data 
1. Has diabetes or BMI < 23.
2. Has diabetes or Glucos > 100.
3. Has diabetes if Glucos > 100.
4. Has diabetes if DiabetesPedigreeFunction > 0.5

In [None]:
total_outcomes = df.shape[0]

# Has diabetes or BMI < 23.
fav_outcome_1 = len(df[df['Outcome'] == 1])
fav_outcome_2 = len(df[df['BMI'] < 23])
prob_of_event = (fav_outcome_1/total_outcomes) + (fav_outcome_2/total_outcomes)

print("P(HasDiabetes or BMI < 23): ", prob_of_event)

# Has diabetes or Glucos > 100.
fav_outcome_1 = 0 # TODO
fav_outcome_2 = 0 # TODO
prob_of_event = 0 # TODO

print("P(Has diabetes or Glucos > 100): ", prob_of_event)

# Has diabetes if Glucos > 100.
# P( A and B) = P(A given B) * P(B)
# A is has diabetes
# B is Glucos > 100
outcomes_a_given_b = len(df[(df['Outcome'] == 1) & (df['Glucose'] > 100)])
outcomes_b = len(df[df['Glucose'] > 100])

p_a_given_b = outcomes_a_given_b / total_outcomes
p_b = outcomes_b / total_outcomes
prob_of_event = p_a_given_b * p_b

print("P(Has diabetes if Glucos > 100): ", prob_of_event)

# Has diabetes if DiabetesPedigreeFunction > 0.5
outcomes_a_given_b = 0 # TODO
outcomes_b = 0 # TODO

p_a_given_b = outcomes_a_given_b / total_outcomes
p_b = outcomes_b / total_outcomes
prob_of_event = 0 # TODO

print("P(Has diabetes if DiabetesPedigreeFunction > 0.5): ", prob_of_event)



## Random variable

A [radom variable](https://encyclopediaofmath.org/index.php?title=Random_variable) is a mathematical formalization of an event which depends on a random event. 

In the example of coin toss, the coin toss is a random event, and the outcome of the coin toss is the associated random variable which can have the values `HEAD` or `TAIL`

**Alternate definition:** A random variable is defined as a measurable function from a probability measure space to a measurable space.

### Types of random variables
- Discrete random variable: The outcome of the random event is a discrete set of values
- Continous random variable: The outcome of the random event maps to between a range of real numbers

### Distribution

The [distribution](https://statisticsbyjim.com/basics/probability-distributions/) of a random variable is defined as the probability measure on the set of all possible values the random variable can take.

#### Further reading
1. Probability distribution function (PDF)
2. Probability mass function

#### Types of Distributions
1. Normal Distribution.
2. Uniform Distribution.
3. Binomial Distribution.
4. Bernoulli Distribution.
5. Poisson Distribution.
6. Exponential Distribution.


### Mean and variance of a distribution

Mean, average, or the expected value (denoted as E(X) or $\mu$ or $\bar{x}$) of a distribution is defined as 

$E(X) = \sum_{x}xP(x)$

Example: If a number is chossen randomly between 1 and 5, what is the expected (average) number.

Probability of chossing any number between 1 and 5 = $\frac{1}{5}$

$E(X) = 1*\frac{1}{5} + 2*\frac{1}{5}+ 3*\frac{1}{5} + 4*\frac{1}{5} + 5*\frac{1}{5}$ where X is the random variable choosen between 1 and 5

$\implies E(X) = 3$

The term Standard deviation or variance refers to a statistical measurement of the spread between numbers in a data set and is denoted as $\sigma$.

$\sigma^2 = \frac{\sum_{x} (x - \bar{x})^2}{n - 1}$


### Code Excercise

In the diabetes data, find the Expected value and variance of BMI.


In [None]:
n = len(df)
average_bmi = sum(df['BMI'])/n
print(average_bmi)

var_bmi = sum((df['BMI'] - average_bmi) ** 2)/(n + 1)
print("sigma_square:", var_bmi)
print("standard_deviation:", var_bmi**0.5)

df['BMI'].describe()


## Normal distribution

**The Coin toss experiment**

Aim: Plot the distribution of number of heads in `N` coin tosses


In [None]:
# Install plotting library
!conda install -y matplotlib

In [None]:
import matplotlib.pyplot as plt
import math

def get_bin_array(n, zfill_value):
    return [int(x) for x in bin(int(n))[2:].zfill(zfill_value)]

def get_coin_toss_outcomes(n):
    zfill_value = int(math.log2(n)) + 1
    return [get_bin_array(x, zfill_value) for x in range(n)]


coin_toss_outcomes = get_coin_toss_outcomes(2**10)

# Denoting a head with 1, a tail with 0
plt.hist([sum(outcome) for outcome in coin_toss_outcomes])
plt.show()

In [None]:
plt.hist(df['Glucose'])
plt.show()