# Statictics Handbook

### Random Variables
Random variables denotes outcomes of random processes, such as flipping acoin or rolling a dice, to numbers.
For instance, the outcome of the random process of flipping a coin can be rpresented by random variable `X` which takes a value `1` if theoutcome is
heads and `0` if the outcome is tails.

In this example, the possible outcomes are `{0,1}`. This set of all possible outcomes is called the **sample space** of the experiment. Each time the random process is repeated, it is referred to as an **event**. The chance or the likelihood of this event occurring with a particular outcome is called the **probability** of that event, `P(x)`.

`Pr (𝑋 = heads) = 0.5` and `Pr (𝑋 = tails) = 0.5`
<p style='text-align: right;'><a href="https://www.freecodecamp.org/news/statistics-for-data-scientce-machine-learning-and-ai-handbook/">ref</a></p>


In [11]:
import random
sample_space = [0, 1] # 0 denotes tail and 1 denotes head

print('flipping coin...')
X = []
X.append(random.choice(sample_space))

P_x = len(X)/len(sample_space)
print(f"The outcome of event X is {X[0]} and occure with a probaility, {P_x}")


flipping coin...
The outcome of event X is 1 and occure with a probaility, 0.5


### Mean
The **population** is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse. On the other hand, a **sample** is a subset of observations from the population that ideally is a <u>true representation</u> of the population.

The **mean**, also known as the average, is a central value of a finite set of numbers.
![title](img/Mean_1.png)
where **N** is the number of observations or data points in the sample set or simply the **data frequency**. Then the sample mean defined by `μ`, which is very often used to approximate the **population mean**, can be expressed as follows:
![title](img/Mean_2.png)
The mean is also referred to as **expectation** which is often defined by `E()` or random variable with a bar on the top. For example, the expectation of random variables X and Y, that is `E(X)` and `E(Y)`, respectively, can be expressed as follows:
![title](img/Mean_3.png)
<p style='text-align: right;'><a href="https://www.freecodecamp.org/news/statistics-for-data-scientce-machine-learning-and-ai-handbook/">ref</a></p>





In [12]:
import numpy as np
import math
x = np.array([1,3,5,6])
mean_x = np.mean(x)
print(f"The mean is {mean_x}")

# in case the data contains NaN values
x_nan = np.array([1,3,5,6, math.nan])
mean_x_nan = np.nanmean(x_nan)
print(f"The mean with NAN is {mean_x_nan}")

The mean is 3.75
The mean with NAN is 3.75


### Variance
The **variance** measures how far the data points are spread out from the average value. It's equal to the sum of the squares of the differences between the data values and the average (the mean).

<p style='text-align: right;'><a href="https://www.freecodecamp.org/news/statistics-for-data-scientce-machine-learning-and-ai-handbook/">ref</a></p>

<div>
<img src="img/Variance_1.png" alt="Variance" style="width: 450px;"/>
</div>
<p style='text-align: right;'><a href="https://www.google.com/url?sa=i&url=https%3A%2F%2Fyassineelkhal.medium.com%2Fvariance-and-standard-deviation-f4cc7e78b92&psig=AOvVaw0IWQqb_5OGHs5LrmR5_Rom&ust=1719356170116000&source=images&cd=vfe&opi=89978449&ved=2ahUKEwie9a7_qvWGAxXUk_0HHQnmDzsQjRx6BAgAEBU">ref</a></p>


In [21]:
x = np.array([1,3,5,6])
variance_x = np.var(x)
print(f"The variance is {variance_x}")

x_nan = np.array([1,3,5,6, math.nan])
variance_x_nan = np.nanvar(x_nan, ddof = 1)
print(f"The variance with NAN and dof set to one is {variance_x_nan}")

The variance is 3.6875
The variance with NAN and dof set to one is 4.916666666666667


**NB:** _To compute the population variance using Python, the var function from the __NumPy library__ is used. By default, this function calculates the __population variance__ by setting the `ddof` <u>(Delta Degrees of Freedom)</u> parameter to `0`. However,when dealing with samples and not the entire population, the `ddof` is typically set to `1` to get the __sample variance__._


### Standard Deviation

**Standard deviation** is simply the square root of the variance and <u>measures the extent to which data varies from its mean</u>. The standard deviation defined by `sigma` can be expressed as follows:
![title](img/SD_1.png)
Standard deviation is often preferred over the variance because it has the same units as the data points, which it can be interpret more easily
<p style='text-align: right;'><a href="https://www.freecodecamp.org/news/statistics-for-data-scientce-machine-learning-and-ai-handbook/">ref</a></p>



In [23]:
x = np.array([1,3,5,6])
sd_x = np.std(x)
print(f"The SD is {sd_x}")

x_nan = np.array([1,3,5,6, math.nan])
sd_x_nan = np.nanstd(x_nan, ddof = 1)
print(f"The SD with NAN and ddof set to one is {sd_x_nan}")

The SD is 1.920286436967152
The SD with NAN and ddof set to one is 2.217355782608345


__NB:__ _In statistics, a standard deviation (SD) of 2 means that 95% of scores in a data set fall within two standard deviations of the mean. This is also known as the empirical rule, or the `68-95-99.7 rule`, which states that:
`68%` Of scores are within `one` standard deviation of the mean
`95%` Of scores are within `two` standard deviations of the mean
`99.7%` Of scores are within `three` standard deviations of the mean_

<div>
<img src="img/Variance_1.png" alt="Drawing" style="width: 450px;"/>
</div>