# Fundamentals of Statistics

#### Basic yet powerful terms: Mean, Mode, Median, Standard Deviation and Correlation Coefficient

In [1]:
import numpy as np

In [2]:
learning_hour = [1,2,6,4,10]
scores = [3,4,6,5,6]

print("Mean learning time: ", np.mean(learning_hour))
print("Mean Score: ", np.mean(scores))
print("Median learning time: ",np.median(learning_hour))
print("Standard Deviation: ",np.std(learning_hour))
print("Correlation b/w learning hours and scores: ",np.corrcoef(learning_hour, scores))

Mean learning time:  4.6
Mean Score:  4.8
Median learning time:  4.0
Standard Deviation:  3.2
Correlation b/w learning hours and scores:  [[1.         0.88964891]
 [0.88964891 1.        ]]


- **Mean:** is the average of a data set, the sum of the elements divided by the total number of elements.


- **Median** is the middle element of the set of elements, if the length is odd, median is middle value of a **sorted copy** of   the array.


- **Standard Deviation** is a measure of **how much the data is spread out**, it shows us how much our data is spread out         around the mean.
    - Standard Deviation is the square root of the **Variance**.
    - Variance is defined as the average of square differences from the mean.
    - To calculate variance manually:
        1. Compute the mean.
        2. Then for each number, subtract the mean and the square the result, i.e. the squared difference
        3. Then compute the average of those squared differences.(Mean of those differences)
    - So to calculate the Standard Deviation, take the square root of the above output.


- **Correlation Coefficient**: When two data sets are strongly linked together, we say they have high correlation.
    - Correlation is Positive when the values for the two sets of data increase together.
    - It is Negative when one value increases while the other decreases.
    - Correlation coefficient is a way to put a value to the relationship.
    - Correlation coefficients have a value of b/w -1 and 1:
        - 1 is a perfect positive correlation.
        - 0 is no correlation meaning the values down't seem linked at all.
        - -1 is perfect negative correlation.
    - **Correlation is NOT Causation**: it means that correlation b/w two things doesn't prove one thing causes other:
        - One event **might** cause the other.
        - The other event **might** cause the first to happen.
        - They **may** be linked by a different reason.
        - Or the result **could** have been random.

## Box Plot

- A box plot is basically a graph that presents information from a **five-number summary**.


- If we look at a box plot diagram (any fully heatured example):
    - The end of the box are the first(lower) and third(upper) quartiles - the box spans the so-called interquartile range.
        - The first quarytile basically represents the 25th percentile.
        - Meaning that 25% of the data points fall below the first quartile.
        - The third quartile is the 75th percentile, means 75% of the points in the data fall below the third quartile.
    - The **median, marked by the horizontal line** inside the box is the middle value of the dataset, the 50th percentile.
        - The median is used instead of mean because it is more robust to outlier value.
    - The **whiskers** are the two lines outside the box that extend to to hightest and lowest(max/min) observations in data.


- **Five Number Summary**, is made up of these five values: the **maximum value**, the **minimum value**, the **lower             quartile**, the **upper quartile**, and  the **median**. These values are ordered from lowest to highest:
    - Minimum Value
    - Lower Quartile(Q1/25th percentile)
    - Median value(Q2/50th percentile)
    - Upper Value(Q3/75th percentile)
    - Maximum Value
- These five numbers give us the summary of the data as each value describes a specific part of a dataset.


#### Interpreting a Box Plot

- A **short box plot** tells us that many of our **data points are similar**, we have many **values in a small range**.
  On the other hand, a **tall box plot** implies that much of the **data points are quite different**, we have values
  that are spreat over a wide range.
- A **median value** that is **close to the bottom** tells us that **most of our data points have lower values**.
  While a **closer to the top** tells us that **most of our data has higher value**. Basically, **a median line not in the       middle of the box indicates skewed data**.
- *What about the length of those whiskers?*
  **Long whiskers** tell us that our data has a **high standard deviation and variance**, i.e., the values are spread out and     vary a lot. If there are long whiskers on one side of the box, but not the other, then it's an indication that our data         varies, but only in one direction.

## Probability

- **Probability is the numerical chance that something will happen; it tells us how likely it is that some event will occur**.


#### Probability and Data Science

- Understanding the methods and models needed for data science, like logistic regression which we will encounter in the Machine   Learning section, randomization in A/B testing, or in experimental design, and sample of data are examples of use-cases that   require good understanding of probability.

- **Probability of an Event** = (**Number of Ways It can Happen**)/(**Total Number of Outcomes**)

#### A. Independent Events
- Two events are independent when the outcome of the first event does not influence the outcome of the second event.
- **```P(X and Y) = P(X) * P(Y)```**

#### B. Dependent Events
- Two events are dependent when the outcome of the first event affects the outcome of the second event.
- **```P(X and Y) = P(X) * P(Y after X has occured)```**

#### C. Mutually Exclusive Events
- Two events are mutually exclusive when it is impossible for them to happen together. 
- e.g. Turning Left and turning Right are mutualy aggressive. They can\'t happen at once.
- ***Except in the world of Quantum Physics***.
- **```P(X or Y) = P(X) + P(Y)```**
- **```P(A and B) = 0```**

#### D. Inclusive Events
- Inclusive events are the events that can happen at the same time.
- To get the probabilities of an inclusive event:
    - We first add the probabilities of the individual events.
    - Then subtract the probability of the two events.
- **```P(X or Y) = P(X) + P(Y) - P(X and Y)```**

## Conditional Probability

- Conditional Probability is ameasure of the probability of an event that another event occered.
- It is the probability of one event occuring with some relationship to one or more other events.
- **```P(Y|X) = P(X and Y) / P(X)```** 
    - Say event **X** is that raining outside, there's a 0.3 chance of rain today.
    - Event **Y** might be that you will need to go outside with a probability of 0.5.
    - A conditional probability would look at these two events, **X AND Y**, in a relationship with one another.
- ***P(XandY) = P(X)*P(Y*|*X)*** , this is known as **Bayes Theorem**.

### Bayesian Statistics

- There are two categories of probabilites: **Frequency Statistics** and **Bayesian Statistics**.
- *Bayesian Statistics* is also known as *Bayesian Inference*.
- **Bayesian Statistics** is amore general approach to statistics.
- It describes the probability of an event based on the previous knowledge of the conditions that might be related to event.
- It is based on the **Bayes' Theorem**: *basically a way of finding probabilities when we know certain other probabilities.*
- The formula: **```P(A|B) = P(A) * P(B|A) / P(B)```**
- This tells us how often A happens given that B happens, written P(A|B), when we have the following information:
    - How often B happens given that A happens, written P(B|A).
    - How likely A is on its own, written P(A).
    - How likely B is on its own, written P(B).

### Probability Distributions

- A probability distribution is a function that represents the probabilities of all possible values.
- By specifying the relative chance of all possible outcomes, probability distributions allow us to understand underlying         trends in data.

#### Random Variables

- The set of possible valuesfrom a random experiment is called **Random Variable**.
- Random Variables can be either discrete or Continuous:
    - **Discrete Data**: a.k.a. Discrete Variables can only take specified values. e.g. Rolling a dice.
    - **Continuous Data**: a.k.a. Continuous Variables can take any value within a range. This range can be finite or infinite.
- *Types of Probability Distributions*:
    1. **Discrete probability Distributions**: for Discrete Variables.
    2. **Probabity Density Functions**: for Continuous Variables.

#### Probability Functions
- The probability functions for a *Discrete Random Variable* is often called **Probability Mass Function**.
- And for a *Continuous Random Variable* is often called **Probability Density Function (prob. distribution function)**.

### Types of Distributions


#### 1. Uniform Distribution

- This is a basic probability distribution where all the values have same probability of occurence within a specified range.
- All the values outside that range have probability of 0.
- e.g. : Rolling a fair die: Outcomes can range from 1-6, and they all have probability of 1/6. Prob. of getting a 0 or 7 is 0.
- Also known as Rectangular Distribution.
- The dessity function, f(X), of a variable X that is uniformly distributed is: **f(x) = 1 / (b-a)**, where **a**:-minimum       value and **b**:- maximum value of the possible range of values of **X**.
- **Mean** = ***E(X) = (a + b) / 2***
- **Variance** = ***V(X) = (b - a)<sup>2</sup> / 12***


#### 2. Benoulli Distribution

- A Bernoulli distribution is a **discrete** probability distribution.
- It is used when a random experiment has **only two outcomes**, **"success"** or **failure**.
- e.g. **Flipping a coin** has only two outcomes: **heads** nad **tails**.
- A random variable ***X***, with a Bernoulli distribution can take value 1 with probability of success ***p***, and value 0     with the probability of failure ***1-p***.
- The probability of success and faiulure do not need to be equally likely.
- The probability function, ***P(X)***is given by
    - ***P(x) = p<sup>x</sup> * (1-p)<sup>(1-x)</sup>***,    where x ∈ (0,1).
- **Mean**  =  ***E(X)  =  1 * p + 0 * (1-p)  =  p***
- **Variance**  =  ***V(X)  =  E(X<sup>2</sup>) - [E(X)]<sup>2</sup>  =  p - p<sup>2</sup> = p(1-p)***


#### 3. Binomial Distribution

- A Binomial distribution can be thought of as the probability of having success or failure as outcome in an experiment that is   repeated multiple times.
- Binomial distributions must follow these criteria:
    - There are **only two possibll outcomes** in a trial-either success or failure.
    - The **probability of success** is exactly the same for all trials.
    - The **number of observations** or trials is fixed, a total number of n identical trials.
    - Each observation **trial is independent**, none of the trials have an effect on the probability of the next trial.
- **Mean** = ***E(X) = n * p***
- **Variance** = ***V(X) = n * p * (1-p)***; n:= number of trials


#### 4. Normal Distribution

- A normal distribution, the bell curve or Gaussian Distribution, is a distribution that represents the behavious in most         situtions.
- The bells curve is symmetrical, half of the data will fall to the left of the mean value and half will fall to the right of     it.
- The number of standard deviation from the mean (distance from mean) is called **standard score** or **z-score**.
- Z-scores are a way to compare results froma test to a **"normal"** population.
- A normal distribution with a mean of 0 and standard deviation of 1 is called a **standard normal ditribution**.
- The process of transforming a distribution to one with a mean of 0 and standard deviation of 1 is called **standardizing the   distribution**.
- **Mean** = ***E(X) = μ***
- **Variance** = ***V(X) = σ<sup>2</sup>***
- **Z-score** = ***z = (x - μ) / σ***


#### 5. Poisson Distribution

- This distrbution gives us the probability of a given **number of events** happening **in a fixed interval of time**.
- For a distribution to be called **Poisson Distribution**, the following assumption need to be in place:
    - The number of success in two disjint time intervals is independent.
    - The probability of a success during a small-time interval is proportional to the entire length of the time interval. The       probability of success in an interval approaches zero as the interval becomes smaller.
- **Probability Function = *P(x) = e<sup> - μ</sup> * μ<sup> x</sup> / x!***;       x:= 0,1,2,3...
- **Mean = * μ = λ * t***;     λ:= The rate at which an event occurs; t:= is the length of the time interval.


#### 6. Exponential Distribution

- It allows us to step further from the Poisson Distribution.
- It allows us to model the time in between each accident.
- The kind of questions that can be answered by modeling waiting times:
    - How much time will go by before a major earthquake hits a certain area?
    - How long will a car component last before it needs replacement?
- **Probability Function = *f(x) = λ * e<sup> - λ x</sup>***;       λ:= mean time between events

### Statistical Significance

- **Statistical significance is ameasure of weather our findings are meaningful or just a result of random chance.**

- Components of Statistical Significance:
    1. Hypothesis Testing
    2. Normal Distribution
    3. p-values
    
1. Hypothesis Testing
    - is a technique for evaluating a theory using data.
    - The hypothesis is the researcher's initial belief about the situation before the study.
    - The commonly accepted fact is known as ***Nul Hypothesis***.
    - While the opposite is known as ***Alternative Hypothesis***.

2. Normal Distribution
    - **z-test**: A z-test is astatistical technique to test the Null Hypothesis against the Alternative Hypothesis.
    - This technique is used when the sample data us normally distributed and the population size is greater than 30. **Why 30?**
    - ***Central Limit Theorem*** : According to this theorem as the sample size grows and the number of data points exceeds         30, the samples are considered to be normally distributed.
    - z-test are based on z-scores, which tell us where athe sample mean lies compared to the population mean.

3. P-value
    - The p-value quantifies the rareness in our results.
    - It tells us how often we'd see the numercal results of an experiment (z-scores) if the null hypothesis is true and there       are no differences between the groups.
    - This mean that we can use the p-values to reach conclusions in significance testing.
    - More specifically, we compare the p-value to a **significance level α** to make conclusions about our hypotheses
        - **If the p-value is very small** or lower that the **significance level** we chose, it means the numbers would rarely           occur by chance alone, and we can reject the null hypothesis in favour of the alternative hypothesis.
        - **If the p-value is greater than or equal to the significance level**, then we fail to reject the null hypothesis.             This doesn't mean we accept the null hypothesis though.
    - The choice of ***α*** depends on the situation, **0.05** is the most widely used value across all scientific disciplines.
    - This means that *p < 0.05* is the threshold beyond which study results can be declared to be *Statistically Significant*.
    - i.e. Its unlikely the results were the result of random chance.
    - If we run the experiment 100 times, we'd see these same numbers, or more extreme results, 5 times, assuming the null           hypothesis is true.
    - p-value less that 0.05 means **there is less than a 5% chance of seeing our results, or more extreme results, assuming         the null hypothesis is true**.