# Lambda School - Sprint Challenge 2: 



## Basic Statistics. 

No proper understanding in statistics can begin without a thorough and proper understanding of the Normal distribution. 

![Normal distribution graph](https://upload.wikimedia.org/wikipedia/commons/a/a9/Empirical_Rule.PNG?1587251651906)

The general form of this type of function, which is known as a probability density function, can be represented as:

![Probability density function](https://wikimedia.org/api/rest_v1/media/math/render/svg/00cb9b2c9b866378626bcfa45c86a6de2f2b2e40)

in the explicity parameters wherein mu is 0, and sigma is 1, we get a "normal distribution function". 

![Normal distribution equation](https://wikimedia.org/api/rest_v1/media/math/render/svg/3123d8dd4c3386afe9fac119fed2cfaf7ce9f336)

a faster way we can get aquainted with the way this function works is to look at an example with varying **mu** ans **sigma**.

![Example of PDF](https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/800px-Normal_Distribution_PDF.svg.png?1587251935434)


In [8]:
# with this in mind we should learn how to use the basic four functions used in statistics within numpy. 
import numpy as np
import pandas as pd 
from scipy import stats

# need to generate a dataset to look at [np.random.norma(mu, sigma, n)]
normal_array = np.random.normal(0, 1, 1000)
normal_array_df = pd.DataFrame(normal_array) 

In [10]:
### Min 
print(f'The lowest value of the dataset is: {np.min(normal_array)}')

### Max 
print(f'The highest value of the dataset is: {np.max(normal_array)}')

### Mean 
print(f'The mean value of the dataset is (or mu): {np.mean(normal_array)}')

### Median
print(f'The median value of the dataset is: {np.median(normal_array)}')

### Sd
print(f'The standard deviation of the dataset is (or sigma): {np.std(normal_array)}')

The lowest value of the dataset is: -2.9358441714196175
The highest value of the dataset is: 3.302542451918032
The mean value of the dataset is (or mu): -0.013505437882494947
The median value of the dataset is: 0.00015701036844917554
The standard deviation of the dataset is (or sigma): 1.0036060146268915


In [9]:
### describe function
normal_array_df.describe()

Unnamed: 0,0
count,1000.0
mean,-0.013505
std,1.004108
min,-2.935844
25%,-0.734454
50%,0.000157
75%,0.646335
max,3.302542


## Statistical Definitons of Value.

**Set**: A collection of distinct entities regarded as a unit, being either individually specified or (more usually) satisfying specified conditions. See the Python class [`set()` ](https://docs.python.org/3/library/stdtypes.html#set) 

**Subset**: If a set `A` in which all members are also members of another set `B`, then set `A` is considered to be a subset of `B`

**Empty Set**: A set without any members. The empty set is a subset of all other sets

**Universal Set**: The set that all other sets are a subset of

**Combination**: A selection of a given number of elements from a larger number without regard to their arrangement $nCr = \frac{n!}{r!(n-r)!}$

**Permutation**: A selection of a given number of elements from a set with concern for ordering $nPr = \frac{n!}{(n-r)!}$

**Degrees of Freedom**: the number of independent values or quantities which can be assigned to a statistical distribution. 

**Sample**: A group of individual observations drawn from a population, usually at random, and with the assumption that the sample mean and population mean will be equal. The larger the sample size, the closer the sample mean and population mean will actually be.

**Null Hypothesis**: A general statement or default position that there is no relationship between two measured phenomena, or no association among groups. "The boring" choice, nothing special is happening.

**Statistical Significance**: a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (usually indicated by a p-value < .05)

**Student's T-test**: A set of statistical hypothesis tests applied in cases where the test statistic is distributed normally (a very typical situation), but the standard deviation of the population is unknown (typical for many real-world situations). It is a relaxed version of Z-tests which assumes more strict normality and a known standard deviation. The larger the t-test statistic the more 'unusual' the result, while the p-value determines whether that result is considered statistically significant (disproving the null hypothesis). 

## The three statistics tests done: 

- [Bayesian Inference](https://medium.com/@mark.rethana/bayesian-statistics-and-naive-bayes-classifier-33b735ad7b16) 
- [Chi-squared Test](https://machinelearningmastery.com/chi-squared-test-for-machine-learning/) 
- [T Test (Student’s T-Test)](https://towardsdatascience.com/statistical-tests-when-to-use-which-704557554740)

### T-Test (Student's T-Test)

Two common t-test are:
- A [one-sample test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html#scipy.stats.ttest_1samp) of whether the mean of a population has a value specified in a null hypothesis.
- A [two-sample test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind) of the null hypothesis such that the means of two populations are equal.

In [None]:
#one sample t-test
from scipy import stats

stats.ttest_1samp(df_subset['feature'],df['feature'].mean()

In [None]:
#two sample t-test

stats.ttest_ind(df1['feature'], df2['feature'])

### Chi-squared Test
**Chi Square Test**: Used to test the independence of rows or columns (null hypothesis is independent), usually with categorical variables. 
- [Chi Square Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html)
- [Chi Square Contingency Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)  Related: [Contingency Table](https://en.wikipedia.org/wiki/Contingency_table)

In [None]:
from scipy.stats import chisquare
# The null hypothesis is that the rows/cols are independent -> low chi square
# The alternative is that there is a dependence -> high chi square

chisquare(observations, axis=None)

In [None]:
from scipy.stats import chi2_contingency

#Chi^2 from contingency table
chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency_table)

#### Confidence Interval
**Confidence Interval**: Given a sample, the confidence interval is a calculated range that an unknown population parameter (such as mean) is likely to be within. A 95% confidence interval means there is a 95% chance that the unknown parameter is within the range calculated.

In [None]:
# Confidence intervals
import numpy as np
from scipy import stats

def confidence_interval(data, confidence=0.95):
  data = np.array(data)
  mean = np.mean(data)
  n = len(data)
  
  stderr = stats.sem(data)
  interval = stderr * stats.t.ppf((1 + confidence) / 2., n - 1)
  
  return (interval)

### Bayesian Infererence: 

**Conditional Probability** - A measure of the probability of an event (some particular situation occurring) given that another event has occurred. $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

**Bayes Theorem** - Describes the probability of an event, based on prior knowledge of conditions that might be related to the event. $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$  The probability of $A$ conditioned on $B$ is the probability of $B$ conditioned on $A$, times the probability of $A$ and divided by the probability of $B$. These unconditioned probabilities are referred to as "prior beliefs", and the conditioned probabilities as "updated."

### Kush's Notes on Bayesian Inference: 

#### ***Bayes Theorem: The many faces***

$$p(H|D) = \frac{p(D|H) * p(H)}{p(D)}$$

* H is the **hypothesis**  
* D is the observed **data**  
* p(H) is the probability of the hypothesis before we see the data, called the prior probability, or just **prior**.    
* p(D) is the **marginal probability** of the data taking into account all possible hypotheses (aka "total probability of D").  
* p(D|H) is the probability of the data give that the hypothesis is true, called the **conditional probability**.   
*  p(H|D) is what we want to compute, the probability of the hypothesis after we see the data, called the **posterior** probability.

$$P(A|B) = \frac{P(B|A)* P(A)}{P(B)}$$

In words - the probability of $A$ conditioned on $B$ is the probability of $B$ conditioned on $A$, times the probability of $A$ and divided by the probability of $B$. 


***How to apply: aka the problem written out for us***


![Bayes Theorem Drug Test Example](https://wikimedia.org/api/rest_v1/media/math/render/svg/95c6524a3736c43e4bae139713f3df2392e6eda9)

* P(+|User) = 1 - True Positive Rate

* P(User) = 1/200 Prior probability

* P(+|Non-user) = False Positive rate

***Most Basic Explanation***
<div>
<img src="https://www.bayestheorem.net/images/Bayes-Theorem-Formula-Defined.jpeg" width="500"/>
</div>
$$posterior = \frac{conditional * prior}{marginal}$$


In [None]:
from scipy.stats import bayes_mvs

bayes_mvs(df['feature'], alpha=.95)

# More Theory: 

We're going to discuss more theory from the stuff in the lectures that's per-se not covered in the lecture notes: 

### Definitions: Prof. Austin's magnificent table: 

| Bayes term | Bayes formula | Confusion Matrix term | Confusion Matrix formula| Alternative CM term | 
|:-|:-|:-|:-|:-|
| prior | P(A) | prevalence | (TP + FN) / (TP+TN+FP+FN) | ? |
| posterior | P(A given B) | Positive Predictive Value (PPV) | TP / (TP + FP) | precision |
| conditional | P(B given A) | True Positive Rate (TPR)  |TP / (TP + FN) | sensitivity, recall |
| marginal | P(B) | queue rate | (TP + FP) / (TP+TN+FP+FN) | ? |  
| prior complement | P(not A) or 100-P(A) | prevalence complement | 1-prevalence | ? |
| ? | P(not B given not A) | True Negative Rate (TNR) | TN / (FP + TN) | specificity |
| ? | P(B given not A) | False Positive Rate (FPR) | FP / (FP+TN) | fall-out rate, false alarm rate |
| ? | P(not B given A) | False Negative Rate (FNR) | FN / (TP + FN) | miss rate |
|?|?|accuracy|(TP + TN) / (TP+TN+FP+FN)|?|
|?|?|error rate|(FP + FN) / (TP+TN+FP+FN)|misclassification rate|


**Abbreviations**  
A: Hypothesized Data       
B: Observed Data         
TP: True Positive  
TN: True Negative  
FP: False Positive  
FN: False Negative  
 
^ Note: Sometimes in Bayesian statistics the following terms are used instead:
 
* prior = hypothesis
* posterior = updated hypothesis
* conditional = likelihood
* marginal = model evidence

### Generalized Structure of a H_0 and H_A

Null Hypothesis $(H_0)$: The is no significant difference in thing between x and y on k-dataset.
Alternative Hypothesis $(H_a)$: The is a significant The is no significant difference in thing between x and y on k-dataset