# Statistics Overview
## Part I - Statistics Basics & Descriptive Statistics
---
Statistics is an area of applied mathematics concerned with data collection, analysis, inerpretation, and presentation 

It consists of Three Different Notebooks: 

* **Part I   - Statistics Basics**
* **Part II  - Descriptive Statistics**
* **Part III  - Probability**
* **Part IV - Inferential Statistics**

---
# Part I - Statistics Basics
---

**Data Set** - The data collected for a particular study

**Elements** - Individual values within a data set

**Variable** - the column containing a group of elements, more specifically the name of the column which characterizes the data for that column.

**Population** - set of all elements about which we wish to draw conclusions

**Sample** - subset of the elements of a population

**Quantitative Data** -  data that can be counted, measured, and expressed using numbers (for example: 'how much of something', or 'how many')

**Quantiative Variables** - there are two types of quantiative variables.
* **Discrete (aka Categorical)** - Data that cannot be broken down into smaller parts, i.e. holds a finite number of possible values. (Example: number of iphones sold last year, number of students in a class)

* **Continuous** - Data that can be infinitely broken down into smaller parts. (Example: weight, age). 

**Qualitative Data** - data that can be categorized based on traits and characteristics (examples: person's gender, car model, etc.)

**Qualitative (Categorical) Variables** - there are two types of categorical variables.
* **Ordinal** - a qualitative variable with meaningful ordering and ranking. (Example: Satisfaction Rating like: 'happy', 'unhappy'). Numbers can also be used like 1 for happy and 0 for unhappy, note that orginal number can not be quantified. 

* **Nominal** - a qualitative variable with no meaningful ordering or ranking. (Example: gender, car model colors).

**Cross Sectional Data** - is data collected at the same or approximately around the same time. (example: multiple company stock closing prices for one day)

**Time Series Data** - is data collected over different periods of time. (Example: daily stock returns for one company for a month)

---
## Data Sources
Oftentimes data is available publicly or for sale privately. When this is the case the data is an **existing** data source. 
On the other hand, if data is not readily available then one must collect the data themselves. Often this is done through **surveys** or **experimental studies**.

**Existing Data** - data that has already been gathered from an outside source

**Response Variable** - main varible (column) of interest in a study

**Factors** - other variables that could potentially relate to the response variable

**Experimental Study (Supervised)** - when factors of a response variable are able to be set and manipulated, then a study is said to be experimental (Example: Drug compnay testing medications and using placebos)

**Observational study (Unsupervised)** - when factors of interest are unable to be controlled, a study is said to be observational. (Example: cholesterp and diet studies)

---
## Sampling Techniques
Sampling is a statistical method that deals with a selection of observations from within a population. There are two main types of sampling methodologies, Probability and Non-Probability. Probability sampling is the method most often used in conjunction with statistics. There are 3 main types of Probability sampling:

**Random Sampling** - Each element or member of a population has an equal chance of being selected for the sample

**Systematic Sampling** - Every nth record is chosen from the population to be a part of the sample

**Stratified Sampling** - A stratum (or subset) of the population that shares at least one common characteristic (such as gender or age) is created, then random sampling is used to select a sufficient number of subjects from the stratum 

----
# Part II - Descriptive Statistics
---
The science of describing the important aspects of a set of data  **Statstical Inference**, uses descriptive statistics in order to make genralizations about the important aspects of a population (or sample) of data. 

**Descriptive Statistics** are measured both graphically and numerically and often both methods are intertwined: 

## Graphical Methods of Descriptive Statistics 
* **Quantitative Data** - quantitative data can be graphed using Histograms, Line Charts, and Scatter Plots to name a few. 


* **Qualtitaive Data** - qualitative data can be graphed using Bar Charts, Pie Charts, Point Plots, Categorical Scatter Plots, Box Plots, and Violin Plots to name a few. 

----
##  Quantitative Charts

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
# CREATE SOME SAMPLE DATA

## Historgram

## Line Chart

## Scatter Plot

----
##  Qualitative Charts

## Bar Chart

## Pie Chart

## Point Plot

## Catgorical Scatter Plot

## Box Plot

## Violin Plot

----
## Numerical Methods of Descriptive Statistics 
Numerical analysis with descriptive statstics is used to describe the **central tendency** of a data set (or the typical value within a data set). 

**Parameter** - is a number calculated that descrives some aspect of a population or sample, example mean, median, mode

**Mean** - simple average of an aspect from a population or sample. 

**Median** - the center value, if data set lenght is odd, then median is central value, if data set is even, then median is average of the two central data points

**IMPORTANT NOTE:**
The median is generally resistant to extreme (outlier) values whereas the mean is not. Therefore is data sets with wide ranges, the median is often the best choice for central tendecy measurements. 

**Mode** - the value that appears the most. 

**Min** - the smallest value in a data set. 

**Max** - the largest value in a data set. 

**Range** - The largest value minus the smallest value in a data set.

**Variance** - The average of the squared differences from the mean (variance describes how much a random variable differs from its expected value)

**Standard Deviation (STD)** - Measures the extent of deviation of a group of data as a whole from the mean, and is measured by the positive squared root of the variance.

**Coefficient of Variation (aka Relative Standard Deviation)** - a stadardized measure of dispertion of a frequency or probability distribution. This measure tells how varied a set of data is. It is calculated by:

* **STD / mean * 100**

The coefficient of Variation is useful when comparing separate samples or populations with different means and standard deviations, Below are some examples. 

<img src='data/images/coeff_var.PNG'>


## Normal Distribution
One type of relative frequency curve (or probabily distribution) describing a data set is the normal curve, which is a symmetrical, bell-shaped curve. If a population or sample can be described by the normal curve, we say that the data is normaly distributed.

Note: A **Normal Random Variable** is a variable with mean at 0 and variance equal to 1

The **mean** of a normal distribution determines the location of the center of the graph and the **Standard Deviation** determines the height of the graph (large std - short wide curve, small std = tall and narrow

The **Central Limit Theorem** states that the sampling distribution of the mean of any independent random variable will be normal or nearly normal, if the sample size is big enough

**Empirical Rule for Normal Distribution** - states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. It states:
* 68.26% of a populations measurements will fall within 1 STD of the mean. 
* 95.44% of a populations measurements will fall within 2 STD of the mean.
* 99.73% of a populations measurements will fall within 3 STD of the mean. 

**Note that the Emprical rule only holds for normal bell shaped distibutions.**


<img src='data/images/emp_rule.jpg'>

## Skewness & Kurtosis
**Skewness** - measures the symmetry or asymmetry of a bell curve. There is a formula to calculate skewness and the results can be interpreted as follows:
* skewness < -1 OR > 1, highly skewed
* skewness between -1 and -0.5 OR 0.5 and 1, moderately skewed
* skewness between -0.5 and 0.5, approximately symmetric

**Symmetrical Curve** - perfect bell shape with no skew.
* Mean = Median = Mode

**Left Skew (aka negative skew)** - has disproportionate tail to the left.
* Mean < Median < Mode 

**Right Skew (aka positive skew)** - has disproportionate tail to the right, also known as positively skewed.
* Mean > Median > Mode

**Skewness Diagram**
<img src='data/images/skew.png'>

<br>
<br>

**Kurtosis** - a measure of whether data distrition is heavy-tailed or light-tailed relative to a normal distribution.
* kurtosis = 3, normally distributed
* kurtosis > 3, fat-tailed distribution
* kurtosis < 3, thin-tailed distribution
* genrally acceptable range for kurtosis is -2 to 2


## z-score (or standardized value)
A z-score is the number of standard deviations that x is from the mean:
* **z-score = (x - mean) / standard deviation**

Z-scores also denote relative location between samples. For example, values from two different populations or samples with the same z-score are therefore the same number of std's away from their respective means. Below is a z-score breakdown:

* z-score = 0; x is on the mean


* positive z-score; x > average mean, so if z = 2, x is 2 std's above mean


* negative z-score; x < average mean, so if z  = -1, x is 1 std below mean 

To determine the exact % value that x is from the mean, use the z-score table, but just knowing 1, 2, and 3 std's will give you a good idea of where the sample lies.

A good reference link explaining z-scores and how to use them in depth:
https://www.statisticshowto.com/probability-and-statistics/z-score/


In [4]:
# Two methods for caclulating Outliers using z-score

# Method 1, Use Scipy Stats
#z = np.abs(stats.zscore(df))

# Set threshold to return values > 3 std's from the mean
#np.where(z > 3)

# To Get max or min vals, use numpy
#np.min(z)
#np.max(z)

#----------------------------------------------------
# Method 2, Manual Calucation Python list

#outliers=[]
#scores = list(scrdf['imdb_score'])

#for score in scores:
#    z_score = (score - 6.5)/1.0
#    if np.abs(z_score) > 3.0:
#        outliers.append(score)
#    else:
#        pass

#print(max(outliers))

## Percentiles, Quartiles, and the Inter Quartile Range
a percentile is a value x such that a a percentage of a population or sample falls below. 

In many cases, it can be helpful to divide a data set up into 4 sections by percentile into four sections are called Quartiles that altogether make up the Inter Quartile Range:

* 1st Quartile, the 25th percentile or the value where 25% or less of the data lies
* 2nd Quartile, the 50th percentile (THE MEDIAN), or the value where 50% or less of the data lies. 
* 3rd Quartile, the 75th percentile, or the value where 75% or less of the data lies. 
* 4th Quartile, the 100th percentile (Max Value), all other data points, lie below this value. 

**IMPORTANT NOTE**
In general percentiles are less suseptable to noise and outliers than other values. In cases where a population or sample is highly skewed (either left or right), then the standard deviation is usually so large that it cannot be used to provide any useful information. In such cases, population variation is best described using percentiles over standard deviation

The graphical display of the Quartile range is the Box Plot (aka Box and Whiskers), this plot also is good for detecting outliers.

## Covariance and Correlation
In many cases we want two analyze two separate variables (columns) from a popultion or sample in order to see if they move togther or not. If two samples do move together (that is when one samples values increase so does the other and vice-versa) then they are said to have a **linear relationship.**

**Covariance** - is a measure of the direction of linearity between variables x and y (ex: positive linearity, negative linearity), however is does not distinguish the strength of the linear relationship.  

**Corellation Coefficient** - a measures both the direction and strenght of a linear realtionship between x and y. Corellation Coefficinets will always be between -1 and 1. The scores are intepreted below:

* -1.0 = perfect negative linear relationship
* -0.7 = strong negative linear relationship
* -0.5 = moderate negative linear relationship
* -0.3 = weak negative linear relationship
*  0.0 = no linear relationship
*  0.3 = weak positive linear relationship
*  0.5 = moderate positive linear relationship
*  0.7 = stong positive linear relationship
*  1.0 = perfect linear relationship

Note that negative linear relationships move downhill from left to right whereas positive linear relationships move uphill from left to right. See image below:

<img src='data/images/corr_coeff.png' width="600">



## Weighted Mean
Often times we compare values based on percentages. One simple example is a teacher who has a weighted grading system like so:

* homework = 25
* quiz = 10
* tests = 30
* final exam = 35

Note that all elements above = 100% for the final grade. Because not all values are treated equally, then we can't use simple averages to compute the final grade. 

For example, say a student gets these grades:

* homework = 88
* quiz = 71
* tests = 97
* final exam = 90

Taking the simple average would give us: 

$ (88 + 71 + 97 + 90) / 4 = 86.75 $ 

However, this is incorrect because of unequal weighting for each category. To take the weights into account, the averaging formula must be altered to:

$ \sum (w * x) / \sum w $ 

Which is:

(88 * 25) + (71 * 10) + (97 * 30) + (90 * 25) / 100 = 89.7

Notice that the weighted score is 3 points higher. 

In [5]:
#Calculating weighted mean

weights = np.array([25, 10, 30, 35])
s1_grades = np.array([88, 71, 97, 90])

sum(weights * s1_grades) / sum(weights)

89.7

## Grouped Data
Central tendency and variability is usually measured using individual values from within a population or sample. However, often times the only data available is in the form of a frequency distribution bar chart or historgram. 

Data summarized in frequency distribution or histogram form are often called grouped data. 

Suppose you have a frequency distribution summarizing a sample of 65 customer satisfaction ratings for a product like so:
* Ratings : Frequency
* 36 - 38 : count 4
* 39 - 41 : count 15
* 42 - 44 : count 25
* 45 - 47 : count 19
* 48 - 50 : count 2

Because we do not know the each of the 65 individual ratings, we cannot compute the exact mean satisfaction rating. But, we can calculate an approximation of this mean. 

To do so, the midpoint of each ratings class is used to represent the measurement for that class. the midpoints are multiplied by each frequency than summed together and dived by the total number of samples. Here's the formula:

$ \sum $ **(freq * midpoint) / total samples**


<img src='data/images/mean.jpg' width="600">


The Variance can also be approximated in a similar manner:

<img src='data/images/var.jpg' width="600">




## Random Samples

In [1]:
# using Pandas
#c6_s1 = c6.sample(n = 700, replace=False)  3 replace= False means no duplicate rows

# np.choice() also creates random samples from a population

---
# Part 3 - Probability

---

For an in depth dive into probability including formulas and examples, go to the notebook in the Harvard CS50 Intro to AI Part III - Uncertainty, only the very basics are touched on below

**Probability** - the ratio of desired outcomes to total outcomes (desired outcomes / total outcomes). Probabilty of all outcomes always sums to 1

**Probabilty In Practice** - There are 3 steps used to calculate probability:
1. **Random Experiment** - an experiment or process for which the outcome cannot be predicted with certainty
2. **Sample Space** - The entire possible set of outcomes of a random experiment 
3. **Event** - One or more outcomes of an experiment (a subset of the Sample Space)

**Example - Rolling a Die**
1. **Random Experiment** - Probabilty of getting a 2 when rolling a die
2. **Sample Space** - All values on die face (1 - 6 with standard die)
3. **Event** - The Probability of getting a two when rolling a die, or 1/6

There are two types of **Events**:
1. **Disjoint Events** - events that do not have any common outcomes (A single card cannot be a king and a queen, a man can not be dead and alive)
2. **Non-Disjoint Events** - events that can have common outcomes (a student can get 100 on a math test and 100 on an english test)

## Three Types of Probability:
1. **Marginal Probability (aka Unconditional Probability)** is the probability of the occurrence of a single event (example, probability of drawing a heart from a deck of cards would be 13/52 
2. **Joint Probability** is the probability of two events happening at the same time 
3. **Conditional Probability** is the probability of an event or outcome occurring based on the occurrence of a previous event or outcome

**A Good example of each is given in the video at: 1:35:56**

## Bayes' Theorem
Bayes' theorem (alternatively Bayes' law or Bayes' rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

## Probability Density Function
**Probabilty Density Function** is the equation describing a contiuous probabilty distribution (A Bell Curve or Normal Distribution is an example of this. There are 3 properties of a PDF:
1. A PDF graph will be continuous over a range
2. The area bounded by the curve of a density function and the x-axis is equal to 1
3. The probabilty that a random variable assumes a value between a & b is equal to the area under the PDF bounded by a & b

---
# Part 4 - Inferential Statistics

---
Descriptive statistics describe data (for example, a chart or graph) while**inferential statistics** allows you to make predictions ("inferences") from that data. With inferential statistics, you take data from samples and make generalizations about a population through point estimation. 

**Point Estimation** - is the use of sample data to measure a single value which serves as an approximate value or the best estimate of an unknown population parameter. 

There are a number of methods used for determining a best estimate including:
1. **Method of Moments** - Estimates are found by equating the first k sample moments to the corresponding k population moments
2. **Maximum of Likelihood** - Uses a model and the values in the model to maximize a likelihood function. This results in the most likely parameter for the inputs selected
3. **Bayes' Estimators** - Minimizes the average risk (an expectation of random variables)
4. **Best Unbiased Estimators** - Several unbiased estimators can be used to approximate a parameter (which one is 'best' depends on what paramter we are trying to find)

## Interval Estimate
By far the most used estimate method is **Interval Estimate** which uses an interval (or range of values) to estimate a population parameter (usually has lower and upper confidence limits)

**Confidence Interval** is the measure of confidence that the interval estimate contains the population mean. Statisticians use the CI to describe the amount of uncertainty associated with a sample estimate of a population parameter

### Confidence Interval Estimation Use Case

<img src='data/images/stat1.png'>

### Example
<img src='data/images/stat2.png'>

**Sampling Error** is the difference between the point estimate and the actual population parameter value 

**Margin Of Error** is the greatest possible distance between the point estimate and the value of the parameter it is estimating

### Margin of Error Use Case
Note that the 1.96 is the **Z-Score** taken from a z-score table where the mean is matched
<img src='data/images/stat3.png'>


## Hypothesis Testing
Statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected. Hypothesis testing is conducted in the following manner:
1. **State the Hypothesis** - This stage involves stating the null and alternative hypothesis
2. **Formulate an Analysis Plan** - This stage involves the construction of an anlysis plan
3. **Analyse Sample Data** - This stage invloves the calculation and inerpretation of the test statistics as described in the anlaysis plan
4. **Interpret Results** - This stage involves the application of the decision rule described in the analysis

**Null Hypothesis** occurs when the result of a test is no different from the assumption

**Alterante Hypothesis** occurs when the result disproves the assumption


