<a name='home'></a>
# Table of Content

1. [Exploratory Data Analysis](#eda)
2. [Data and Sampling Distribution](#samp)
3. [Statistical Experiments and Significance Test](#exp)
4. [Regression and Prediction](#reg)
5. [Classification](#class)  
6. [Statisitcal Machine Learning](#ml)
7. []()


#### Appendix:
* [Examples of Code](https://github.com/andrewgbruce/statistics-for-data-scientists)
* [Safari Books](https://www.oreilly.com/)
* [TheOnlineStatBook project](http://onlinestatbook.com/2/index.html)

<a name='eda'></a>
## Exploratory Data Analysis
* Two basic types of structured data
    * numeric
        * continuous
        * discrete
    * categorical
        * fixed set of values, e.g. state names
        * binary data, e.g. 0/1, true/false, yes/no
        * ordinal data, ordered data
The data type is important to help determine the type of visual display, data analysis or statistical model

### Rectangular Data
Recatangular data is essentially a two-dimensional matrix with rows indicting records (cases) and columns indicating features (variables).
* Data Frame: Rectangular data 
* Feature: A column in the table 
* Outcome: Dependent variable
* Records: A row in the table

### Nonrectangular Data Structures
* Time Series: successive measurements of the same variable
* Spatial data structure: object representation by its spatial coordinates
* Graph data structure: representation of physical, social and abstract

### Estimates of Location
* Mean: The sum of all values divided by the number of values  
    MEAN = $\overline{x} = \frac{\sum_{i}^{n} x_{i}}{n}$
    
* Trimmed Mean: The average of all values after dropping a fixed number of extreme values  
    T_MEAN = $\overline{x}$ = $\frac{\sum_{i=p+1}^{n-p} x_{i}}{n-2p}$
    
* Weighted Mean: The sum of all values times a weight divided by the sum of the weights
    W_Mean = $\overline{x_{w}}$ = $\frac{\sum_{i=1}^{n} w_{i} x_{i}}{\sum_{i}^{n}{w_{i}}}$
    
* Median: The value such that one-half of the data lies above and below
* Weighted Median: The value such that one-halve of the sum of the weights lies above and below the sorted data
* Robust: Not sensitive to extreme values
* Outlier: A data value that is very different from most of the data

### Estimates of Variability
Variability measures whether the data values are tightly clustered or spread out.
* Deviation: The difference between the observed values and the estimate of location
* Variance: The sum squared deviations from the mean divided by n-1  
    Variance = $s^{2} = \frac{\sum_{}^{} (x- \overline{x})}{n-1}$ 
    
* Standard deviation: The square root of the variances  
    StdVar = s = $ \sqrt{Variance} $
* Mean absolute deviation: The mean of the absolute value of the deviation from the mean  
    Mean Absolute Deviation = $\frac{\sum_{i=1}^{n} |x_{i} - \overline{x}|}{n}$
    
* Mean absolute deviation from the median: The median of the absolute value of the deviation from the median
* Range: The difference between the largest and the smallest value in the data set
* Order statistics: Metrics based on the values sorted from the smallest to biggest
* Percentile: The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more  

### Exploring Data Distribution
* Boxplot: Are based on percentiles and give a quick way to visualize the distribution of data.  
    ![Boxplot example](https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Box-Plot_mit_Min-Max_Abstand.png/440px-Box-Plot_mit_Min-Max_Abstand.png)
    
* Frequency table: A frequency table of a variable divides the up the variable range into equally spaced segments, and tells how many values fall in each segment  
* Histogram: A histogram is a way to visualise a frequency table  
    ![Histogram example](https://upload.wikimedia.org/wikipedia/commons/1/1d/Example_histogram.png)
* Density plot: Shows the distribution of data values as a continuous line  
    ![Density plot example](https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Gumbel_distribtion.png/600px-Gumbel_distribtion.png)

### Exploring Binary and Categorical Data
* Bar Chart: The frequency or proportion for each category plotted as bars  
    Displaying a single categorical variable
* Pie Chart: The frequency or proportion for each category plotted as wedges in a pie
* Mode: is the value that appears most often in the data
* Expected Value: a form of weighted mean in which the weights are probabilities  
    $ E(X) = \sum_{i=1}^{n} x_{i} p(x_{i}) $
    
### Correlation
* Correlation coefficient: A metric that measures the extend to which numeric variables are associated with one another (range -1 to +1)  
    r = $\frac{\sum_{i=1}^{N} (x_{i} - \overline{x})(y_{i} - \overline{y})}{(N-1)s_{x}s_{y}}$  
    The correlation coefficient is sensitive to outliners.
* Correlation matrix: A table where the variables are shown on rows and columns and the cell values are the correlation between the variables  
    Variables can have an association that is not linear, in which case the corrleation coefficient is not a useful metric  
    ![Correlation matrix example](https://www.displayr.com/wp-content/uploads/2018/07/rsz_correlation_matrix_3.png)
* Scatterplot: A plot in which the x-axis is the value of one variable and the y-axis the value of another  
    ![Scatterplot example](https://upload.wikimedia.org/wikipedia/commons/4/4b/Mpl_example_scatter_plot.svg)

### Further reading
* [Data type taxonomy for R](#http://www.r-tutor.com/r-introduction/basic-data-types)
* [Data type taxonomy for data bases](https://www.w3schools.com/sql/sql_datatypes.asp)
* [Data Frames in R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html)
* [Data Frames in Python](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe)
* [Boxplot](http://www.oswego.edu/~srp/stats/bp_con.htm)
* [Density estimation in R](https://vita.had.co.nz/papers/density-estimation.html)
* [Modern Data Science](https://www.crcpress.com/Modern-Data-Science-with-R/Baumer-Kaplan-Horton/p/book/9781498724487)
* [Ggplot2: Elegant Graphics for Data Analysis](https://www.springer.com/gp/book/9783319242750)

[Home](#home)

<a name="samp"></a>
## Data and Sampling Distribution

### Random Sampling and Sample Bias

* Sample: A subset from a larger data set
* Population: The larger data set or idea of a data set
* N(n): The size of the population
* Random sampling: Drawing elements into a sample at random
* Stratified sampling: Dividing the population into strata and randomly sampling from each strata
* Simple random sample: The sample that results from random sampling without stratifying the population
* Sample bias: A sample that misrepresents the population

Random sampling is the process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw.
Sampling can be done:
* with replacement
* without replacement

Data quality in data science involves:
* completness
* consistency of format
* cleanliness
* accuracy of individual points
Statistics adds the notion of representativeness

#### Bias
Nonrandom: a sample bias that is meaningfull and can be expected to continue for other samples drawn in the same way. A <b>statistical bias</b> refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process. It is often an indication that a statistical model has been misspecified, e.g. an important variable was left out.  

<b>Selection bias</b> refers to the practice of selectively choosing data in a way that leads to a conclusion that is misleading or ephemeral.  
* Bias: Systematic error
* Data snooping: Extensive hunting through data in search of something interesting
* Vast search effect: Bias or nonreproducibility resulting from repeated data modeling, or modeling with large numbers of predicator variables.
* others: cherry-picking data, selection of time intervals that accentuate a particular effect, stopping an experiment when the results look interesting

#### Regression to the Mean
Regression to the mean refers to a phenomenon involving successive measurements on a given variable: extreme observations tend to be followed by more central ones.![Galton's regression to the mean](https://www.researchgate.net/profile/Yeming_Ma2/publication/280970132/figure/fig1/AS:284517131669510@1444845578444/Rate-of-regression-in-hereditary-stature-Galton-1886-Plate-IX-fig-a-The-short.png)

### Sampling Distribution of a Statistic
* Sample statistic: A metric calculated for a sample of data drwan from a larger population
* Data distribution: The frequency distribution of individual values in a data set
* Sampling distribution: The frequency distribution of a sample statistic over many samples or re-samples
* Central limit therorem: The tendency of the sampling distribution to take on a normal shape as the sample size raises
* Standard error: the variability (standard deviation) of a sample statistic over <b>many samples</b>

The <b>Central Limit Theorem</b> says that the means drawn from multiple samples will resamble the familiar bell-shaped normal curve, even if the source population is not normally distributed, provided the sample size is large enough

The <b>Standard Error</b> is a single metric that sums up the variability in the sampling distribution for a statistic  
$s = \frac{s} {\sqrt n}$  

As the sample size increases, the standard error decreases *reduce the std error by two, the sample size needs to be increased by factor 4*

The <b>Bootstrap</b> is to draw additional sampels with replacement from the sample itself and recalculate the statistic or model for each re-sample.
* Bootstrap sample: A sample taken with replacement from and observed data set
* Resampling: the process of taking repeated samples from observed data that is sample with replacement  
The bootstrap can be used with <b>multivariate data</b> where the rows are sampled as units.

### Confidence Interval
* Confidence level: The percentage of confidence intervals, constructed in the same way from the same population, expected to contain the statistic of interest
* Interval endpoints: The top and bottom of the confidence interval

The percentage associated with the confidence interval is termed the level of confidence. Strictly speaking a 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ).

### Normal Distribution
* Error: The difference between a data point and a predicted or average value
* Standardise: Subtract the mean and divide by the standard deviation
* z-score: The result of standardising and individual point
* standard normal: A normal distribution with mean = 0 and standard deviation = 1
* QQ-Plot: A plot to visualise how close a sample distribution is to a normal distribution

A <b>standard normal distribution</b> is one in which the units on the x-axis are expressed in terms of standard deviations away from the mean. 
![Standard normal distribution (Gauss curve)](https://spss-tutorials.com/img/standard-normal-distribution-with-probabilities.png)

The standard normal distribution is centered at zero and the degree to which a given measurement deviates from the mean is given by the standard deviation. For the standard normal distribution, 68% of the observations lie within 1 standard deviation of the mean; 95% lie within two standard deviation of the mean; and 99.9% lie within 3 standard deviations of the mean.

To compare data to a standard normal distribution, you subtract the mean then divide by the standard deviation: This is called normalisation or standardisation:

A <b>Q–Q (quantile-quantile) plot</b> is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. It is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions, e.g. to a standard normal distrbution.
![A QQ plot](https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Normal_normal_qq.svg/600px-Normal_normal_qq.svg.png)

### Long Tailed Distributions
* Tail: The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency
* Skew: Where one tail of a distribution is longer than the other

Normal standard distribution is usally not the characteristic of raw data. It can be
* skewed, i.e. be asymetric
* discrete (binomial data)
* and may have long tails

### Student's t-Distribution
The t-distribution is a normally shaped distribution but thicker and longer on the tails. The larger the sample, the more normally shaped the t-distribution becomes. 

* n: Sample size
* Degrees of freedom: A parameter that allows the t-distribution to adjust to different sample sizes, statistics and number of groups
In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and the population standard deviation is unknown.

![t-Distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Student_t_pdf.svg/650px-Student_t_pdf.svg.png)

### Binominal distribution
* Trial: An event with a discrete outcome, e.g. 1 or 0
* Success: The outcome of interest, i.e. often 1, yes, true etc. but it does not imply that the outcome is desirable or beneficial
* Binomial: Having two outcomes
* Binomial trial: A trial with two outcomes (=Bernoulli trial)
* Binomial distribution: Distribution of number of success in *x* trials

The binomial distribution is the frequency of the number of success (*x*) in a give number of trials (*n*) with specified probability (*p*) of success in each trial.

Mean of binomial = $ n * p$  
Variance = $n * p(1-p)$  

A <b>Poisson Distribution</b> is the distribution of events per unit of time or space in a specified interval of time or space. The variance for a Poisson distribution is $\lambda$ (Lambda).

### Further reading
* [Identifying and Avoiding Bias in Research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2917255/)
* [Fooled by Randomness Through Selection Bias](http://systemtradersuccess.com/fooled-by-randomness-through-selection-bias/)
* [The Black Swan](https://en.wikipedia.org/wiki/The_Black_Swan:_The_Impact_of_the_Highly_Improbable)
* [t-Distribution](http://onlinestatbook.com/2/estimation/t_distribution.html)

[Home](#home)

<a name="exp"></a>
## Statistical Experiments and Significance Testing
Design of experiments is a cornerstone of the practice of statistics. The goal is to design an experiment in order to confirm or reject a hypothesis.

Classical statisctical inference "pipeline":
1. Formulate hypothesis
2. Design experiment
3. Collect data
4. Inference / Conclusions

A/B-Testing:
An A/B test is an experiment with two groups to establish which of the tow products / procedures or the like is superior. A propoer A/B test has subjects that can be assigned to one treatment or another. Ideally, subjects are randomised to treatments. 
Inclusion of a control group greatly strengthens researchers’ ability to draw conclusions from a study. Indeed, only in the presence of a control group can a researcher determine whether a treatment under investigation truly has a significant effect on an experimental group, and the possibility of making an erroneous conclusion is reduced. Without a control group, there is no assurance that the "other things than the control variable are equal".

* A blind study: a blind study is one in which the subjects are unaware of whether they are getting treatment A or B
* A double blind study: a double blind study is one in which the investigatores and the facilitators are unaware which treatment they are getting

In a standard A/B experiment you need to decide on the the metrics ahead of time.

## Hypothesis Tests
The purpose of hypothesis test is to learn whether random chance might be responsible for an observed effect.
* Null hypothesis: the hypothesis that chance is the cause
* Alternative hypothesis: Couterpoint to the null, i.e. what you hope to prove
* One way test: Hypothesis test that counts chance results only in one direction
* Tow way test: Hypothesis test that counts chance results only in two direction

Hypothesis test use the following logic: experiments require proof that the difference between groupsis more extreme than what chance might reasonably produce. This involves the a baseline assumption that the treatments are equivalent, and any difference between groups is due to chance. The baseline assumption is termed the <b>null hypothesis</b>. 
* $H_0$ = no difference between the meand of group A and group B, $H_A$ = A<>B
* $H_0$ = A $\leq $B , $H_A$ = A $\geq $B

## Resampling
Resampling in statistics means to repeatedly sample values from observes data, with a general goal of assessing random variability in a statistic. There are two main types of resampling procedures:
* bootstrap: assess the reliability of an estimate
* permutation: used to test the hypothesis

### Permutation test
Permute means to change the order of a set of values and follows wthe follwoing steps
1. Combine the results from the different groups in a single data set
2. shuffle the combined data, the randomly draw a resample of the same size as group A
3. fro mth remaining data, randomly drwa a resample of the same size as group B
4. repeat for C, D etc.
5. calculate statistics for the new samples and record
6. Repeat the process a given R times to yield a permutation distribution of the test statistics

#### Exhaustuve and Bootstrap Permutation Test
In an <b>exhaustive permutation test</b>, instead of randomly shuffling and dividing the data, we actually figure and process all the possible combinations. This is limited to relative small sample sizes
In a <b>bootstrap permutation test</b>, the draws are made with replacement. In this was the models also the randome element in the selection of subjects from a population.

## Statistical Significance
Statistical signicance measures whether an experiment yields a result more extreme than what chance might produce.
* P-Value: is the probability of obtaining results as unusual or extreme as the observed results
* Alpha: The probability threshold of "unusalness" the chance results must surpass, for actual outcomes to be deemed statistical significant
* Type 1 error: Mistakenly concluded an effect is real, when it is due to chance
* Type 2 error: Mistakenly concluding an effect is due to chance when it is real

### P-Value
In statistical hypothesis testing, the p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct. 

The p-value is defined as the probability, under the null hypothesis $H$ (at times denoted $H_{0}$ as opposed to $H_{a}$ denoting the alternative hypothesis) about the unknown distribution $F$ of the random variable $X$, for the variate to be observed as a value equal to or more extreme than the value observed. If $x$ is the observed value, then depending on how we interpret it, the "equal to or more extreme than what was actually observed" can mean $ X \geq x$ (right-tail event), $X \leq x$ (left-tail event) or the event giving the smallest probability among $ X \geq x$ and $X \leq x$ (double-tailed event). Thus, the p-value is given by

* $ Pr(X \geq x|H)$ for right tail event,
* $ Pr(X \leq x|H)$ for left tail event,
* $ 2min {Pr(X \leq x|H), Pr(X \geq x|H)}$ for double tail event.  

The smaller the p-value, the higher the significance that the hypothesis under consideration may not adequately explain the observation. The null hypothesis $H_0$ is rejected if any of these probabilities is less than or equal to a small, fixed but arbitrarily pre-defined threshold value $ \alpha$, which is referred to as the level of significance. Alpha $ \aplpha$ is a threshold specified in advance of the observation to be more extreme than chance. 

![P-Value distribution](https://www.simplypsychology.org/p-value.png?ezimgfmt=rs:534x336/rscb18/ng:webp/ngcb18)

For data science, a p-value is a useful metric in situations where you want to know whether a model results that appears interesting and useful is within the range of normal chance variability.

### t-Test (Student's test)
The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. The t test tells you how significant the differences between groups are; In other words it lets you know if those differences (measured in means/averages) could have happened by chance.
A standardized form to the test statistic must be used.

### Degrees of freedom
The concept applies to statitics calculated from sample data and refers to the number of values free to vary. Degrees of freedom is part of the standardisation calculation to ensure that your standarised data matches the appropirate reference distribution (t- or F-distribution).

### Analysis of Variance (ANOVA)
The statistical procedure that thest for a statistical signifcant difference of mean between groups (3+) in a sample. Basically, you’re testing groups to see if there’s a difference between them.

### F-Statistic
The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion. 
Variance is the square of the standard deviation. F-statistics are based on the ratio of mean squares. The term “mean squares” is simply an estimate of population variance that accounts for the degrees of freedom (DF) used to calculate that estimate.  
![F-Statistic_low_value](https://blog.minitab.com/hubfs/Imported_Blog_Media/low_f_dplot.png) ![F-Statistic_high_value](https://blog.minitab.com/hubfs/Imported_Blog_Media/high_f_dplot.png)

The higher the the ratio, the more significant the results.

### Chi-Square Test
The chi-square test is used with count data to test how well it fits some expected distribution.

The Pearson residual  
$ R = \frac {Observed - Expected} { \sqrt{Expected}}$  
The R measures the extend to which the actual counts differ from these expected counts

The chi-square statistic is defined as the sum of the squared Pearson residuals where r and c are the number of rows and columns

$$ X = \sum_{i}^{r} {\sum_{j}^{c}{R^2} } $$ 

The Chi-squared test ($X^2$) is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a so-called contingency table.

In the standard applications of this test, the observations are classified into mutually exclusive classes. If the so-called null hypothesis is true, the test statistic computed from the observations follows a $χ2$ distribution.  
The primary reason that the chi-square distribution is used extensively in hypothesis testing is its relationship to the normal distribution. Many hypothesis tests use a test statistic, such as the t-statistic in a t-test. For these hypothesis tests, as the sample size, n, increases, the sampling distribution of the test statistic approaches the normal distribution (central limit theorem). 

<b>Interpret results</b>  
Since the P-value (0.0003) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between the two categorical variables.

## Further reading
* Introductory Statistics and Analytics: A Resampling perspective by Peter Bruce
* [Chi-Square Test for Independence](https://stattrek.com/chi-square-test/independence.aspx)

[Home](#home)

<a name='reg'></a>
## Regression and Prediction
Perhaps the most common goal in statistics is to answer whether the variable X is associated with the variable Y and if so, what is the relationship:
* prediction of an outcome
* anomaly detection

### Simple Linear Regression
Simple linear regression (SLR) models the relationship between the magnitude of one variable and that of a second. Correlation is another way to measure how to variables are related. Whilst correlation measures the strength of an association between two variables, regression quantifies the nature of the relationship.

**Fitted values and residuals:**  
The data that doesn't fall exactly on the line of the regression equation are an explicit error term $e_i$:  

$ Y_i = b_0 + b_1X_1 + e_i$  

Fitted values, are referred to as *predicted* values and are typically denoted as $\hat{Y_i}$. We compute the residuals $\hat{e_i}$ by subtracting the predicted values from the original data:  

$\hat{e_i} = Y_i - \hat{Y_i}$  

**Least Squares**
In reality the regression line is the estimate that minimizes the sum of squared residual values, called *residual sum of squares (RSS)*.  

$RSS = \sum_{i=1}^{n}{(Y_i - \hat{Y_i})^2} = \sum_{i=1}^{n}{(Y_i - \hat{b_0} - \hat{b_1}X_1 )^2}$  

### Multiple Linear Regression
When there are multiple predictors, the equation is extended to accommodate:  

$ Y_i = b_0 + b_1X_1 + b_2X_2 + ... + e$  

**Key Terms of "in sample metric's"**  
* Root mean squared error: The square root of the average squared error of the regression = RMSE  
    $RMSE = \sqrt(\frac{\sum_{i=1}^{n}{(y_i - \hat{y_i})^2}}{n})$  
    This measures the overall accuracy of the model and is a basis for comparing it to other models
* Residual standard error: RMES but adjusted for degrees of freedom = RSE  
    $RSE = \sqrt{(\frac{\sum_{i=1}^{n}{(y_i - \hat{y_i})^2}}{n-p-1})}$
* R-squared: The proportion of variance explained by the model from **$R^2$ = 0 to 1**  
    $R^2 = \frac{\sum_{i=1}^{n}{(y_i - \hat{y_i})^2}}{\sum_{i=1}^{n}{(y_i - \overline{y_i})^2}}$  
    - $\hat{y_i}$ = predicted value  
    - $\overline{y_i}$ = mean of observed values  
    This measures the proportion of variation in the data that is accounted for in the model
* t-statistic: The coefficient for a predicator, divided by the standard error of the coefficient  
    $t_b = \frac{\hat{b}}{SE(\hat{b})}$  
    t-Statistics and its mirror the p-value measure the extent to which a coefficient $b_i$ is "statistically significant". That is outside the range of what a random chance arrangement of predictor and target variable might produce. The higher the t-statistic, the more significant the predictor.  

$R^2$, F-statistics and p-values are all "in-sample" metrics.

### Prediction Using Regression
The primary purpose of regression in data science is prediction. Regression however, should not be used to extrapolate beyond the range of data. Useful metrics are confidence intervals which are uncertainty intervals placed around regression coefficients and predictions.

**Factor variables**
Factor variables, also *categorical* variables, take on a limited number of discrete values. Regression requires numerical inputs, so factor variables need to be recoded to be used in a model. 

### Further reading
* [To Explain or to Predict](https://projecteuclid.org/euclid.ss/1294167961)

[Home](#home)

<a name='class'></a>
## Classification
Classification is perhaps the most important form of prediction: the goal is to predict whether a record is 0 or 1. Rahter than having a model simply assign a binary classification, most algorithms can return a probability score of belonging to the class of interest.

* Predict whether Y = 0 or Y > 0
* Given that Y > 0, perdict whether Y = 1 or Y = 2

### Navice Bayes
The algorithm uses the probabilty of observing predictor values, given an outcome, to estimate the probabilty of observing  outcome Y = i, given a set of predictor values.  

**Key Terms**  
* Conditional probabilty:  
    The probability of observing some event (X = i) given some other event (Y = i), written as P($X_i | Y_i$)
* Posterior probabilty:
    The probability of an outcome after the predictor information has been incorporated (in contrast to the prior probability of outcomes, not taking predictor information into account)  

Naive Byes is a data-driven, empirical method requiring relatively little statistical expertise. The Naive Bayesian classifier is known to produce biased estimates. However, where the goal is to rank records to the probability that Y = 1, unbiased estimates of probability are not needed and naive Bayes produces good results.

### Covariance Matrix
Introduction of the concept of covariance between two or more variables. The covariance measures the relationship between two vairables x and z.

$s_{x,z} = \frac{\sum_{i=1}^{n}{(x_i-\overline{x})(z_i-\overline{z})}}{(n-1)}$  

$\overline{x}$ = mean of x (e.z. z)

Correlation is constraint to between -1 and 1, whereas covariance is on the same scale as the vairables x and z. The covariance matrix $\sum$ for x and z consists of the individual variable variances $s_x^2$ and $s_y^2$ on the *diagonal* and the covariances between variable pairs on the *off-diagonals*.  

$\hat{\sum} = \begin{bmatrix} 
                s_x^2 & s_{x,z}\\
                s_{z,x} & s_z^2
                \end{bmatrix}$

### Discriminant Analysis
**Key Terms**
* Covariance: A measure of the extend to which one variable varies in concert with another.
* Discriminant function: The function that, when appplied to the predictor vaiables, maximizes the separation of the classes.
* Discriminant weights: The scores that results from the application of the discriminant function, and are used to estimate probabilities of belonging to one class or another.

### Linear discriminant analysis (LDA)
Discriminant analysis assumes the predictor varbiables aer normally distributed continues variables. In practice, the method works well even for nonextrem departures from normality and for binary predictors. Fisher's linear discriminanat distinguishes variation between the groups from within the groups. It seeks to maximise the "between" sum of squares $SS_{between}$ relative to the "within" sum of squares $ss_{within}$. The method finds the linear combination of $w_xx + w_zz$ that maximises the sum of square ratio.  

$\frac{ss_{between}}{ss_{within}}$  

The  $SS_{between}$ is the squared distance between the two group means, and the $ss_{within}$ is the spread around the means within each group, weighted by the covariane matrix. This function yields the gratest separation between the groups by maximising $SS_{between}$ and minimising $ss_{within}$.  

Using the discriminat function weights, LDA splits teh predictor space into two regions.

### Logistic Regression
Logistic regression is analgous to multiple regression except that the outcome is binary. It is a structures approach, rather than a data-centric approach.  
**Key Terms**
* Logit: The function that maps the probability of belonging to a class with a ranger from $+/- \infty$ (instead of 0 to 1).
* Odds: the ratio of success (1) to not success (0)
* Log odds: The respone in the transformed model (now linear), which gets mapped back to a probability

The key is the logistic response function and the logit, in which we map a probability to a more expansive scale. Think of the outcome variable not as a binary label but as the probabilty $p$ that the lable is a "1". Model $p$ by applying a logistic response function to the predictor variables  

$p = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}}$  This transformation ensures that $p$ stays between 0 and 1.  

Odds(Y=1) = $\frac{p}{1-p}$ = ratio of success(1) to nonsuccess (0) = $e^{(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}$.  
log(Odds) = $\beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Probability $p = \frac{Odds}{1+Odds}$

The log-odds function, known as *logit* function, maps the probability p from (0,1) to any value ($-\infty , +\infty$)  
[logit function](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Logit.svg/525px-Logit.svg.png)

Logistic regression is a special instance of a generaliesed linear model (GLM). GLMs are characterised by:
* a probability distribution family
* a link function mapping the response to the predictors

**Predicted Values**  
The predicted values is given by the logistic response function: $\hat{p} = \frac{1}{1 + e^{-\hat{Y}}}$  

Multiple linear regression and logistic regression share many commonalities:
* both assume a parametric liner form, relating the predictors with response
* exploring and finding the best model are done in similar ways

Logistic regression differs by:
* the way the model is fit
* the nature and analysis of the residuals from the model

**Fitting the model** is done by using the maximum likelihood estimation (MLE). The MLE finds the solution such that the estimated log odds best describes the observed outcome.
**Assessing the model** is done by assessing how accuratly the model classifies new data. Standard statistical tools to assess and improve the model's estimated coefficients are:
* R, reporting the standard error of the coefficients (SE)
* z-value
* p-value

However, a logist regression model which has a binary response, does not have an associated RMSE or R^2. Instead, a logistic regression model is evaluated using more general metrics for classification, i.e.
* Accuracy: the percent (or proportion) of cases classified correctly
* Confusion matrix: a tabular display (2x2) of the record counts by their predicted and actual classification status
* Sensitivity: The percent (or proportion) of 1's correctly classified
* Specifity: the percent (or proportion) of 0's correctly classified
* Precision: the percent (or proportion) of predicted 1's that are actually 1's
* ROC curve: A plot of sensitivity versus specifity
* Lift: A measure of how effective the model is at identifying (comparitively rare) 1s at different probability cut offs

**Accuracy**  
Count of the proportion of predictions that are correct = $\frac{\sum{TruePos} + \sum{TrueNeg}}{SampleSize}$  

**Confusion Matrix**  
A table showing the number of correct and incorrect predictions categorised by the type of response. The diagnoal elements of the matrix show the number of the matrix of correct predictions and the off-diagonal elements show the number of incorrect predidictions.  
[Confusion Matrix](https://https://glassboxmedicine.files.wordpress.com/2019/02/confusion-matrix.png?w=1200)  

* precision = $\frac{\sum{TruePos}}{\sum{TruePos} + \sum{FalsePos}}$
* recall = $\frac{\sum{TrueNeg}}{\sum{TrueNeg} + \sum{FalseNeg}}$
* specifity = $\frac{\sum{TrueNeg}}{\sum{TrueNEg} + \sum{FalsePos}}$

**Receiver Operating Characteristics (ROC)**  
The ROC curv plots recall on the y-axis against specifity on x-axis and shows the trade-off between recall and specifity. An extremly effective classifier will have an ROC that hugs the upper left corner.   
[ROC](http://deparkes.co.uk/wp-content/uploads/2018/02/roc_curve_1.png)  

### Strategies for Imbalanced Data
**Key Terms**  
* Undersample: use fewer of the prevalent class records in the classification model
* Oversample: use more of the rare class records in the classification model, bootstraping if necessary
* Up weight or down weight: attache more (or less) weight to the rare (or prevalent) class in the model
* Data generation: Each new bootstrapped record is slightly different
* z-Score: the value that results after standardisation
* K: The number of neighbours considered in the nearest neighbour calculation

[Home](#home)

<a name='ml'></a>
## Statistical Machine Learning

### Further reading
* [To Explain or to Predict](https://projecteuclid.org/euclid.ss/1294167961)

[Home](#home)