<p style="padding: 10px;
          color: Black;
          font-weight: bold;
          text-align: center;
          font-size:240%;">
Statistical Learning Tutorial

</p>

<img src="https://media.giphy.com/media/l378c04F2fjeZ7vH2/giphy.gif">

1. [Data](#1)
2. [Level of Measurements](#2)
3. [Population and Sample](#3)
4. [Central Tendency](#4)
    - [Mean](#5)   
    - [Median](#6)     
    - [Mode](#7)
5. [Dispersion](#8)   
    - [Range](#9)   
    - [Variance](#10)   
    - [Standard Deviation](#11)
6. [Central Tendency and Dispersion](#12)
7. [Quartiles](#13)
8. [Bivariate Data and Covariance](#14)
9. [Pearson Correlation Coefficient](#15)
10. [Spearman Rank Coefficient](#16)
11. [Effect Size](#17)
12. [Probability](#18)   
    - [Permutation](#19)   
    - [Combination](#20)
    - [Intersection, Unions and Complements](#21)
13. [Statistics](#22)
    - [Sampling](#23)
    - [Central Limit Theorem](#24)
    - [Standard Error](#25)
    - [Hypothesis Testing](#26)
    - [T-Distribution](#27)
    - [A/B Testing](#28)
14. [ANOVA (Analysis of Variance)](#29)
    - [F-Distribution](#30)
15. [Chi-Square Analysis](#31)
    - [Chi-Square Analysis Example](#32)

<a id = "1"></a>
# Data
Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.

There are 2 types of data:
- **Continous:**  Continuous data is data that can take any value. Height, weight, temperature and length are all examples of continuous data. Some continuous data will change over time; the weight of a baby in its first year or the temperature in a room throughout the day.
- **Categorical:**  Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.

<a id = "2"></a>
# Level of Measurements
Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and is widely criticized by scholars in other disciplines. Other classifications include those by Mosteller and Tukey, and by Chrisman.

**Nominal Measurement:** A nominal variable is one of the 2 types of categorical variables and is the simplest among all the measurement variables. Some examples of nominal variables include gender, Name, phone, etc.   
**Ordinal Measurement:** Examples of ordinal variables include: socio economic status (“low income”,”middle income”,”high income”), education level (“high school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”, “over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”).  
**Interval Measurement:** An interval scale is one where there is order and the difference between two values is meaningful. Examples of interval variables include: temperature (Farenheit), temperature (Celcius), pH, SAT score (200-800), credit score (300-850).  
**Ratio Measurement:** The most common examples of ratio scale are height, money, age, weight etc. With respect to market research, the common examples that are observed are sales, price, number of customers, market share etc.

<a id = "3"></a>
# Population and Sample
A population is the entire group that you want to draw conclusions about. A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. In research, a population doesn't always refer to people.

![image.png](attachment:image.png)

**Example:** The population may be "ALL people living in the US." A sample data set contains a part, or a subset, of a population. The size of a sample is always less than the size of the population from which it is taken.

<a id = "4"></a>
# Central Tendency
Central tendency (or measure of central tendency) is a central or typical value for a probability distribution. It may also be called a center or location of the distribution. Colloquially, measures of central tendency are often called averages. The term central tendency dates from the late 1920s.

The most common measures of central tendency are the arithmetic mean, the median, and the mode. A middle tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution. Occasionally authors use central tendency to denote "the tendency of quantitative data to cluster around some central value."

The central tendency of a distribution is typically contrasted with its dispersion or variability; dispersion and central tendency are the often characterized properties of distributions. Analysis may judge whether data has a strong or a weak central tendency based on its dispersion.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

In [None]:
age  = [23,27,24,23,34,28,23,27,36,38]

<a id = "5"></a>
## Mean
There are several kinds of mean in mathematics, especially in statistics. For a data set, the arithmetic mean, also called the expected value or average, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values.

In [None]:
mean_age = np.mean(age)
print("Mean:" ,mean_age)

<a id = "6"></a>
## Median
The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean (often simply described as the "average") is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of a "typical" value.

In [None]:
median_age = np.median(age)
print("Median:" ,median_age)

<a id = "7"></a>
## Mode
The mode is the value that appears most often in a set of data values. If X is a discrete random variable, the mode is the value x (i.e, X = x) at which the probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled.

In [None]:
mode_age = stats.mode(age)
print("Mode:" ,mode_age[0][0])

<a id = "8"></a>
# Dispersion
Dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the **variance**, **standard deviation**, and **interquartile range**.

<a id = "9"></a>
## Range
The range of a set of data is the difference between the largest and smallest values.

In [None]:
print("Range: ", (np.max(age)-np.min(age)))

<a id = "10"></a>
## Variance
Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value.

$$\Huge \sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n} $$

In [None]:
print("Variance: ", (np.var(age)))
var = sum((age - np.mean(age))**2)/len(age)
print("Variance with Formula: ", var)

<a id = "11"></a>
## Standard Deviation
The standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

$$\Huge \sigma = {\sqrt{\frac{\sum_{}(x-\mu)^2}{N}}} $$

In [None]:
print("Standard Deviation: ", np.std(age))
std = np.sqrt(sum((age - np.mean(age))**2)/len(age))
print("Standard deviation with Formula: ", std)

<a id = "12"></a>
# Central Tendency and Dispersion

In [None]:
import matplotlib.pyplot as plt

y = np.random.uniform(5,8,100)
x1 = np.random.uniform(10,20,100)
x2 = np.random.uniform(0,30,100)

plt.scatter(x1,y,color = "blue")
plt.scatter(x2,y,color = "red")
plt.xlim([-1,31])
plt.ylim([2,11])
plt.xlabel("x")
plt.ylabel("y")

print("x1 mean: {} and median: {}".format(np.mean(x1),np.median(x1)))
print("x2 mean: {} and median: {}".format(np.mean(x2),np.median(x2)))

In [None]:
x1_range = (np.max(x1)-np.min(x1))
x1_variance = (np.var(x1))
x1_std = (np.std(x1))

x2_range = (np.max(x2)-np.min(x2))
x2_variance = (np.var(x2))
x2_std = (np.std(x2))

x1_dispersion = [x1_range, x1_variance, x1_std]
x2_dispersion = [x2_range, x2_variance, x2_std]

df = pd.DataFrame([x1_dispersion,x2_dispersion],columns= ['Range','Variance','Std'], index= ['x1','x2'])
df

In [None]:
plt.figure(figsize=(6,4))    
plt.plot(['Range','Variance','Std'], x1_dispersion, color = "blue")
plt.plot(['Range','Variance','Std'], x2_dispersion, color  = "red")
plt.title("Dispersion", size = 14)
plt.show()

<a id = "13"></a>
# Quartiles
Quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a form of order statistic. The three main quartiles are as follows:

- **The first quartile (Q1)** is defined as the middle number between the smallest number (minimum) and the median of the data set. It is also known as the lower or 25th empirical quartile, as **25%** of the data is below this point.
- **The second quartile (Q2)** is the median of a data set; thus **50%** of the data lies below this point.
- **The third quartile (Q3)** is the middle value between the median and the highest value (maximum) of the data set. It is known as the upper or 75th empirical quartile, as **75%** of the data lies below this point.
    
Along with the minimum and maximum of the data (which are also quartiles), the three quartiles described above provide a five-number summary of the data. This summary is important in statistics because it provides information about both the center and the spread of the data. Knowing the lower and upper quartile provides information on how big the spread is and if the dataset is skewed toward one side. Since quartiles divide the number of data points evenly, the range is not the same between quartiles (i.e., Q3-Q2 ≠ Q2-Q1) and is instead known as the interquartile range (IQR). While the maximum and minimum also show the spread of the data, the upper and lower quartiles can provide more detailed information on the location of specific data points, the presence of outliers in the data, and the difference in spread between the middle 50% of the data and the outer data points.

In [None]:
plt.style.use("ggplot")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
data = pd.read_csv("/kaggle/input/biomechanical-features-of-orthopedic-patients/column_2C_weka.csv")
data.head()

In [None]:
data_abnormal = data[data["class"] == "Abnormal"]
data_normal = data[data["class"] == "Normal"]
desc = data_abnormal.pelvic_incidence.describe()
Q1 = desc[4]
Q3 = desc[6]
IQR = Q3 - Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
print("Anything outside this range is an outlier: (" , lower_bound ,"," , upper_bound,")")
data_abnormal[data_abnormal.pelvic_incidence < lower_bound].pelvic_incidence
print("Outliers: " , data_abnormal[(data_abnormal.pelvic_incidence < lower_bound) | (data_abnormal.pelvic_incidence > upper_bound)].pelvic_incidence.values)

In [None]:
melted_data  = pd.melt(data,id_vars = "class", value_vars = ['pelvic_incidence'])
sns.boxplot(x = "variable", y = "value", hue = "class", data = melted_data)
plt.show()

<a id = "14"></a>
# Bivariate Data and Covariance
**Bivariate data** is data on each of two variables, where each value of one of the variables is paired with a value of the other variable. Typically it would be of interest to investigate the possible association between the two variables. The association can be studied via a tabular or graphical display, or via sample statistics which might be used for inference. The method used to investigate the association would depend on the level of measurement of the variable. 

In [None]:
f,ax=plt.subplots(figsize = (8,8))
# corr() is actually pearson correlation
sns.heatmap(data.corr(),
            annot= True,
            linewidths=0.5,
            fmt = ".2f",
            vmax = 1,
            vmin = -1,
            ax=ax,
            annot_kws = {'size': 14},
            cmap ="coolwarm")
plt.xticks(rotation=70, size = 12)
plt.yticks(rotation=0, size = 12)
plt.title('Correlation Map',size = 14)
plt.show()

In [None]:
plt.figure(figsize = (15,10))
sns.jointplot(data.pelvic_incidence,data.sacral_slope,kind="reg")
sns.jointplot(data.pelvic_radius,data.sacral_slope,kind="reg")
plt.show()

In [None]:
sns.set(style = "white")
df = data.loc[:,["pelvic_incidence","sacral_slope","lumbar_lordosis_angle"]]
g = sns.PairGrid(df,diag_sharey = False,)
g.map_lower(sns.kdeplot,cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot,lw =3)
plt.show()

**Covariance** is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (that is, the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

$$\Huge cov_{x,y}=\frac{\sum_{}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{n}$$

In [None]:
np.cov(data.pelvic_incidence,data.sacral_slope)
print("Covariance between Pelvic Incidence and Sacral Slope: ",data.pelvic_incidence.cov(data.sacral_slope))
print("Covariance between Pelvic Incidence and Pelvic Radius: ",data.pelvic_incidence.cov(data.pelvic_radius))
fig, axs = plt.subplots(1, 2, figsize = (10,4))
axs[0].scatter(data.pelvic_incidence, data.sacral_slope)
axs[1].scatter(data.pelvic_radius, data.pelvic_incidence)
plt.show()

<a id = "15"></a>
# Pearson Correlation Coefficient
Pearson correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables.  It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance.  It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship.

In [None]:
p1 = data.loc[:,["pelvic_incidence","sacral_slope"]].corr(method= "pearson")
p2 = data.sacral_slope.cov(data.pelvic_incidence)/(data.sacral_slope.std()*data.pelvic_incidence.std())
print('Pearson Correlation: ')
print(p1)
print('Pearson Correlation: ',p2)

In [None]:
sns.jointplot(data.sacral_slope,data.pelvic_incidence,kind="reg")
plt.show()

<a id = "16"></a>
# Spearman Rank Coefficient
The Spearman's rank-order correlation is the nonparametric version of the Pearson product-moment correlation. Spearman's correlation coefficient, (ρ, also signified by rs) measures the strength and direction of association between two ranked variables.

In [None]:
ranked_data = data.rank() 
spearman_corr = ranked_data.loc[:,["pelvic_incidence","sacral_slope"]].corr(method= "pearson")
print("Spearman's Correlation: ")
print(spearman_corr)

Spearman's correlation is little higher than Pearson correlation.
- If relationship between distributions are non linear, spearman's correlation tends to better estimate the strength of relationship.
- Pearson correlation can be affected by outliers. Spearman's correlation is more robust.

<a id = "17"></a>
# Effect Size
Effect size is a number measuring the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter of a hypothetical statistical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event (such as a heart attack) happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

$$\huge  \frac {M_1 - M_2} {\sqrt{\frac {S_1^2} {n_1} + \frac {S_2^2} {n_2} }} $$

In [None]:
mean_diff = data_abnormal.pelvic_incidence.mean() - data_normal.pelvic_incidence.mean()    # m1 - m2
var_abnormal = data_abnormal.pelvic_incidence.var()
var_normal = data_normal.pelvic_incidence.var()
var_pooled = (len(data_abnormal)*var_normal +len(data_normal)*var_abnormal ) / float(len(data_abnormal)+ len(data_normal))
effect_size = mean_diff/np.sqrt(var_pooled)
print("Effect Size:",effect_size)

<a id = "18"></a>
# Probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes ("heads" and "tails") are both equally probable; the probability of "heads" equals the probability of "tails"; and since no other outcomes are possible, the probability of either "heads" or "tails" is 1/2 (which could also be written as 0.5 or 50%).

<a id = "19"></a>
## Permutation
A permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or process of changing the linear order of an ordered set.

Permutations differ from combinations, which are selections of some members of a set regardless of order. For example, written as tuples, there are six permutations of the set {1,2,3}, namely: (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), and (3,2,1). These are all the possible orderings of this three-element set. Anagrams of words whose letters are different are also permutations: the letters are already ordered in the original word, and the anagram is a reordering of the letters. The study of permutations of finite sets is an important topic in the fields of combinatorics and group theory.


$$\Huge P(n,r)  = \frac {n!} {(n-r)! } $$

Permutations are the different ways in which a collection of items can be arranged. For example: The different ways in which the alphabets A, B and C, taken 2 at a time, can be arranged is 3!/(3-2)! = 3!/1! = 6 ways. (AB, AC, BA, BC, CA, CB)

In [None]:
import math as math
words = ["A","B","C"]
p = int(math.factorial(len(words)) / math.factorial(len(words)-2))
print(p)

<a id = "20"></a>
## Combination
A combination is a selection of items from a collection, such that the order of selection does not matter (unlike permutations). For example, given three fruits, say an apple, an orange and a pear, there are three combinations of two that can be drawn from this set: an apple and a pear; an apple and an orange; or a pear and an orange. More formally, a k-combination of a set S is a subset of k distinct elements of S. If the set has n elements, the number of k-combinations is equal to the binomial coefficient.

$$\Huge C(n,r)  = \frac {n!} { r!(n-r)! } $$

<a id = "21"></a>
## Intersection, Unions and Complements
- The **intersection** of two sets A and B, denoted by A ∩ B, is the set containing all elements of A that also belong to B (or equivalently, all elements of B that also belong to A).    
- In set theory, the **union** (denoted by ∪) of a collection of sets is the set of all elements in the collection. It is one of the fundamental operations through which sets can be combined and related to each other.

![image.png](attachment:image.png)

<a id = "22"></a>
# Statistics
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<a id = "23"></a>
## Sampling
Quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population.

Each observation measures one or more properties (such as weight, location, colour) of observable bodies distinguished as independent objects or individuals. In survey sampling, weights can be applied to the data to adjust for the sample design, particularly in stratified sampling. Results from probability theory and statistical theory are employed to guide the practice. In business and medical research, sampling is widely used for gathering information about a population. Acceptance sampling is used to determine if a production lot of material meets the governing specifications.

![image.png](attachment:image.png)

<a id = "24"></a>
## Central Limit Theorem
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.

All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.

Here’s what the Central Limit Theorem is saying, graphically. The picture below shows one of the simplest types of test: rolling a fair die. The more times you roll the die, the more likely the shape of the distribution of the means tends to look like a normal distribution graph.

In [None]:
x = np.random.random_integers(10,size=100000)
plt.hist(x)
plt.show()

In [None]:
import random
mean_sample = []
for i in range(1000):
    sample = random.randrange(5,10)
    mean_sample.append(np.mean(random.sample(list(x),sample)))
plt.hist(mean_sample,bins = 50, color = "red")
plt.show()

In [None]:
plt.hist(x,alpha = 0.5,density=True)
plt.hist(mean_sample,bins = 50,alpha = 0.5,color = "red",density=True)
plt.title("Central Limit Theorem")
plt.show()

<a id = "25"></a>
## Standard Error
The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error of the mean (SEM). 

The sampling distribution of a mean is generated by repeated sampling from the same population and recording of the sample means obtained. This forms a distribution of different means, and this distribution has its own mean and variance. Mathematically, the variance of the sampling distribution obtained is equal to the variance of the population divided by the sample size. This is because as the sample size increases, sample means cluster more closely around the population mean.

Therefore, the relationship between the standard error of the mean and the standard deviation is such that, for a given sample size, the standard error of the mean equals the standard deviation divided by the square root of the sample size. In other words, the standard error of the mean is a measure of the dispersion of sample means around the population mean.

In regression analysis, the term "standard error" refers either to the square root of the reduced chi-squared statistic, or the standard error for a particular regression coefficient (as used in, say, confidence intervals).

<a id = "26"></a>
## Hypothesis Testing
Hypothesis testing refers to the process of making inferences or educated guesses about a particular parameter. This can either be done using statistics and sample data, or it can be done on the basis of an uncontrolled observational study.

When a predetermined number of subjects in a hypothesis test prove the "alternative hypothesis," then the original hypothesis (the "null hypothesis") is overturned or "rejected." You must decide the level of statistical significance in your hypothesis, as you can never be 100 percent confident in your findings. First, let's examine the steps to test a hypothesis. Then, we'll enjoy some examples of hypothesis testing.

## How to test a Hypothesis?
At this point, you'll already have a hypothesis ready to go. Now, it's time to test your theory. Remember, a hypothesis is a statement regarding what you believe might happen. These are the steps you'll want to take to see if your suppositions stand up:

1. **State your null hypothesis.** The null hypothesis is a commonly accepted fact. It's the default, or what we'd believe if the experiment was never conducted. It's the least exciting result, showing no significant difference between two or more groups. Researchers work to nullify or disprove null hypotheses.
2. **State an alternative hypothesis.** You'll want to prove an alternative hypothesis. This is the opposite of the null hypothesis, demonstrating or supporting a statistically significant result. By rejecting the null hypothesis, you accept the alternative hypothesis.
3. **Determine a significance level.** This is the determiner, also known as the alpha (α). It defines the probability that the null hypothesis will be rejected. A typical significance level is set at 0.05 (or 5%). You may also see 0.1 or 0.01, depending on the area of study.
    If you set the alpha at 0.05, then there is a 5% chance you'll find support for the alternative hypothesis (thus rejecting the null hypothesis) when, in truth, the null hypothesis is actually true and you were wrong to reject it.
    In other words, the significance level is a statistical way of demonstrating how confident you are in your conclusion. If you set a high alpha (0.25), then you'll have a better shot at supporting your alternative hypothesis, since you don't need to find as big a difference between your test groups. However, you'll also have a bigger chance at being wrong about your conclusion.
4. **Calculate the p-value.** The p-value, or calculated probability, indicates the probability of achieving the results of the null hypothesis. While the alpha is the significance level you're trying to achieve, the p-level is what your actual data is showing when you calculate it. A low p-value offers stronger support for your alternative hypothesis.
5. **Draw a conclusion.** If your p-value meets your significance level requirements, then your alternative hypothesis may be valid and you may reject the null hypothesis. In other words, if your p-value is less than your significance level (e.g., if your calculated p-value is 0.02 and your significance level is 0.05), then you can reject the null hypothesis and accept your alternative hypothesis.

## Hypothesis Testing Example
Let's take those five steps and look at a couple of real-world scenarios.

### Peppermint Essential Oil
Essential oils are becoming more and more popular. Chamomile, lavender, and ylang-ylang are commonly touted as anxiety remedies. Perhaps you'd like to test the healing powers of peppermint essential oil. Your hypothesis might go something like this:

1. **Null hypothesis** - Peppermint essential oil has no effect on the pangs of anxiety.
2. **Alternative hypothesis** - Peppermint essential oil alleviates the pangs of anxiety.
3. **Significance level** - The significance level is 0.25 (allowing for a better shot at proving your alternative hypothesis).
4. **P-value** - The p-value is calculated as 0.05.
5. **Conclusion** - After providing one group with peppermint oil and the other with a placebo, you gauge the difference between the two based on self-reported levels of anxiety. Based on your calculations, the difference between the two groups is statistically significant with a p-value of 0.05, well below the defined alpha of 0.25. You conclude that your study supports the alternative hypothesis that peppermint essential oil can alleviate the pangs of anxiety.

## Hypothesis Testing with Our Data
**Null hypothesis:** relationship between pelvic incidence and sacral slope is zero.   
**Alternate hypothesis:** relationship between pelvic incidence and sacral slope is not zero.   

Let's find p-value.

In [None]:
statistic, p_value = stats.ttest_rel(data.pelvic_incidence,data.sacral_slope)
p_value = round(p_value,3)
print('p-value: ',p_value)
if p_value == 0: 
    print("Reject null hypothesis, alternate hypothesis is correct, relationship between pelvic incidence and sacral slope is not zero.")
else:
    print("Fail to reject null hypothesis, relationship between pelvic incidence and sacral slope is zero.")

<a id = "27"></a>
## T-Distribution
Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally-distributed population in situations where the sample size is small and the population's standard deviation is unknown. It was developed by English statistician William Sealy Gosset under the pseudonym "Student".

The t-distribution plays a role in a number of widely used statistical analyses, including Student's t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. The Student's t-distribution also arises in the Bayesian analysis of data from a normal family.

In [None]:
s1 = np.array([14.67230258, 14.5984991 , 14.99997003, 14.83541808, 15.42533116,
       15.42023888, 15.0614731 , 14.43906856, 15.40888636, 14.87811941,
       14.93932134, 15.04271942, 14.96311939, 14.0379782 , 14.10980817,
       15.23184029])
print("mean 1: ", np.mean(s1))
print("standart deviation 1: ", np.std(s1))
print("variance 1: ", np.var(s1))
s2 = np.array([15.23658167, 15.30058977, 15.49836851, 15.03712277, 14.72393502,
       14.97462198, 15.0381114 , 15.18667258, 15.5914418 , 15.44854406,
       15.54645152, 14.89288726, 15.36069141, 15.18758271, 14.48270754,
       15.28841374])
print("mean 2: ", np.mean(s2))
print("standart deviation 2: ", np.std(s2))
print("variance 2: ", np.var(s2))
# visualize with pdf
import seaborn as sns
sns.kdeplot(s1)
sns.kdeplot(s2)
plt.show()

In [None]:
t_val = np.abs(np.mean(s1)-np.mean(s2))/np.sqrt((np.var(s1)/len(s1))+(np.var(s2)/len(s2)))
print("t-value: ", t_val)

![image.png](attachment:image.png)

In [None]:
critical_value = 2.04
print("Null hypothesis: There is no statistically significant difference between these two distributions.")
if t_val > critical_value:
    print("t value > critical value")
    print("Reject Null Hypothesis")
else:
    print("t value < critical value")
    print("Fail to reject Null Hypothesis ")

<a id = "28"></a>
## A/B Testing
A/B testing (also known as bucket testing or split-run testing) is a user experience research methodology. A/B tests consist of a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject's response to variant A against variant B, and determining which of the two variants is more effective.

![image.png](attachment:031e0bfb-17f6-4bf7-9c76-19c8f9529962.png)

In [None]:
from scipy.stats import shapiro,levene
data = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")
data[data['parental level of education'].isin(["bachelor's degree",'high school'])]. \
groupby('parental level of education').agg({'math score':'mean'})

### Shapiro-Wilks Normality Test

The normality assumption is more important when the two groups have small sample sizes than for larger sample sizes.

Normal distributions are symmetric, which means they are “even” on both sides of the center. Normal distributions do not have extreme values, or outliers. You can check these two features of a normal distribution with graphs. Earlier, we decided that the body fat data was “close enough” to normal to go ahead with the assumption of normality. The figure below shows a normal quantile plot for men and women, and supports our decision.

In [None]:
test_stat, p = shapiro(data[data['parental level of education'] == "bachelor's degree"]['math score'])
print('Test Stat: {}'.format(round(test_stat,4)))
print('p-value: {}'.format(round(p,4)))
if p < 0.05:
    print('p < 0.05 --> Reject Null Hypothesis, data are not normally distributed.')
else:
    print('p > 0.05 --> Cannot reject Null Hypothesis, data are normally distributed.')

In [None]:
test_stat, p = shapiro(data[data['parental level of education'] == 'high school']['math score'])
print('Test Stat: {}'.format(round(test_stat,4)))
print('p-value: {}'.format(round(p,4)))
if p < 0.05:
    print('p < 0.05 --> Reject Null Hypothesis, data are not normally distributed.')
else:
    print('p > 0.05 --> Cannot reject Null Hypothesis, data are normally distributed.')

### Levene Test for Equality of Variances

Levene's test is used to test if k samples have equal variances. Equal variances across samples is called homogeneity of variance. Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Levene test can be used to verify that assumption.

In [None]:
test_stat, p = levene(data[data['parental level of education'] == "bachelor's degree"]['math score'],
                      data[data['parental level of education'] == 'high school']['math score'])

print('Test Stat: {}'.format(round(test_stat,4)))
print('p-value: {}'.format(round(p,4)))

if p < 0.05:
    print('p < 0.05 --> Reject Null Hypothesis, variances are not equal.')
else:
    print('p > 0.05 --> Cannot reject Null Hypothesis, variances are equal.')

### Two-Samples T-Test

The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.

**Is this the same as an A/B test?**   
Yes, a two-sample t-test is used to analyze the results from A/B tests.

**When can I use the test?**   
You can use the test when your data values are independent, are randomly sampled from two normal populations and the two independent groups have equal variances.

In [None]:
test_stat, p = stats.ttest_ind(data[data['parental level of education'] == "bachelor's degree"]['math score'],
                               data[data['parental level of education'] == 'high school']['math score'],
                               equal_var = True)

print('Test Stat: {}'.format(round(test_stat,4)))
print('p-value: {}'.format(round(p,4)))

if p < 0.05:
    print('p < 0.05 --> Reject Null Hypothesis, population means are not the same.')
    
else:
    print('p > 0.05 --> Cannot reject Null Hypothesis, population means are the same.')

<a id = "29"></a>
# ANOVA (Analysis of Variance)
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic factors have a statistical influence on the given data set, while the random factors do not. Analysts use the ANOVA test to determine the influence that independent variables have on the dependent variable in a regression study.


The t- and z-test methods developed in the 20th century were used for statistical analysis until 1918, when Ronald Fisher created the analysis of variance method. ANOVA is also called the Fisher analysis of variance, and it is the extension of the t- and z-tests. The term became well-known in 1925, after appearing in Fisher's book, "Statistical Methods for Research Workers." It was employed in experimental psychology and later expanded to subjects that were more complex.

- For example, are the exam anxiety of middle school, high school and university students different from each other? We will answer the question with ANOVA.
- Null Hypothesis: Exam concerns same

In [None]:
middle_school = np.array([51.36372405, 44.96944041, 49.43648441, 45.84584407, 45.76670682,
       56.04033356, 60.85163656, 39.16790361, 36.90132329, 43.58084076])
high_school = np.array([56.65674765, 55.92724431, 42.32435143, 50.19137162, 48.91784081,
       48.11598035, 50.91298812, 47.46134988, 42.76947742, 36.86738678])
university = np.array([60.03609029, 56.94733648, 57.77026852, 47.29851926, 54.21559389,
       57.74008243, 50.92416154, 53.47770749, 55.62968872, 59.42984391])

print("Middle school Mean: ",np.mean(middle_school))
print("High school Mean: ",np.mean(high_school))
print("University Mean: ",np.mean(university))
total_mean = (np.mean(middle_school) + np.mean(high_school) + np.mean(university))/3
print("Total Mean: ",np.mean(total_mean))

sns.kdeplot(middle_school)
sns.kdeplot(high_school)
sns.kdeplot(university)
plt.show()

In [None]:
stats.f_oneway(middle_school, high_school, university)

In [None]:
f_value = stats.f_oneway(middle_school, high_school, university)[0]
print("F value:", f_value)

<a id = "30"></a>
##  F-Distribution
In probability theory and statistics, the F-distribution, also known as Snedecor's F distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is a continuous probability distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance (ANOVA), e.g., F-test.

- F value < critical value --> fail to reject null hypothesis
- F value > critical value --> reject null hypothesis
- Degrees of freedom for groups: Number of groups - 1   
3 - 1 = 2
- Degrees of freedom for error: (number of rows - 1)* number of groups   
(10 - 1) * 3 = 27

![image.png](attachment:image.png)

In [None]:
critical_value = 5.4881
if f_value > critical_value:
    print("Reject to Null Hypothesis (f_value > critical_value)")
else:
    print("Fail to reject Null Hypothesis (critical_value > f_value)")

<a id = "31"></a>
# Chi-Square Analysis
A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

In the standard applications of this test, the observations are classified into mutually exclusive classes. If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a χ2 frequency distribution. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.

Test statistics that follow a χ2 distribution occur when the observations are independent and normally distributed, which assumptions are often justified under the central limit theorem. There are also χ2 tests for testing the null hypothesis of independence of a pair of random variables based on observations of the pairs.

Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the χ2 distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase.

For example, let's give an example, we throw money in the air 10 times. It comes to 9 tails and 1 head.
- Our question is: 9 times there is no chance of tails or if this money is inclined to tails? so is it biased (you can also consider it fraudulent)
- Null hypothesis: For a fair coin, it makes sense to get 9 tails out of 10 shots with a statistically 95% probability (confidence level 0.05).

For the tails in our example:   
- expected frequency = 5
- observed frequency = 9

For heads: 

- expected frequency = 5
- observed frequency = 1

Chi-Sqaure Value: 6.4

![image.png](attachment:image.png)

If the chi-square value is less than the critical value, there is a high correlation between observation and expected values.

- 6.4> 3.8 so reject the null hypothesis.

<a id = "32"></a>
## Chi-Square Analysis Example
There are 7 washing machines with equal probability of deterioration in the laundry. So expected = failure rate should be same for all washing machines.

- Washing machines are independent of each other.
- Observations: 1(5), 2(7), 3(9), 4(4), 5(1), 6(10), 7(6)
- Null Hypothesis: observation values in this way makes sense with a statistically 95% probability.
- Total Deterioration: 42
- Expected Value: 42 / 7 = 6
- Degrees of Freedom: 7 - 1 = 6

In [None]:
observation = np.array([5,7,9,4,1,10,6])
print("Total: ",np.sum(observation))
expected = np.sum(observation)/ len(observation)
print("Expected: ",expected)
chi_value = np.sum(((observation - expected)**2)/expected)
print("Chi_value: ",chi_value)

In [None]:
from scipy.stats import chi2
crit_value = chi2.isf(0.05,6)
print("Critical value: ", crit_value)

In [None]:
if crit_value > chi_value:
    print("Fail to reject Null Hypothesis (crit_value > chi_value)")
else:
    print("Reject Null Hypothesis (chi_value > crit_value)")

# Credits

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm

https://www.jmp.com/en_ch/statistics-knowledge-portal/t-test/two-sample-t-test.html

https://analyse-it.com/docs/user-guide/distribution/continuous/normality-hypothesis-test