# Introduction to Statistics
This notebook will serve as an introduction into statistics and probability

# Section 1.0 -  Probability

Probability is the likelihood of an event occurring.

$$P(X)=\frac{Preferred\;Outcomes}{Sample\;Space}$$

An **event** is a set of outcomes from a sample space that we are interested in. In other words an event is the preferred outcomes.<br><br>
The Probability Formula:
* The Probability of event X occurring equals the **number** of preferred outcomes over the **number** of outcomes in the
sample space.
* Preferred outcomes are the outcomes we want to occur or the outcomes we are interested in. We also call refer to such
outcomes as “Favorable”.
* Sample space refers to all possible outcomes that can occur. Its “size” indicates the amount of elements in it.

<br>If two events are **independent**:
The probability of them occurring simultaneously equals the product of them occurring on their own.<br>
For example, the probability of drawing an Ace does not depend of the probability of drawing a spade. So the probability of drawing an Ace of Spaces equals:
$$P(A\spadesuit)=P(A)*P(\spadesuit)$$
$$P(A\spadesuit)=\frac{4}{52}*\frac{13}{52}= 0.0192 \approx 1.92%$$


## Section 1.1 - Expected Values

**Trail** - Observing an event occurance and recording the outcome.  
**Experiment** - A collection of one or more trails.  
**Experimental Probability** - The probability we assign an event, based on an experiment we conduct.  

**Expected Value** - The specific outcome we expect to occur when we run an experiment.

1. <u>Example of a Trail</u>: Flipping a coin and recording the outcome
2. <u>Example of an Experiment</u>: Flipping a coin 20 time and recording the 20 individual outcomes
<br><br>



### Expected Value for categorical variables
A categorical variable is something like:
* Product Ratings (e.g. "Poor", "Average", "Good", "Excellent")
* Survey responses (e.g. "Yes", "No", "Maybe")

A numerical value **n** must be assigned to the variable.

The expected value for categorical variables is $E(X) = n*p$

#### Example:
Lets say we asked 1000 people how statisted they are. With the options being "Unsatisfied", "Neutral", or "Satisfied".<br>
Let assume we assigned "Unsatisfied" to 1, "Neutral" to 2, and "Satisfied" to 3. And the results of the survery as such:
* "Unsatisfied" (score = 1) -> 10 responses
* "Neutral" (score = 2) -> 30 responses
* "Satisfied (score = 3) -> 60 responses

$$E(X) = 1*\frac{10}{100} + 2*\frac{30}{100} + 3*\frac{60}{100} = 2.5

### Expected Value for numerical variables
$$E(X) = \sum_{i=1}^n x_i*p_i$$

## Sections 1. 2 - Combinatorics

Combinatorics is a branch of mathematics focused on counting, arranging and combinining objects - often under specific rules or constraints

### Section 1.2.1 - Permutations
Permutations represent the number of different possible ways we can arrange a number of elements.
$$P(n) = n*(n-1)*(n-2)*(n-3)*...*1$$

Characteristics of Permutations:
* Arranging all elements within the sample space.
* No repetition.
* $P(n) = n*(n-1)*(n-2)*(n-3)*...*1 = n!$ (Called "n factorial")

#### Example:
* If we need to arrange 3 people, we would have P(3)= 6 ways of doing so.
* Assume the people are "Jabari", "Ameer", "Tariq"

<br>We could arrange the following ways:
1. Jabari, Ameer, Tariq
2. Jabari, Tariq, Ameer
3. Ameer, Jabari, Tariq
4. Ameer, Tariq, Jabari
5. Tariq, Jabari, Ameer
6. Tariq, Ameer, Jabari

### Rules for factorials:
* $0!=1$
* If $n < 0, n!$ does not exist
* $(n + k)! = n!*(n+1)*...*(n+k)$
* $(n-k)! = \frac{n!}{(n-k+1)*...*(n-k+k)} = \frac{n!}{(n-k+1)*...*(n)}$
* $\frac{n!}{k!}=\frac{k!*(k+1)*...*n}{k!}=(k+1)*...*n$

#### Examples:
Let n=7,k=4
* $(7 + 4)! = 11! = 7!*8*9*10*11$
* $(7-4)! = 3! = \frac{7!}{4*5*6*7}$
* $\frac{7!}{4!}=5*6*7$

### Section 2.2 - Variations
Variations represent the number of different possible ways we can <u>pick</u> and <u>arrange</u> a number of elements.<br><br>
Variations **with** repetition
$$\overline{V}(n,p) = n^p$$
Intution behind the formula (\w repetition):
* We have n-many options for the first element
* We still have n-many options for the second element because repetition is allowed.
* We have n-many options for each of the p-many elements
* $n*n*n*...*n = n^p$

<br><br>Variations **without** repetition
$$V(n,p)=\frac{n!}{(n-p)!}$$
Intution behind the formula (\wo repetition):
* We have n-many options for the first element
* We still have (n-1)-many options for the second element because we can't repeat the value we chose to start with.
* We have less options left for each additional element
* $n*(n-1)*(n-2)*...*(n-p+1) = \frac{n!}{(n-p)!}$

### Section 2.3 - Combinations
Combinations represent the number of different possible ways we can pick a number of elements.
$$C(n,p) = C_p^n = \frac{n!}{(n-p)!p!}$$
Characteristics of Combinations:
* Takes into account double-counting. (Selecting Jabari, Ameer, Tariq and Makenna is the same as selecting Makenna, Tariq, Ameer and Jabari)
* All the different permutations of a single cobmination are different variations
* $$ C = \frac{V}{P} = \frac{n!/(n-p)!}{p!} = \frac{n!}{p!(n-p)!}$$
* Combinations are symmetric, so $C_p^n = C_{n-p}^n$, since selectiing p elements is the same as omitting n-p elements

#### Section 2.3.1 - Combinations where order matters
$$\overline{C}_p^n = C_p^{n+p-1}$$
In this case, selecting Jabari, Ameer, Tariq and Makenna is **NOT** the same as selecting Makenna, Tariq, Ameer and Jabari

#### Section 2.3.2 - Combinations with seperate sample spaces
Combinations represent the number of different possible ways we can pick a number of elements.
$$ C = n_1 * n_2 *...* n_p$$
where $n_1$ is the size of the first sample space, $n_2$ is the size of the second sample space,...,$n_p$ is the size of the p-th sample space
Characteristics of Combinations with separate sample spaces:
* The option we choose for any element does not affect the number of options for the other elements.
* The order in which we pick the individual elements is arbitrary.
* We need to know the size of the sample space for each individual element. $(n_1,n_2...n_p)$

## Section 1.3 - Bayesian Notation

### Section 1.3.1 - Sets
A **set** is a collection of elements, which hold certain values. Additionally, every event has a set of outcomes
that satisfy it.
The null-set (or empty set), denoted $\emptyset$, is an set which contain no values.

$$x \in A$$
where the Element x is lower-case and the Set A is upper-case<br>
Notation:
* $x \in A$ means "Element x is a part of set A". Example: $2 \in All\,even\,numbers$
* $A \ni x$ means "Set A contains element x". Example: $All\,even\,numbers \ni 2$
* $x \notin A$ means "Element x is NOT a part of set A'. Example $1 \in All\,even\,numbers$
* $\forall x:$ means "For all/any x such that...". Example: $\forall x:x \in All\,even\,numbers$
* $A \subseteq B$ means "A is a subset of B"   Example: $Even\,numbers \subseteq Intergers$

Remember! Every set has at least 2 subsets
* $ A \subseteq A$
* $ \emptyset \subseteq A$

### Section 1.3.2 - Multiple Events

In [None]:
from ipycanvas import Canvas
from math import pi

def write_text(c,string,x,y,size):
    c.fill_style = "white"
    c.font = f"{size}px serif"
    c.fill_text(string, x, y)

def draw_circle(c,x,y,r,color,fontsize,text):
    c.fill_style = color
    c.fill_circle(x, y, r)
    write_text(c,text,x-r/3,y-r/3,fontsize)

def draw_intersect(c,x2,y2,r2,color):  
    c.fill_style = color 
    c.global_composite_operation = 'source-atop';
    c.fill_circle(x2,y2,r2)
    c.global_composite_operation = 'destination-over';

canvas = Canvas(width=1200, height=300)
sectionWidth=400
sectionStart=0

write_text(canvas,"Not touching at all",50,32,18)
draw_circle(canvas,150,150,100,"red",32,"A")
draw_circle(canvas,325,150,50,"orange",32,"B")

sectionStart=400
canvas.stroke_style = "white"
canvas.stroke_rect(sectionStart, 0, sectionWidth, canvas.height)
write_text(canvas,"Intersect (Partially Overlap)",sectionStart+50,32,18)
x1=sectionStart+200;y1=150;r1=100
x2=sectionStart+300;y2=150;r2=50

intersectColor="#CC710A"
draw_circle(canvas,x1,y1,r1,"red",32,"A")
draw_intersect(canvas,x2,y2,r2,intersectColor)
draw_circle(canvas,x2,y2,r2,"orange",32,"B")
canvas.global_composite_operation = 'source-over'
write_text(canvas,"B",x2+r2/4,y2-r2/3,32)

sectionStart=800
canvas.stroke_style = "white"
canvas.stroke_rect(sectionStart, 0, sectionWidth, canvas.height)
write_text(canvas,"One completely overlaps the other",sectionStart+50,32,18)
draw_circle(canvas,sectionStart+200,150,100,"red",32,"A")
draw_circle(canvas,sectionStart+200,175,50,intersectColor,32,"B")

canvas.global_composite_operation = 'destination-over'
canvas.fill_style = "#197186"
canvas.fill_rect(0, 0, canvas.width, canvas.height)
canvas.stroke_style = "white"

canvas

Canvas(height=300, width=1200)

Examples:
1. Not touching at all: $A \subseteq \clubsuit$ , $B \subseteq \spadesuit$
2. Intersecting: $A \subseteq \clubsuit$ , $B \subseteq Queen$
3. Completely overlaps: $A \subseteq Black\;Cards$ , $B \subseteq \spadesuit$

The **intersection** of two or more events expresses the set of outcomes that satisfy all the events
simultaneously. Graphically, this is the area where the sets intersect.<br>
We denote the interection of two sets as:
$$A \cap B$$

In [14]:
from ipycanvas import Canvas
from math import pi,sin,cos

def write_text(c,string,x,y,size):
    c.fill_style = "white"
    c.font = f"{size}px serif"
    c.fill_text(string, x, y)

def draw_circle(c,x,y,r,color,fontsize,text):
    c.fill_style = color
    c.fill_circle(x, y, r)
    write_text(c,text,x-r/3,y-r/3,fontsize)

def draw_intersect(c,x2,y2,r2,color):  
    c.fill_style = color 
    c.global_composite_operation = 'source-atop';
    c.fill_circle(x2,y2,r2)
    c.global_composite_operation = 'destination-over';

canvas = Canvas(width=400, height=300)
sectionWidth=400
sectionStart=0

sectionStart=0
canvas.stroke_style = "white"
canvas.stroke_rect(sectionStart, 0, sectionWidth, canvas.height)
write_text(canvas,"Union",sectionStart+50,32,18)
x1=sectionStart+200;y1=150;r1=100
x2=sectionStart+300;y2=150;r2=50

intersectColor="#CC710A"
draw_circle(canvas,x1,y1,r1,"red",32,"A")

draw_intersect(canvas,x2,y2,r2,intersectColor)
draw_circle(canvas,x2,y2,r2,"orange",32,"B")
canvas.global_composite_operation = 'source-over'
write_text(canvas,"B",x2+r2/4,y2-r2/3,32)


canvas.global_composite_operation = 'destination-over'
draw_circle(canvas,x1,y1,r1*1.05,"white",32,"A")
draw_circle(canvas,x2,y2,r2*1.1,"white",32,"A")
canvas.fill_style = "#197186"
canvas.fill_rect(0, 0, canvas.width, canvas.height)
canvas.stroke_style = "white"
canvas.stroke_rect(sectionStart, 0, sectionWidth, canvas.height)

canvas

Canvas(height=300, width=400)

The **union** of two or more events expresses the set of outcomes that satisfy at least one of the events.<br>
Graphically, this is the area that includes both sets. We denote the union of two sets as:
$$ A\cup B$$
$$ A \cup B = A + B - A \cap B $$

In [15]:
from ipycanvas import Canvas
from math import pi,sin,cos

def write_text(c,string,x,y,size):
    c.fill_style = "white"
    c.font = f"{size}px serif"
    c.fill_text(string, x, y)

def draw_circle(c,x,y,r,color,fontsize,text):
    c.fill_style = color
    c.fill_circle(x, y, r)
    write_text(c,text,x-r/3,y-r/3,fontsize)

def draw_intersect(c,x2,y2,r2,color):  
    c.fill_style = color 
    c.global_composite_operation = 'source-atop';
    c.fill_circle(x2,y2,r2)
    c.global_composite_operation = 'destination-over';

canvas = Canvas(width=800, height=300)
sectionWidth=400
sectionStart=0



write_text(canvas,"Mutually Exclusive Sets",50,32,18)
draw_circle(canvas,150,150,100,"red",32,"A")
draw_circle(canvas,325,150,50,"orange",32,"B")

sectionStart=400
write_text(canvas,"Complements",sectionStart+50,32,18)
draw_circle(canvas,sectionStart+150,150,100,"red",32,"A")
canvas.global_composite_operation = 'destination-over';
canvas.fill_style = "orange"
canvas.fill_rect(sectionStart, 0, sectionStart, canvas.height)
canvas.global_composite_operation = 'source-atop';
write_text(canvas,"B",sectionStart+sectionWidth-100,150,32)
canvas.stroke_style = "white"
canvas.stroke_rect(sectionStart, 0, sectionWidth, canvas.height)

canvas.global_composite_operation = 'destination-over';
canvas.fill_style = "#197186"
canvas.fill_rect(0, 0, canvas.width, canvas.height)
canvas.stroke_style = "white"
canvas.stroke_rect(sectionStart, 0, sectionWidth, canvas.height)

canvas

Canvas(height=300, width=800)

Sets with no overlapping elements are called **mutually exclusive**. Graphically, their circles never touch. <br>
If $ A \cap B = \emptyset$,then the two sets are mutually exclusive <br>

The **complement** of a event is ALL that are in the sample space but are NOT in the event
$$ A^c = \forall x:x \notin A $$
Remember:<br>
All complements are mutually exclusive, but not all mutually exclusive sets are complements.

Example: <br>
Dogs and Cats are mutually exclusive sets, since no species is simultaneously a feline and a canine, but the two are not complements, since there exist other types of animals as well.

### Section 1.3.3 - Conditional Probability

For any two events A and B, such that the likelihood of B occurring is greater than 0 ($𝑃(𝐵) > 0$), the conditional probability
formula states the following:

$$ P(A | B) = \frac{P(A \cap B)}{P(B)} $$

This reads as "The probability of A occuring given that B has occurred equal the probability of them both happening simultaneously divided by the probability of B occurring". <br><br>

Remember $ P(A | B)$ is not the same as $P(B | A) $ even if $ P(A|B) = P(B|A) $ numerically

#### Section 1.3.3.1 - Law of Total Probability

The law of total probability dictates that for any set A, which is a union of many mutually exclusive sets $B_1,B_2,...,B_n$, its probability equals the
following sum.

$$ P(A) = P(A | B_1)*P(B_1) + P(A | B_2)*P(B_2) + ... + P(A | B_n)*P(B_n)

#### Section 1.3.3.2 - Multiplication Rule

The multiplication rule calculates the probability of the intersection based on the conditional probability.

$$ P(A \cap B) = P(A|B)*P(B) $$

Intution behind the formula:
* If event B occurs 40% of the time ($P(B)=0.4$) and event A occurs 50% of the time that event B occurs ($P(A|B)=0.5$), then they would simultaneously occur 20% of the time ($P(A|B)*P(B)=0.5*0.4 = 0.2$) 

#### Section 1.3.3.3 - Bayes' Law

Bayes’ Law helps us understand the relationship between two events by computing the different conditional probabilities. <br>
We also call it Bayes’ Rule or Bayes’ Theorem.
$$ P(A|B) = \frac{P(B|A)*P(A)}{P(B)} $$

Intution behind the formula
* According to the multiplication rule $ P(A \cap B) = P(A|B)*P(B) $ , so $P(B \cap A) = P(B|A)*P(A)$
* Since $P(A \cap B) = P(B \cap B)$, we plug in $P(B|A)*P(A)\,for\,P(A \cap B)$ in the probability formula $P(A|B) = \frac{P(A \cap B)}{P(B)}$

# Inferential Statistics

## Population
A population is a collection of all items of interest (an entire set). The number of items $X_i$ in the population is denoted $N$.

The number obtain when using a population is called a parameter
The population mean denoted as $\mu$ is:
$$\mu = \frac{1}{N}*\sum_{i=1}^NX_i $$
The population variance denoted $\sigma^2$ is:
$$\sigma^2 = E((X-u)^2)$$
$$\sigma^2 = \frac{1}{N}*\sum_{i=1}^N(X_i-\mu)^2 $$

## Sample
A sample is a subset of the population. The number of items $x_i$ in a sample is denoted $n$.

The number obtain when using a population is called a statistic
The sample mean denoted as $\bar{x}$ is:
$$\bar{x} = \frac{1}{n}*\sum_{i=1}^nx_i $$
The sample variance denoted $s^2$ is:
$$s^2 = E((X-\bar{x})^2)$$
$$s^2 = \frac{1}{n-1}*\sum_{i=1}^n(x_i-\bar{x})^2 $$


## Central Limit Theorem
If you take the sum (or mean) of many of samples, those sums (or means) will have a distribution. This distrubtion is called the sampling distribution. The Central Limit Theorem states that the sampling distribution will approximate a normal distribution no matter what the uderlying distribution of the population is. The sample distributions will have the following statistics:
$$N \sim (\mu,\frac{\sigma^2}{n})$$

The standard deviation of the sampling distribution is called the <u>standard error</u> and is $s=\frac{\sigma}{\sqrt{n}}$

## Confidence Interval
If you take the mean of a sample, $\bar{x}$ it is called and point estimate. This point estimate will change from sample to sample as it is only an estimate of population mean $\mu$. The amount that is changes is based on the population variance $\sigma^2$. This higher the variance, the more the point estimate will flucutate. 

A confidence interval is a range in which this a certain probability that the true popluation parameter falls within that range.
For instance a confidence interval, CI, of 95% means that there is a 95% certainly (or confidence) that the true parameter is in that range.
Note: The only way to have a 100% confidence interval is to "sample" the entire population.

The level of confidence, denoted $1- \alpha$ is called the confidence level of the interval. Where $0\leq \alpha \leq1$. So to be 95% confident that value is inside the interval then $\alpha=5$%

The formula for the confidence interval is:
$$[point\,estimate - reliability\,factor*standard\,error,point\,estimate + reliability\,factor*standard\,error]$$

There a two scenarios when we calculate the confidence interval of a paramater, 1) when the population variance $\sigma^2$ is known and 2) when $\sigma^2$ is unknown. 
$$ CI,variance_{know} = [\bar{x}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\bar{x}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}]$$
$$CI,variance_{known} = \bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$$

The z-table is based on the CDF of the normal distrobution and is as follows, where each index equals $(1-\alpha)$:

In [16]:
# Import required libraries
import numpy as np
import scipy.stats as stats
import pandas as pd

# The row headers (the first column in the table).
# The 2nd parameter (end of interval) is excluded from the output.  
# Thus, we'll set it to 3.1 to get values from 0 to 3.
row_headers = np.arange(0, 3.1, 0.1)
column_headers = np.arange(0.0, 0.10, 0.01)

# Generate a 2D grid of column and row headers.
# It'll return two 2D arrays containing pairs for 
# for each combination of row and column headers.
X1, X2 = np.meshgrid(column_headers, row_headers)

z_score_grid = X1 + X2

# Get cumulative probability for all the z-scores. 
z_score_cdf_grid = stats.norm.cdf(z_score_grid)

positive_z_score_table = pd.DataFrame(
    # Pass the z-score CDF 2D array 
    z_score_cdf_grid,
    # Set the single decimal z-scores as the index
    index=row_headers, 
    # Set array with 2nd decimal as the columns header
    columns=column_headers
)
# round all probability values to 4 decimals 
positive_z_score_table.round(4)

Unnamed: 0,0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09
0.0,0.5,0.504,0.508,0.512,0.516,0.5199,0.5239,0.5279,0.5319,0.5359
0.1,0.5398,0.5438,0.5478,0.5517,0.5557,0.5596,0.5636,0.5675,0.5714,0.5753
0.2,0.5793,0.5832,0.5871,0.591,0.5948,0.5987,0.6026,0.6064,0.6103,0.6141
0.3,0.6179,0.6217,0.6255,0.6293,0.6331,0.6368,0.6406,0.6443,0.648,0.6517
0.4,0.6554,0.6591,0.6628,0.6664,0.67,0.6736,0.6772,0.6808,0.6844,0.6879
0.5,0.6915,0.695,0.6985,0.7019,0.7054,0.7088,0.7123,0.7157,0.719,0.7224
0.6,0.7257,0.7291,0.7324,0.7357,0.7389,0.7422,0.7454,0.7486,0.7517,0.7549
0.7,0.758,0.7611,0.7642,0.7673,0.7704,0.7734,0.7764,0.7794,0.7823,0.7852
0.8,0.7881,0.791,0.7939,0.7967,0.7995,0.8023,0.8051,0.8078,0.8106,0.8133
0.9,0.8159,0.8186,0.8212,0.8238,0.8264,0.8289,0.8315,0.834,0.8365,0.8389


For each index, the value of $z_{\alpha/2}$ is found by adding the row header plus the column header for the given index.

For example for 95% confidence interval, $\alpha$ = 0.05; therefore $\alpha/2 = 0.025$ and $(1-\alpha/2) =  0.975$. With these numbers the reliability factor, $z_{0.025} = 1.9 + 0.06 = 1.96$

When the population variance is unknown, we use the sample variance and the Students T distribution, where the degrees of freedom (df) equals $n-1$.
$$CI,variance_{unknown} = \bar{x} \pm t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}$$

The Student's T table is as follows:

In [17]:
import numpy as np
from scipy.stats import t
import pandas as pd

def create_students_t_table(degrees_of_freedom_range, alpha_levels):
    """
    Creates a Student's t-distribution table.

    Args:
        degrees_of_freedom_range (list or array): A list or array of degrees of freedom.
        alpha_levels (list or array): A list or array of significance levels (e.g., 0.10, 0.05, 0.01).

    Returns:
        pandas.DataFrame: A DataFrame representing the t-table.
    """
    t_values = {}
    for df in degrees_of_freedom_range:
        row_values = []
        for alpha in alpha_levels:
            # For a two-tailed test, the quantile is 1 - alpha/2
            # For a one-tailed test (e.g., upper tail), the quantile is 1 - alpha
            # This example generates values for two-tailed tests
            critical_value = t.ppf(1 - alpha / 1, df)
            row_values.append(critical_value)
        t_values[df] = row_values

    df_t_table = pd.DataFrame(t_values).T
    df_t_table.columns = [f'alpha={a}' for a in alpha_levels]
    df_t_table.index.name = '(df)'
    return df_t_table

# Define the degrees of freedom and significance levels
dfs = np.arange(1, 31)  # Degrees of freedom from 1 to 30
alphas = [0.10, 0.05, 0.025, 0.01, 0.005] # Common alpha levels

# Create the t-table
t_table = create_students_t_table(dfs, alphas)

# Print the table
t_table.round(4)

Unnamed: 0_level_0,alpha=0.1,alpha=0.05,alpha=0.025,alpha=0.01,alpha=0.005
(df),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3.0777,6.3138,12.7062,31.8205,63.6567
2,1.8856,2.92,4.3027,6.9646,9.9248
3,1.6377,2.3534,3.1824,4.5407,5.8409
4,1.5332,2.1318,2.7764,3.7469,4.6041
5,1.4759,2.015,2.5706,3.3649,4.0321
6,1.4398,1.9432,2.4469,3.1427,3.7074
7,1.4149,1.8946,2.3646,2.998,3.4995
8,1.3968,1.8595,2.306,2.8965,3.3554
9,1.383,1.8331,2.2622,2.8214,3.2498
10,1.3722,1.8125,2.2281,2.7638,3.1693


For two dependent samples x and y the confidence interval between the two is:
Dependent samples when population variance unknown but assumed to be the same:
$$CI,2Dvariance_{unknown} = (\bar{x}-\bar{y}) \pm t_{n-1,\alpha/2}\frac{s_d}{\sqrt{n}} $$
Population variance known
$$CI,2Dvariance_{known} = (\bar{x}-\bar{y}) \pm z_{\alpha/2}\sqrt{\frac{\sigma_x^2}{n_x} + \frac{\sigma_y^2}{n_y}}$$
Independent samples when population variance unknown but assumed to be the same:
Pooled Variance:
$$s_p^2=\frac{(n_x-1)s_x^2 + (n_y-1)s_y^2}{n_x+n_y-2}$$
$$CI,2Dvariance_{unknown} = (\bar{x}-\bar{y}) \pm t_{n_x+n_y-2,\alpha/2}\sqrt{\frac{s_p^2}{n_x} + \frac{s_p^2}{n_y}}$$
Independent samples when population variance unknown and assumed to be different:
Note:This is not covered as its rarely encounted in our cases.

## Hypothesis Testing

Steps in data-driven decision making
1. Formulate a hypothesis
2. Find the right test
3. Execute the test
4. Make a decision based on the results

If the population variance is known:
$$Z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}$$
<br>
We aim to reject the null-hypothesis if it is false. We could make an error and reject the null-hypothesis if it is true. The probability of rejecting the null hypothesis if it true is denoted as $\alpha$<br>
If the test in the parameter = a number then <br>
**Decision rule:**
* Accept if: Z = 0, as this means the sample statistic equals the population parameter
* Reject if: absolute value of Z-score >= positive critical value (z)

The <u>p-value</u> is the smallest level of significance at which we can reject the null hypothesis, given the observed sample statistic.

For a 1-sided test, the p_value is: 1 minus the number from the table
For a 2-sided test, the p_value is (1 minus the number from the table)*2

**Decision rule:**
* Accept if: p-value > $\alpha$
* Reject if: p-value <= $\alpha$