The Basics Statistics To understand the Hypothesis theory

# Statistics

## 1. Introduction to Statistics
Statistics is the science of **collecting**, **organizing**, **analyzing**, and **interpreting data** to make decisions or draw conclusions.

There are two main branches of statistics:
- **Descriptive Statistics** — Describe and summarize data.
- **Inferential Statistics** — Make predictions or inferences about a population from a sample.

---

## 2. Inferential and Descriptive Statistics

| Type | Purpose | Key Concepts | Example |
|------|----------|---------------|----------|
| **Descriptive Statistics** | Summarize the data you have | Mean, Median, Mode, Range, Variance, Charts | "The average exam score is 72." |
| **Inferential Statistics** | Make conclusions about a larger population using sample data | Sampling, Probability, Hypothesis Testing, Confidence Intervals | "The average score for all students is between 70–74." |

**In short:**
- Descriptive = Describe what you see  
- Inferential = Predict what you don’t see

---

## 3. Population and Sample

| Term | Meaning | Example |
|------|----------|----------|
| **Population** | The entire group you want to study | All university students in Kenya |
| **Sample** | A smaller part of the population used for analysis | 200 students selected from different universities |

**Relationship:**  
We use the **sample** to make **inferences about the population**.

**Example:**  
If 200 sampled students have an average score of 72, we can estimate the average for all students.

---

## 4. Descriptive Statistics

Descriptive statistics help us **summarize data** in simple and meaningful ways.

### 4.1 Measures of Central Tendency
- **Mean** — The average value.  
- **Median** — The middle value when data is arranged in order.  
- **Mode** — The most frequent value in the dataset.

### 4.2 Measures of Spread
- **Range** — The difference between the highest and lowest value.  
- **Variance** — The average squared difference from the mean.  
- **Standard Deviation** — The square root of the variance; shows how spread out data is.

### 4.3 Data Visualization
Common tools for visualizing data:
- Tables  
- Bar Charts  
- Histograms  
- Boxplots  
- Pie Charts  

---

## 5. Summary

| Concept | Description |
|----------|-------------|
| **Descriptive Statistics** | Describe the known data. |
| **Inferential Statistics** | Predict the unknown from the known. |
| **Population** | The whole group being studied. |
| **Sample** | A small subset of the population used for study. |

**In essence:**  
Descriptive tells the story of your data; inferential predicts the next chapter.

---

## 6. Importing Libraries 


```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats


In [None]:
#import the library here 

We describe the data to give the statical infomation like mean media std percentiles and etc

frequency distributions or Histogram

%matplotlib inline tells Jupyter:

“Show all my plots inside the notebook, right below the code cell that creates them.”

### Cumulated historgram

A cumulative histogram shows how data accumulates across intervals (bins). Instead of showing the frequency within each bin, it shows the running total — how many values fall below or equal to each bin edge

# Deviation

## Definition
Deviation is the amount by which a single measurement differs from a fixed value — usually the **mean**.  
It shows how far and in what direction a data point is from the average.

---

## Mathematical Expression

If $x_i$ is a data point and $\bar{x}$ is the mean, then:

$$
\text{Deviation} = x_i - \bar{x}
$$

---

## Example

**Heights of 5 students (in cm):**

| Student | Height ($x_i$) | Mean ($\bar{x} = 170$) | Deviation ($x_i - \bar{x}$) |
|----------|----------------|------------------------|-----------------------------|
| A | 165 | 170 | -5 |
| B | 168 | 170 | -2 |
| C | 170 | 170 | 0 |
| D | 175 | 170 | +5 |
| E | 172 | 170 | +2 |

**Interpretation:**
- Negative deviation → below the mean  
- Positive deviation → above the mean  
- Zero deviation → equal to the mean  

---

## Important Notes

The sum of all deviations from the mean is always **zero**:

$$
\sum (x_i - \bar{x}) = 0
$$

To measure overall spread, we use:
- **Variance:** Average of squared deviations  
- **Standard Deviation:** Square root of variance  

Thus, **deviation** is the foundation for all measures of dispersion.

---


```python

# Calculate deviations
deviations = data - mean




It’s a powerful library that provides:

High-performance tools for mathematical operations,

Efficient handling of large datasets,

import that library

# Mean Deviation and Mean Absolute Deviation (MAD)

## Understanding Deviation

Find how far each height is from the mean (the **deviation**), then add them all together and round the result.  
The sum of all deviations from the mean is always **zero** — meaning the mean is the **balance point** of the data.

---

## Example

Let's calculate the deviations manually:

$$
(-20) + (-10) + 0 + 10 + 20 = 0
$$

 **Always equals zero** → because the mean is the balance point.  
 **Not useful for measuring spread** → since positive and negative deviations cancel out.

---



## Mean Absolute Deviation (MAD)

### Definition
The **Mean Absolute Deviation (MAD)** is the **average of the absolute deviations** from the mean.  
It shows how far, on average, each data value lies from the mean.

---

### Formula

$$
MAD = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|
$$

Where:  
- \( x_i \) = individual data value  
- \( \bar{x} \) = mean (average) of the dataset  
- \( n \) = total number of observations  
- \( |x_i - \bar{x}| \) = absolute deviation from the mean  

---

### Example Data

**Heights of 5 students (in cm):**

| Student | \( x_i \) (Height) |
|----------|--------------------|
| A | 150 |
| B | 160 |
| C | 170 |
| D | 180 |
| E | 190 |

---

### Step 1 — Find the Mean

$$
\bar{x} = \frac{150 + 160 + 170 + 180 + 190}{5} = 170
$$

---

### Step 2 — Find Each Deviation from the Mean

$$
x_i - \bar{x} = [150 - 170, 160 - 170, 170 - 170, 180 - 170, 190 - 170]
$$

$$
x_i - \bar{x} = [-20, -10, 0, 10, 20]
$$

---

### Step 3 — Take the Absolute Deviations

$$
|x_i - \bar{x}| = [20, 10, 0, 10, 20]
$$

---

### Step 4 — Find the Mean of Absolute Deviations

$$
MAD = \frac{20 + 10 + 0 + 10 + 20}{5}
$$

$$
MAD = \frac{60}{5} = 12
$$

---

### Final Result

$$
MAD = 12
$$

---


```python

mad = (abs(data[''] - data[''].mean())).mean()
mad


# Variance

## Definition
The **Variance** is the **average of the squared deviations** from the mean.  
It shows how much the data values **spread out** from the mean.  
A higher variance indicates that data points are more spread out, while a lower variance means they are closer to the mean.

---

## Mathematical Expression

For a **population**, the variance is denoted by \( \sigma^2 \):

$$
\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \bar{x})^2}{N}
$$

For a **sample**, the variance is denoted by \( s^2 \):

$$
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}
$$

Where:  
- \( x_i \) → each data point  
- \( \bar{x} \) → mean of the dataset  
- \( N \) → total number of data points (population)  
- \( n \) → total number of data points (sample)  
- \( (x_i - \bar{x})^2 \) → squared deviation from the mean  

---

## Example

**Heights of 5 students (in cm):**

| Student | Height ($x_i$) | Mean ($\bar{x} = 170$) | Deviation ($x_i - \bar{x}$) | Squared Deviation ($(x_i - \bar{x})^2$) |
|----------|----------------|------------------------|-----------------------------|------------------------------------------|
| A | 150 | 170 | -20 | 400 |
| B | 160 | 170 | -10 | 100 |
| C | 170 | 170 | 0 | 0 |
| D | 180 | 170 | +10 | 100 |
| E | 190 | 170 | +20 | 400 |

---

## Step-by-Step Calculation

**Step 1 — Find the Mean**

$$
\bar{x} = \frac{150 + 160 + 170 + 180 + 190}{5} = 170
$$

**Step 2 — Find the Squared Deviations**

$$
(x_i - \bar{x})^2 = [400, 100, 0, 100, 400]
$$

**Step 3 — Find the Mean of the Squared Deviations**

$$
\sigma^2 = \frac{400 + 100 + 0 + 100 + 400}{5}
$$

$$
\sigma^2 = \frac{1000}{5} = 200
$$

---

## Final Result

$$
\boxed{\text{Variance} = 200}
$$

---




# Standard Deviation

## Definition
The **Standard Deviation (SD)** is the **square root of the variance**.  
It measures how spread out the data values are from the mean — in the **same units** as the data.  
A **small standard deviation** means the data values are close to the mean, while a **large standard deviation** means they are more spread out.

---

## Mathematical Expression

For a **population**, the standard deviation is denoted by \( \sigma \):

$$
\sigma = \sqrt{ \frac{ \sum_{i=1}^{N} (x_i - \bar{x})^2 }{ N } }
$$

For a **sample**, it is denoted by \( s \):

$$
s = \sqrt{ \frac{ \sum_{i=1}^{n} (x_i - \bar{x})^2 }{ n - 1 } }
$$

Where:  
- \( x_i \) → each data point  
- \( \bar{x} \) → mean of the dataset  
- \( N \) → total number of observations (population)  
- \( n \) → total number of observations (sample)  
- \( (x_i - \bar{x})^2 \) → squared deviation from the mean  

---

## Example

**Heights of 5 students (in cm):**

| Student | Height ($x_i$) | Mean ($\bar{x} = 170$) | Deviation ($x_i - \bar{x}$) | Squared Deviation ($(x_i - \bar{x})^2$) |
|----------|----------------|------------------------|-----------------------------|------------------------------------------|
| A | 150 | 170 | -20 | 400 |
| B | 160 | 170 | -10 | 100 |
| C | 170 | 170 | 0 | 0 |
| D | 180 | 170 | +10 | 100 |
| E | 190 | 170 | +20 | 400 |

---

## Step-by-Step Calculation

**Step 1 — Find the Variance**

$$
\sigma^2 = \frac{400 + 100 + 0 + 100 + 400}{5} = 200
$$

**Step 2 — Take the Square Root of Variance**

$$
\sigma = \sqrt{200}
$$

$$
\sigma \approx 14.14
$$

---

## Final Result

$$
\boxed{\text{Standard Deviation} = 14.14}
$$

---



# Coefficient of Variation (CV)

## Definition
The **Coefficient of Variation (CV)** measures the **relative variability** of a dataset compared to its mean.  
It expresses the **standard deviation as a percentage of the mean**, allowing comparison between datasets with different units or scales.

A **low CV** indicates more consistency (less risk), while a **high CV** indicates greater variability (more risk).

---

## Mathematical Expression

For a **population**:

$$
CV = \frac{\sigma}{\bar{x}} \times 100
$$

For a **sample**:

$$
CV = \frac{s}{\bar{x}} \times 100
$$

Where:  
- \( \sigma \) → population standard deviation  
- \( s \) → sample standard deviation  
- \( \bar{x} \) → mean (average) of the dataset  
- \( CV \) → coefficient of variation (expressed as a percentage)

---

## Step-by-Step Explanation

1. Compute the **mean** (\( \bar{x} \)) of the dataset.  
2. Find the **standard deviation** (\( \sigma \) or \( s \)).  
3. Divide the standard deviation by the mean.  
4. Multiply the result by **100** to express it as a percentage.

---

## Example

| Dataset | Mean (\( \bar{x} \)) | Standard Deviation (\( \sigma \)) | CV (%) | Interpretation |
|:--------:|:-------------------:|:---------------------------------:|:------:|:----------------------------:|
| A (Heights) | 170 | 10 | \( \frac{10}{170} \times 100 = 5.88\% \) | Very consistent (Low risk) |
| B (Weights) | 60 | 12 | \( \frac{12}{60} \times 100 = 20.00\% \) | More variable (High risk) |

---

## Step-by-Step Calculation

**Step 1 — Write the Formula**

$$
CV = \frac{\sigma}{\bar{x}} \times 100
$$

**Step 2 — Substitute Values (for Dataset A)**

$$
CV = \frac{10}{170} \times 100
$$

**Step 3 — Simplify**

$$
CV = 5.88\%
$$

---

## Final Result

$$
\boxed{\text{Coefficient of Variation (CV)} = 5.88\%}
$$

---


```python


# Calculate Coefficient of Variation
cv_A = (std_A / mean_A) * 100
cv_B = (std_B / mean_B) * 100




|   CV Range  | Meaning     | Interpretation                                          |
| :---------: | :---------- | :------------------------------------------------------ |
|  **< 10%**  | Low CV      | Data points are consistent (low risk / low variability) |
| **10%–20%** | Moderate CV | Some variation — acceptable risk                        |
|  **> 20%**  | High CV     | Data is very spread out (high risk / high uncertainty)  |




---

# Sample Variance and Sample Standard Deviation

## Definition

The **Sample Variance** and **Sample Standard Deviation** are used to estimate the **population variance** and **population standard deviation** when working with a **sample** (a subset of the entire population).

Because a sample represents only part of the population, it tends to **underestimate** the true spread.
To correct this, we use **Bessel’s Correction**, dividing by ( (N - 1) ) instead of ( N ).

---

## ⚖️ Why Use ( (N - 1) ) Instead of ( N )?

When calculating variance or standard deviation from a **sample**, we use the **sample mean** (( \bar{x} )) — which is already based on that same data.
This makes the spread appear **smaller** than the true spread of the entire population.

To fix this:

* Dividing by ( N - 1 ) instead of ( N ):

  * Increases the variance slightly
  * Corrects underestimation
  * Provides an **unbiased estimate** of the population variance

---

##  Mathematical Formulas

### Sample Variance

$$
s^2 = \frac{\sum (x_i - \bar{x})^2}{N - 1}
$$

### Sample Standard Deviation

$$
s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{N - 1}}
$$

### Relationship Between Variance and Standard Deviation

$$
SD = \sqrt{s^2}
$$

---

##  Example

Given the sample data:

$$
x_1 = 35, \quad x_2 = 31, \quad x_3 = 32.5
$$

### Step 1 — Compute the Mean

$$
\bar{x} = \frac{35 + 31 + 32.5}{3} = 33.5
$$

---

### Step 2 — Compute Deviations and Squared Deviations

| Observation ((x_i)) | ( x_i - \bar{x} ) | ( (x_i - \bar{x})^2 ) |
| :-----------------: | :---------------: | :-------------------: |
|          35         |      ( 1.5 )      |        ( 2.25 )       |
|          31         |      ( -2.5 )     |        ( 6.25 )       |
|         32.5        |      ( -1.0 )     |        ( 1.00 )       |

---

### Step 3 — Compute Sample Variance

$$
s^2 = \frac{2.25 + 6.25 + 1.00}{3 - 1} = \frac{9.5}{2} = 4.75
$$

---

### Step 4 — Compute Sample Standard Deviation

$$
s = \sqrt{4.75} \approx 2.18
$$

---

##  Summary Table

| Measure Type            | Formula                                        | Denominator | Used For          | Description                              |
| :---------------------- | :--------------------------------------------- | :---------- | :---------------- | :--------------------------------------- |
| **Population Variance** | ( \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} )    | ( N )       | Entire population | True variance of population              |
| **Sample Variance**     | ( s^2 = \frac{\sum (x_i - \bar{x})^2}{N - 1} ) | ( N - 1 )   | Sample data       | Unbiased estimate of population variance |

---

##  Key Takeaway

Using ( (N - 1) ) ensures your **sample variance** and **sample standard deviation** are **unbiased estimators** of the population values — a fundamental concept in **inferential statistics**.

---






---

# Sampling Techniques

## Definition

**Sampling** is the process of selecting a **subset of individuals** (a *sample*) from a **population** to estimate characteristics of the **entire population**.
Different sampling methods are used depending on the **study design**, **population structure**, and **available resources**.

---

## Types of Sampling Methods

---

###  Random Sampling

**Definition:**
Every member of the population has an **equal chance** of being selected.

**Variants:**

* **With Replacement:** Selected individuals are returned to the population before the next draw.
  *Example:* Drawing a card from a deck, noting it, and putting it back.
* **Without Replacement:** Once selected, individuals are not returned to the population.
  *Example:* Picking 5 students from a class of 30 to survey — each student can only be chosen once.

**Key Point:**
Random sampling minimizes **bias** and ensures each individual has the same probability of being chosen.

---

###  Systematic Sampling

**Definition:**
Selects **every ( k^{th} )** individual from a population list, starting at a random position.

**Steps:**

1. Choose a random starting point.
2. Select every ( k^{th} ) element (where ( k = \frac{N}{n} )).

**Example:**
Surveying every 10th customer entering a store.

**Note:**
Simple to implement but may introduce **bias** if there’s a repeating pattern in the population list.

---

###  Stratified Sampling

**Definition:**
The population is divided into **subgroups (strata)** based on shared characteristics (e.g., age, gender, income level), and **random samples** are taken from each stratum.

**Advantages:**

* Ensures **representation** from all important subgroups.
* Produces **more precise estimates** compared to simple random sampling.

**Example:**
Surveying 100 students — 40% male and 60% female, sampled according to these proportions.

---

###  Cluster Sampling

**Definition:**
The population is divided into **clusters** (often naturally occurring groups like schools, villages, or companies), and **entire clusters** are randomly selected for the study.

**Advantages:**

* Cost-effective for **large or geographically spread populations**.
* Useful when a complete list of the population is not available.

**Example:**
Selecting 5 schools from a district and surveying **all students** in those selected schools.

---

###  Quota Sampling

**Definition:**
A **non-random** method where individuals are selected to meet **specific quotas** (e.g., by gender, age, or occupation).

**Example:**
Surveying 50 males and 50 females — chosen based on convenience to meet the quota.

**Note:**
Ensures representation but can introduce **bias** since selection isn’t random.

---

###  Snowball Sampling

**Definition:**
Existing participants **recruit other participants** from their network.

**When to Use:**

* For **hard-to-reach** or **hidden populations** (e.g., drug users, rare disease patients).

**Example:**
Studying a rare disease — initial patients refer other patients they know.

---

##  Summary Table

| Sampling Method | Description                        | Example                                         |
| :-------------- | :--------------------------------- | :---------------------------------------------- |
| **Random**      | Every member has equal chance      | Randomly selecting 10 students from a class     |
| **Systematic**  | Select every ( k^{th} ) individual | Every 5th customer entering a store             |
| **Stratified**  | Population divided into strata     | Randomly pick males & females proportionally    |
| **Cluster**     | Randomly select entire groups      | Select 3 schools, survey all students           |
| **Quota**       | Meet specific quotas               | Survey 50 males & 50 females by convenience     |
| **Snowball**    | Participants recruit others        | Patients refer other patients with rare disease |

---

# Central Limit Theorem (CLT)

## Definition

The **Central Limit Theorem (CLT)** states that:

> The sampling distribution of the **sample means** approaches a **normal distribution** as the **sample size increases**,
> regardless of the shape of the population distribution.

---

##  Mathematical Representation

Let ( X_1, X_2, \ldots, X_n ) be independent random variables with mean ( \mu ) and standard deviation ( \sigma ).
Then, the **sample mean** ( \bar{X} ) is approximately normally distributed for large ( n ):

$$
\bar{X} \sim N \left( \mu, \frac{\sigma^2}{n} \right)
$$

That is,

$$
Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim N(0,1)
$$

---

##  Key Implications

* The **mean of the sample means** equals the **population mean** (( E[\bar{X}] = \mu )).
* The **spread of the sample means** decreases as **sample size (n)** increases:
  $$
  \text{Standard Error (SE)} = \frac{\sigma}{\sqrt{n}}
  $$
* Works for **any population shape** (skewed, uniform, etc.) — as long as ( n ) is large (commonly ( n \geq 30 )).

---

##  Example

Suppose the population of test scores has:

* Mean (( \mu )) = 70
* Standard deviation (( \sigma )) = 12

If we take samples of size ( n = 36 ), then:

### Sampling distribution of the sample mean:

$$
\bar{X} \sim N(70, \frac{12^2}{36})
$$

### Standard Error:

$$
SE = \frac{12}{\sqrt{36}} = 2
$$

Hence, sample means will cluster around 70 with a standard deviation of 2.

---

##  Key Takeaways

* CLT explains **why normal distribution is common in statistics**.
* Enables **confidence intervals** and **hypothesis testing** even for non-normal data.
* Larger sample sizes lead to **more accurate estimates** of the population mean.

---

##  Python Illustration

```python


population = np.random.exponential(scale=10, size=100000)

sample_means = [np.mean(np.random.choice(population, size=30)) for _ in range(1000)]

plt.hist(sample_means, bins=30, density=True)
plt.title("Central Limit Theorem Demonstration")
plt.xlabel("Sample Means")
plt.ylabel("Frequency")
plt.show()
```

---





---

# Normal Distribution and Z-Score

---

##  Minimum Data Requirement for Normal Distribution

For a dataset or series to be **approximately normally distributed**, it should have a **sufficient sample size**.

> **In practice, at least 30 data points** are recommended for the sampling distribution to **approximate a normal curve** according to the **Central Limit Theorem (CLT)**.

As the sample size increases:

* The **sampling distribution** of the mean becomes **smoother** and **more symmetric**.
* The **Z-score** becomes a valid and useful measure of how far a data point is from the mean.

---

# Z-Score

## Definition

A **Z-Score** (also called a *standard score*) represents the **number of standard deviations** a value is **from the mean** of a dataset.

It converts raw data into a **standardized scale**, allowing comparison across different datasets.

---

##  Formula

$$
Z = \frac{X - \bar{X}}{s}
$$

For a population:

$$
Z = \frac{X - \mu}{\sigma}
$$

Where:

* ( Z ) → Z-score (standard score)
* ( X ) → observed data value
* ( \bar{X} ) or ( \mu ) → mean of the dataset
* ( s ) or ( \sigma ) → standard deviation

---

##  Interpretation

|   Z-Score  | Meaning         | Interpretation           |
| :--------: | :-------------- | :----------------------- |
|  **Z = 0** | At the mean     | Value equals the mean    |
|  **Z > 0** | Above the mean  | Value is above average   |
|  **Z < 0** | Below the mean  | Value is below average   |
| **Z = +1** | 1 SD above mean | Higher than ~84% of data |
| **Z = –1** | 1 SD below mean | Lower than ~16% of data  |

---

##  Example

Let’s consider test scores for 5 students:

| Student | Score (( X )) |
| :-----: | :-----------: |
|    A    |       60      |
|    B    |       70      |
|    C    |       80      |
|    D    |       90      |
|    E    |      100      |

### Step 1 — Find the Mean

$$
\bar{X} = \frac{60 + 70 + 80 + 90 + 100}{5} = 80
$$

### Step 2 — Find the Standard Deviation

$$
s = \sqrt{\frac{(60-80)^2 + (70-80)^2 + (80-80)^2 + (90-80)^2 + (100-80)^2}{5}}
$$

$$
s = \sqrt{\frac{400 + 100 + 0 + 100 + 400}{5}} = \sqrt{200} \approx 14.14
$$

### Step 3 — Compute Z-Scores

| Student | ( X ) |    ( Z = \frac{X - \bar{X}}{s} )   | Interpretation |
| :-----: | :---: | :--------------------------------: | :------------- |
|    A    |   60  |  ( \frac{60 - 80}{14.14} = -1.41 ) | Below mean     |
|    B    |   70  |  ( \frac{70 - 80}{14.14} = -0.71 ) | Slightly below |
|    C    |   80  |    ( \frac{80 - 80}{14.14} = 0 )   | At mean        |
|    D    |   90  |  ( \frac{90 - 80}{14.14} = +0.71 ) | Slightly above |
|    E    |  100  | ( \frac{100 - 80}{14.14} = +1.41 ) | Above mean     |

---

##  Final Notes

* **Z-scores standardize data** — useful for comparing values from different distributions.
* A **Z-score of ±1.96** corresponds to **95% confidence** in a normal distribution.
* **Negative Z:** value is below mean; **Positive Z:** value is above mean.

---


```python


mean = data['Scores'].mean()
std = data['Scores'].std(ddof=0)


data['Z-Score'] = (data['Scores'] - mean) / std

print("Mean:", mean)
print("Standard Deviation:", std)
print(data)
```

---





---

#  Inferential Statistics

---

##  Estimating the Population Mean Using Z-Score

**Inferential statistics** allow us to use **sample data** to make **conclusions or estimates about a population**.

When the **population standard deviation (( \sigma ))** is known, and the **sampling distribution** is **normal** (or the sample size is large), we use the **Z-distribution** to estimate the population mean.

---

##  Conditions for Inference About Population Mean

For valid inference using **Z-scores**, the following conditions must be met:

| Condition                                 | Description                                                                                      |
| :---------------------------------------- | :----------------------------------------------------------------------------------------------- |
| **Random**                                | The sample must be randomly selected from the population.                                        |
| **Normal Data**                           | The population is normally distributed, or the sample size ( n \ge 30 ) (Central Limit Theorem). |
| **Independent**                           | The sampled observations must be independent of each other.                                      |
| **Known Standard Deviation (( \sigma ))** | Population standard deviation must be known.                                                     |

---

##  Important Concepts

### 1. **Point Estimate**

A **point estimate** is a **single value** used to estimate a population parameter (e.g., sample mean ( \bar{X} ) estimates population mean ( \mu )).

>  Point estimates are **as good as the sample** used — they are **more prone to sampling error**.

---

### 2. **Interval Estimate (Confidence Interval)**

An **interval estimate** gives a **range of values** within which the analyst can state, **with some confidence**, that the **population parameter lies**.

Confidence intervals can be:

* **Two-sided** → captures uncertainty in both directions (common)
* **One-sided** → only upper or lower bound considered

> In this section, we focus on **two-sided confidence intervals**.

---

##  Based on the Central Limit Theorem (CLT)

When the population standard deviation ( \sigma ) is known:

* For **large samples (n ≥ 30)**, the **sampling distribution** of the sample mean ( \bar{X} ) is approximately **normal**, regardless of the population’s shape.
* For **small samples (n < 30)**, the **population itself must be normally distributed**.

Thus, we can use the **Z-distribution** for estimation.

---

##  Formula for Confidence Interval of the Mean

$$
\bar{X} \pm Z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}
$$

Where:

* ( \bar{X} ) → sample mean
* ( Z_{\alpha/2} ) → critical value from Z-table (depends on confidence level)
* ( \sigma ) → population standard deviation
* ( n ) → sample size

---

##  Alpha ( ( \alpha ) )

**Alpha ( ( \alpha ) )** represents the **total area in the tails** of the distribution **outside** the confidence interval.

| Confidence Level | ( \alpha ) | ( \alpha/2 ) (each tail) | ( Z_{\alpha/2} ) |
| :--------------: | :--------: | :----------------------: | :--------------: |
|        90%       |    0.10    |           0.05           |       1.645      |
|        95%       |    0.05    |           0.025          |       1.96       |
|        99%       |    0.01    |           0.005          |       2.576      |

Example:
When the **confidence level = 95%**,
then ( \alpha = 1 - 0.95 = 0.05 ).
From the **Z-table**, ( Z_{\alpha/2} = 1.96 ).

---

##  Example

A random sample of **n = 50** students has an average test score of **78**, with a **known population standard deviation** of **10**.
Find a **95% confidence interval** for the **population mean**.

---

### Step 1 — Identify known values:

[
\bar{X} = 78, \ \sigma = 10, \ n = 50, \ Z_{0.025} = 1.96
]

---

### Step 2 — Compute the standard error (SE):

[
SE = \frac{\sigma}{\sqrt{n}} = \frac{10}{\sqrt{50}} = 1.414
]

---

### Step 3 — Compute the margin of error (ME):

[
ME = Z_{\alpha/2} \times SE = 1.96 \times 1.414 = 2.77
]

---

### Step 4 — Construct the confidence interval:

[
\text{CI} = \bar{X} \pm ME = 78 \pm 2.77
]

[
\boxed{(75.23, \ 80.77)}
]

So we are **95% confident** that the **true population mean** lies between **75.23 and 80.77**.

---


```python
import math
import scipy.stats as stats


x_bar = 78        # sample mean
sigma = 10        # population std dev
n = 50            # sample size
confidence = 0.95

z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)


SE = sigma / math.sqrt(n)
ME = z_critical * SE

lower = x_bar - ME
upper = x_bar + ME

print(f"Z-critical: {z_critical:.2f}")
print(f"Margin of Error: {ME:.2f}")
print(f"95% Confidence Interval: ({lower:.2f}, {upper:.2f})")
```

---

##  Key Takeaways

* Use **Z-distribution** when ( \sigma ) is **known** and data is **normal or n ≥ 30**.
* **Confidence Interval (CI)** provides a **range estimate**, not a single value.
* **Higher confidence level** → **wider interval** (more uncertainty).
* **Lower confidence level** → **narrower interval** (less certainty).
* The **Z-critical value** depends on the **confidence level (α)**.

---





---

#  Confidence Interval When σ Is Unknown (Using *t*-Distribution)

---

##  Overview

When the **population standard deviation (σ)** is **unknown**, we use the **sample standard deviation (s)** as an estimate.
In such cases, the **t-distribution** is used instead of the **Z-distribution**.

The **t-distribution** is similar to the normal distribution but has **heavier tails**, which account for the **extra uncertainty** due to estimating σ from the sample.

---

##  Conditions for Using the *t*-Distribution

| Condition                    | Description                                                                                |
| :--------------------------- | :----------------------------------------------------------------------------------------- |
| **Random Sample**            | Data must be collected randomly from the population.                                       |
| **Normal Population**        | The underlying population should be approximately normal (important for small samples).    |
| **Independent Observations** | Each sample observation must be independent.                                               |
| **σ Unknown**                | Population standard deviation (σ) is not known. Use sample standard deviation (s) instead. |

---

##  Formula for Confidence Interval

The confidence interval for the **population mean (μ)** when **σ is unknown** is:

[
\bar{X} \pm t_{\alpha/2, , df} \times \frac{s}{\sqrt{n}}
]

Where:

| Symbol                 | Meaning                                                                |
| :--------------------- | :--------------------------------------------------------------------- |
| ( \bar{X} )            | Sample mean                                                            |
| ( s )                  | Sample standard deviation                                              |
| ( n )                  | Sample size                                                            |
| ( t_{\alpha/2, , df} ) | Critical *t*-value for a given confidence level and degrees of freedom |
| ( df )                 | Degrees of freedom = ( n - 1 )                                         |

---

##  Degrees of Freedom (df)

[
df = n - 1
]

The **degrees of freedom** determine which *t*-distribution curve to use.
Smaller samples have **wider tails**, making the confidence interval **broader**.

---

##  Example

A random sample of **n = 25** students has a mean score of **74** and a sample standard deviation of **8**.
Find the **95% confidence interval** for the **population mean**.

---

### Step 1 — Identify known values:

[
\bar{X} = 74, \ s = 8, \ n = 25, \ df = 24
]

---

### Step 2 — Determine the critical *t*-value:

For a **95% confidence level** and **df = 24**,
from the *t*-table or Python:

[
t_{0.025, 24} = 2.064
]

---

### Step 3 — Compute the standard error (SE):

[
SE = \frac{s}{\sqrt{n}} = \frac{8}{\sqrt{25}} = 1.6
]

---

### Step 4 — Compute the margin of error (ME):

[
ME = t_{\alpha/2, df} \times SE = 2.064 \times 1.6 = 3.30
]

---

### Step 5 — Construct the confidence interval:

[
\text{CI} = \bar{X} \pm ME = 74 \pm 3.30
]

[
\boxed{(70.70, \ 77.30)}
]

So we are **95% confident** that the **true population mean** lies between **70.7 and 77.3**.

---



```python

# Given data
x_bar = 74        # sample mean
s = 8             # sample std deviation
n = 25            # sample size
confidence = 0.95
df = n - 1        # degrees of freedom


t_critical = stats.t.ppf(1 - (1 - confidence) / 2, df)


SE = s / math.sqrt(n)
ME = t_critical * SE


lower = x_bar - ME
upper = x_bar + ME

print(f"T-critical: {t_critical:.3f}")
print(f"Margin of Error: {ME:.3f}")
print(f"95% Confidence Interval: ({lower:.3f}, {upper:.3f})")
```

---

## Comparison: Z vs. t Distribution

| Feature             | **Z-Distribution** |        **t-Distribution**       |
| :------------------ | :----------------: | :-----------------------------: |
| Population σ known? |         Yes       |            No (use s)          |
| Sample size (n)     |   Large (n ≥ 30)   |          Small (n < 30)         |
| Shape               |       Normal       |           Wider tails           |
| Curve depends on    |        Fixed       | Degrees of freedom (df = n − 1) |

---

##  Key Takeaways

* Use **t-distribution** when **σ is unknown** and **sample is small**.
* As **sample size increases**, the **t-distribution approaches** the **Z-distribution**.
* Always check assumptions: randomness, independence, and normality.
* The **t-critical value** is slightly **larger than Z**, reflecting greater uncertainty.

---

