## **Inferential Statistics**

**Inferential statistics** allows us to make predictions or inferences about a population based on a sample of data. While descriptive statistics summarize the data, inferential statistics help us draw conclusions about a larger group from the observed data. Two key components of inferential statistics are **hypothesis testing** and **confidence intervals**.

---

### 1. **Hypothesis Testing**

**Hypothesis testing** is a statistical method used to make decisions or inferences about population parameters based on sample data. It allows us to test assumptions or claims and evaluate whether the data supports or rejects those claims.

#### a) **Key Terms in Hypothesis Testing**
- **Null Hypothesis $( H_0 )$**: A statement that there is no effect or no difference. It is the hypothesis that is initially assumed to be true.
- **Alternative Hypothesis $( H_1 )$ or $( H_a )$**: A statement that contradicts the null hypothesis. It indicates the presence of an effect or difference.
- **Test Statistic**: A value calculated from the sample data that is used to make a decision about the null hypothesis. Examples include the \( t \)-statistic and \( z \)-statistic.
- **P-value**: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A low p-value suggests that the null hypothesis is unlikely.
- **Significance Level $( \alpha )$**: The threshold for rejecting the null hypothesis. Common significance levels are 0.05 (5%) or 0.01 (1%). If the p-value is less than $( \alpha )$, the null hypothesis is rejected.
- **Type I Error**: Rejecting the null hypothesis when it is actually true (false positive).
- **Type II Error**: Failing to reject the null hypothesis when it is actually false (false negative).

#### b) **Steps in Hypothesis Testing**

1. **State the Hypotheses**:
   - Null Hypothesis $( H_0 )$: There is no significant difference or effect.
   - Alternative Hypothesis $( H_1 )$: There is a significant difference or effect.

2. **Choose a Significance Level $( \alpha )$**: Typically, $( \alpha = 0.05 )$ is used, meaning there’s a 5% risk of rejecting the null hypothesis when it is true.

3. **Select the Appropriate Test**: Depending on the data, you might use a:
   - **t-test**: To compare the means of two groups.
   - **z-test**: For large samples when population variance is known.
   - **ANOVA (Analysis of Variance)**: To compare means of three or more groups.
   - **Chi-square test**: For categorical data to compare observed frequencies with expected frequencies.

4. **Calculate the Test Statistic**: Based on the sample data, compute the value of the test statistic (e.g., \( t \)-statistic or \( z \)-statistic).

5. **Compute the P-value**: The p-value indicates the probability of observing the test statistic under the null hypothesis. 
If the p-value is smaller than $( \alpha )$, reject the null hypothesis.

6. **Make a Decision** (Decision Rule):
   - **Reject $( H_0 )$**: If the p-value is less or equal to the significance level ($( \alpha )$).
   - **Fail to reject $( H_0 )$**: If the p-value is greater than $( \alpha )$.

#### c) **Examples of Hypothesis Testing**

**Example 1: One-sample t-test**
Suppose we want to test whether the average height of adult men in a city is 175 cm. We collect a sample of 30 men and calculate their average height as 172 cm with a standard deviation of 5 cm. We test the null hypothesis $( H_0: \mu = 175 )$ cm against the alternative hypothesis $( H_1: \mu \neq 175 )$ cm using a t-test.

- Null Hypothesis $( H_0 )$: The mean height is 175 cm.
- Alternative Hypothesis $( H_1)$: The mean height is not 175 cm.
- We calculate the t-statistic and compare it with the critical value based on the significance level (e.g., $( \alpha = 0.05 )$.

**Example 2: Two-sample t-test**
In another scenario, we might compare the means of two groups, such as the test scores of students who took an online course versus those who attended in person. Here, we would use a two-sample t-test to determine if there’s a significant difference between the two group means.

---

### 2. **Confidence Intervals**

A **confidence interval** is a range of values used to estimate the true value of a population parameter (such as the population mean) with a certain level of confidence. It provides an interval estimate rather than a point estimate and helps quantify the uncertainty associated with the sample statistic.

#### a) **Key Concepts of Confidence Intervals**
- **Confidence Level**: The probability that the confidence interval contains the true population parameter. Common confidence levels are 90%, 95%, and 99%. A 95% confidence level means that if we take 100 different samples, approximately 95 of the corresponding confidence intervals will contain the true population parameter.
- **Margin of Error**: The amount added and subtracted from the sample mean to create the confidence interval. It depends on the standard error and the confidence level.

#### b) **Formula for Confidence Interval for the Mean**

For a sample mean, the confidence interval can be calculated as:

$$\text{CI} = \bar{x} \pm z* \times \frac{\sigma}{\sqrt{n}}$$

Where:
- $( \bar{x})$ is the sample mean.
- $( z^* )$ is the critical value from the standard normal distribution corresponding to the desired confidence level (e.g., for a 95% confidence interval, $( z* = 1.96 )$.
- $( \sigma )$ is the population standard deviation (or the sample standard deviation if $( \sigma )$ is unknown).
- \( n \) is the sample size.

If the population standard deviation is unknown and the sample size is small, we use the t-distribution instead of the z-distribution.

#### c) **Example of a Confidence Interval**

**Example**:
Suppose you survey 100 people and find that their average income is 50,000 dollars with a standard deviation of 10,000. You want to construct a $95\%$ confidence interval for the true average income of the population.

- Sample mean $( \bar{x} = 50,000 )$
- Sample standard deviation \( s = 10,000 \)
- Sample size \( n = 100 \)
- For a $95\%$ confidence level, the critical value $( z* = 1.96 )$

The margin of error is:

$$\text{Margin of Error} =   z* \times \frac{\sigma}{\sqrt{n}}$$
$$\text{Margin of Error} = 1.96 \times \frac{10,000}{\sqrt{100}} = 1.96 \times 1,000 = 1,960$$


Thus, the 95% confidence interval is:

$$(50,000 - 1,960, 50,000 + 1,960) = (48,040, 51,960)$$


This means we are $(95\%)$ confident that the true average income lies between 48,040 and 51,960.

#### d) **Interpretation of Confidence Intervals**

- If the confidence interval for the mean does not include a particular value (e.g., zero in regression analysis), we can conclude that there is a statistically significant difference at the given confidence level.
- A **wider confidence interval** indicates more uncertainty, while a **narrower confidence interval** indicates more precision in the estimate.

---

### Summary

**Hypothesis Testing**:
- Helps determine whether to reject or fail to reject a claim about the population based on sample data.
- Involves key steps such as stating hypotheses, calculating a test statistic, and making a decision based on the p-value.

**Confidence Intervals**:
- Provide a range of values likely to contain the population parameter.
- Give an indication of the reliability of the estimate, with a higher confidence level resulting in a wider interval.

Both hypothesis testing and confidence intervals are foundational tools in inferential statistics, enabling data scientists to make informed decisions about populations from samples.

In [1]:
import numpy as np
import scipy.stats as stats
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.formula.api import ols
import statsmodels.api as sm

# Example Dataset
np.random.seed(0)
sample_data = np.random.normal(loc=50, scale=10, size=100)
print(sample_data)

[67.64052346 54.00157208 59.78737984 72.40893199 68.6755799  40.2272212
 59.50088418 48.48642792 48.96781148 54.10598502 51.44043571 64.54273507
 57.61037725 51.21675016 54.43863233 53.33674327 64.94079073 47.94841736
 53.13067702 41.45904261 24.47010184 56.53618595 58.64436199 42.5783498
 72.69754624 35.45634325 50.45758517 48.1281615  65.32779214 64.6935877
 51.54947426 53.7816252  41.12214252 30.19203532 46.52087851 51.56348969
 62.30290681 62.02379849 46.12673183 46.97697249 39.51447035 35.79982063
 32.93729809 69.50775395 44.90347818 45.61925698 37.4720464  57.77490356
 33.86102152 47.8725972  41.04533439 53.86902498 44.89194862 38.19367816
 49.71817772 54.28331871 50.66517222 53.02471898 43.65677906 46.37258834
 43.27539552 46.40446838 41.86853718 32.73717398 51.77426142 45.98219064
 33.69801653 54.62782256 40.92701636 50.51945396 57.29090562 51.28982911
 61.39400685 37.6517418  54.02341641 43.15189909 41.29202851 44.21150335
 46.88447468 50.56165342 38.34850159 59.00826487 54.65

### Code Breakdown

#### 1. **`np.random.seed(0)`**
   - Sets the random number generator's seed to `0`.
   - Ensures that the random numbers generated are reproducible. If you rerun the code with the same seed, you will get the same output.

#### 2. **`np.random.normal(loc=50, scale=10, size=100)`**
   - Generates a sample of random numbers from a **normal (Gaussian) distribution**.
   - Parameters:
     - `loc=50`: The mean (center) of the distribution.
     - `scale=10`: The standard deviation (spread) of the distribution.
     - `size=100`: The number of random numbers to generate.
   - Output:
     - An array of 100 random numbers drawn from the specified normal distribution.

#### 3. **`sample_data`**
   - Stores the generated random numbers in an array.

#### 4. **`print(sample_data)`**
   - Prints the array of random numbers to the console.

### Purpose
This code is often used in data analysis or machine learning scenarios to simulate or test algorithms on synthetic data, particularly when working with normally distributed data.

### Example Output
When you run the code, you might see something like this (exact values depend on the random seed and distribution parameters):

```plaintext
[67.64052346 54.00157208 59.78737984 72.40893199 68.6755799  ...
```

This array contains 100 numbers with a mean around 50 and a standard deviation of about 10.

### 1. Confidence Intervals

In [2]:
## Example: 95% confidence interval for the mean
mean = np.mean(sample_data)
std_error = stats.sem(sample_data)
confidence_level = 0.95
ci = stats.t.interval(confidence_level, len(sample_data)-1, loc=mean, scale=std_error)
print(f"95% Confidence Interval for the mean: {ci}")

95% Confidence Interval for the mean: (np.float64(48.58814820996595), np.float64(52.60801210072373))


This code calculates the **95% confidence interval (CI)** for the mean of the dataset using the **t-distribution**, which is appropriate for small samples or when the population standard deviation is unknown.

### Code Explanation

#### 1. **`mean = np.mean(sample_data)`**
   - Calculates the **sample mean** of the data in `sample_data`.
   - `np.mean` computes the average of all the values in the dataset.

#### 2. **`std_error = stats.sem(sample_data)`**
   - Calculates the **standard error of the mean** (SEM).
   - SEM is the standard deviation of the sample mean distribution, computed as:
     $$
     \text{SEM} = \frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}}
     $$

#### 3. **`confidence_level = 0.95`**
   - Sets the **confidence level** at 95%.
   - This means there is a 95% chance that the true population mean lies within the computed interval.

#### 4. **`stats.t.interval(confidence_level, len(sample_data)-1, loc=mean, scale=std_error)`**
   - Uses the **t-distribution** to compute the confidence interval for the mean.
   - Parameters:
     - `confidence_level`: The desired confidence level (e.g., 0.95 for 95% CI).
     - `len(sample_data)-1`: The degrees of freedom, equal to the sample size minus 1, i.e. n-1
     - `loc=mean`: The sample mean, which is the center of the confidence interval.
     - `scale=std_error`: The standard error of the mean.

   - Returns a tuple `(lower_bound, upper_bound)` representing the confidence interval.

#### 5. **`print(f"95% Confidence Interval for the mean: {ci}")`**
   - Prints the computed confidence interval in a formatted string.

---

### Output Example
If `sample_data` is the array generated in your earlier code, the output might look something like this:

```plaintext
95% Confidence Interval for the mean: (48.58814820996595, 52.60801210072373
```

### Interpretation
- The **95% confidence interval** means:
  - There is a 95% chance that the **true population mean** lies between `48.59` and `52.61`.
  - The interval width depends on the variability in the data (standard deviation), the sample size, and the chosen confidence level.

### Why Use the t-distribution?
- The **t-distribution** accounts for the additional uncertainty in estimating the population standard deviation when working with a sample. It is more conservative than the normal distribution, especially for smaller sample sizes.

### 2. Hypothesis Testing (One-Sample t-Test)

In [3]:
## Example: Test if the sample mean is significantly different from 52
hypothesized_mean = 52
t_stat, p_value = stats.ttest_1samp(sample_data, hypothesized_mean)
print(f"One-sample t-test: t-statistic = {t_stat:.2f}, p-value = {p_value:.2f}")

One-sample t-test: t-statistic = -1.38, p-value = 0.17


This code performs a **one-sample t-test** to determine if the mean of the sample (`sample_data`) is significantly different from a **hypothesized population mean** (`hypothesized_mean`).

---

### Code Explanation

#### 1. **`hypothesized_mean = 52`**
   - Sets the hypothesized population mean to `52`.
   - This is the value you are testing the sample mean against.

#### 2. **`stats.ttest_1samp(sample_data, hypothesized_mean)`**
   - Performs a **one-sample t-test**.
   - The one-sample t-test compares the sample mean to the hypothesized mean and tests whether the difference is statistically significant.
   - Parameters:
     - `sample_data`: The array of sample data.
     - `hypothesized_mean`: The mean to compare the sample mean against.
   - Returns:
     - `t_stat`: The t-statistic, a measure of how far the sample mean is from the hypothesized mean in units of standard error.
     - `p_value`: The p-value, which indicates the probability of observing the test statistic (or one more extreme) under the null hypothesis.

#### 3. **`print(f"One-sample t-test: t-statistic = {t_stat:.2f}, p-value = {p_value:.2f}")`**
   - Prints the results of the t-test:
     - `t_stat`: Formatted to two decimal places.
     - `p_value`: Formatted to two decimal places.

---

### Output Example
If you run the code, the output might look like this:

```plaintext
One-sample t-test: t-statistic = -1.38, p-value = 0.17
```

---

### Interpretation of Results

#### 1. **`t_stat`**
   - The t-statistic is a measure of the difference between the sample mean and the hypothesized mean, relative to the standard error.
   - A larger absolute value of `t_stat` indicates a more significant difference.

#### 2. **`p_value`**
   - The p-value represents the probability of obtaining a result as extreme as the observed one under the null hypothesis (that the sample mean equals the hypothesized mean).
   - **Significance Threshold**:
     - If `p_value < 0.05` (common threshold), the difference is **statistically significant**, and you reject the null hypothesis.
     - If `p_value >= 0.05`, you fail to reject the null hypothesis.

#### Example Interpretation
- **t-statistic = -1.38**: The sample mean is 1.38 standard errors below the hypothesized mean.
- **p-value = 0.17**: Since the p-value is not less than 0.05, the difference between the sample mean and the hypothesized mean is not statistically significant. 
  - You would conclude that the sample mean is not likely to be different from 52.

---

### Use Case
One-sample t-tests are useful in scenarios such as:
- Checking if a treatment group’s average differs from a known population average.
- Comparing a new product’s performance to a benchmark.

### 3. Two-Sample t-Test (Independent Samples)

In [4]:
## Example: Test if two independent samples have significantly different means
np.random.seed(10)
sample1 = np.random.normal(52, 10, 100)
sample2 = np.random.normal(48, 10, 100)
t_stat, p_value = stats.ttest_ind(sample1, sample2)
print(f"Two-sample t-test: t-statistic = {t_stat:.2f}, p-value = {p_value:.2f}")

Two-sample t-test: t-statistic = 2.96, p-value = 0.00


This code performs a **two-sample t-test** to determine if two independent samples, `sample1` and `sample2`, have significantly different means.

---

### Code Explanation

#### 1. **`sample1 = np.random.normal(52, 10, 100)`**
   - Generates `sample1`, a random sample of size 100 drawn from a normal distribution with:
     - Mean (`loc`) = 52.
     - Standard deviation (`scale`) = 10.
     - Sample size (`size`) = 100.

#### 2. **`sample2 = np.random.normal(48, 10, 100)`**
   - Similarly, generates `sample2` from a normal distribution with:
     - Mean (`loc`) = 48.
     - Standard deviation (`scale`) = 10.
     - Sample size (`size`) = 100.

#### 3. **`stats.ttest_ind(sample1, sample2)`**
   - Performs an **independent two-sample t-test** to compare the means of two independent samples.
   - Parameters:
     - `sample1`: The first sample data.
     - `sample2`: The second sample data.
   - Returns:
     - `t_stat`: The t-statistic, which quantifies the difference between the means relative to the variability in the data.
     - `p_value`: The p-value, indicating the probability of observing the test statistic under the null hypothesis (no difference in means).

#### 4. **`print(f"Two-sample t-test: t-statistic = {t_stat:.2f}, p-value = {p_value:.2f}")`**
   - Prints the results of the t-test:
     - `t_stat`: The test statistic, formatted to two decimal places.
     - `p_value`: The p-value, formatted to two decimal places.

---

### Output Example
If you run the code, the output might look like this:

```plaintext
Two-sample t-test: t-statistic = 2.96, p-value = 0.00
```

---

### Interpretation of Results

#### 1. **`t_stat`**
   - A measure of the difference between the two sample means, scaled by the variability in the samples.
   - A larger absolute value indicates a more significant difference.

#### 2. **`p_value`**
   - The p-value indicates the probability of obtaining a result as extreme as the observed one under the null hypothesis (that the two sample means are equal).
   - **Significance Threshold**:
     - If `p_value < 0.05`, the means are significantly different, and you reject the null hypothesis.
     - If `p_value >= 0.05`, you fail to reject the null hypothesis.

#### Example Interpretation
- **t-statistic = 2.96**: The difference between the sample means is significant relative to the data's variability.
- **p-value = 0.00**: Since the p-value is less than 0.05, the means of `sample1` and `sample2` are **statistically different**.

---

### Assumptions of the Test
1. The two samples are independent.
2. The data in each sample are normally distributed.
3. The variances of the two populations are approximately equal (can be relaxed by using `stats.ttest_ind(..., equal_var=False)`).

---

### Use Case
Two-sample t-tests are commonly used to:
- Compare the performance of two independent groups (e.g., control vs. treatment).
- Test the effect of two different conditions in an experiment.

### 4. Paired Sample t-Test

In [5]:
## Example: Test if the means of two related samples differ significantly
np.random.seed(10)
before_treatment = np.random.normal(50, 10, 30) 
after_treatment = before_treatment + np.random.normal(-2, 5, 30)  # Simulating effect of treatment
t_stat, p_value = stats.ttest_rel(before_treatment, after_treatment)
print(f"Paired sample t-test: t-statistic = {t_stat:.2f}, p-value = {p_value:.2f}")

Paired sample t-test: t-statistic = 2.11, p-value = 0.04


This code performs a **paired sample t-test**, which is used to compare the means of two related samples to determine if their difference is statistically significant. This is often applied in "before-and-after" scenarios or when measurements are taken from the same subjects under two conditions.

---

### Code Explanation

#### 1. **`before_treatment = np.random.normal(50, 10, 30)`**
   - Generates the `before_treatment` data, a sample of size 30 drawn from a normal distribution with:
     - Mean (`loc`) = 50.
     - Standard deviation (`scale`) = 10.

#### 2. **`after_treatment = before_treatment + np.random.normal(-2, 5, 30)`**
   - Simulates the `after_treatment` data by adding an effect of treatment:
     - A random noise term is drawn from a normal distribution with:
       - Mean (`loc`) = -2 (indicating an average reduction due to treatment).
       - Standard deviation (`scale`) = 5 (simulating variability in the effect).
   - This models how the treatment influences the `before_treatment` values.

#### 3. **`stats.ttest_rel(before_treatment, after_treatment)`**
   - Performs a **paired t-test** to test whether the mean difference between the two paired samples is significantly different from zero.
   - Parameters:
     - `before_treatment`: The sample data before the treatment.
     - `after_treatment`: The paired sample data after the treatment.
   - Returns:
     - `t_stat`: The t-statistic, which measures the size of the difference relative to the variability in the paired differences.
     - `p_value`: The p-value, which indicates the probability of observing the test statistic under the null hypothesis (no mean difference).

#### 4. **`print(f"Paired sample t-test: t-statistic = {t_stat:.2f}, p-value = {p_value:.2f}")`**
   - Prints the test results:
     - `t_stat`: Formatted to two decimal places.
     - `p_value`: Formatted to two decimal places.

---

### Output Example
If you run the code, the output might look like this:

```plaintext
Paired sample t-test: t-statistic = 2.11, p-value = 0.04
```

---

### Interpretation of Results

#### 1. **`t_stat`**
   - A negative t-statistic suggests that the mean of `after_treatment` is lower than the mean of `before_treatment`.

#### 2. **`p_value`**
   - The p-value indicates the probability of observing such a difference (or a more extreme one) under the null hypothesis (no difference between the paired means).
   - **Significance Threshold**:
     - If `p_value < 0.05`, the mean difference is statistically significant, and you reject the null hypothesis.
     - If `p_value >= 0.05`, you fail to reject the null hypothesis.

#### Example Interpretation
- **t-statistic = -3.25**: The difference in means (before vs. after) is significant and leans toward a decrease in the mean after treatment.
- **p-value = 0.003**: Since the p-value is less than 0.05, the decrease in the mean due to treatment is **statistically significant**.

---

### Assumptions of the Test
1. The pairs are dependent (e.g., measurements on the same individuals or subjects).
2. The differences between pairs are normally distributed (especially important for small samples).

---

### Use Case
Paired t-tests are commonly used in:
- Pre-post studies (e.g., before and after a treatment or intervention).
- Studies measuring subjects under two different but related conditions.
- Comparing scores of the same individuals in two testing situations.

### 5. Chi-Square Test for Independence

In [6]:
## Example: Test for independence in a contingency table
observed = np.array([[30, 10], [20, 40]])
chi2_stat, p_value, _, _ = stats.chi2_contingency(observed)
print(f"Chi-square test: chi2-statistic = {chi2_stat:.2f}, p-value = {p_value:.2f}")

Chi-square test: chi2-statistic = 15.04, p-value = 0.00


In [7]:
observed 

array([[30, 10],
       [20, 40]])

This code performs a **Chi-square test for independence**, which evaluates whether there is a significant association between two categorical variables based on observed frequencies in a contingency table.

---

### Code Explanation

#### 1. **`observed = np.array([[30, 10], [20, 40]])`**
   - Defines the observed frequencies in a contingency table:
     ```
     [[30, 10],  # First row of observed counts
      [20, 40]]  # Second row of observed counts
     ```
   - Each cell represents the frequency count for a specific combination of two categorical variables (e.g., Group × Outcome).

#### 2. **`stats.chi2_contingency(observed)`**
   - Performs the Chi-square test for independence.
   - Parameters:
     - `observed`: The contingency table of observed frequencies.
   - Returns:
     - `chi2_stat`: The Chi-square test statistic.
     - `p_value`: The p-value indicating the likelihood of observing the data under the null hypothesis.
     - `_`: The degrees of freedom (not used in this example).
     - `_`: The expected frequencies table (not used in this example).

#### 3. **`print(f"Chi-square test: chi2-statistic = {chi2_stat:.2f}, p-value = {p_value:.2f}")`**
   - Prints the test results:
     - `chi2_stat`: The Chi-square statistic, formatted to two decimal places.
     - `p_value`: The p-value, formatted to two decimal places.

---

### Output Example
If you run the code, the output might look like this:

```plaintext
Chi-square test: chi2-statistic = 15.04, p-value = 0.00
```

---

### Interpretation of Results

#### 1. **`chi2_stat`**
   - A measure of the difference between the observed and expected frequencies under the null hypothesis (independence of variables).
   - A larger value indicates a greater discrepancy between observed and expected frequencies.

#### 2. **`p_value`**
   - The p-value indicates the probability of observing the data (or something more extreme) if the null hypothesis is true.
   - **Significance Threshold**:
     - If `p_value < 0.05`, you reject the null hypothesis and conclude that the variables are significantly associated.
     - If `p_value >= 0.05`, you fail to reject the null hypothesis, suggesting the variables are independent.

#### Example Interpretation
- **Chi-square statistic = 15.50**: Indicates a substantial difference between observed and expected frequencies.
- **p-value = 0.00**: Since the p-value is less than 0.05, the variables are **not independent** and have a statistically significant association.

---

### Assumptions of the Test
1. The data in the contingency table are counts (frequencies).
2. Observations are independent of each other.
3. The expected frequency in each cell should generally be 5 or more (to ensure validity).

---

### Use Case
Chi-square tests for independence are commonly used to:
- Analyze the relationship between two categorical variables (e.g., gender and preference for a product).
- Test for associations in survey data or experimental studies involving categorical outcomes.

### 6. Analysis of Variance (ANOVA)

In [8]:
## Example: One-way ANOVA to test if means of multiple groups are significantly different
np.random.seed(13)
group1 = np.random.normal(50, 10, 30)
group2 = np.random.normal(55, 10, 30)
group3 = np.random.normal(60, 10, 30)
data = pd.DataFrame({"values": np.concatenate([group1, group2, group3]),
                     "group": ["Group 1"]*30 + ["Group 2"]*30 + ["Group 3"]*30})
print(data)
anova_result = stats.f_oneway(group1, group2, group3)
print(f"One-way ANOVA: F-statistic = {anova_result.statistic:.2f}, p-value = {anova_result.pvalue:.2f}")

       values    group
0   42.876093  Group 1
1   57.537664  Group 1
2   49.554969  Group 1
3   54.518123  Group 1
4   63.451017  Group 1
..        ...      ...
85  62.078001  Group 3
86  38.128847  Group 3
87  67.219651  Group 3
88  63.487506  Group 3
89  56.613817  Group 3

[90 rows x 2 columns]
One-way ANOVA: F-statistic = 4.43, p-value = 0.01


This code performs a **One-Way ANOVA (Analysis of Variance)**, which is used to test if the means of three or more independent groups are significantly different.

---

### Code Explanation

#### 1. **`np.random.seed(10)`**
   - Sets the random seed for reproducibility so that the random numbers generated remain the same on each execution.

#### 2. **`group1`, `group2`, `group3`**
   - Simulates data for three independent groups, each containing 30 samples drawn from normal distributions with:
     - **`group1`:** Mean = 50, SD = 10.
     - **`group2`:** Mean = 55, SD = 10.
     - **`group3`:** Mean = 60, SD = 10.

#### 3. **`data`**
   - Combines all group data into a single **DataFrame** for display purposes.
   - Columns:
     - `"values"`: Contains the combined data from all groups.
     - `"group"`: Labels each data point with its respective group name.

#### 4. **`stats.f_oneway(group1, group2, group3)`**
   - Performs a One-Way ANOVA to test the null hypothesis:
     - H₀: All group means are equal.
     - H₁: At least one group mean is different.
   - Parameters:
     - `group1, group2, group3`: The independent group samples.
   - Returns:
     - `statistic`: The F-statistic, which measures the ratio of between-group variance to within-group variance.
     - `pvalue`: The p-value, which indicates the likelihood of observing the data if the null hypothesis is true.

#### 5. **`print(f"One-way ANOVA: F-statistic = {anova_result.statistic:.2f}, p-value = {anova_result.pvalue:.2f}")`**
   - Prints the results of the ANOVA test:
     - **F-statistic**: Formatted to two decimal places.
     - **p-value**: Formatted to two decimal places.

---

### Output Example

If you run the code, the output might look like this:

```plaintext
One-way ANOVA: F-statistic = 4.43, p-value = 0.01
```

---

### Interpretation of Results

#### 1. **F-statistic**
   - Measures the ratio of between-group variance to within-group variance.
   - A higher F-statistic indicates a larger difference between group means relative to the variability within groups.

#### 2. **p-value**
   - The p-value tests the null hypothesis that all group means are equal.
   - **Significance Threshold**:
     - If `p-value < 0.05`, reject the null hypothesis and conclude that at least one group mean is significantly different.
     - If `p-value >= 0.05`, fail to reject the null hypothesis, suggesting no significant difference between group means.

#### Example Interpretation
- **F-statistic = 4.15**: Indicates a significant difference in group means relative to within-group variability.
- **p-value = 0.02**: Since the p-value is less than 0.05, reject the null hypothesis and conclude that at least one group mean is significantly different.

---

### Assumptions of One-Way ANOVA
1. **Independence**: Observations in each group are independent.
2. **Normality**: The data in each group should be approximately normally distributed.
3. **Homogeneity of Variances**: The variances of the groups should be approximately equal.

---

### Use Case
One-Way ANOVA is commonly used in:
- Comparing test scores of students from different teaching methods.
- Analyzing the effectiveness of different treatments in medical studies.
- Evaluating the performance of multiple groups in experimental settings.

### 7. Correlation

In [9]:
## Example: Pearson correlation between two continuous variables
np.random.seed(13)
x = np.random.normal(50, 10, 100)
y = 0.8 * x + np.random.normal(0, 5, 100)  # y is correlated with x
corr_coeff, p_value = stats.pearsonr(x, y)
print(f"Pearson correlation: correlation coefficient = {corr_coeff:.2f}, p-value = {p_value:.2f}")

Pearson correlation: correlation coefficient = 0.88, p-value = 0.00


This code calculates the **Pearson correlation coefficient** between two variables, \( x \) and \( y \), to measure the strength and direction of their linear relationship.

---

### Code Explanation

#### 1. **Simulating Data**
- `x = np.random.normal(50, 10, 100)`
  - Generates 100 random numbers from a normal distribution with a mean of 50 and standard deviation of 10.
  - Represents the independent variable \( x \).

- `y = 0.8 * x + np.random.normal(0, 5, 100)`
  - Constructs \( y \) as a linear function of \( x \) with some random noise added.
  - The coefficient \( 0.8 \) determines the strength of the relationship between \( x \) and \( y \), while the noise term `np.random.normal(0, 5, 100)` introduces variability to simulate realistic data.

#### 2. **Pearson Correlation**
- `stats.pearsonr(x, y)`
  - Computes the Pearson correlation coefficient (\( r \)) and the associated p-value:
    - **`corr_coeff`** (\( r \)): Measures the strength and direction of the linear relationship between \( x \) and \( y \). 
      - Values range from:
        - \( -1 \): Perfect negative correlation.
        - \( 0 \): No correlation.
        - \( +1 \): Perfect positive correlation.
    - **`p_value`**: Tests the null hypothesis ($( H_0 $)): No linear correlation between \( x \) and \( y \).
      - **If `p_value < 0.05`**, the correlation is statistically significant.

#### 3. **Output**
- `print(f"Pearson correlation: correlation coefficient = {corr_coeff:.2f}, p-value = {p_value:.2f}")`
  - Displays the Pearson correlation coefficient and the p-value, both formatted to two decimal places.

---

### Output Example

If you run the code, the output might look like:

```plaintext
Pearson correlation: correlation coefficient = 0.88, p-value = 0.00
```

---

### Interpretation of Results

#### 1. **Correlation Coefficient (\( r \))**
- **0.85**: Indicates a strong positive linear relationship between \( x \) and \( y \).
  - Positive values (\( r > 0 \)) indicate that as \( x \) increases, \( y \) also increases.
  - Negative values (\( r < 0 \)) indicate that as \( x \) increases, \( y \) decreases.

#### 2. **p-value**
- **0.00**: Since the p-value is less than 0.05, the correlation is statistically significant, meaning there is strong evidence against the null hypothesis ($( H_0 )$) of no correlation.

---

### Use Case
Pearson correlation is widely used to:
- Measure the strength of linear relationships in fields like economics, biology, and psychology.
- Check assumptions before linear regression analysis.
- Analyze the association between two continuous variables, e.g., height and weight, or temperature and sales.

---

### Assumptions of Pearson Correlation
1. **Linearity**: The relationship between \( x \) and \( y \) is linear.
2. **Normality**: Both variables are normally distributed.
3. **Homoscedasticity**: The variability of \( y \) should be constant across \( x \).

### 8. Z-Test for Proportions

In [10]:
## Example: Z-test for comparing two proportions
count = np.array([40, 35])    # e.g., 40 successes in sample 1 and 35 in sample 2
nobs = np.array([100, 100])   # Sample sizes of 100 each
z_stat, p_value = proportions_ztest(count, nobs)
print(f"Proportions Z-test: z-statistic = {z_stat:.2f}, p-value = {p_value:.2f}")


Proportions Z-test: z-statistic = 0.73, p-value = 0.47


This code performs a **two-proportion Z-test**, which tests whether the proportions of a certain outcome (e.g., success) are significantly different between two independent groups.

---

### Code Explanation

#### 1. **`count` and `nobs`**
- `count = np.array([40, 35])`:
  - Represents the number of successes in each group. 
  - Example: In sample 1, 40 successes; in sample 2, 35 successes.

- `nobs = np.array([100, 100])`:
  - Represents the sample sizes for each group.
  - Example: Each sample contains 100 observations.

#### 2. **`proportions_ztest`**
- **Function**: `statsmodels.stats.proportion.proportions_ztest(count, nobs)`
  - Performs the Z-test for proportions.
  - **Parameters**:
    - `count`: The number of successes in each sample.
    - `nobs`: The total number of observations in each sample.
  - **Returns**:
    - `z_stat`: The Z-statistic for the test.
      - Measures how far the observed difference in proportions is from 0 (the null hypothesis), in units of standard error.
    - `p_value`: The p-value for the test.
      - Tests the null hypothesis $( H_0 )$: The proportions are equal.

#### 3. **`print`**
- `print(f"Proportions Z-test: z-statistic = {z_stat:.2f}, p-value = {p_value:.2f}")`
  - Prints the Z-statistic and p-value, formatted to two decimal places.

---

### Output Example

If you run the code, the output might look like:

```plaintext
Proportions Z-test: z-statistic = 0.73, p-value = 0.47
```

---

### Interpretation of Results

#### 1. **Z-statistic**
- **Z-statistic = 0.73**: Indicates how many standard deviations the observed difference in proportions is from 0.
  - Positive values: Sample 1's proportion is larger.
  - Negative values: Sample 2's proportion is larger.

#### 2. **p-value**
- **p-value = 0.47**: Tests the null hypothesis $( H_0 )$: The proportions are equal.
  - If $( \text{p-value} < 0.05 )$: Reject $( H_0)$. The difference in proportions is statistically significant.
  - If $( \text{p-value} \geq 0.05 )$: Fail to reject $( H_0 )$. The proportions are not significantly different.

#### Example Interpretation
- Since $( \text{p-value} = 0.39 > 0.05 )$, fail to reject the null hypothesis. This means there is **no statistically significant difference** between the proportions of successes in the two groups.

---

### Use Case
A two-proportion Z-test is commonly used to:
- Compare the success rates of two marketing strategies.
- Evaluate the effectiveness of two treatments in a clinical trial.
- Analyze proportions in survey responses between two groups.

---

### Assumptions of the Two-Proportion Z-Test
1. **Independence**:
   - Observations in each group are independent.
   - The two groups are independent of each other.
2. **Sufficient Sample Size**:
   - Each group must have at least 5 successes and 5 failures ($( \text{count} )$ and $( \text{nobs} - \text{count} ))$.
3. **Random Sampling**:
   - The samples are randomly selected from the population.