In [None]:
"""

### **Statistics Part 2: Assignment Questions**

-----

#### **1. What is hypothesis testing in statistics?**

Hypothesis testing is a fundamental concept in inferential statistics used to make decisions or draw conclusions about a population based on sample data. It involves setting up two competing hypotheses, the null hypothesis ($H\_0$) and the alternative hypothesis ($H\_a$), and then using sample evidence to determine whether there is enough evidence to reject the null hypothesis in favor of the alternative. The process involves calculating a test statistic and comparing it to a critical value or a p-value to make a decision.

-----

#### **2. What is the null hypothesis, and how does it differ from the alternative hypothesis?**

  * **Null Hypothesis ($H\_0$)**: The null hypothesis is a statement of no effect, no difference, or no relationship between variables. It's the default assumption that any observed difference in the data is due to random chance. For example, a null hypothesis might state that the mean of a population is equal to a specific value.

  * **Alternative Hypothesis ($H\_a$ or $H\_1$)**: The alternative hypothesis is a statement that contradicts the null hypothesis. It proposes that there is an effect, a difference, or a relationship. The alternative hypothesis is what the researcher is typically trying to find evidence to support. For example, an alternative hypothesis might state that the mean of a population is not equal to, greater than, or less than a specific value.

The primary difference is that the **null hypothesis represents the status quo or a baseline of no change**, while the **alternative hypothesis represents the change or effect that the researcher is investigating**.

-----

#### **3. What is the significance level in hypothesis testing, and why is it important?**

The **significance level**, denoted by the Greek letter alpha ($\\alpha$), is the probability of rejecting the null hypothesis when it is actually true. In other words, it's the probability of making a Type I error. The most common significance level is 0.05 (or 5%), but other values like 0.01 and 0.10 are also used.

It is important because it **sets the threshold for how strong the sample evidence must be** to reject the null hypothesis. A smaller significance level (e.g., 0.01) means that the evidence must be stronger, which reduces the chance of a Type I error but increases the chance of a Type II error (failing to reject a false null hypothesis).

-----

#### **4. What does a P-value represent in hypothesis testing?**

The **P-value** (or probability value) represents the probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct. A small P-value indicates that the observed data is unlikely to have occurred if the null hypothesis were true.

-----

#### **5. How do you interpret the P-value in hypothesis testing?**

The interpretation of a P-value is based on comparing it to the pre-determined significance level ($\\alpha$):

  * **If P-value ≤ $\\alpha$**: You **reject the null hypothesis**. This means there is statistically significant evidence to support the alternative hypothesis.
  * **If P-value \> $\\alpha$**: You **fail to reject the null hypothesis**. This means there is not enough statistical evidence to support the alternative hypothesis.

-----

#### **6. What are Type 1 and Type 2 errors in hypothesis testing?**

In hypothesis testing, two types of errors can occur:

  * **Type I Error**: This occurs when you **reject a true null hypothesis**. The probability of a Type I error is equal to the significance level ($\\alpha$).
  * **Type II Error**: This occurs when you **fail to reject a false null hypothesis**. The probability of a Type II error is denoted by the Greek letter beta ($\\beta$).

| Decision | Null Hypothesis is True | Null Hypothesis is False |
| :--- | :--- | :--- |
| **Reject Null Hypothesis** | Type I Error | Correct Decision |
| **Fail to Reject Null Hypothesis** | Correct Decision | Type II Error |

-----

#### **7. What is the difference between a one-tailed and a two-tailed test in hypothesis testing?**

  * **One-Tailed Test**: A one-tailed test is used when the alternative hypothesis specifies a direction of the difference or relationship. For example, if you are testing whether a new drug *increases* a certain metric. The critical region for rejection is in only one tail of the sampling distribution.

  * **Two-Tailed Test**: A two-tailed test is used when the alternative hypothesis does not specify a direction. For example, if you are testing whether a new drug has a *different* effect (either an increase or a decrease). The critical region is split between both tails of the sampling distribution.

-----

#### **8. What is the Z-test, and when is it used in hypothesis testing?**

The **Z-test** is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large (typically n \> 30). It is based on the standard normal distribution (Z-distribution).

It is used for:

  * Comparing a sample mean to a known population mean.
  * Comparing the means of two samples.

-----

#### **9. How do you calculate the Z-score, and what does it represent in hypothesis testing?**

The Z-score (or Z-statistic) is calculated using the formula:

$Z = \\frac{(\\bar{x} - \\mu)}{(\\frac{\\sigma}{\\sqrt{n}})}$

Where:

  * $\\bar{x}$ is the sample mean.
  * $\\mu$ is the population mean.
  * $\\sigma$ is the population standard deviation.
  * $n$ is the sample size.

The **Z-score represents how many standard deviations a data point (or sample mean) is from the population mean**. In hypothesis testing, it measures the strength of the evidence against the null hypothesis.

-----

#### **10. What is the T-distribution, and when should it be used instead of the normal distribution?**

The **T-distribution** (or Student's t-distribution) is a probability distribution that is similar to the normal distribution but has heavier tails. It is used in hypothesis testing when the **sample size is small** (typically n \< 30) and the **population standard deviation is unknown**. As the sample size increases, the T-distribution approaches the normal distribution.

-----

#### **11. What is the difference between a Z-test and a T-test?**

| Feature | Z-test | T-test |
| :--- | :--- | :--- |
| **Population Standard Deviation** | Known | Unknown |
| **Sample Size** | Large (n \> 30) | Small (n \< 30) |
| **Distribution** | Standard Normal Distribution | T-distribution |

-----

#### **12. What is the T-test, and how is it used in hypothesis testing?**

The **T-test** is a statistical test used to compare the means of two groups. It is used when the population standard deviation is unknown and the sample size is small.

Types of T-tests:

  * **One-sample T-test**: Compares the mean of a single sample to a known or hypothesized population mean.
  * **Independent two-sample T-test**: Compares the means of two independent groups.
  * **Paired sample T-test**: Compares the means of the same group at two different times or under two different conditions.

-----

#### **13. What is the relationship between Z-test and T-test in hypothesis testing?**

The T-test is closely related to the Z-test. The primary difference lies in the conditions under which they are used, mainly related to the knowledge of the population standard deviation and the sample size. As the sample size ($n$) increases, the T-distribution converges to the standard normal distribution. For large sample sizes (n \> 30), the results of a T-test and a Z-test will be very similar.

-----

#### **14. What is a confidence interval, and how is it used to interpret statistical results?**

A **confidence interval** is a range of values that is likely to contain the true population parameter (e.g., the population mean) with a certain level of confidence. For example, a 95% confidence interval for a population mean implies that if you were to take many samples and construct a confidence interval for each, about 95% of those intervals would contain the true population mean.

In interpreting statistical results, a confidence interval provides a range of plausible values for the population parameter. If a hypothesized value (from the null hypothesis) falls outside the confidence interval, it provides evidence against the null hypothesis.

-----

#### **15. What is the margin of error, and how does it affect the confidence interval?**

The **margin of error** is a statistic that expresses the amount of random sampling error in a survey's results. It is the half-width of the confidence interval. A larger margin of error means a wider confidence interval, indicating less precision in the estimate of the population parameter.

The margin of error is affected by:

  * **Confidence level**: Higher confidence levels lead to a larger margin of error.
  * **Sample size**: Larger sample sizes lead to a smaller margin of error.
  * **Sample variability**: Higher variability leads to a larger margin of error.

-----

#### **16. How is Bayes' Theorem used in statistics, and what is its significance?**

**Bayes' Theorem** describes the probability of an event based on prior knowledge of conditions that might be related to the event. The formula is:

$P(A|B) = \\frac{P(B|A) \\cdot P(A)}{P(B)}$

Where:

  * $P(A|B)$ is the posterior probability (the probability of A given B).
  * $P(B|A)$ is the likelihood (the probability of B given A).
  * $P(A)$ is the prior probability (the initial belief in A).
  * $P(B)$ is the marginal probability of B.

Its significance lies in its application in **Bayesian inference**, a statistical method where Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. It is widely used in fields like machine learning, medical diagnosis, and spam filtering.

-----

#### **17. What is the Chi-square distribution, and when is it used?**

The **Chi-square ($\\chi^2$) distribution** is a continuous probability distribution that is widely used in hypothesis testing. The shape of the distribution depends on the degrees of freedom.

It is used in:

  * **Goodness-of-fit tests**: To determine if a sample of data comes from a specific distribution.
  * **Tests of independence**: To determine if there is a significant association between two categorical variables.
  * **Tests for variance**: To test hypotheses about the variance of a population.

-----

#### **18. What is the Chi-square goodness of fit test, and how is it applied?**

The **Chi-square goodness-of-fit test** is used to determine whether an observed frequency distribution differs from a theoretical (expected) frequency distribution.

It is applied by:

1.  Stating the null and alternative hypotheses.
2.  Calculating the expected frequencies for each category.
3.  Calculating the Chi-square statistic using the formula: $\\chi^2 = \\sum \\frac{(O - E)^2}{E}$, where O is the observed frequency and E is the expected frequency.
4.  Determining the critical value from the Chi-square distribution table based on the significance level and degrees of freedom.
5.  Comparing the calculated Chi-square statistic to the critical value to make a decision.

-----

#### **19. What is the F-distribution, and when is it used in hypothesis testing?**

The **F-distribution** is a continuous probability distribution that is used in hypothesis testing, particularly in Analysis of Variance (ANOVA) and F-tests. The shape of the F-distribution depends on two degrees of freedom parameters.

It is used for:

  * **Comparing the variances of two populations** (F-test for equality of variances).
  * **Comparing the means of three or more groups** (ANOVA).

-----

#### **20. What is an ANOVA test, and what are its assumptions?**

**ANOVA (Analysis of Variance)** is a statistical test used to compare the means of two or more groups to determine if there are any statistically significant differences between them.

The main assumptions of ANOVA are:

  * **Independence**: The observations in each group are independent.
  * **Normality**: The data in each group are approximately normally distributed.
  * **Homogeneity of variances (homoscedasticity)**: The variances of the groups are equal.

-----

#### **21. What are the different types of ANOVA tests?**

  * **One-Way ANOVA**: Used to compare the means of three or more groups based on one independent variable (factor).
  * **Two-Way ANOVA**: Used to study the effect of two independent variables on a dependent variable, including their interaction effect.
  * **MANOVA (Multivariate Analysis of Variance)**: Used when there is more than one dependent variable.

-----

#### **22. What is the F-test, and how does it relate to hypothesis testing?**

The **F-test** is a statistical test that uses the F-distribution. In the context of ANOVA, the **F-statistic** is a ratio of two variances (or mean squares). It is calculated as:

$F = \\frac{\\text{Variance between groups}}{\\text{Variance within groups}}$

In hypothesis testing, the F-test is used to:

  * Test the overall significance of a regression model.
  * Test the equality of means in ANOVA.
  * Compare the variances of two populations.

If the calculated F-statistic is larger than the critical F-value from the F-distribution table, the null hypothesis is rejected.
"""

In [None]:
### **Practical Python Implementations**

-----

#### **1. Z-test for Comparing a Sample Mean to a Known Population Mean**

```python
import numpy as np
from statsmodels.stats.weightstats import ztest

# Sample data
data = [2.5, 3.0, 2.8, 3.2, 2.9, 3.5, 3.1, 2.7, 3.3, 2.9]
population_mean = 3.0
alpha = 0.05

# Perform Z-test
z_stat, p_value = ztest(data, value=population_mean)

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < alpha:
    print("Reject the null hypothesis: The sample mean is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference.")

```

-----

In [None]:
#### **2. Simulate Data and Perform Hypothesis Testing**

```python
import numpy as np
from scipy import stats

# Simulate random data from a normal distribution
np.random.seed(0)
sample_data = np.random.normal(loc=105, scale=10, size=50)
population_mean = 100
alpha = 0.05

# Perform a one-sample t-test
t_stat, p_value = stats.ttest_1samp(sample_data, population_mean)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")
```

-----

In [None]:
#### **3. Implement a One-Sample Z-test**

```python
import numpy as np
from statsmodels.stats.weightstats import ztest

def one_sample_ztest(sample_data, pop_mean, alpha=0.05):
    """
    Performs a one-sample Z-test.
    """
    z_stat, p_value = ztest(sample_data, value=pop_mean)
    
    print(f"Z-statistic: {z_stat:.4f}")
    print(f"P-value: {p_value:.4f}")

    if p_value < alpha:
        print("Reject the null hypothesis.")
    else:
        print("Fail to reject the null hypothesis.")

# Example usage
sample = np.random.normal(loc=5.2, scale=1.5, size=100)
one_sample_ztest(sample, 5.0)

```

-----

In [None]:
#### **4. Two-Tailed Z-test with Visualization**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Sample data
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x, 0, 1)

# Z-test result
z_stat = 2.5
p_value = stats.norm.sf(abs(z_stat)) * 2
alpha = 0.05
critical_value = stats.norm.ppf(1 - alpha/2)

plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Standard Normal Distribution')
plt.fill_between(x, y, where=(x > critical_value) | (x < -critical_value), color='red', alpha=0.5, label='Rejection Region')
plt.axvline(z_stat, color='green', linestyle='--', label=f'Z-statistic = {z_stat:.2f}')
plt.title('Two-Tailed Z-test Decision Region')
plt.xlabel('Z-score')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

print(f"P-value: {p_value:.4f}")
if p_value < alpha:
    print("Reject the null hypothesis.")
```

-----

In [None]:
#### **5. Visualize Type 1 and Type 2 Errors**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def visualize_errors(mu0, mu1, sigma, n, alpha):
    """
    Visualizes Type I and Type II errors.
    """
    se = sigma / np.sqrt(n)
    x = np.linspace(mu0 - 4*se, mu1 + 4*se, 1000)
    
    # Null and alternative distributions
    null_dist = norm(mu0, se)
    alt_dist = norm(mu1, se)

    # Critical value
    critical_value = norm.ppf(1 - alpha, loc=mu0, scale=se)

    plt.figure(figsize=(12, 7))
    plt.plot(x, null_dist.pdf(x), label='Null Hypothesis ($H_0$)')
    plt.plot(x, alt_dist.pdf(x), label='Alternative Hypothesis ($H_a$)')

    # Type I error (alpha)
    x_alpha = np.linspace(critical_value, mu0 + 4*se, 100)
    plt.fill_between(x_alpha, null_dist.pdf(x_alpha), color='red', alpha=0.5, label=f'Type I Error (alpha) = {alpha:.2f}')

    # Type II error (beta)
    x_beta = np.linspace(mu1 - 4*se, critical_value, 100)
    beta = alt_dist.cdf(critical_value)
    plt.fill_between(x_beta, alt_dist.pdf(x_beta), color='blue', alpha=0.5, label=f'Type II Error (beta) = {beta:.2f}')
    
    plt.axvline(critical_value, color='black', linestyle='--')
    plt.title('Type I and Type II Errors')
    plt.xlabel('Sample Mean')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(True)
    plt.show()

# Parameters
visualize_errors(mu0=100, mu1=105, sigma=15, n=30, alpha=0.05)
```

-----

In [None]:
#### **6. Independent T-test**

```python
import numpy as np
from scipy import stats

# Sample data for two independent groups
group1 = np.random.normal(loc=10, scale=2, size=30)
group2 = np.random.normal(loc=12, scale=2, size=30)

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject the null hypothesis: The means of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis.")
```

-----

In [None]:
#### **7. Paired Sample T-test with Visualization**

```python
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Paired sample data (e.g., before and after treatment)
before = np.random.normal(loc=80, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=3, size=25)

# Perform paired t-test
t_stat, p_value = stats.ttest_rel(after, before)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Visualization
plt.figure(figsize=(8, 6))
plt.scatter(before, after)
plt.plot([min(before), max(before)], [min(before), max(before)], color='red', linestyle='--')
plt.title('Paired Sample T-test: Before vs. After')
plt.xlabel('Before Treatment')
plt.ylabel('After Treatment')
plt.grid(True)
plt.show()
```

-----

In [None]:
#### **8. Compare Z-test and T-test**

```python
import numpy as np
from statsmodels.stats.weightstats import ztest
from scipy import stats

# Simulate data
sample = np.random.normal(loc=15.5, scale=2.5, size=50)
pop_mean = 15.0
pop_std_dev = 2.5 # Assumed for Z-test

# Perform Z-test
z_stat, z_p_value = ztest(sample, value=pop_mean, ddof=0)

# Perform T-test
t_stat, t_p_value = stats.ttest_1samp(sample, pop_mean)

print(f"Z-test: Z-statistic = {z_stat:.4f}, P-value = {z_p_value:.4f}")
print(f"T-test: T-statistic = {t_stat:.4f}, P-value = {t_p_value:.4f}")
print("\nFor larger sample sizes, the Z-test and T-test results are very similar.")
```

-----

In [None]:
#### **9. Calculate Confidence Interval for a Sample Mean**

```python
import numpy as np
from scipy import stats

def confidence_interval_mean(data, confidence=0.95):
    """
    Calculates the confidence interval for a sample mean.
    """
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data) # Standard error of the mean
    
    ci = stats.t.interval(confidence, df=n-1, loc=mean, scale=se)
    
    print(f"{confidence*100}% Confidence Interval: {ci}")
    print("This means we are " + str(confidence*100) + "% confident that the true population mean lies within this interval.")

# Example usage
sample = np.random.normal(loc=20, scale=5, size=100)
confidence_interval_mean(sample)
```

-----

In [None]:
#### **10. Calculate Margin of Error**

```python
import numpy as np
from scipy import stats

def margin_of_error(data, confidence=0.95):
    """
    Calculates the margin of error for a given confidence level.
    """
    n = len(data)
    se = stats.sem(data)
    
    moe = se * stats.t.ppf((1 + confidence) / 2., n-1)
    
    print(f"Margin of Error: {moe:.4f}")
    return moe

# Example usage
sample = np.random.normal(loc=50, scale=10, size=40)
moe = margin_of_error(sample)
```

-----

In [None]:
#### **11. Bayesian Inference with Bayes' Theorem**

```python
def bayesian_inference(prior, likelihood, evidence):
    """
    Simple implementation of Bayes' Theorem.
    """
    posterior = (likelihood * prior) / evidence
    return posterior

# Example: Medical Diagnosis
# P(A) = Prior probability of having the disease
# P(B|A) = Probability of testing positive given the disease (sensitivity)
# P(B) = Overall probability of testing positive

p_disease = 0.01
p_positive_given_disease = 0.99 # Sensitivity
p_positive_given_no_disease = 0.05 # False positive rate
p_no_disease = 1 - p_disease

# P(B) = P(B|A)P(A) + P(B|not A)P(not A)
p_positive = (p_positive_given_disease * p_disease) + (p_positive_given_no_disease * p_no_disease)

# Calculate P(Disease | Positive Test)
p_disease_given_positive = bayesian_inference(p_disease, p_positive_given_disease, p_positive)

print(f"Posterior probability of having the disease given a positive test: {p_disease_given_positive:.4f}")
```

-----

In [None]:
#### **12. Chi-square Test for Independence**

```python
import numpy as np
from scipy.stats import chi2_contingency

# Observed data in a contingency table
# Rows: Smoker, Non-smoker; Columns: Lung Cancer, No Lung Cancer
observed = np.array([[50, 10], [20, 120]])

chi2, p, dof, expected = chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:\n", expected)

if p < 0.05:
    print("\nReject the null hypothesis: There is a significant association between smoking and lung cancer.")
else:
    print("\nFail to reject the null hypothesis.")
```

-----

In [None]:
#### **13. Calculate Expected Frequencies for Chi-square Test**

```python
import numpy as np

def expected_frequencies(observed_table):
    """
    Calculates expected frequencies for a Chi-square test.
    """
    row_totals = observed_table.sum(axis=1)
    col_totals = observed_table.sum(axis=0)
    grand_total = observed_table.sum()
    
    expected_table = np.outer(row_totals, col_totals) / grand_total
    return expected_table

# Example usage
observed = np.array([[30, 10], [15, 25]])
expected = expected_frequencies(observed)
print("Observed Frequencies:\n", observed)
print("\nExpected Frequencies:\n", expected)
```

-----

In [None]:
#### **14. Goodness-of-Fit Test**

```python
from scipy.stats import chisquare

# Observed frequencies of dice rolls
observed = [9, 11, 8, 12, 10, 10] # Total 60 rolls
# Expected frequencies for a fair die
expected = [10, 10, 10, 10, 10, 10]

chi2, p = chisquare(f_obs=observed, f_exp=expected)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")

if p < 0.05:
    print("Reject the null hypothesis: The die is likely not fair.")
else:
    print("Fail to reject the null hypothesis: The die appears to be fair.")
```

-----

In [None]:
#### **15. Visualize Chi-square Distribution**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2

# Degrees of freedom
df_values = [2, 5, 10, 20]
x = np.linspace(0, 30, 1000)

plt.figure(figsize=(10, 6))
for df in df_values:
    plt.plot(x, chi2.pdf(x, df), label=f'df = {df}')

plt.title('Chi-square Distribution')
plt.xlabel('Chi-square value')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.ylim(0, 0.5)
plt.show()
print("Characteristics: The Chi-square distribution is right-skewed and its shape depends on the degrees of freedom. As df increases, it approaches a normal distribution.")
```

-----

In [None]:
#### **16. F-test for Comparing Variances**

```python
import numpy as np
from scipy.stats import f

def f_test_variances(sample1, sample2):
    """
    Performs an F-test to compare the variances of two samples.
    """
    var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)
    n1, n2 = len(sample1), len(sample2)
    
    f_stat = var1 / var2
    df1, df2 = n1 - 1, n2 - 1
    
    p_value = f.cdf(f_stat, df1, df2)
    
    print(f"F-statistic: {f_stat:.4f}")
    print(f"P-value: {p_value:.4f}")

# Example usage
groupA = np.random.normal(loc=10, scale=3, size=20)
groupB = np.random.normal(loc=10, scale=5, size=20)
f_test_variances(groupA, groupB)
```

-----

In [None]:
#### **17. ANOVA Test**

```python
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data for three groups
data = {'values': np.concatenate([np.random.normal(10, 2, 20),
                                  np.random.normal(15, 2, 20),
                                  np.random.normal(12, 2, 20)]),
        'group': ['A']*20 + ['B']*20 + ['C']*20}
df = pd.DataFrame(data)

# Fit ANOVA model
model = ols('values ~ group', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)
print("\nInterpretation: Look at the P-value (PR(>F)). If it's less than your significance level, it means there is a significant difference between the means of the groups.")
```

-----

In [None]:
#### **18. One-Way ANOVA with Plot**

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# Sample data
group1 = np.random.normal(20, 5, 30)
group2 = np.random.normal(25, 5, 30)
group3 = np.random.normal(22, 5, 30)

f_stat, p_value = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}, P-value: {p_value:.4f}")

# Plotting
df = pd.DataFrame({'Group1': group1, 'Group2': group2, 'Group3': group3})
plt.figure(figsize=(8, 6))
sns.boxplot(data=df)
plt.title('One-Way ANOVA: Comparison of Group Means')
plt.ylabel('Value')
plt.show()
```

-----

In [None]:
#### **19. Check ANOVA Assumptions**

```python
from scipy.stats import shapiro, levene

def check_anova_assumptions(data, group_col, value_col):
    """
    Checks normality and homogeneity of variances.
    """
    groups = data[group_col].unique()
    
    # Normality (Shapiro-Wilk test)
    print("Normality Check (Shapiro-Wilk):")
    for group in groups:
        stat, p = shapiro(data[data[group_col] == group][value_col])
        print(f"Group {group}: p-value = {p:.4f}")
        if p < 0.05: print("  -> Not normal")

    # Homogeneity of variances (Levene's test)
    print("\nHomogeneity of Variances (Levene's test):")
    samples = [data[data[group_col] == group][value_col] for group in groups]
    stat, p = levene(*samples)
    print(f"p-value = {p:.4f}")
    if p < 0.05: print("  -> Variances are not equal")

# Example using data from Q17
check_anova_assumptions(df, 'group', 'values')
```

-----

In [None]:
#### **20. Two-Way ANOVA**

```python
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data with two factors
data = {'factor1': ['A']*20 + ['B']*20 + ['A']*20 + ['B']*20,
        'factor2': ['X']*40 + ['Y']*40,
        'value': np.concatenate([np.random.normal(10, 2, 20),
                                 np.random.normal(15, 2, 20),
                                 np.random.normal(12, 2, 20),
                                 np.random.normal(18, 2, 20)])}
df = pd.DataFrame(data)

# Fit Two-Way ANOVA model
model = ols('value ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

# Visualize interaction
sns.pointplot(data=df, x='factor1', y='value', hue='factor2', dodge=True)
plt.title('Interaction Plot for Two-Way ANOVA')
plt.show()
```

-----

In [None]:
#### **21. Visualize F-distribution**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f

dfn_values = [5, 10, 20]
dfd = 10
x = np.linspace(0, 5, 1000)

plt.figure(figsize=(10, 6))
for dfn in dfn_values:
    plt.plot(x, f.pdf(x, dfn, dfd), label=f'dfn={dfn}, dfd={dfd}')

plt.title('F-distribution')
plt.xlabel('F-value')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()
print("Use in hypothesis testing: The F-distribution is used to find the P-value in ANOVA and other F-tests, helping to determine if the null hypothesis should be rejected.")
```

-----

In [None]:
#### **22. One-Way ANOVA with Boxplots**

(This is similar to Q18, but a more explicit example is provided here)

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# Data for different groups
data = {'A': np.random.normal(100, 15, 50),
        'B': np.random.normal(110, 15, 50),
        'C': np.random.normal(105, 15, 50)}
df = pd.DataFrame(data)

f_stat, p_value = f_oneway(df['A'], df['B'], df['C'])
print(f"F-statistic: {f_stat:.4f}, P-value: {p_value:.4f}")

# Visualize with boxplots
plt.figure(figsize=(10, 7))
sns.boxplot(data=df)
plt.title('Group Means Comparison using Boxplots')
plt.ylabel('Score')
plt.xlabel('Group')
plt.show()
```

-----

In [None]:
#### **23. Simulate Normal Data and Test Means**

```python
import numpy as np
from scipy.stats import ttest_1samp

# Simulate data from a normal distribution
np.random.seed(42)
simulated_data = np.random.normal(loc=50, scale=10, size=100)
hypothesized_mean = 52

# Perform hypothesis test (one-sample t-test)
t_stat, p_value = ttest_1samp(simulated_data, hypothesized_mean)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"Reject the null hypothesis: The mean of the simulated data is significantly different from {hypothesized_mean}.")
else:
    print("Fail to reject the null hypothesis.")
```

-----

In [None]:
#### **24. Hypothesis Test for Population Variance**

```python
import numpy as np
from scipy.stats import chi2

def test_population_variance(sample_data, pop_variance, alpha=0.05):
    """
    Performs a Chi-square test for population variance.
    """
    n = len(sample_data)
    sample_var = np.var(sample_data, ddof=1)
    
    chi2_stat = (n - 1) * sample_var / pop_variance
    
    # Two-tailed test
    p_value = 2 * min(chi2.cdf(chi2_stat, n - 1), 1 - chi2.cdf(chi2_stat, n - 1))

    print(f"Chi-square statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    
    if p_value < alpha:
        print(f"Reject the null hypothesis: The sample variance is significantly different from {pop_variance}.")
    else:
        print("Fail to reject the null hypothesis.")

# Example
data = np.random.normal(loc=0, scale=5, size=30)
test_population_variance(data, pop_variance=25) # 5^2
```

-----

In [None]:
#### **25. Z-test for Comparing Proportions**

```python
from statsmodels.stats.proportion import proportions_ztest

# Data: number of successes and number of observations
count = np.array([50, 80]) # Successes in group 1 and 2
nobs = np.array([100, 120]) # Total observations in group 1 and 2

z_stat, p_value = proportions_ztest(count, nobs)

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject the null hypothesis: The proportions are significantly different.")
else:
    print("Fail to reject the null hypothesis.")
```

-----

In [None]:
#### **26. F-test for Comparing Variances with Visualization**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f, probplot

def f_test_and_visualize(sample1, sample2):
    var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)
    f_stat = var1 / var2 if var1 > var2 else var2 / var1
    df1, df2 = len(sample1)-1, len(sample2)-1 if var1 > var2 else (len(sample2)-1, len(sample1)-1)
    
    p_value = 1 - f.cdf(f_stat, df1, df2)
    print(f"F-statistic: {f_stat:.4f}, P-value: {p_value:.4f}")

    # Visualization (Q-Q plots)
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    probplot(sample1, dist="norm", plot=plt)
    plt.title('Q-Q Plot for Sample 1')
    plt.subplot(1, 2, 2)
    probplot(sample2, dist="norm", plot=plt)
    plt.title('Q-Q Plot for Sample 2')
    plt.show()

# Example
s1 = np.random.normal(10, 2, 50)
s2 = np.random.normal(10, 3, 50)
f_test_and_visualize(s1, s2)
```

-----

In [None]:
#### **27. Chi-square Goodness of Fit with Simulated Data**

```python
from scipy.stats import chisquare, norm
import numpy as np

# Simulate data from a normal distribution
np.random.seed(1)
sim_data = np.random.normal(loc=10, scale=2, size=100)

# Create bins and get observed frequencies
observed_freq, bins = np.histogram(sim_data, bins=10)

# Get expected frequencies from a normal distribution
expected_prob = norm.cdf(bins[1:], loc=10, scale=2) - norm.cdf(bins[:-1], loc=10, scale=2)
expected_freq = expected_prob * 100

chi2_stat, p_value = chisquare(f_obs=observed_freq, f_exp=expected_freq)

print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject the null hypothesis: The data does not follow a normal distribution.")
else:
    print("Fail to reject the null hypothesis: The data appears to follow a normal distribution.")
```