# **Statistics Part 2**

---

#### **What is hypothesis testing in statistics?**
Hypothesis testing is a statistical method used to make inferences or draw conclusions about a population based on sample data. It involves two competing hypotheses: the **null hypothesis (H₀)** and the **alternative hypothesis (H₁)**. The goal is to determine whether there is enough statistical evidence to reject the null hypothesis.

---

#### **What is the null hypothesis, and how does it differ from the alternative hypothesis?**
- **Null hypothesis (H₀)**: This is the hypothesis that there is no effect or difference in the population. It assumes that any observed effect in the sample is due to chance.
- **Alternative hypothesis (H₁)**: This is the hypothesis that there is a significant effect or difference in the population. It represents the claim that the researcher is trying to prove.

Example:
- H₀: The mean height of men is 5'8".
- H₁: The mean height of men is not 5'8".

---

#### **What is the significance level in hypothesis testing, and why is it important?**
The **significance level (α)** is the probability of rejecting the null hypothesis when it is actually true (Type 1 error). It represents the threshold for deciding whether the test results are statistically significant. Common significance levels are 0.05, 0.01, and 0.10. A smaller α value reduces the chance of making a Type 1 error but increases the chance of a Type 2 error.

---

#### **What does a P-value represent in hypothesis testing?**
The **P-value** is the probability of obtaining a test statistic at least as extreme as the one observed, under the assumption that the null hypothesis is true. A smaller P-value indicates stronger evidence against the null hypothesis. If the P-value is less than the significance level (α), you reject the null hypothesis.

---

#### **How do you interpret the P-value in hypothesis testing?**
- **P-value < α (significance level)**: Reject the null hypothesis. There is enough evidence to support the alternative hypothesis.
- **P-value ≥ α**: Fail to reject the null hypothesis. There is insufficient evidence to support the alternative hypothesis.

---

#### **What are Type 1 and Type 2 errors in hypothesis testing?**
- **Type 1 error (α)**: Rejecting the null hypothesis when it is actually true (false positive).
- **Type 2 error (β)**: Failing to reject the null hypothesis when it is actually false (false negative).

---

#### **What is the difference between a one-tailed and a two-tailed test in hypothesis testing?**
- **One-tailed test**: Tests for the possibility of an effect in one direction (either greater than or less than a certain value).
  - Example: H₀: µ ≤ 5, H₁: µ > 5
- **Two-tailed test**: Tests for the possibility of an effect in both directions (greater than or less than a certain value).
  - Example: H₀: µ = 5, H₁: µ ≠ 5

---

#### **What is the Z-test, and when is it used in hypothesis testing?**
The **Z-test** is used to determine whether there is a significant difference between the sample mean and the population mean, especially when the sample size is large (typically n > 30) and the population standard deviation is known. It follows the standard normal distribution.

---

#### **How do you calculate the Z-score, and what does it represent in hypothesis testing?**
The **Z-score** is calculated as:

\[
Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}
\]

Where:
- \(\bar{X}\) is the sample mean,
- \(\mu\) is the population mean,
- \(\sigma\) is the population standard deviation,
- \(n\) is the sample size.

The Z-score represents how many standard deviations the sample mean is from the population mean.

---

#### **What is the T-distribution, and when should it be used instead of the normal distribution?**
The **T-distribution** is used when the sample size is small (n ≤ 30) and/or the population standard deviation is unknown. It is similar to the normal distribution but has heavier tails to account for the additional variability due to small sample sizes.

---

#### **What is the difference between a Z-test and a T-test?**
- **Z-test**: Used when the sample size is large (n > 30) and the population standard deviation is known.
- **T-test**: Used when the sample size is small (n ≤ 30) and/or the population standard deviation is unknown.

---

#### **What is the T-test, and how is it used in hypothesis testing?**
The **T-test** is used to determine if there is a significant difference between the means of two groups (independent or paired). It is used when the population standard deviation is unknown and the sample size is small.

---

#### **What is the relationship between Z-test and T-test in hypothesis testing?**
The Z-test and T-test are both used for hypothesis testing, but the Z-test is appropriate when the sample size is large or when the population standard deviation is known. The T-test is used when the sample size is small, and the population standard deviation is unknown. As the sample size increases, the T-distribution approaches the normal distribution, and the T-test becomes similar to the Z-test.

---

#### **What is a confidence interval, and how is it used to interpret statistical results?**
A **confidence interval (CI)** is a range of values that is used to estimate a population parameter. The interval has an associated confidence level (e.g., 95%) that indicates the likelihood that the parameter lies within the interval.

---

#### **What is the margin of error, and how does it affect the confidence interval?**
The **margin of error** is the amount of error that is allowed in the estimation of a population parameter. It affects the width of the confidence interval. A larger margin of error results in a wider interval, indicating more uncertainty.

---

#### **How is Bayes' Theorem used in statistics, and what is its significance?**
**Bayes' Theorem** is used to update the probability estimate for a hypothesis based on new evidence. It is significant because it allows for the incorporation of prior knowledge into statistical inference.

Formula:
\[
P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}
\]
Where:
- \(P(H|E)\) is the posterior probability (probability of hypothesis given the evidence),
- \(P(E|H)\) is the likelihood (probability of evidence given the hypothesis),
- \(P(H)\) is the prior probability (initial probability of the hypothesis),
- \(P(E)\) is the marginal likelihood (probability of the evidence).

---

#### **What is the Chi-square distribution, and when is it used?**
The **Chi-square distribution** is used in hypothesis testing, particularly in tests of independence and goodness of fit. It is used to assess whether observed frequencies differ significantly from expected frequencies.

---

#### **What is the Chi-square goodness of fit test, and how is it applied?**
The **Chi-square goodness of fit test** is used to determine whether a sample data matches an expected distribution. It compares the observed frequencies with the expected frequencies, and the test statistic is calculated using the Chi-square distribution.

---

#### **What is the F-distribution, and when is it used in hypothesis testing?**
The **F-distribution** is used in tests that compare two variances, such as the **ANOVA** test. It is also used in regression analysis to assess model fit. The F-distribution is positively skewed and depends on two degrees of freedom (for the numerator and denominator).

---

#### **What is an ANOVA test, and what are its assumptions?**
The **ANOVA (Analysis of Variance)** test is used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. The assumptions include:
1. Normality: The data in each group should be normally distributed.
2. Independence: Observations within each group should be independent.
3. Homogeneity of variances: The variance within each group should be approximately equal.

---

#### **What are the different types of ANOVA tests?**
- **One-way ANOVA**: Compares means across three or more groups based on one factor.
- **Two-way ANOVA**: Compares means across groups based on two factors and examines the interaction between them.
- **Repeated measures ANOVA**: Used when the same subjects are measured multiple times.

---

#### **What is the F-test, and how does it relate to hypothesis testing?**
The **F-test** is used to compare two variances to determine if they are significantly different. It is often used in the context of ANOVA and regression analysis.

---

# **Practical**

---

#### **Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results**

```python
import numpy as np
import scipy.stats as stats

# Sample data
sample_data = np.random.normal(50, 10, 100)  # mean=50, std=10, sample size=100

# Parameters
population_mean = 50
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1)
n = len(sample_data)

# Z-statistic calculation
z_stat = (sample_mean - population_mean) / (sample_std / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f'Z-statistic: {z_stat}')
print(f'P-value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")
```

---



#### **Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python**

```python
import numpy as np
from scipy import stats

# Simulate random sample data
np.random.seed(42)
sample_data = np.random.normal(100, 15, 50)  # mean=100, std=15, sample size=50

# Known population mean
population_mean = 105

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(sample_data, population_mean)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpretation of the result
if p_value < 0.05:
    print("Reject the null hypothesis. The sample mean is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis. The sample mean is not significantly different from the population mean.")
```

In this example, we simulate a random sample of 50 data points from a normal distribution with a mean of 100 and a standard deviation of 15. We then perform a one-sample t-test comparing the sample mean to a population mean of 105. The p-value helps us decide whether the sample mean is significantly different from the population mean.

---

#### **Implement a one-sample Z-test using Python to compare the sample mean with the population mean**

```python
import numpy as np
from scipy import stats

# Sample data
sample_data = np.random.normal(50, 12, 100)  # mean=50, std=12, sample size=100

# Known population parameters
population_mean = 50
population_std = 12  # Known standard deviation

# Sample mean and sample size
sample_mean = np.mean(sample_data)
n = len(sample_data)

# Z-test calculation
z_stat = (sample_mean - population_mean) / (population_std / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"Z-statistic: {z_stat}")
print(f"P-value: {p_value}")

# Interpretation of the result
if p_value < 0.05:
    print("Reject the null hypothesis. The sample mean is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis. The sample mean is not significantly different from the population mean.")
```

This Python script performs a one-sample Z-test, comparing the sample mean to a population mean, where the population standard deviation is known.

---

#### **Perform a two-tailed Z-test using Python and visualize the decision region on a plot**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulate random data
np.random.seed(42)
sample_data = np.random.normal(50, 10, 100)  # mean=50, std=10, sample size=100

# Known population mean
population_mean = 50

# Parameters for the Z-test
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1)
n = len(sample_data)

# Z-statistic calculation
z_stat = (sample_mean - population_mean) / (sample_std / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# Plot the Z-distribution with the rejection region
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x, 0, 1)
plt.plot(x, y)
plt.fill_between(x, 0, y, where=(x < -1.96) | (x > 1.96), color='red', alpha=0.5)
plt.axvline(x=-1.96, color='red', linestyle='dashed')
plt.axvline(x=1.96, color='red', linestyle='dashed')
plt.title("Two-tailed Z-test Decision Region")
plt.show()

print(f"Z-statistic: {z_stat}")
print(f"P-value: {p_value}")
```

In this script, we conduct a two-tailed Z-test and plot the rejection region (areas outside the ±1.96 Z-scores) on the normal distribution curve. The red shaded area represents the rejection region, and we determine if the calculated Z-statistic falls within this region.

---

#### **Create a Python function that calculates and visualizes Type 1 and Type 2 errors during hypothesis testing**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Function to simulate Type 1 and Type 2 errors
def plot_errors(pop_mean, sample_mean, sample_std, n, alpha=0.05):
    # Normal distribution of the population
    x = np.linspace(pop_mean - 4*sample_std, pop_mean + 4*sample_std, 1000)
    y = stats.norm.pdf(x, pop_mean, sample_std / np.sqrt(n))

    # Critical value for alpha (Type 1 error)
    critical_value = stats.norm.ppf(1 - alpha)

    # Calculate Type 1 error: Area to the right of critical value
    type_1_error_area = stats.norm.cdf(critical_value - sample_mean)

    # Plot population distribution
    plt.plot(x, y, label="Population Distribution")

    # Plot Type 1 error (rejection region)
    plt.fill_between(x, 0, y, where=(x > critical_value), color='red', alpha=0.5, label="Type 1 Error Area")

    plt.axvline(critical_value, color='red', linestyle='dashed', label="Critical Value")
    plt.legend()
    plt.title(f"Type 1 Error (alpha = {alpha})")

    plt.show()

    print(f"Type 1 Error Area: {type_1_error_area}")

# Parameters
pop_mean = 50
sample_mean = 53  # Test mean
sample_std = 12
n = 100
alpha = 0.05

plot_errors(pop_mean, sample_mean, sample_std, n, alpha)
```

In this example, we simulate a population distribution and visualize the **Type 1 error**, which is the rejection of the null hypothesis when it is actually true. The **critical value** is calculated for the given significance level (α).

---

#### **Write a Python program to perform an independent T-test and interpret the results**

```python
import numpy as np
from scipy import stats

# Sample data for two groups
group_1 = np.random.normal(50, 10, 30)  # mean=50, std=10, sample size=30
group_2 = np.random.normal(55, 12, 30)  # mean=55, std=12, sample size=30

# Perform independent T-test
t_stat, p_value = stats.ttest_ind(group_1, group_2)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpretation of the result
if p_value < 0.05:
    print("Reject the null hypothesis. There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the two groups.")
```

Here, we perform an independent T-test to compare the means of two independent groups. We then interpret the p-value to determine if the differences are statistically significant.

---

### **Further Practical Questions:**

---

#### **Perform a paired sample T-test using Python and visualize the comparison results**

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Sample data (before and after treatment)
before = np.random.normal(55, 8, 30)
after = before + np.random.normal(2, 5, 30)  # simulate improvement

# Perform paired T-test
t_stat, p_value = stats.ttest_rel(before, after)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Visualizing the comparison using boxplot
plt.boxplot([before, after], labels=["Before", "After"])
plt.title("Paired Sample T-test: Before vs After")
plt.show()

# Interpretation of the result
if p_value < 0.05:
    print("Reject the null hypothesis. The treatment has a significant effect.")
else:
    print("Fail to reject the null hypothesis. The treatment has no significant effect.")
```

This script performs a **paired sample T-test**, where the same subjects are measured before and after a treatment. A boxplot is used to visualize the comparison between the "before" and "after" groups.

---
