
# Statistics Part 2 - Theoretical Answers

## 1. What is hypothesis testing in statistics?
Hypothesis testing is a statistical method used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. It helps in making decisions based on data rather than assumptions.

## 2. What is the null hypothesis, and how does it differ from the alternative hypothesis?
- **Null Hypothesis (H₀)**: It represents the default assumption that there is no effect, no difference, or no relationship in the population.
- **Alternative Hypothesis (H₁ or Ha)**: It contradicts the null hypothesis and represents the effect, difference, or relationship that we want to test.

Example:  
H₀: The new drug has no effect on blood pressure.  
H₁: The new drug significantly affects blood pressure.

## 3. What is the significance level in hypothesis testing, and why is it important?
The significance level (denoted as **α**) is the probability of rejecting the null hypothesis when it is actually true. Common values are 0.05 (5%) or 0.01 (1%). A lower α reduces the risk of false positives but may increase false negatives.

## 4. What does a P-value represent in hypothesis testing?
The **P-value** measures the probability of obtaining a test result at least as extreme as the observed data, assuming the null hypothesis is true.

- **Small P-value (≤ α)**: Strong evidence against H₀, leading to its rejection.
- **Large P-value (> α)**: Weak evidence against H₀, so we fail to reject it.

## 5. How do you interpret the P-value in hypothesis testing?
- **P-value ≤ 0.05**: Strong evidence against the null hypothesis → Reject H₀.
- **P-value > 0.05**: Insufficient evidence to reject H₀ → Fail to reject H₀.

Example: If a test results in **P = 0.03**, we reject H₀ at a 5% significance level.

## 6. What are Type 1 and Type 2 errors in hypothesis testing?
- **Type 1 Error (False Positive)**: Rejecting H₀ when it is actually true. (False alarm)
- **Type 2 Error (False Negative)**: Failing to reject H₀ when it is actually false. (Missed detection)

Example:  
- Type 1: A healthy patient is diagnosed with a disease.
- Type 2: A sick patient is not diagnosed with the disease.

## 7. What is the difference between a one-tailed and a two-tailed test in hypothesis testing?
- **One-tailed test**: Tests if a parameter is **either greater or less** than a certain value (but not both).
- **Two-tailed test**: Tests if a parameter is **different** (either greater or smaller) from a certain value.

Example:  
- One-tailed: "New medicine reduces blood pressure" (only one direction).
- Two-tailed: "New medicine affects blood pressure" (either increase or decrease).

## 8. What is the Z-test, and when is it used in hypothesis testing?
A **Z-test** is used to compare a sample mean to a known population mean when:
1. The population variance is known.
2. The sample size is large (n ≥ 30).

Example: Comparing the average test scores of students to a national average.

## 9. How do you calculate the Z-score, and what does it represent in hypothesis testing?
The **Z-score** formula:  
\[Z = \frac{(X - \mu)}{\sigma / \sqrt{n}}
\]

Where:  
- \( X \) = sample mean  
- \( \mu \) = population mean  
- \( \sigma \) = population standard deviation  
- \( n \) = sample size  

The Z-score represents how many standard deviations the sample mean is from the population mean.

## 10. What is the T-distribution, and when should it be used instead of the normal distribution?
The **T-distribution** is similar to the normal distribution but is used when:
1. The sample size is **small (n < 30)**.
2. The population standard deviation is **unknown**.

It is more spread out than the normal distribution but approaches normal as sample size increases.

## 11. What is the difference between a Z-test and a T-test?
| Z-test | T-test |
|--------|--------|
| Used when population variance is **known**. | Used when population variance is **unknown**. |
| Sample size is **large (n ≥ 30)**. | Sample size is **small (n < 30)**. |
| Uses normal distribution. | Uses T-distribution. |

## 12. What is a confidence interval, and how is it used to interpret statistical results?
A **confidence interval (CI)** is a range of values that is likely to contain the true population parameter. It is calculated as:
\[CI = 	ext{sample mean} \pm (	ext{critical value} 	imes 	ext{standard error})
\]

For example, a 95% CI of [45, 55] means we are 95% confident that the population mean is between 45 and 55.

## 13. What is an ANOVA test, and what are its assumptions?
**ANOVA (Analysis of Variance)** tests whether there are significant differences among multiple group means.

**Assumptions of ANOVA:**  
1. **Normality** – Data should be normally distributed.  
2. **Independence** – Observations should be independent.  
3. **Equal Variance (Homogeneity of variance)** – Variances in different groups should be similar.

## 14. What is the F-test, and how does it relate to hypothesis testing?
The **F-test** is used to compare two variances and check if they are significantly different.

- **H₀**: The variances are equal.
- **H₁**: The variances are different.

Formula:
\[F = \frac{	ext{Variance of group 1}}{	ext{Variance of group 2}}
\]
If \( F \) is significantly high or low, we reject the null hypothesis.

## 14. What is the margin of error, and how does it affect the confidence interval?
The margin of error (ME) quantifies uncertainty in a confidence interval. A larger sample size reduces ME, making the interval narrower and more precise.

## 15. How is Bayes' Theorem used in statistics, and what is its significance?
Bayes' Theorem helps update probabilities based on new evidence. Formula:

𝑃
(
𝐴
∣
𝐵
)
=
𝑃
(
𝐵
∣
𝐴
)
⋅
𝑃
(
𝐴
)
𝑃
(
𝐵
)
P(A∣B)= 
P(B)
P(B∣A)⋅P(A)
​
 
It is used in spam filtering, medical diagnosis, and machine learning.

## 16. What is the Chi-square distribution, and when is it used?
The Chi-square distribution is used to test relationships between categorical variables. It is used in:

Chi-square goodness of fit test (checks if data fits expected distribution).
Chi-square test for independence (tests if two categorical variables are related).
## 17. What is an ANOVA test, and what are its assumptions?
ANOVA (Analysis of Variance) compares means across multiple groups. Assumptions:

The data is normally distributed.
The groups have equal variance.
The samples are independent.
## 18. What are the different types of ANOVA tests?
One-way ANOVA – Compares means across one independent variable.
Two-way ANOVA – Compares means across two independent variables.
Repeated Measures ANOVA – Compares means when the same subjects are tested multiple times.
## 19. What is the F-test, and how does it relate to hypothesis testing?
The F-test compares variances between two or more groups to test if they are significantly different. It is commonly used in ANOVA tests.

## 23. What is an ANOVA test, and what are its assumptions?
ANOVA (Analysis of Variance) tests whether there are significant differences among multiple group means.

Assumptions of ANOVA:

Normality – Data should be normally distributed.
Independence – Observations should be independent.
Equal Variance (Homogeneity of variance) – Variances in different groups should be similar.

## 24. What are the different types of ANOVA tests?
One-way ANOVA – Compares means of multiple groups for one independent variable.
Two-way ANOVA – Compares means of groups across two independent variables.
Repeated Measures ANOVA – Compares means when the same subjects are measured multiple times.

## 25. What is the F-test, and how does it relate to hypothesis testing?
The F-test is used to compare two variances and check if they are significantly different.

H₀: The variances are equal.
H₁: The variances are different.
Formula:

𝐹
=
Variance of group 1
Variance of group 2
F= 
Variance of group 2
Variance of group 1
​
 
If 
𝐹
F is significantly high or low, we reject the null hypothesis.


In [None]:
# Statistics Part 2 - Practical Solutions



import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy.stats import norm, t, chisquare, chi2_contingency
import pandas as pd

# Q1: Perform a Z-test

def z_test(sample, population_mean, population_std):
    sample_mean = np.mean(sample)
    n = len(sample)
    z_score = (sample_mean - population_mean) / (population_std / np.sqrt(n))
    p_value = 2 * (1 - norm.cdf(abs(z_score)))
    return z_score, p_value

# Q2: Hypothesis testing with random data

def simulate_hypothesis_test():
    sample_data = np.random.normal(loc=50, scale=5, size=30)
    t_stat, p_value = stats.ttest_1samp(sample_data, 50)
    return t_stat, p_value

# Q3: One-sample Z-test

def one_sample_z_test(sample, population_mean, population_std):
    return z_test(sample, population_mean, population_std)

# Q4: Two-tailed Z-test with visualization

def two_tailed_z_test(sample, population_mean, population_std):
    z_score, p_value = z_test(sample, population_mean, population_std)
    x = np.linspace(-4, 4, 1000)
    y = norm.pdf(x, 0, 1)
    plt.plot(x, y)
    plt.axvline(x=-1.96, color='r', linestyle='--')
    plt.axvline(x=1.96, color='r', linestyle='--')
    plt.show()
    return z_score, p_value

# Q5: Visualizing Type 1 & Type 2 errors

def visualize_type1_type2():
    x = np.linspace(-4, 4, 1000)
    y1, y2 = norm.pdf(x, 0, 1), norm.pdf(x, 1, 1)
    plt.plot(x, y1, label="H0")
    plt.plot(x, y2, label="H1")
    plt.legend()
    plt.show()

# Q6: Independent T-test

def independent_t_test():
    group1, group2 = np.random.normal(50, 5, 30), np.random.normal(52, 5, 30)
    return stats.ttest_ind(group1, group2)

# Q7: Paired T-test

def paired_t_test():
    before, after = np.random.normal(50, 5, 30), np.random.normal(52, 5, 30)
    return stats.ttest_rel(before, after)

# Q8: Compare Z-test and T-test

def compare_z_t_test():
    sample = np.random.normal(50, 5, 30)
    z_result = z_test(sample, 50, 5)
    t_stat, p_value = stats.ttest_1samp(sample, 50)
    return z_result, (t_stat, p_value)

# Q9: Confidence interval

def confidence_interval(sample, confidence=0.95):
    sample_mean, sample_std, n = np.mean(sample), np.std(sample, ddof=1), len(sample)
    t_score = t.ppf((1 + confidence) / 2, df=n-1)
    moe = t_score * (sample_std / np.sqrt(n))
    return (sample_mean - moe, sample_mean + moe)

# Q10: Margin of error

def margin_of_error(sample, confidence=0.95):
    return confidence_interval(sample, confidence)

# Q11: Bayesian inference

def bayes_theorem(prior_A, prob_B_given_A, prob_B_given_not_A):
    prob_not_A = 1 - prior_A
    prob_B = (prob_B_given_A * prior_A) + (prob_B_given_not_A * prob_not_A)
    return (prob_B_given_A * prior_A) / prob_B

# Q12: Chi-square test for independence

def chi_square_test():
    observed = np.array([[20, 30], [10, 40]])
    return chi2_contingency(observed)

# Q13: Expected frequencies for Chi-square test

def expected_frequencies():
    observed = np.array([[25, 35], [15, 25]])
    _, _, _, expected = chi2_contingency(observed)
    return expected

# Q14: Goodness-of-fit test

def goodness_of_fit():
    observed, expected = np.array([50, 30, 20]), np.array([40, 40, 20])
    return chisquare(observed, expected)

# Running all functions and saving results to a DataFrame
if __name__ == "__main__":
    sample_data = np.random.normal(50, 5, 30)
    results = {
        "Z-Test": z_test(sample_data, 50, 5),
        "Hypothesis Test": simulate_hypothesis_test(),
        "One-Sample Z-Test": one_sample_z_test(sample_data, 50, 5),
        "Two-Tailed Z-Test": two_tailed_z_test(sample_data, 50, 5),
        "Independent T-Test": independent_t_test(),
        "Paired T-Test": paired_t_test(),
        "Compare Z vs T-Test": compare_z_t_test(),
        "Confidence Interval": confidence_interval(sample_data),
        "Margin of Error": margin_of_error(sample_data),
        "Bayes' Theorem": bayes_theorem(0.01, 0.95, 0.05),
        "Chi-Square Test": chi_square_test(),
        "Expected Frequencies": expected_frequencies(),
        "Goodness-of-Fit Test": goodness_of_fit()
    }
    df_results = pd.DataFrame(results, index=["Statistic", "P-Value"])
    print(df_results)


   

# Q15: Perform an F-test to compare variances of two samples
def f_test(sample1, sample2):
    f_stat = np.var(sample1, ddof=1) / np.var(sample2, ddof=1)
    df1, df2 = len(sample1) - 1, len(sample2) - 1
    p_value = 1 - stats.f.cdf(f_stat, df1, df2)
    return f_stat, p_value

# Q16: Perform an ANOVA test to compare means of multiple groups
def anova_test(*groups):
    return f_oneway(*groups)

# Q17: Perform a one-way ANOVA test and visualize results
def one_way_anova_visualized(*groups):
    stat, p_value = f_oneway(*groups)
    plt.boxplot(groups, labels=[f'Group {i+1}' for i in range(len(groups))])
    plt.title("One-Way ANOVA Comparison")
    plt.show()
    return stat, p_value

# Q18: Check ANOVA assumptions (normality, independence, equal variance)
def check_anova_assumptions(*groups):
    normality = [stats.shapiro(group)[1] for group in groups]
    variances = stats.levene(*groups)[1]
    return normality, variances

# Q19: Perform a two-way ANOVA test
def two_way_anova(df, factor1, factor2, response):
    import statsmodels.api as sm
    from statsmodels.formula.api import ols
    model = ols(f'{response} ~ C({factor1}) + C({factor2}) + C({factor1}):C({factor2})', data=df).fit()
    return sm.stats.anova_lm(model, typ=2)

# Q20: Visualize F-distribution
def visualize_f_distribution():
    x = np.linspace(0, 5, 1000)
    y = stats.f.pdf(x, dfn=10, dfd=20)
    plt.plot(x, y, label="F-Distribution")
    plt.fill_between(x, y, alpha=0.3)
    plt.legend()
    plt.show()

# Q21: Perform a hypothesis test for population variance
def chi_square_variance_test(sample, hypothesized_variance):
    n = len(sample)
    sample_var = np.var(sample, ddof=1)
    chi2_stat = (n - 1) * sample_var / hypothesized_variance
    p_value = 1 - stats.chi2.cdf(chi2_stat, df=n-1)
    return chi2_stat, p_value

# Q22: Perform a Z-test for comparing proportions
def z_test_proportions(p1, p2, n1, n2):
    p_pool = (p1 * n1 + p2 * n2) / (n1 + n2)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    z_score = (p1 - p2) / se
    p_value = 2 * (1 - norm.cdf(abs(z_score)))
    return z_score, p_value

# Q23: Implement an F-test for comparing variances
def f_test_variances(sample1, sample2):
    return f_test(sample1, sample2)

# Q24: Perform a Chi-square test for goodness of fit
def chi_square_goodness_of_fit(observed, expected):
    return chisquare(observed, expected)

# Q25: Simulate random data from a normal distribution and perform hypothesis testing
def simulate_and_test():
    sample_data = np.random.normal(50, 5, 30)
    return stats.ttest_1samp(sample_data, 50)

# Q26: Perform a two-way ANOVA test with visualization
def two_way_anova_visual(df, factor1, factor2, response):
    result = two_way_anova(df, factor1, factor2, response)
    result.plot(kind='bar')
    plt.title("Two-Way ANOVA Results")
    plt.show()
    return result

# Q27: Perform a one-way ANOVA test and visualize results with boxplots
def one_way_anova_boxplot(*groups):
    stat, p_value = f_oneway(*groups)
    plt.boxplot(groups, labels=[f'Group {i+1}' for i in range(len(groups))])
    plt.title("One-Way ANOVA with Boxplots")
    plt.show()
    return stat, p_value

# Running all functions if executed as script
if __name__ == "__main__":
    sample1 = np.random.normal(50, 5, 30)
    sample2 = np.random.normal(52, 5, 30)
    results = {
        "F-Test": f_test(sample1, sample2),
        "One-Way ANOVA": one_way_anova_visualized(sample1, sample2),
        "ANOVA Assumptions": check_anova_assumptions(sample1, sample2),
        "Chi-Square Variance Test": chi_square_variance_test(sample1, 25),
        "Z-Test Proportions": z_test_proportions(0.5, 0.4, 100, 100),
        "Chi-Square Goodness of Fit": chi_square_goodness_of_fit([50, 30, 20], [40, 40, 20]),
        "Simulated Hypothesis Test": simulate_and_test()
    }
    df_results = pd.DataFrame(results, index=["Statistic", "P-Value"])
    print(df_results)

