# Applied Statistics and Inference â€” Assignment (DA-AG-006)

---


## Question 1
**What are Type I and Type II errors in hypothesis testing, and how do they impact decision-making?**

**Answer (concise):**
- **Type I error (False Positive):** Rejecting the null hypothesis \(H_0\) when it is actually true. Probability of Type I error is \(\alpha\) (significance level).
- **Type II error (False Negative):** Failing to reject \(H_0\) when the alternative hypothesis \(H_a\) is true. Probability of Type II error is \(\beta\).

**Impact on decision-making:**
- Choosing a smaller \(\alpha\) reduces the chance of Type I error but typically increases \(\beta\) (more Type II errors) unless sample size is increased.
- The trade-off matters: in medical testing, a Type I error may mean incorrectly declaring a treatment effective; a Type II error may miss a real beneficial effect. Choose \(\alpha\) and sample size according to the context and consequences.

**Tip for exams:** define both errors, write the symbols (\(\alpha,\beta\)), and give a short practical example.


## Question 2
**What is the P-value in hypothesis testing, and how should it be interpreted in the context of the null hypothesis?**

**Answer (concise):**
- **P-value** is the probability of observing test-statistic results at least as extreme as the one observed, assuming the null hypothesis \(H_0\) is true.
- **Interpretation:** A small p-value (typically < \(\alpha\)) indicates that the observed data are unlikely under \(H_0\), which provides evidence against \(H_0\) and may lead us to reject it. A large p-value means the data are consistent with \(H_0\) and we do not reject \(H_0\).

**Important:** P-value is _not_ the probability that \(H_0\) is true, nor the probability that the result occurred by chance alone. Always report the p-value and the chosen \(\alpha\) when making conclusions.


## Question 3
**Explain the difference between a Z-test and a T-test, including when to use each.**

**Answer (concise):**
- **Z-test:** used when the population standard deviation \(\sigma\) is known and the sample size is large (commonly \(n\geq30\)). The test statistic follows the standard normal distribution.
- **T-test:** used when \(\sigma\) is unknown and must be estimated from the sample (use sample standard deviation \(s\)). The test statistic follows Student's t-distribution with \(n-1\) degrees of freedom; for small samples the t-distribution accounts for extra uncertainty.

**Practical rule:** Use t-test in most real situations because \(\sigma\) is rarely known. As sample size grows, t-distribution approaches normal, so results become similar.


## Question 4
**What is a confidence interval, and how does the margin of error influence its width and interpretation?**

**Answer (concise):**
- A **confidence interval (CI)** for a parameter (e.g., population mean) is a range constructed from sample data that, under repeated sampling, contains the true parameter a specified proportion of the time (confidence level, e.g., 95%).
- **Margin of error (ME)** is half the width of the CI. For a mean: \(\text{ME} = z_{\alpha/2} \cdot \dfrac{s}{\sqrt{n}}\) (or use t critical value when appropriate).

**Influence:** Larger ME -> wider interval -> less precision. ME decreases with larger sample size \(n\) and smaller variability \(s\), and depends on chosen confidence level (higher level -> larger critical value -> larger ME).

**Interpretation tip:** A 95% CI does not mean there is a 95% probability the true parameter lies in the specific interval; it means the method produces intervals containing the true parameter in 95% of repeated samples.


## Question 5
**Describe the purpose and assumptions of an ANOVA test. How does it extend hypothesis testing to more than two groups?**

**Answer (concise):**
- **Purpose:** ANOVA (Analysis of Variance) tests whether three or more group means are equal by comparing between-group variability to within-group variability.
- **Assumptions:** independence of observations, normally distributed residuals within each group, and equal variances (homoscedasticity) across groups.
- **Extension:** Instead of multiple pairwise t-tests (which increase Type I error), ANOVA tests a single null hypothesis:
  \(H_0:\ \mu_1 = \mu_2 = \dots = \mu_k\) against the alternative that at least one mean differs. If ANOVA is significant, follow-up pairwise tests or post-hoc procedures (e.g., Tukey) identify which groups differ.


## Question 6
**Write a Python program to perform a one-sample Z-test and interpret the result for a given dataset.**

**Approach:**
- Use the formula for one-sample Z-statistic: \(Z = (\bar{x} - \mu_0)/(\sigma/\sqrt{n})\) where \(\mu_0\) is the null mean and \(\sigma\) is population standard deviation (assumed known).
- Calculate p-value from the standard normal distribution.

**Example dataset and code below.**


In [None]:
# Question 6 - One-sample Z-test example
import numpy as np, math
from scipy import stats

# Example data (sample)
data = np.array([52, 49, 47, 55, 50, 53, 48, 51, 54, 49])  # sample of size 10
n = len(data)
xbar = data.mean()

# Suppose null hypothesis H0: mu = 50, and population sigma is known (assume sigma = 3.0)
mu0 = 50.0
sigma = 3.0

# Z statistic
z_stat = (xbar - mu0) / (sigma / math.sqrt(n))
# Two-sided p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

xbar, z_stat, p_value


## Question 7
**Simulate a dataset from a binomial distribution (n = 10, p = 0.5) using NumPy and plot the histogram.**

**Approach:** Use `numpy.random.binomial` to generate many trials and plot frequency histogram.


In [None]:
# Question 7 - Binomial simulation and histogram
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)
trials = 10
p = 0.5
# simulate 1000 observations of number of successes in 10 trials
samples = np.random.binomial(n=trials, p=p, size=1000)

# Frequency table (optional) and histogram
values, counts = np.unique(samples, return_counts=True)
print('Value : Count')
for v,c in zip(values, counts):
    print(f'{v:2d} : {c:3d}')

plt.figure(figsize=(7,4))
plt.hist(samples, bins=range(-1, trials+2), align='left', rwidth=0.8, edgecolor='black')
plt.title('Histogram of Binomial(n=10, p=0.5) - 1000 samples')
plt.xlabel('Number of successes')
plt.ylabel('Frequency')
plt.xticks(range(0, trials+1))
plt.show()


## Question 8
**Generate multiple samples from a non-normal distribution and implement the Central Limit Theorem using Python.**

**Approach:**
- Draw many samples (e.g., 1000 samples) of size n from an exponential distribution (which is skewed).
- Compute the sample means and show that their distribution approaches normal as n increases.

We will demonstrate for n=5 and n=50 and compare histograms.


In [None]:
# Question 8 - CLT demonstration with Exponential distribution
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2)
pop_scale = 1.0  # exponential scale parameter (mean = 1)
num_samples = 2000

def sample_means(n):
    samples = np.random.exponential(scale=pop_scale, size=(num_samples, n))
    means = samples.mean(axis=1)
    return means

means_n5 = sample_means(5)
means_n50 = sample_means(50)

print('Mean of sample means (n=5):', means_n5.mean(), 'Std =', means_n5.std(ddof=1))
print('Mean of sample means (n=50):', means_n50.mean(), 'Std =', means_n50.std(ddof=1))

plt.figure(figsize=(6,3))
plt.hist(means_n5, bins=30, edgecolor='black')
plt.title('Distribution of Sample Means (n=5) - Exponential population')
plt.xlabel('Sample mean'); plt.ylabel('Frequency'); plt.show()

plt.figure(figsize=(6,3))
plt.hist(means_n50, bins=30, edgecolor='black')
plt.title('Distribution of Sample Means (n=50) - Exponential population')
plt.xlabel('Sample mean'); plt.ylabel('Frequency'); plt.show()


## Question 9
**Write a Python function to calculate and visualize the confidence interval for a sample mean.**

**Approach:**
- Function accepts sample data, confidence level, and whether population sigma is known.
- Computes CI using z (if sigma known) or t (if sigma unknown).
- Visualizes the mean with error bar.


In [None]:
# Question 9 - Function to compute and plot CI for sample mean
import numpy as np, math
import matplotlib.pyplot as plt
from scipy import stats

def ci_for_mean(data, conf=0.95, sigma_known=None, pop_sigma=None):
    data = np.array(data)
    n = len(data)
    xbar = data.mean()
    if sigma_known and pop_sigma is not None:
        z = stats.norm.ppf(1 - (1-conf)/2)
        se = pop_sigma / math.sqrt(n)
        critical = z
        dist = 'z'
    else:
        # use t-distribution
        se = data.std(ddof=1) / math.sqrt(n)
        critical = stats.t.ppf(1 - (1-conf)/2, df=n-1)
        dist = 't'
    margin = critical * se
    lower = xbar - margin
    upper = xbar + margin
    
    # Print and plot
    print(f'n={n}, method={dist}-based, mean={xbar:.4f}, CI=({lower:.4f}, {upper:.4f})')
    plt.figure(figsize=(6,1.8))
    plt.errorbar(0, xbar, yerr=margin, fmt='o', capsize=8)
    plt.xlim(-1,1); plt.xticks([])
    plt.title(f'{int(conf*100)}% CI for sample mean (n={n})')
    plt.ylabel('Mean value')
    plt.show()
    return (lower, upper)

# Example usage
sample_data = [12, 15, 14, 16, 13, 14, 15, 17, 11, 14]
ci_for_mean(sample_data, conf=0.95, sigma_known=False)


## Question 10
**Perform a Chi-square goodness-of-fit test using Python to compare observed and expected distributions, and explain the outcome.**

**Approach:**
- Given observed category counts and expected probabilities, compute expected counts and apply chi-square test: \(\chi^2 = \sum (O_i - E_i)^2/E_i\).
- Use `scipy.stats.chisquare` to compute statistic and p-value.


In [None]:
# Question 10 - Chi-square goodness-of-fit example
import numpy as np
from scipy import stats

# Example: observed counts of dice rolls (6 faces) from an experiment
observed = np.array([18, 22, 17, 15, 20, 8])  # total 100 rolls
# If the die is fair, expected probabilities are equal
expected_probs = np.array([1/6]*6)
expected_counts = expected_probs * observed.sum()

chi2_stat, p_val = stats.chisquare(f_obs=observed, f_exp=expected_counts)

print('Observed counts:', observed)
print('Expected counts (fair die):', np.round(expected_counts,2))
print(f'Chi-square stat = {chi2_stat:.4f}, p-value = {p_val:.4f}')

if p_val < 0.05:
    print('Result: Reject the null hypothesis of a fair die (significant difference).')
else:
    print('Result: Do not reject the null hypothesis (no evidence against fairness).')


---

*End of assignment solutions.*