# Statistics Part 2

Q1 What is hypothesis testing in statistics ?

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It helps determine whether there is enough evidence to support a certain claim or assumption.

Here's how it works:
1. **State the hypotheses**: You define two hypotheses—
   - **Null hypothesis (H₀)**: This represents the default assumption or status quo (e.g., "There is no difference between two treatments").
   - **Alternative hypothesis (H₁ or Hₐ)**: This suggests an effect or difference exists (e.g., "Treatment A is more effective than Treatment B").

2. **Choose a significance level (α)**: Typically, a threshold like 0.05 or 0.01 is set. This represents the probability of rejecting the null hypothesis when it is actually true (Type I error).

3. **Collect and analyze data**: A statistical test (e.g., t-test, chi-square test, ANOVA) is used to compare observed data against what would be expected under H₀.

4. **Compute the test statistic and p-value**: The test statistic helps measure how extreme the data is under H₀. The p-value indicates the probability of obtaining results as extreme as the observed ones if H₀ were true.

5. **Make a decision**:
   - If the p-value is smaller than α, you reject H₀, suggesting evidence supports H₁.
   - If the p-value is larger, you fail to reject H₀, meaning there isn’t enough evidence to support H₁.

Q2 What is the null hypothesis, and how does it differ from the alternative hypothesis ?

The **null hypothesis (H₀) is the default assumption that there is no effect, no difference, or no relationship between variables. It represents the status quo or what is assumed to be true unless there is enough evidence to prove otherwise. For example, in a medical trial, H₀ could be: *"This new drug has no effect on blood pressure."*

The **alternative hypothesis (H₁ or Hₐ)**, on the other hand, proposes that there is an effect, a difference, or a relationship. It challenges the null hypothesis and suggests that something significant is happening. In the same medical trial example, H₁ could be: *"This new drug lowers blood pressure."*

Key Differences:
1. **Purpose**:  
   - H₀ assumes no change or relationship exists.  
   - H₁ suggests that there is a change or effect.  

2. **Decision Making**:  
   - If statistical evidence strongly contradicts H₀, it is rejected in favor of H₁.  
   - If evidence is insufficient, H₀ is not rejected, meaning we don’t have proof to support H₁.

3. **Risk of Error**:  
   - Rejecting H₀ when it is actually true is a **Type I error** (false positive).  
   - Failing to reject H₀ when H₁ is actually true is a **Type II error** (false negative).  

Q3 What is the significance level in hypothesis testing, and why is it important ?

The **significance level (α)** in hypothesis testing is the probability of rejecting the null hypothesis (H₀) when it is actually true. It sets a threshold for determining whether the observed results are statistically significant.

Why Is It Important?
1. **Controls Error Risk**: Since hypothesis testing involves probability, there’s always a chance of making a mistake. Setting a significance level helps manage the risk of **Type I error** (incorrectly rejecting H₀ when it’s true).
2. **Defines Statistical Confidence**: A lower α (e.g., 0.01) demands stronger evidence before rejecting H₀, while a higher α (e.g., 0.10) allows more flexibility in making conclusions.
3. **Standardization in Research**: Common values like **0.05 (5%)** or **0.01 (1%)** are widely used to ensure consistency across scientific studies and data analysis.
4. **Influences Decision Making**: If the **p-value** (computed probability) is **less than α**, we reject H₀, suggesting strong evidence for the alternative hypothesis (H₁). If **p ≥ α**, we fail to reject H₀.

Q4 What does a P-value represent in hypothesis testing ?

The **p-value** in hypothesis testing represents the probability of obtaining results as extreme as the observed ones, assuming the **null hypothesis (H₀)** is true. It helps assess whether the observed data provides strong enough evidence to reject H₀ in favor of the **alternative hypothesis (H₁).**

How to interpret a p-value:
- **Small p-value (≤ α, such as 0.05 or 0.01)**: Strong evidence against H₀ → **Reject H₀** (supporting H₁).
- **Large p-value (> α)**: Not enough evidence to reject H₀ → **Fail to reject H₀** (no strong support for H₁).

Example:

Imagine a study tests whether a new fertilizer improves plant growth.
- **H₀**: The fertilizer has no effect.
- **H₁**: The fertilizer increases plant growth.
- Researchers set **α = 0.05** and obtain **p = 0.02**.

Q5 How do you interpret the P-value in hypothesis testing ?

Interpreting the **p-value** in hypothesis testing helps determine whether the observed data provides enough evidence to reject the **null hypothesis (H₀)** in favor of the **alternative hypothesis (H₁).** Here's how to approach it:

 **Interpreting the p-value:**
1. **Low p-value (≤ α, e.g., 0.05 or 0.01)**  
   - Suggests that the observed data is unlikely to have occurred under H₀.  
   - Provides strong evidence **against** H₀ → **Reject H₀** → Supports H₁.  

2. **High p-value (> α)**  
   - Implies that the data is more consistent with H₀, meaning no strong evidence exists to reject it.  
   - **Fail to reject H₀** → Not enough support for H₁.  

 **Example Interpretation:**  
Imagine researchers test whether a new teaching method improves student performance.  
- **H₀**: The teaching method has no effect.  
- **H₁**: The teaching method improves performance.  
- **Significance level (α) = 0.05**  
- **Obtained p-value = 0.02**  

Q6 What are Type 1 and Type 2 errors in hypothesis testing ?

In hypothesis testing, **Type I and Type II errors** refer to mistakes in the decision-making process when evaluating a hypothesis. These errors occur because hypothesis testing is based on probabilities, meaning there's always a chance of making the wrong conclusion.

 **Type I Error (False Positive)**
- Occurs when the **null hypothesis (H₀) is rejected when it is actually true**.
- Essentially, you detect an effect or difference when none exists.
- Probability of making a Type I error is **α** (the significance level, e.g., 0.05 or 0.01).

**Example:**  
A medical test wrongly detects a disease in a healthy person.

 **Type II Error (False Negative)**
- Happens when **H₀ is not rejected when the alternative hypothesis (H₁) is actually true**.
- You fail to detect a real effect or difference.
- Probability of making a Type II error is **β**, and (1 - β) represents the **power** of the test (the ability to detect real effects).

**Example:**  
A medical test fails to detect a disease in a sick person.

 **Balancing Errors**
- Lowering **α** reduces Type I errors but increases the risk of Type II errors.
- Increasing the sample size helps reduce both errors, leading to more reliable conclusions.

Q7 What is the difference between a one-tailed and a two-tailed test in hypothesis testing ?

The key difference between a **one-tailed** and a **two-tailed** test in hypothesis testing lies in the directionality of the test.

**One-Tailed Test** (Directional)
- Used when the research hypothesis specifies **a direction** of the effect.
- It tests for an **increase or decrease** in a parameter, not both.
- The critical region (rejection area for H₀) is only on **one side** of the distribution.

**Example**:

A company claims its new battery lasts **more than** 10 hours.  
- **H₀**: The battery lasts **≤ 10 hours**.  
- **H₁**: The battery lasts **> 10 hours** (right-tailed).  
- The test will only check for whether the battery life is significantly **greater** than 10 hours.

**Two-Tailed Test** (Non-Directional)
- Used when the research hypothesis does **not** specify a direction.
- It checks for **any significant difference**—whether the parameter is higher or lower.
- The critical region is **split between both tails** of the distribution.

**Example**:  
A pharmaceutical company wants to test if a drug **affects** blood pressure in **either** direction.  
- **H₀**: The drug has **no effect** on blood pressure.  
- **H₁**: The drug **changes** blood pressure (either increases or decreases).  
- The test examines **both** possibilities (higher or lower blood pressure).

**Choosing Between the Two**
- Use a **one-tailed test** if you're only interested in a specific direction (e.g., "greater than" or "less than").
- Use a **two-tailed test** when looking for **any change** or difference (e.g., "not equal to").

Q8 What is the Z-test, and when is it used in hypothesis testing ?

The **Z-test** is a statistical test used to determine whether there is a significant difference between sample and population means or between two sample means, assuming the data follows a normal distribution. It relies on the **Z-score**, which measures how far a sample statistic deviates from the expected population parameter.

 **When is the Z-test used?**
A Z-test is applied when:
1. **Sample Size is Large** (typically **n ≥ 30**).  
2. **Data is Normally Distributed** (or the sample size is large enough for the Central Limit Theorem to apply).  
3. **Population Variance is Known** (or assumed).  
4. **Comparing a Sample Mean to a Population Mean** or **Comparing Two Independent Samples**.  

 **Types of Z-tests**
1. **One-sample Z-test**: Used to compare a sample mean with a known population mean.  
   *Example*: Checking if the average height of students in a school differs from the national average.  
2. **Two-sample Z-test**: Used to compare the means of two independent samples.  
   *Example*: Comparing the average test scores of students from two different schools.  
3. **Z-test for Proportions**: Used when dealing with categorical data and comparing proportions.  
   *Example*: Testing whether the proportion of people preferring a new product differs from past preferences.  

**Formula for One-Sample Z-Test**
\[
Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}
\]
Where:
- \( \bar{X} \) = sample mean  
- \( \mu \) = population mean  
- \( \sigma \) = population standard deviation  
- \( n \) = sample size  

Q9 How do you calculate the Z-score, and what does it represent in hypothesis testing ?

The **Z-score**, also known as the **standard score**, measures how far a data point or sample mean deviates from the population mean in terms of **standard deviations**. It helps determine whether the observed data is significantly different from the expected norm.

**Formula for Z-score:**
For a single data point:
\[
Z = \frac{X - \mu}{\sigma}
\]
For a sample mean in hypothesis testing:
\[
Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}
\]
Where:
- \( X \) = individual data value
- \( \bar{X} \) = sample mean  
- \( \mu \) = population mean  
- \( \sigma \) = population standard deviation  
- \( n \) = sample size  

**What Does the Z-score Represent?**
1. **Measures Distance from Mean**:  
   - Positive Z-score → Data is above the mean  
   - Negative Z-score → Data is below the mean  

2. **Indicates Significance in Hypothesis Testing**:  
   - If the Z-score is extreme (far from zero), it suggests the sample mean significantly differs from the population mean.  
   - The Z-score is compared against critical values from the **Z-table** (often **±1.96** for a 95% confidence level).  

Q10  What is the T-distribution, and when should it be used instead of the normal distribution ?

The **T-distribution** (Student's T-distribution) is a probability distribution that is similar to the **normal distribution** but has heavier tails. This means it accounts for more variability in data, making it useful when dealing with small sample sizes or unknown population variances.

**When to Use the T-Distribution Instead of the Normal Distribution?**
1. **Small Sample Size (\( n < 30 \))**:  
   - When the sample size is small, the normal distribution may not accurately represent the population. The T-distribution adjusts for this by considering sample variability.
   
2. **Unknown Population Standard Deviation (\( \sigma \))**:  
   - If the **population standard deviation** is unknown, we estimate it using the **sample standard deviation (s)**. The T-distribution provides better estimates in such cases.
   
3. **More Variability and Uncertainty**:  
   - Since the T-distribution has wider tails than the normal distribution, it accounts for more variability and **reduces the risk of underestimating uncertainty** when making conclusions.

**Key Differences Between T-Distribution and Normal Distribution**

| Feature            | Normal Distribution | T-Distribution |
|-------------------|----------------|--------------|
| **Shape**         | Bell-shaped, thin tails | Bell-shaped, but heavier tails |
| **Usage**         | Large samples (\( n \geq 30 \)) | Small samples (\( n < 30 \)) |
| **Standard Deviation** | Known (\( \sigma \)) | Unknown, estimated from sample |
| **Tail Thickness** | Thinner tails | Thicker tails (accounts for more variability) |

Q11  What is the difference between a Z-test and a T-test ?

The **Z-test** and **T-test** are both statistical tests used to compare sample data to a population or between two samples. The key difference lies in when they should be applied.

**Key Differences:**

| Feature | Z-test | T-test |
|---------|-------|-------|
| **Sample Size** | Large (\( n \geq 30 \)) | Small (\( n < 30 \)) |
| **Population Standard Deviation (\( \sigma \))** | Known | Unknown (uses sample standard deviation \( s \)) |
| **Distribution Used** | Normal distribution | T-distribution (heavier tails for more variability) |
| **Typical Use Cases** | Comparing means or proportions in large datasets | Comparing means in small datasets where population parameters are unknown |

**When to Use Each:**
- **Use a Z-test** when the sample is large, normally distributed, and the population standard deviation is known.
- **Use a T-test** when the sample is small and/or the population standard deviation is unknown.

 **Example Scenarios:**

**Z-test:**  
A researcher wants to test whether the average height of students in a university (large sample) differs from the national average. The **population standard deviation is known**, so a Z-test is appropriate.

**T-test:**  
A company tests whether a new training program improves employee productivity. They only have **15 employees**, and the population standard deviation is unknown. A **T-test** is more suitable.

Q12 What is the T-test, and how is it used in hypothesis testing ?

The **T-test** is a statistical test used to determine whether there is a significant difference between the means of two groups. It is particularly useful when dealing with **small sample sizes (\( n < 30 \))** and when the **population standard deviation (\( \sigma \)) is unknown**.

**How the T-test is Used in Hypothesis Testing:**
1. **State the Hypotheses**:  
   - **Null Hypothesis (H₀)**: Assumes no difference between group means.  
   - **Alternative Hypothesis (H₁)**: Suggests there is a significant difference.

2. **Choose the Type of T-test**:  
   - **One-sample T-test**: Compares a sample mean to a known population mean.  
   - **Independent (two-sample) T-test**: Compares means of two independent groups.  
   - **Paired T-test**: Compares means from the same group before and after treatment.

3. **Calculate the T-score**:
\[
T = \frac{\bar{X}_1 - \bar{X}_2}{\frac{s}{\sqrt{n}}}
\]
Where:A What is the relationship between Z-test and T-test in hypothesis testing,
- \( \bar{X}_1, \bar{X}_2 \) = sample means  
- \( s \) = estimated standard deviation  
- \( n \) = sample size  

4. **Compare the T-score to the Critical Value from the T-table**:  
   - If the **computed T-value exceeds the critical value**, reject **H₀**.  
   - If it falls within the acceptance range, **fail to reject H₀**.


Q13  What is the relationship between Z-test and T-test in hypothesis testing ?

The **Z-test** and **T-test** are closely related statistical tests used in hypothesis testing to compare sample data with population parameters or between two sample groups. While both tests assess whether observed differences are statistically significant, their application differs based on sample size and available data.

**Relationship Between Z-test and T-test**
1. **Purpose**:  
   - Both tests evaluate whether a sample mean significantly differs from a population mean or another sample mean.
   - They help determine whether to **reject the null hypothesis (H₀)** in favor of the alternative hypothesis (H₁).

2. **Underlying Distribution**:  
   - **Z-test** relies on the **normal distribution** (used when the sample size is large).  
   - **T-test** uses the **T-distribution**, which accounts for **greater variability** in smaller samples.  

3. **Sample Size Dependency**:  
   - **Z-test** is appropriate when **\( n \geq 30 \)** (large sample).  
   - **T-test** is used for **\( n < 30 \)** (small sample) because it adjusts for **sample variability** with heavier tails.  

4. **Known vs. Unknown Population Standard Deviation (\( \sigma \))**:  
   - **Z-test** is used when the **population standard deviation** (\( \sigma \)) is known.  
   - **T-test** is used when **\( \sigma \)** is unknown and must be estimated from sample data.  

 **Common Usage of Both Tests**

| Feature | Z-test | T-test |
|---------|-------|--------|
| **Distribution Used** | Normal | T-distribution |
| **Sample Size** | Large (\( n \geq 30 \)) | Small (\( n < 30 \)) |
| **Standard Deviation (\( \sigma \))** | Known | Unknown (estimated from sample) |
| **Use Case** | Large datasets, population-level analysis | Small datasets, experimental studies |

Q14 What is a confidence interval, and how is it used to interpret statistical results ?

A **confidence interval (CI)** is a range of values that provides an estimate of where a population parameter (such as a mean or proportion) is likely to fall, based on sample data. It reflects the uncertainty in statistical analysis and helps determine the reliability of results.

**How Confidence Intervals Are Used in Statistical Interpretation**

1. **Indicates Precision**  
   - A **narrow** confidence interval suggests precise estimates.  
   - A **wide** confidence interval indicates more uncertainty.  

2. **Helps Make Decisions**  
   - If a CI does **not** include a specific value (e.g., zero for differences), it suggests a statistically **significant** result.  
   - If a CI **includes** the null hypothesis value, the result might not be statistically significant.  

3. **Provides a Confidence Level**  
   - Common values: **90%, 95%, and 99%** confidence levels.  
   - A **95% CI** means: "If we repeated the study multiple times, 95% of the time, the true population parameter would fall within this range."  

**Confidence Interval Formula (for Mean)**

\[
CI = \bar{X} \pm Z \times \frac{\sigma}{\sqrt{n}}
\]
Where:
- \( \bar{X} \) = sample mean  
- \( Z \) = critical value from Z-table (based on confidence level)  
- \( \sigma \) = population standard deviation (or estimated from sample)  
- \( n \) = sample size  

Q15  What is the margin of error, and how does it affect the confidence interval ?

The **margin of error (MoE)** is the range within which the true population parameter is expected to fall, based on a sample estimate. It represents the level of uncertainty in statistical results and directly affects the **confidence interval (CI)** by determining how wide or narrow it is.

**How the Margin of Error Affects the Confidence Interval**

1. **Defines the Range of Possible Values**  
   - The confidence interval is calculated as:  
     \[
     CI = \bar{X} \pm \text{Margin of Error}
     \]
   - A **larger MoE** means the confidence interval is wider, indicating **more uncertainty** in the estimate.
   - A **smaller MoE** means the confidence interval is narrower, suggesting **greater precision** in the estimate.

2. **Factors Influencing the Margin of Error**
   - **Confidence Level**: Higher confidence levels (e.g., 99% vs. 95%) lead to a **larger MoE**.
   - **Sample Size (\( n \))**: A larger sample size reduces variability, leading to a **smaller MoE**.
   - **Standard Deviation (\( \sigma \))**: More variability in data increases MoE.
   - **Critical Value (Z or t-score)**: Based on statistical tables for normal or T-distributions.

**Formula for Margin of Error**

\[
\text{MoE} = Z \times \frac{\sigma}{\sqrt{n}}
\]
Where:
- \( Z \) = critical value from Z-table (for normal distribution)
- \( \sigma \) = population standard deviation (or estimated from sample)
- \( n \) = sample size

**Example Interpretation**

Imagine a survey estimates that **60% of customers prefer a new product**, with a **margin of error of ±3%** at **95% confidence**.

- **Confidence Interval**: (57%, 63%)  
- This means researchers are **95% confident** that the true percentage of customer preference falls within this range.

Q16 How is Bayes' Theorem used in statistics, and what is its significance ?

Bayes' Theorem is a fundamental concept in statistics that describes how to update the probability of an event based on new evidence. It is widely used in probability theory, machine learning, and decision-making processes.

**Bayes' Theorem Formula**

\[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
\]
Where:
- \( P(A|B) \) = Probability of **A** given that **B** has occurred (posterior probability).
- \( P(B|A) \) = Probability of **B** given **A** (likelihood).
- \( P(A) \) = Prior probability of **A**.
- \( P(B) \) = Probability of observing **B** (marginal probability).

**Significance of Bayes' Theorem**

1. **Updates Beliefs with New Information**  
   - It helps refine probabilities based on observed data.
   - Used in **medical diagnosis**, where prior knowledge of disease prevalence is updated with new test results.

2. **Decision-Making Under Uncertainty**  
   - Applied in **risk analysis, finance, and machine learning models** to evaluate uncertainty.

3. **Spam Filtering & AI Predictions**  
   - Used in email spam filters, where prior probabilities of spam emails are adjusted based on characteristics like keywords.

4. **Genetics & Medicine**  
   - Helps assess the probability of genetic disorders based on family history and test results.

**Example Application: Medical Diagnosis**

A doctor wants to determine whether a patient has a rare disease.  
- **Prior Probability**: The disease affects 1% of the population.  
- **Test Accuracy**: A positive test correctly detects the disease 95% of the time, but has a 5% false positive rate.  
- **Bayes' Theorem** helps the doctor update the probability based on a **positive test result**.

Q17 What is the Chi-square distribution, and when is it used ?

The **Chi-square distribution** is a probability distribution widely used in statistics for hypothesis testing, especially when dealing with categorical data. It arises when summing the squares of independent standard normal variables and is primarily used in goodness-of-fit tests, independence tests, and variance analysis.

**Key Characteristics of Chi-square Distribution:**

- **Skewed Distribution:** Positively skewed, but becomes more symmetric as degrees of freedom increase.
- **Non-Negative Values:** Since it represents squared quantities, Chi-square values are always **≥ 0**.
- **Degrees of Freedom (df):** Determines its shape and depends on the number of categories or independent variables.

**When is the Chi-square Distribution Used?**

1. **Chi-square Goodness-of-Fit Test:**  
   - Checks if a sample distribution matches an expected distribution.  
   - Example: Determining whether observed customer preferences follow predicted proportions.

2. **Chi-square Test for Independence:**  
   - Assesses whether two categorical variables are independent.  
   - Example: Testing if gender influences product preference.

3. **Chi-square Test for Variance:**  
   - Determines if a population variance differs from a specified value.  
   - Example: Checking whether production quality variance meets industry standards.

**Chi-square Formula for Independence Test**

\[
\chi^2 = \sum \frac{(O - E)^2}{E}
\]
Where:
- \( O \) = Observed frequency  
- \( E \) = Expected frequency  

**Example Application:**

Suppose a company wants to check if customer satisfaction is **independent** of location. They survey **customers across multiple cities** and conduct a **Chi-square test** to see if satisfaction levels differ significantly.

Q18 What is the Chi-square goodness of fit test, and how is it applied ?

The **Chi-square goodness-of-fit test** is a statistical test used to determine whether a sample distribution differs significantly from an expected or theoretical distribution. It helps assess how well observed data matches a predicted pattern.

**How the Chi-square Goodness-of-Fit Test Works:**

1. **Define Hypotheses**  
   - **Null Hypothesis (H₀)**: The observed distribution **matches** the expected distribution.  
   - **Alternative Hypothesis (H₁)**: The observed distribution **does not** match the expected distribution.  

2. **Collect Data and Define Expected Values**  
   - The expected values are determined based on a known theoretical distribution or historical data.  

3. **Apply the Chi-square Formula:**  
\[
\chi^2 = \sum \frac{(O - E)^2}{E}
\]
Where:
- \( O \) = Observed frequency  
- \( E \) = Expected frequency  

4. **Compare the Chi-square Statistic to the Critical Value:**  
   - Use a **Chi-square distribution table** to find the critical value for a chosen **significance level (α, usually 0.05)** and **degrees of freedom (df = categories - 1)**.  
   - If **computed \( \chi^2 \)** is greater than the table value, **reject H₀**, meaning the data does not fit the expected distribution.  

Q19 What is the F-distribution, and when is it used in hypothesis testing ?

The **F-distribution** is a probability distribution that arises frequently in statistical analysis, particularly in hypothesis testing involving **variance comparisons**. It is **right-skewed** and only takes **positive values** since it deals with squared quantities.

**When is the F-Distribution Used?**

The F-distribution is used when comparing **ratios of variances** in different scenarios:

1. **Analysis of Variance (ANOVA)**  
   - Determines whether the means of multiple groups are **significantly different**.
   - Example: A company tests whether employee productivity differs across three departments.

2. **F-test for Comparing Variances**  
   - Checks if two population variances are **equal**.
   - Example: Comparing variability in product defects between two manufacturing processes.

3. **Regression Analysis**  
   - Helps evaluate the **overall significance** of regression models.
   - Example: Assessing whether multiple predictors explain variation in sales data.

**Formula for the F-statistic**

\[
F = \frac{\text{Variance of Sample 1}}{\text{Variance of Sample 2}}
\]
Where:
- A **large F-value** suggests significant differences in variances.
- Critical values are obtained from the **F-distribution table**, based on **degrees of freedom** (\( df \)).

The **F-distribution** plays a crucial role in statistical hypothesis testing whenever variance-based comparisons are required.

Q20 What is an ANOVA test, and what are its assumptions ?

The **Analysis of Variance (ANOVA)** test is a statistical method used to compare the means of multiple groups to determine if there is a **significant difference** among them. Unlike a t-test, which compares two groups, ANOVA allows for comparison across three or more groups.

 **How ANOVA Works:**

1. **Form Hypotheses:**  
   - **Null Hypothesis (H₀)**: All group means are equal (no significant difference).  
   - **Alternative Hypothesis (H₁)**: At least one group mean is different.  

2. **Calculate the F-statistic:**  
   - Measures the variance **between** groups and compares it to variance **within** groups.  
   - A large F-value suggests that group means **differ significantly**.

3. **Compare to Critical Value:**  
   - If the computed **F-value** exceeds the critical value from the F-distribution table, we reject **H₀**.

**Types of ANOVA Tests**

- **One-way ANOVA:** Compares **one independent variable** across multiple groups.  
   Example: Examining whether three different teaching methods impact student performance.  
- **Two-way ANOVA:** Analyzes the effects of **two independent variables** simultaneously.  
  Example: Studying how **teaching method and gender** affect student performance.  

**Assumptions of ANOVA**

To ensure valid results, ANOVA relies on these assumptions:
1. **Independence:** Observations are independent of each other.  
2. **Normality:** Data in each group follows a **normal distribution** (especially important for small sample sizes).  
3. **Homogeneity of Variance (Homoscedasticity):** The variance across groups should be approximately **equal** (tested using Levene’s test).  

If assumptions are violated, alternative methods like **Welch’s ANOVA** or **Kruskal-Wallis test** may be used.

Q21 What are the different types of ANOVA tests ?

There are several types of **ANOVA (Analysis of Variance) tests**, each designed for different experimental setups and comparisons:

 **1. One-Way ANOVA**  
- Compares **means of three or more groups** based on a **single independent variable**.  
- Used when testing the effect of one factor on a dependent variable.  
🔹 *Example:* Comparing the effectiveness of **three different teaching methods** on student performance.  

**2. Two-Way ANOVA**  
- Compares the means of groups based on **two independent variables**.  
- Examines whether each factor has a significant effect and whether they interact.  
🔹 *Example:* Studying how **exercise type and diet** influence weight loss.  

**3. Repeated Measures ANOVA**  
- Used when measuring **the same subjects multiple times** under different conditions.  
- Helps analyze changes **over time** within the same participants.  
🔹 *Example:* Tracking **employee productivity** before, during, and after a training program.  

 **4. Mixed-Design ANOVA**  
- Combines elements of **repeated measures** and **two-way ANOVA**.  
- Used for studies that involve **both within-subject and between-subject comparisons**.  
 *Example:* Testing how **a new medication affects different age groups** over time.  

**5. MANOVA (Multivariate ANOVA)**  
- Extends ANOVA to compare multiple **dependent variables** simultaneously.  
 *Example:* Evaluating how a marketing campaign influences **brand awareness and customer satisfaction** at the same time.  

Q22 What is the F-test, and how does it relate to hypothesis testing ?

The **F-test** is a statistical test used to compare the **variances** of two datasets or to evaluate the significance of multiple regression models. It plays a crucial role in hypothesis testing, particularly in **ANOVA (Analysis of Variance)** and **variance comparison tests**.

**How the F-Test Relates to Hypothesis Testing**

1. **Compares Variances**:  
   - The F-test examines whether two populations have the **same variance** or whether the variance among multiple groups is statistically significant.  

2. **Forms Hypotheses**:  
   - **Null Hypothesis (H₀):** The variances of the two groups are equal.  
   - **Alternative Hypothesis (H₁):** The variances are different.  

3. **Computes the F-statistic**:  
\[
F = \frac{ \text{Variance of Group 1} }{ \text{Variance of Group 2} }
\]
   - If **F > critical value** (from the F-distribution table), H₀ is rejected, indicating a significant variance difference.

4. **Used in ANOVA and Regression Analysis**:  
   - In **ANOVA**, the F-test determines if group means differ significantly.  
   - In **linear regression**, the F-test checks if independent variables meaningfully explain the variation in the dependent variable.

**Example Application**

A factory tests whether two machines produce products with **equal variance in defect rates**. If the F-test finds a significant difference, managers may adjust quality control processes.



In [None]:
...
 **Practical Part-1**

#Q1 Write a Python program to generate a random variable and display its value ?

import random

# Generate a random integer between 1 and 100
random_variable = random.randint(1, 100)

# Display the random variable
print("Generated random variable:", random_variable)
...
#Q2 Generate a discrete uniform distribution using Python and plot the probability mass function (PMF)?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

# Define the range of the discrete uniform distribution
low, high = 1, 10  # Values from 1 to 10
rv = randint(low, high+1)  # Discrete uniform distribution

# Generate possible values
x = np.arange(low, high+1)
pmf_values = rv.pmf(x)

# Plot the PMF
plt.bar(x, pmf_values, color='blue', alpha=0.7)
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Probability Mass Function (PMF) of Discrete Uniform Distribution')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q3 Write a Python function to calculate the probability distribution function (PDF) of a Bernoulli distribution?
def bernoulli_pdf(x, p):
    """
    Calculate the probability distribution function (PDF) of a Bernoulli distribution.

    Parameters:
    x (int): The outcome (0 or 1).
    p (float): The probability of success (0 ≤ p ≤ 1).

    Returns:
    float: Probability associated with the given outcome.
    """
    if x not in [0, 1]:
        raise ValueError("x must be either 0 or 1 for a Bernoulli distribution.")
    if not (0 <= p <= 1):
        raise ValueError("Probability p must be between 0 and 1.")

    return p if x == 1 else 1 - p

# Example usage:
p_success = 0.7
print("P(X=1):", bernoulli_pdf(1, p_success))
print("P(X=0):", bernoulli_pdf(0, p_success))
...
#Q4 Write a Python script to simulate a binomial distribution with n=10 and p=0.5, then plot its histogram ?
import numpy as np
import matplotlib.pyplot as plt

# Define parameters for the binomial distribution
n = 10  # Number of trials
p = 0.5  # Probability of success
size = 1000  # Number of experiments

# Generate binomially distributed random values
binomial_samples = np.random.binomial(n, p, size)

# Plot histogram
plt.hist(binomial_samples, bins=range(n+2), density=True, color='blue', alpha=0.7, edgecolor='black')
plt.xlabel('Number of successes')
plt.ylabel('Frequency')
plt.title(f'Binomial Distribution Histogram (n={n}, p={p})')
plt.xticks(range(n+1))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q5 Create a Poisson distribution and visualize it using Python ?
import numpy as np
import matplotlib.pyplot as plt

# Define parameters for the Poisson distribution
lambda_value = 5  # Mean number of occurrences (λ)
size = 1000  # Number of experiments

# Generate Poisson distributed random values
poisson_samples = np.random.poisson(lambda_value, size)

# Plot histogram
plt.hist(poisson_samples, bins=range(0, max(poisson_samples)+1), density=True, color='purple', alpha=0.7, edgecolor='black')
plt.xlabel('Number of occurrences')
plt.ylabel('Probability')
plt.title(f'Poisson Distribution Histogram (λ={lambda_value})')
plt.xticks(range(0, max(poisson_samples)+1))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q6 Write a Python program to calculate and plot the cumulative distribution function (CDF) of a discrete uniform distribution ?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

# Define the range of the discrete uniform distribution
low, high = 1, 10  # Values from 1 to 10
rv = randint(low, high+1)  # Discrete uniform distribution

# Generate possible values
x = np.arange(low, high+1)
cdf_values = rv.cdf(x)

# Plot the CDF
plt.plot(x, cdf_values, marker='o', linestyle='-', color='red')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.title('Cumulative Distribution Function (CDF) of Discrete Uniform Distribution')
plt.grid(True)
plt.show()
...
#Q7  Generate a continuous uniform distribution using NumPy and visualize it ?
import numpy as np
import matplotlib.pyplot as plt

# Generate random samples from a continuous uniform distribution
low, high = 0, 1  # Range [0, 1]
size = 1000  # Number of samples
uniform_samples = np.random.uniform(low, high, size)

# Plot histogram
plt.hist(uniform_samples, bins=30, density=True, color='green', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Continuous Uniform Distribution Histogram')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q8 Simulate data from a normal distribution and plot its histogram ?
import numpy as np
import matplotlib.pyplot as plt

# Define parameters for the normal distribution
mean = 0  # Mean (μ)
std_dev = 1  # Standard deviation (σ)
size = 1000  # Number of samples

# Generate normally distributed random values
normal_samples = np.random.normal(mean, std_dev, size)

# Plot histogram
plt.hist(normal_samples, bins=30, density=True, color='blue', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Normal Distribution Histogram')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q9 Write a Python function to calculate Z-scores from a dataset and plot them ?
import numpy as np
import matplotlib.pyplot as plt

def calculate_z_scores(data):
    """
    Calculate Z-scores for a given dataset.

    Parameters:
    data (list or np.array): The dataset.

    Returns:
    np.array: The Z-scores of the dataset.
    """
    mean = np.mean(data)
    std_dev = np.std(data, ddof=1)  # Using ddof=1 for sample standard deviation
    return (data - mean) / std_dev

# Generate a random dataset
np.random.seed(42)
data = np.random.normal(loc=50, scale=10, size=100)  # Normal distribution with mean=50, std=10

# Calculate Z-scores
z_scores = calculate_z_scores(data)

# Plot histogram of Z-scores
plt.hist(z_scores, bins=20, density=True, color='blue', alpha=0.7, edgecolor='black')
plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title('Histogram of Z-scores')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q10 Implement the Central Limit Theorem (CLT) using Python for a non-normal distribution ?
import numpy as np
import matplotlib.pyplot as plt

# Define parameters for the exponential distribution
lambda_value = 1  # Rate parameter for the exponential distribution
population_size = 10000  # Number of data points
sample_size = 30  # Size of each sample
num_samples = 1000  # Number of samples

# Generate a non-normal population (exponential distribution)
population = np.random.exponential(scale=1/lambda_value, size=population_size)

# Generate sample means
sample_means = [np.mean(np.random.choice(population, sample_size, replace=False)) for _ in range(num_samples)]

# Plot histogram of sample means
plt.hist(sample_means, bins=30, density=True, color='blue', alpha=0.7, edgecolor='black')
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.title(f'Central Limit Theorem (CLT) Demonstration\n{num_samples} Samples of Size {sample_size}')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q11 Simulate multiple samples from a normal distribution and verify the Central Limit Theorem ?
import numpy as np
import matplotlib.pyplot as plt

# Define parameters for the normal distribution
mean = 50  # Population mean (μ)
std_dev = 15  # Population standard deviation (σ)
population_size = 100000  # Large population size
sample_size = 30  # Size of each sample
num_samples = 1000  # Number of samples

# Generate normally distributed population
population = np.random.normal(mean, std_dev, population_size)

# Generate sample means
sample_means = [np.mean(np.random.choice(population, sample_size, replace=False)) for _ in range(num_samples)]

# Plot histogram of sample means
plt.hist(sample_means, bins=30, density=True, color='blue', alpha=0.7, edgecolor='black')
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.title(f'Central Limit Theorem (CLT) Verification\n{num_samples} Samples of Size {sample_size}')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q12 Write a Python function to calculate and plot the standard normal distribution (mean = 0, std = 1) ?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def plot_standard_normal_distribution():
    """
    Generate and plot the standard normal distribution (mean=0, std=1).
    """
    x = np.linspace(-4, 4, 1000)  # Generate values from -4 to 4
    y = norm.pdf(x, 0, 1)  # Compute the PDF of the standard normal distribution

    # Plot the standard normal distribution
    plt.plot(x, y, color='blue', label='Standard Normal Distribution')
    plt.fill_between(x, y, alpha=0.2, color='blue')
    plt.xlabel('Value')
    plt.ylabel('Probability Density')
    plt.title('Standard Normal Distribution (μ=0, σ=1)')
    plt.legend()
    plt.grid(True)
    plt.show()

# Call the function to plot the distribution
plot_standard_normal_distribution()
...
#Q13 Generate random variables and calculate their corresponding probabilities using the binomial distribution ?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# Define parameters for the binomial distribution
n = 10  # Number of trials
p = 0.5  # Probability of success
size = 1000  # Number of random variables

# Generate random variables from the binomial distribution
binomial_samples = np.random.binomial(n, p, size)

# Calculate probability mass function (PMF) for all possible values
x = np.arange(n+1)  # Possible values (0 to n)
pmf_values = binom.pmf(x, n, p)

# Plot histogram of generated random variables
plt.hist(binomial_samples, bins=n+1, density=True, color='blue', alpha=0.7, edgecolor='black', label="Simulated Data")

# Overlay the theoretical PMF
plt.plot(x, pmf_values, marker='o', linestyle='-', color='red', label="Theoretical PMF")

plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q14 Write a Python program to calculate the Z-score for a given data point and compare it to a standard normal distribution ?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def calculate_z_score(x, mean, std_dev):
    """
    Calculate the Z-score for a given data point.

    Parameters:
    x (float): The data point.
    mean (float): The mean of the dataset.
    std_dev (float): The standard deviation of the dataset.

    Returns:
    float: The Z-score.
    """
    return (x - mean) / std_dev

# Example dataset parameters
mean = 50
std_dev = 10
data_point = 65  # Example data point

# Calculate Z-score
z_score = calculate_z_score(data_point, mean, std_dev)
print(f"Z-score for data point {data_point}: {z_score:.2f}")

# Generate standard normal distribution
x_values = np.linspace(-4, 4, 1000)
y_values = norm.pdf(x_values, 0, 1)

# Plot standard normal distribution
plt.plot(x_values, y_values, color='blue', label="Standard Normal Distribution")

# Highlight the calculated Z-score
plt.axvline(z_score, color='red', linestyle='dashed', label=f"Z-score = {z_score:.2f}")
plt.fill_between(x_values, y_values, where=(x_values <= z_score), color='red', alpha=0.3)

plt.xlabel('Z-score')
plt.ylabel('Probability Density')
plt.title('Comparison of Z-score with Standard Normal Distribution')
plt.legend()
plt.grid(True)
plt.show()
...
#Q15 Implement hypothesis testing using Z-statistics for a sample dataset ?
import numpy as np
import scipy.stats as stats

def z_test(sample_data, pop_mean, pop_std, alpha=0.05):

    # Calculate sample mean and sample size
    sample_mean = np.mean(sample_data)
    n = len(sample_data)

    # Compute Z-statistic
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))

    # Compute critical value (for two-tailed test)
    z_critical = stats.norm.ppf(1 - alpha / 2)

    # Print results
    print(f"Sample Mean: {sample_mean:.2f}")
    print(f"Z-Statistic: {z_stat:.2f}")
    print(f"Critical Value (Zα/2): ±{z_critical:.2f}")

    # Hypothesis test decision
    if abs(z_stat) > z_critical:
        print("Result: Reject the null hypothesis (H0).")
    else:
        print("Result: Fail to reject the null hypothesis (H0).")

# Example usage
np.random.seed(42)
sample_data = np.random.normal(loc=52, scale=5, size=40)  # Sample dataset
pop_mean = 50  # Hypothesized mean
pop_std = 5  # Known population standard deviation

z_test(sample_data, pop_mean, pop_std)
...
#Q16 Create a confidence interval for a dataset using Python and interpret the result ?
import numpy as np
import scipy.stats as stats

def confidence_interval(data, confidence=0.95):
    """
    Calculate the confidence interval for a dataset.

    Parameters:
    data (list or np.array): The dataset.
    confidence (float): Confidence level (default: 0.95).

    Returns:
    tuple: Lower and upper bounds of the confidence interval.
    """
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  # Sample standard deviation (ddof=1 for unbiased estimate)
    n = len(data)

    # Compute Z-score
    z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)

    # Compute margin of error
    margin_of_error = z_critical * (sample_std / np.sqrt(n))

    # Compute confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

# Example dataset
np.random.seed(42)
sample_data = np.random.normal(loc=100, scale=15, size=50)  # Sample dataset

# Calculate 95% confidence interval
ci_lower, ci_upper = confidence_interval(sample_data)
print(f"95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")
...
#Q17 Generate data from a normal distribution, then calculate and interpret the confidence interval for its mean ?
import numpy as np
import scipy.stats as stats

# Step 1: Generate data from a normal distribution
np.random.seed(42)  # For reproducibility
mean = 100  # True mean
std_dev = 15  # True standard deviation
size = 50  # Sample size

# Generate random samples
data = np.random.normal(mean, std_dev, size)

# Step 2: Calculate Confidence Interval
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # Using ddof=1 for an unbiased estimate
confidence = 0.95  # Confidence level

# Compute the critical Z-score
z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)

# Calculate the margin of error
margin_of_error = z_critical * (sample_std / np.sqrt(size))

# Compute confidence interval bounds
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Print results
print(f"Sample Mean: {sample_mean:.2f}")
print(f"95% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")
...
#Q18 Write a Python script to calculate and visualize the probability density function (PDF) of a normal distribution ?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Define parameters for the normal distribution
mean = 0  # Mean (μ)
std_dev = 1  # Standard deviation (σ)

# Generate values along the x-axis
x_values = np.linspace(-4, 4, 1000)

# Calculate the probability density function (PDF)
pdf_values = norm.pdf(x_values, mean, std_dev)

# Plot the normal distribution
plt.plot(x_values, pdf_values, color='blue', label='Normal Distribution PDF')
plt.fill_between(x_values, pdf_values, alpha=0.2, color='blue')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Probability Density Function (PDF) of Normal Distribution')
plt.legend()
plt.grid(True)
plt.show()
...
#Q19 Use Python to calculate and interpret the cumulative distribution function (CDF) of a Poisson distribution ?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# Define parameters
lambda_value = 5  # Mean number of occurrences
x_values = np.arange(0, 15)  # Possible values of X

# Calculate CDF
cdf_values = poisson.cdf(x_values, lambda_value)

# Plot the CDF
plt.step(x_values, cdf_values, where='post', color='red', label='Poisson CDF')
plt.xlabel('Number of occurrences (x)')
plt.ylabel('Cumulative Probability')
plt.title(f'Poisson CDF (λ={lambda_value})')
plt.grid(True)
plt.legend()
plt.show()
...
#Q20 Simulate a random variable using a continuous uniform distribution and calculate its expected value ?
import numpy as np

# Define parameters for the uniform distribution
a, b = 0, 10  # Range [0, 10]
size = 1000  # Number of random samples

# Generate random variable samples
uniform_samples = np.random.uniform(a, b, size)

# Calculate expected value
expected_value = (a + b) / 2

# Print results
print(f"Generated {size} samples from U({a}, {b})")
print(f"Expected Value (Mean): {expected_value:.2f}")
print(f"Sample Mean: {np.mean(uniform_samples):.2f}")
...
#Q21 Write a Python program to compare the standard deviations of two datasets and visualize the difference ?
import numpy as np
import matplotlib.pyplot as plt

# Generate two datasets with different spreads
np.random.seed(42)
dataset1 = np.random.normal(loc=50, scale=5, size=100)  # Low standard deviation
dataset2 = np.random.normal(loc=50, scale=15, size=100)  # High standard deviation

# Calculate standard deviations
std_dev1 = np.std(dataset1, ddof=1)  # Sample standard deviation (ddof=1)
std_dev2 = np.std(dataset2, ddof=1)

# Print standard deviation results
print(f"Standard Deviation of Dataset 1: {std_dev1:.2f}")
print(f"Standard Deviation of Dataset 2: {std_dev2:.2f}")

# Visualize the datasets
plt.hist(dataset1, bins=20, alpha=0.7, label=f'Dataset 1 (σ={std_dev1:.2f})', color='blue', edgecolor='black')
plt.hist(dataset2, bins=20, alpha=0.7, label=f'Dataset 2 (σ={std_dev2:.2f})', color='red', edgecolor='black')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Comparison of Standard Deviations')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
...
#Q22 Calculate the range and interquartile range (IQR) of a dataset generated from a normal distribution ?
import numpy as np

# Generate data from a normal distribution
np.random.seed(42)
data = np.random.normal(loc=100, scale=15, size=100)  # Mean=100, StdDev=15, Sample size=100

# Calculate Range
data_range = np.max(data) - np.min(data)

# Calculate Interquartile Range (IQR)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Print results
print(f"Range: {data_range:.2f}")
print(f"Interquartile Range (IQR): {iqr:.2f}")
...
#Q23 Implement Z-score normalization on a dataset and visualize its transformation ?
import numpy as np
import matplotlib.pyplot as plt

# Generate a dataset
np.random.seed(42)
data = np.random.normal(loc=50, scale=15, size=100)  # Original dataset

# Compute Z-score normalization
mean = np.mean(data)
std_dev = np.std(data, ddof=1)  # Sample standard deviation
normalized_data = (data - mean) / std_dev

# Plot before and after normalization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(data, bins=20, color='blue', alpha=0.7, edgecolor='black')
plt.xlabel('Original Values')
plt.ylabel('Frequency')
plt.title('Original Data Distribution')

plt.subplot(1, 2, 2)
plt.hist(normalized_data, bins=20, color='red', alpha=0.7, edgecolor='black')
plt.xlabel('Z-score Normalized Values')
plt.ylabel('Frequency')
plt.title('Z-score Normalized Data Distribution')

plt.tight_layout()
plt.show()
...
#Q24 Write a Python function to calculate the skewness and kurtosis of a dataset generated from a normal distribution.?
import numpy as np
import scipy.stats as stats

def calculate_skewness_kurtosis(data):
    """
    Calculate skewness and kurtosis of a dataset.

    Parameters:
    data (list or np.array): The dataset.

    Returns:
    tuple: Skewness and kurtosis values.
    """
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data, fisher=True)  # Fisher=True gives excess kurtosis (subtracts 3)
    return skewness, kurtosis

# Generate data from a normal distribution
np.random.seed(42)
data = np.random.normal(loc=100, scale=15, size=1000)

# Compute skewness and kurtosis
skewness, kurtosis = calculate_skewness_kurtosis(data)
print(f"Skewness: {skewness:.2f}")
print(f"Kurtosis (Excess): {kurtosis:.2f}")
...

In [None]:
...
**Practical Part-2**
#Q1 Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results ?
import numpy as np
import scipy.stats as stats

def z_test(sample_data, pop_mean, pop_std, alpha=0.05):
    """
    Perform a Z-test for a sample mean compared to a known population mean.

    Parameters:
    sample_data (list or np.array): Sample dataset.
    pop_mean (float): Population mean (hypothesized).
    pop_std (float): Population standard deviation (σ).
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints hypothesis test results).
    """

    # Compute sample statistics
    sample_mean = np.mean(sample_data)
    sample_size = len(sample_data)

    # Calculate Z-statistic
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(sample_size))

    # Compute critical value for two-tailed test
    z_critical = stats.norm.ppf(1 - alpha / 2)

    # Print results
    print(f"Sample Mean: {sample_mean:.2f}")
    print(f"Z-Statistic: {z_stat:.2f}")
    print(f"Critical Value (±Zα/2): {z_critical:.2f}")

    # Hypothesis test decision
    if abs(z_stat) > z_critical:
        print("Result: Reject the null hypothesis (H0).")
    else:
        print("Result: Fail to reject the null hypothesis (H0).")

# Example usage
np.random.seed(42)
sample_data = np.random.normal(loc=52, scale=5, size=40)  # Sample dataset
pop_mean = 50  # Hypothesized population mean
pop_std = 5  # Known population standard deviation

z_test(sample_data, pop_mean, pop_std)
...
#Q2 Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python ?
import numpy as np
import scipy.stats as stats

# Step 1: Generate two random datasets (simulating two populations)
np.random.seed(42)
sample1 = np.random.normal(loc=50, scale=10, size=30)  # Mean = 50, StdDev = 10, Sample size = 30
sample2 = np.random.normal(loc=55, scale=10, size=30)  # Mean = 55, StdDev = 10, Sample size = 30

# Step 2: Perform independent t-test for hypothesis testing
t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Step 3: Print results
print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.4f}")

# Step 4: Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Result: Reject the null hypothesis (H0). The two samples are significantly different.")
else:
    print("Result: Fail to reject the null hypothesis (H0). No significant difference between the samples.")
...
#Q3 Implement a one-sample Z-test using Python to compare the sample mean with the population mean ?
import numpy as np
import scipy.stats as stats

def one_sample_z_test(sample_data, pop_mean, pop_std, alpha=0.05):
    """
    Perform a one-sample Z-test.

    Parameters:
    sample_data (list or np.array): Sample dataset.
    pop_mean (float): Population mean (hypothesized).
    pop_std (float): Known population standard deviation (σ).
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints hypothesis test results).
    """

    # Compute sample statistics
    sample_mean = np.mean(sample_data)
    sample_size = len(sample_data)

    # Calculate Z-statistic
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(sample_size))

    # Compute critical value for two-tailed test
    z_critical = stats.norm.ppf(1 - alpha / 2)

    # Compute P-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))  # Two-tailed test

    # Print results
    print(f"Sample Mean: {sample_mean:.2f}")
    print(f"Z-Statistic: {z_stat:.2f}")
    print(f"Critical Value (±Zα/2): {z_critical:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Hypothesis test decision
    if abs(z_stat) > z_critical:
        print("Result: Reject the null hypothesis (H0). The sample mean is significantly different from the population mean.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

# Example usage
np.random.seed(42)
sample_data = np.random.normal(loc=52, scale=5, size=40)  # Sample dataset
pop_mean = 50  # Hypothesized population mean
pop_std = 5  # Known population standard deviation

one_sample_z_test(sample_data, pop_mean, pop_std)
...
#Q4 Perform a two-tailed Z-test using Python and visualize the decision region on a plot ?
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def two_tailed_z_test(sample_data, pop_mean, pop_std, alpha=0.05):

    # Compute sample statistics
    sample_mean = np.mean(sample_data)
    sample_size = len(sample_data)

    # Calculate Z-statistic
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(sample_size))

    # Compute critical values for a two-tailed test
    z_critical = stats.norm.ppf(1 - alpha / 2)

    # Compute P-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))  # Two-tailed test

    # Print results
    print(f"Sample Mean: {sample_mean:.2f}")
    print(f"Z-Statistic: {z_stat:.2f}")
    print(f"Critical Value (±Zα/2): {z_critical:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if abs(z_stat) > z_critical:
        print("Result: Reject the null hypothesis (H0).")
    else:
        print("Result: Fail to reject the null hypothesis (H0).")

    # Visualization of decision region
    x_values = np.linspace(-4, 4, 1000)
    y_values = stats.norm.pdf(x_values, 0, 1)

    plt.plot(x_values, y_values, color='blue', label="Standard Normal Distribution")
    plt.axvline(-z_critical, color='red', linestyle='dashed', label=f"Critical Value (-Zα/2)")
    plt.axvline(z_critical, color='red', linestyle='dashed', label=f"Critical Value (Zα/2)")
    plt.axvline(z_stat, color='black', linestyle='solid', label=f"Z-Statistic ({z_stat:.2f})")
    plt.fill_between(x_values, y_values, where=(x_values < -z_critical) | (x_values > z_critical), color='red', alpha=0.3)

    plt.xlabel("Z-score")
    plt.ylabel("Probability Density")
    plt.title("Two-Tailed Z-Test Decision Region")
    plt.legend()
    plt.grid(True)
    plt.show()

np.random.seed(42)
sample_data = np.random.normal(loc=52, scale=5, size=40)  # Sample dataset
pop_mean = 50  # Hypothesized population mean
pop_std = 5  # Known population standard deviation

two_tailed_z_test(sample_data, pop_mean, pop_std)
...
#Q5 Create a Python function that calculates and visualizes Type 1 and Type 2 errors during hypothesis testing ?
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def visualize_type1_type2_errors(mu_null, mu_alt, sigma, alpha=0.05, n=30):
    """
    Simulates hypothesis testing and visualizes Type 1 and Type 2 errors.

    Parameters:
    mu_null (float): Hypothesized population mean (H0).
    mu_alt (float): Actual population mean (H1).
    sigma (float): Population standard deviation.
    alpha (float): Significance level (default: 0.05).
    n (int): Sample size.
    """

    # Compute critical value for Type 1 error (false positive)
    z_critical = stats.norm.ppf(1 - alpha)
    critical_value = mu_null + z_critical * (sigma / np.sqrt(n))

    # Compute Type 2 error probability (beta)
    beta = stats.norm.cdf(critical_value, loc=mu_alt, scale=sigma / np.sqrt(n))
    power = 1 - beta  # Power of the test

    # Generate normal distribution curves
    x_values = np.linspace(mu_null - 3*sigma, mu_alt + 3*sigma, 1000)
    y_null = stats.norm.pdf(x_values, mu_null, sigma / np.sqrt(n))
    y_alt = stats.norm.pdf(x_values, mu_alt, sigma / np.sqrt(n))

    # Plot distributions
    plt.plot(x_values, y_null, label="Null Distribution (H0)", color='blue')
    plt.plot(x_values, y_alt, label="Alternative Distribution (H1)", color='red')

    # Highlight Type 1 Error Region
    plt.fill_between(x_values, y_null, where=(x_values >= critical_value), color='blue', alpha=0.3, label="Type 1 Error (α)")

    # Highlight Type 2 Error Region
    plt.fill_between(x_values, y_alt, where=(x_values < critical_value), color='red', alpha=0.3, label="Type 2 Error (β)")

    plt.axvline(critical_value, color='black', linestyle="dashed", label="Critical Value")
    plt.xlabel("Value")
    plt.ylabel("Probability Density")
    plt.title("Type 1 and Type 2 Error Visualization")
    plt.legend()
    plt.grid(True)
    plt.show()

    print(f"Critical Value: {critical_value:.2f}")
    print(f"Type 1 Error (α): {alpha:.2f}")
    print(f"Type 2 Error (β): {beta:.2f}")
    print(f"Test Power (1 - β): {power:.2f}")

# Example usage
visualize_type1_type2_errors(mu_null=50, mu_alt=55, sigma=10, alpha=0.05, n=30)
...
#Q6 Write a Python program to perform an independent T-test and interpret the results ?
import numpy as np
import scipy.stats as stats

def independent_t_test(sample1, sample2, alpha=0.05):
    """
    Perform an independent T-test to compare two sample means.

    Parameters:
    sample1 (list or np.array): First dataset.
    sample2 (list or np.array): Second dataset.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """

    # Perform an independent T-test
    t_statistic, p_value = stats.ttest_ind(sample1, sample2)

    # Print results
    print(f"T-statistic: {t_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Interpretation
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). There is a significant difference between the two samples.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

# Example usage
np.random.seed(42)
sample1 = np.random.normal(loc=50, scale=10, size=30)  # Group 1
sample2 = np.random.normal(loc=55, scale=10, size=30)  # Group 2

independent_t_test(sample1, sample2)
...
#Q7  Perform a paired sample T-test using Python and visualize the comparison results ?
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def paired_t_test(pre_data, post_data, alpha=0.05):
    """
    Perform a paired sample T-test and visualize the comparison.

    Parameters:
    pre_data (list or np.array): First dataset (before intervention).
    post_data (list or np.array): Second dataset (after intervention).
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results and shows visualization).
    """

    # Compute paired differences
    differences = np.array(post_data) - np.array(pre_data)

    # Perform the paired sample T-test
    t_statistic, p_value = stats.ttest_rel(pre_data, post_data)

    # Print results
    print(f"Mean of Differences: {np.mean(differences):.2f}")
    print(f"T-statistic: {t_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The two samples are significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

    # Visualization: Histogram of differences
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    plt.hist(differences, bins=15, color='purple', alpha=0.7, edgecolor='black')
    plt.xlabel("Difference (Post - Pre)")
    plt.ylabel("Frequency")
    plt.title("Histogram of Differences")

    # Visualization: Boxplot comparison
    plt.subplot(1, 2, 2)
    plt.boxplot([pre_data, post_data], labels=["Pre-Test", "Post-Test"], patch_artist=True)
    plt.title("Boxplot Comparison of Paired Samples")

    plt.tight_layout()
    plt.show()

# Example usage: Simulate pre and post test scores
np.random.seed(42)
pre_scores = np.random.normal(loc=50, scale=10, size=30)  # Pre-test scores
post_scores = pre_scores + np.random.normal(loc=2, scale=5, size=30)  # Post-test scores with slight increase

paired_t_test(pre_scores, post_scores)
...
#Q8 Simulate data and perform both Z-test and T-test, then compare the results using Python ?
import numpy as np
import scipy.stats as stats

def perform_tests(sample_data, pop_mean, pop_std_known, alpha=0.05):

    sample_mean = np.mean(sample_data)
    sample_std = np.std(sample_data, ddof=1)  # Sample standard deviation
    n = len(sample_data)

    # Perform Z-test (assuming known population standard deviation)
    z_stat = (sample_mean - pop_mean) / (pop_std_known / np.sqrt(n))
    z_p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))  # Two-tailed test

    # Perform T-test (assuming unknown population standard deviation)
    t_stat, t_p_value = stats.ttest_1samp(sample_data, pop_mean)

    # Print results
    print("=== Z-Test (Known σ) ===")
    print(f"Z-Statistic: {z_stat:.2f}")
    print(f"P-value: {z_p_value:.4f}")
    print("Decision:", "Reject H0" if z_p_value < alpha else "Fail to reject H0")

    print("\n=== T-Test (Unknown σ) ===")
    print(f"T-Statistic: {t_stat:.2f}")
    print(f"P-value: {t_p_value:.4f}")
    print("Decision:", "Reject H0" if t_p_value < alpha else "Fail to reject H0")

# Example usage: Generate random dataset
np.random.seed(42)
sample_data = np.random.normal(loc=52, scale=5, size=30)  # Sample dataset
pop_mean = 50  # Hypothesized population mean
pop_std_known = 5  # Known population standard deviation (for Z-test)

perform_tests(sample_data, pop_mean, pop_std_known)
...
#Q9 Write a Python function to calculate the confidence interval for a sample mean and explain its significance ?
import numpy as np
import scipy.stats as stats

def confidence_interval(data, confidence=0.95):
    """
    Calculate the confidence interval for a sample mean.

    Parameters:
    data (list or np.array): The dataset.
    confidence (float): Confidence level (default: 0.95).

    Returns:
    tuple: Lower and upper bounds of the confidence interval.
    """
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  # ddof=1 for unbiased sample standard deviation
    n = len(data)

    # Compute Z-score for the given confidence level
    z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)

    # Compute margin of error
    margin_of_error = z_critical * (sample_std / np.sqrt(n))

    # Compute confidence interval bounds
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

# Example usage
np.random.seed(42)
sample_data = np.random.normal(loc=100, scale=15, size=50)  # Simulated dataset

# Compute confidence interval
ci_lower, ci_upper = confidence_interval(sample_data)
print(f"95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")
...
#Q10 Write a Python program to calculate the margin of error for a given confidence level using sample data ?
import numpy as np
import scipy.stats as stats

def margin_of_error(data, confidence=0.95):
    """
    Calculate the margin of error for a given confidence level using sample data.

    Parameters:
    data (list or np.array): Sample dataset.
    confidence (float): Confidence level (default: 0.95).

    Returns:
    float: The calculated margin of error.
    """
    sample_std = np.std(data, ddof=1)  # Unbiased sample standard deviation
    n = len(data)

    # Compute Z-score for the confidence level
    z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)

    # Compute margin of error
    moe = z_critical * (sample_std / np.sqrt(n))
    return moe

# Example usage
np.random.seed(42)
sample_data = np.random.normal(loc=100, scale=15, size=50)  # Simulated dataset

# Calculate margin of error
moe_value = margin_of_error(sample_data)
print(f"Margin of Error (95% Confidence): ±{moe_value:.2f}")
...
#Q11 Implement a Bayesian inference method using Bayes' Theorem in Python and explain the process ?
import numpy as np

def bayesian_inference(prior, likelihood, marginal):
    """
    Compute Bayesian inference using Bayes' Theorem.

    Parameters:
    prior (float): Prior probability (P(H)).
    likelihood (float): Likelihood of data given hypothesis (P(D|H)).
    marginal (float): Marginal probability of data (P(D)).

    Returns:
    float: Posterior probability (P(H|D)).
    """
    posterior = (likelihood * prior) / marginal
    return posterior

# Example scenario:
# Let's say we're diagnosing a disease:
# P(Disease) = 0.01 (Prior probability of having the disease)
# P(Positive Test | Disease) = 0.95 (Likelihood)
# P(Positive Test) = 0.05 (Marginal probability based on overall test outcomes)

prior = 0.01
likelihood = 0.95
marginal = 0.05

posterior = bayesian_inference(prior, likelihood, marginal)
print(f"Posterior Probability of Disease Given Positive Test: {posterior:.4f}")
...
#Q12 Perform a Chi-square test for independence between two categorical variables in Python ?
import numpy as np
import scipy.stats as stats

# Define observed frequency table (contingency table)
# Example: Survey on product preference based on gender
#            Product A  Product B
# Male        30        10
# Female      20        40

observed = np.array([[30, 10], [20, 40]])

# Perform Chi-square test for independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

# Print results
print(f"Chi-square Statistic: {chi2_stat:.2f}")
print(f"Degrees of Freedom: {dof}")
print(f"P-value: {p_value:.4f}")

# Decision rule
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Result: Reject the null hypothesis (H0). The variables are dependent.")
else:
    print("Result: Fail to reject the null hypothesis (H0). The variables are independent.")

# Print expected frequencies
print("\nExpected Frequencies:")
print(expected)
...
#Q13  Write a Python program to calculate the expected frequencies for a Chi-square test based on observed data ?
import numpy as np
import scipy.stats as stats

def calculate_expected_frequencies(observed):
    """
    Compute expected frequencies for a Chi-square test based on observed data.

    Parameters:
    observed (np.array): Contingency table with observed frequencies.

    Returns:
    np.array: Expected frequencies table.
    """
    # Perform Chi-square test to get expected frequencies
    _, _, _, expected = stats.chi2_contingency(observed)
    return expected

# Example observed frequency table
#            Product A  Product B
# Male        30        10
# Female      20        40
observed_data = np.array([[30, 10], [20, 40]])

# Calculate expected frequencies
expected_frequencies = calculate_expected_frequencies(observed_data)

# Print results
print("Observed Frequencies:")
print(observed_data)
print("\nExpected Frequencies:")
print(np.round(expected_frequencies, 2))  # Rounded for readability
...
#Q14  Perform a goodness-of-fit test using Python to compare the observed data to an expected distribution.?
import numpy as np
import scipy.stats as stats

def goodness_of_fit_test(observed, expected, alpha=0.05):
    """
    Perform a Chi-square goodness-of-fit test.

    Parameters:
    observed (list or np.array): Observed frequencies.
    expected (list or np.array): Expected frequencies.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """
    # Perform Chi-square goodness-of-fit test
    chi2_stat, p_value = stats.chisquare(observed, expected)

    # Print results
    print(f"Chi-square Statistic: {chi2_stat:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The observed data does NOT fit the expected distribution.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). The observed data fits the expected distribution.")

# Example observed and expected frequencies
observed_data = np.array([25, 30, 45])  # Sample observed counts
expected_data = np.array([33, 33, 34])  # Hypothetical expected distribution

goodness_of_fit_test(observed_data, expected_data)
...
#Q15 Create a Python script to simulate and visualize the Chi-square distribution and discuss its characteristics ?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

def visualize_chi_square(df_values, num_samples=1000):
    """
    Simulate and visualize the Chi-square distribution for different degrees of freedom.

    Parameters:
    df_values (list): List of degrees of freedom values.
    num_samples (int): Number of random samples (default: 1000).
    """
    x_values = np.linspace(0, max(df_values) * 3, 1000)  # X-axis range

    plt.figure(figsize=(10, 6))

    for df in df_values:
        chi_square_dist = stats.chi2.pdf(x_values, df)
        plt.plot(x_values, chi_square_dist, label=f"df={df}")

    plt.xlabel("Value")
    plt.ylabel("Probability Density")
    plt.title("Chi-square Distribution for Different Degrees of Freedom")
    plt.legend()
    plt.grid(True)
    plt.show()

# Example Usage: Simulate Chi-square distributions for df = [2, 5, 10]
visualize_chi_square(df_values=[2, 5, 10])
...
#Q16 Implement an F-test using Python to compare the variances of two random samples ?
import numpy as np
import scipy.stats as stats

def f_test(sample1, sample2, alpha=0.05):
    """
    Perform an F-test to compare variances of two samples.

    Parameters:
    sample1 (np.array): First dataset.
    sample2 (np.array): Second dataset.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """

    # Compute sample variances
    var1 = np.var(sample1, ddof=1)  # Unbiased variance (ddof=1)
    var2 = np.var(sample2, ddof=1)

    # Compute F-statistic
    f_stat = var1 / var2 if var1 > var2 else var2 / var1  # Ensure F > 1 for correct interpretation
    df1, df2 = len(sample1) - 1, len(sample2) - 1  # Degrees of freedom

    # Compute P-value
    p_value = 2 * min(stats.f.cdf(f_stat, df1, df2), 1 - stats.f.cdf(f_stat, df1, df2))  # Two-tailed test

    # Print results
    print(f"F-statistic: {f_stat:.2f}")
    print(f"Degrees of Freedom: ({df1}, {df2})")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The variances are significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference in variances.")

# Example usage
np.random.seed(42)
sample1 = np.random.normal(loc=100, scale=15, size=30)  # Sample 1
sample2 = np.random.normal(loc=100, scale=25, size=30)  # Sample 2 (higher variance)

f_test(sample1, sample2)
...
#Q17  Write a Python program to perform an ANOVA test to compare means between multiple groups and interpret the results ?
import numpy as np
import scipy.stats as stats

def anova_test(*groups, alpha=0.05):
    """
    Perform an ANOVA test to compare means across multiple groups.

    Parameters:
    groups (tuple of np.array): Multiple datasets.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """
    # Perform one-way ANOVA
    f_statistic, p_value = stats.f_oneway(*groups)

    # Print results
    print(f"F-statistic: {f_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Interpretation
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). At least one group mean is significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

# Example usage: Generate 3 sample groups
np.random.seed(42)
group1 = np.random.normal(loc=50, scale=10, size=30)
group2 = np.random.normal(loc=55, scale=10, size=30)
group3 = np.random.normal(loc=60, scale=10, size=30)

anova_test(group1, group2, group3)
...
#Q18 Perform a one-way ANOVA test using Python to compare the means of different groups and plot the results ?
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def one_way_anova_test(*groups, alpha=0.05):
    """
    Perform a one-way ANOVA test to compare means across multiple groups and visualize the results.

    Parameters:
    groups (tuple of np.array): Multiple datasets.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results and displays boxplots).
    """
    # Perform one-way ANOVA
    f_statistic, p_value = stats.f_oneway(*groups)

    # Print results
    print(f"F-statistic: {f_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Interpretation
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). At least one group mean is significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

    # Visualization: Boxplot comparison
    plt.figure(figsize=(8, 6))
    plt.boxplot(groups, labels=[f"Group {i+1}" for i in range(len(groups))], patch_artist=True)
    plt.title("Comparison of Group Means using ANOVA")
    plt.xlabel("Groups")
    plt.ylabel("Values")
    plt.grid(True)
    plt.show()

# Example usage: Generate 3 sample groups
np.random.seed(42)
group1 = np.random.normal(loc=50, scale=10, size=30)
group2 = np.random.normal(loc=55, scale=10, size=30)
group3 = np.random.normal(loc=60, scale=10, size=30)

one_way_anova_test(group1, group2, group3)
...
#Q19  Write a Python function to check the assumptions (normality, independence, and equal variance) for ANOVA ?
import numpy as np
import scipy.stats as stats

def check_anova_assumptions(*groups, alpha=0.05):
    """
    Check assumptions (normality and equal variance) for ANOVA.

    Parameters:
    groups (tuple of np.array): Multiple datasets (samples).
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints assumption test results).
    """

    # Normality Check (Shapiro-Wilk test)
    print("=== Normality Check ===")
    for i, group in enumerate(groups):
        stat, p_value = stats.shapiro(group)
        print(f"Group {i+1}: P-value = {p_value:.4f} → {'Normally Distributed' if p_value > alpha else 'Not Normal'}")

    # Equal Variance Check (Levene’s test)
    print("\n=== Equal Variance Check ===")
    stat, p_value = stats.levene(*groups)
    print(f"Levene's Test P-value = {p_value:.4f} → {'Equal Variance' if p_value > alpha else 'Unequal Variance'}")

    print("\nAssumptions Summary:")
    print("✅ Data should be normally distributed (Shapiro-Wilk test).")
    print("✅ Groups should have similar variance (Levene’s test).")
    print("⚠ Independence must be ensured from study design.")

# Example Usage
np.random.seed(42)
group1 = np.random.normal(loc=50, scale=10, size=30)
group2 = np.random.normal(loc=55, scale=10, size=30)
group3 = np.random.normal(loc=60, scale=10, size=30)

check_anova_assumptions(group1, group2, group3)
...
#Q20 Perform a two-way ANOVA test using Python to study the interaction between two factors and visualize the results ?
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Simulate a dataset
np.random.seed(42)

# Create categorical factors
factor_A = np.random.choice(['Low', 'High'], size=100)
factor_B = np.random.choice(['Type 1', 'Type 2'], size=100)

# Generate dependent variable (continuous data) influenced by both factors
dependent_var = np.random.normal(loc=50, scale=10, size=100) + (factor_A == 'High') * 5 + (factor_B == 'Type 2') * -3

# Create DataFrame
df = pd.DataFrame({'Factor_A': factor_A, 'Factor_B': factor_B, 'Dependent_Var': dependent_var})

# Step 2: Perform Two-Way ANOVA
model = smf.ols('Dependent_Var ~ C(Factor_A) + C(Factor_B) + C(Factor_A):C(Factor_B)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("\n=== Two-Way ANOVA Results ===\n")
print(anova_table)

# Step 3: Visualization of Group Means using Boxplots
plt.figure(figsize=(10, 5))
sns.boxplot(x='Factor_A', y='Dependent_Var', hue='Factor_B', data=df)
plt.title('Interaction Effect of Factor A & Factor B on Dependent Variable')
plt.xlabel('Factor A')
plt.ylabel('Dependent Variable')
plt.legend(title='Factor B')
plt.show()
...
#Q21 Write a Python program to visualize the F-distribution and discuss its use in hypothesis testing ?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

def visualize_f_distribution(df1_values, df2_values, x_limit=5):
    """
    Visualize the F-distribution for different degrees of freedom.

    Parameters:
    df1_values (list): Numerator degrees of freedom.
    df2_values (list): Denominator degrees of freedom.
    x_limit (float): Upper limit for X-axis.
    """
    x_values = np.linspace(0, x_limit, 1000)

    plt.figure(figsize=(10, 6))

    for df1, df2 in zip(df1_values, df2_values):
        f_dist = stats.f.pdf(x_values, df1, df2)
        plt.plot(x_values, f_dist, label=f"F-distribution (df1={df1}, df2={df2})")

    plt.xlabel("F-value")
    plt.ylabel("Probability Density")
    plt.title("F-distribution for Different Degrees of Freedom")
    plt.legend()
    plt.grid(True)
    plt.show()

# Example Usage: Simulate different F-distributions
visualize_f_distribution(df1_values=[2, 5, 10], df2_values=[20, 30, 50])
...
#Q22 Perform a one-way ANOVA test in Python and visualize the results with boxplots to compare group means ?
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

def one_way_anova_test(*groups, alpha=0.05):
    """
    Perform a one-way ANOVA test to compare means across multiple groups and visualize the results.

    Parameters:
    groups (tuple of np.array): Multiple datasets.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results and displays boxplots).
    """
    # Perform one-way ANOVA
    f_statistic, p_value = stats.f_oneway(*groups)

    # Print results
    print(f"F-statistic: {f_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Interpretation
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). At least one group mean is significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

    # Create a DataFrame for visualization
    import pandas as pd
    data = []
    for i, group in enumerate(groups):
        for value in group:
            data.append([f"Group {i+1}", value])

    df = pd.DataFrame(data, columns=["Group", "Value"])

    # Visualization: Boxplot comparison
    plt.figure(figsize=(8, 6))
    sns.boxplot(x="Group", y="Value", data=df, palette="Set2")
    plt.title("Comparison of Group Means using ANOVA")
    plt.xlabel("Groups")
    plt.ylabel("Values")
    plt.grid(True)
    plt.show()

# Example usage: Generate 3 sample groups
np.random.seed(42)
group1 = np.random.normal(loc=50, scale=10, size=30)
group2 = np.random.normal(loc=55, scale=10, size=30)
group3 = np.random.normal(loc=60, scale=10, size=30)

one_way_anova_test(group1, group2, group3)
...
#Q23 Simulate random data from a normal distribution, then perform hypothesis testing to evaluate the means ?
import numpy as np
import scipy.stats as stats

def hypothesis_testing(sample_data, pop_mean, alpha=0.05):
    """
    Perform a one-sample T-test to evaluate the sample mean compared to a population mean.

    Parameters:
    sample_data (np.array): Simulated dataset.
    pop_mean (float): Hypothesized population mean.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """
    sample_mean = np.mean(sample_data)
    sample_std = np.std(sample_data, ddof=1)
    n = len(sample_data)

    # Perform one-sample T-test
    t_statistic, p_value = stats.ttest_1samp(sample_data, pop_mean)

    # Print results
    print(f"Sample Mean: {sample_mean:.2f}")
    print(f"T-statistic: {t_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The sample mean is significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

# Example usage: Simulate data from a normal distribution
np.random.seed(42)
sample_data = np.random.normal(loc=52, scale=5, size=30)  # Mean = 52, StdDev = 5, Sample size = 30
pop_mean = 50  # Hypothesized population mean

hypothesis_testing(sample_data, pop_mean)
...
#Q24 Perform a hypothesis test for population variance using a Chi-square distribution and interpret the results ?
import numpy as np
import scipy.stats as stats

def chi_square_variance_test(sample_data, pop_variance, alpha=0.05):
    """
    Perform a Chi-square hypothesis test for population variance.

    Parameters:
    sample_data (np.array): Sample dataset.
    pop_variance (float): Hypothesized population variance.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """
    n = len(sample_data)
    sample_variance = np.var(sample_data, ddof=1)  # Unbiased sample variance
    chi_square_stat = (n - 1) * sample_variance / pop_variance

    # Compute P-value (two-tailed test)
    p_value = 2 * min(stats.chi2.cdf(chi_square_stat, df=n-1), 1 - stats.chi2.cdf(chi_square_stat, df=n-1))

    # Print results
    print(f"Sample Variance: {sample_variance:.2f}")
    print(f"Chi-square Statistic: {chi_square_stat:.2f}")
    print(f"Degrees of Freedom: {n-1}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The variance is significantly different from the hypothesized value.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

# Example usage: Simulate a dataset
np.random.seed(42)
sample_data = np.random.normal(loc=50, scale=5, size=30)  # Sample dataset
hypothesized_variance = 25  # Hypothesized population variance (e.g., σ²=5²)

chi_square_variance_test(sample_data, hypothesized_variance)
...
#Q25  Write a Python script to perform a Z-test for comparing proportions between two datasets or groups ?
import numpy as np
import scipy.stats as stats

def z_test_proportions(successes1, size1, successes2, size2, alpha=0.05):
    """
    Perform a Z-test for comparing proportions between two groups.

    Parameters:
    successes1 (int): Number of successes in group 1.
    size1 (int): Sample size of group 1.
    successes2 (int): Number of successes in group 2.
    size2 (int): Sample size of group 2.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """

    # Compute sample proportions
    p1 = successes1 / size1
    p2 = successes2 / size2

    # Compute pooled proportion
    p_pooled = (successes1 + successes2) / (size1 + size2)

    # Compute Z-statistic
    z_stat = (p1 - p2) / np.sqrt(p_pooled * (1 - p_pooled) * (1/size1 + 1/size2))

    # Compute P-value for two-tailed test
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    # Print results
    print(f"Proportion 1 (p1): {p1:.4f}")
    print(f"Proportion 2 (p2): {p2:.4f}")
    print(f"Z-statistic: {z_stat:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The proportions are significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference detected.")

# Example usage
z_test_proportions(successes1=45, size1=200, successes2=60, size2=220)
...
#Q26 Implement an F-test for comparing the variances of two datasets, then interpret and visualize the results ?
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def f_test_and_visualize(sample1, sample2, alpha=0.05):
    """
    Perform an F-test to compare variances of two datasets and visualize results.

    Parameters:
    sample1 (np.array): First dataset.
    sample2 (np.array): Second dataset.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results and displays histogram visualization).
    """

    # Compute sample variances
    var1 = np.var(sample1, ddof=1)
    var2 = np.var(sample2, ddof=1)

    # Compute F-statistic
    f_stat = var1 / var2 if var1 > var2 else var2 / var1  # Ensure F > 1
    df1, df2 = len(sample1) - 1, len(sample2) - 1  # Degrees of freedom

    # Compute P-value for two-tailed test
    p_value = 2 * min(stats.f.cdf(f_stat, df1, df2), 1 - stats.f.cdf(f_stat, df1, df2))

    # Print results
    print(f"Sample Variance 1: {var1:.2f}")
    print(f"Sample Variance 2: {var2:.2f}")
    print(f"F-statistic: {f_stat:.2f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Degrees of Freedom: ({df1}, {df2})")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The variances are significantly different.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). No significant difference in variances.")

    # Visualization: Histogram comparison
    plt.figure(figsize=(8, 6))
    plt.hist(sample1, bins=15, alpha=0.6, color='blue', label="Sample 1", edgecolor='black')
    plt.hist(sample2, bins=15, alpha=0.6, color='red', label="Sample 2", edgecolor='black')
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.title("Comparison of Distributions (F-test)")
    plt.legend()
    plt.grid(True)
    plt.show()

# Example usage: Generate two sample datasets
np.random.seed(42)
sample1 = np.random.normal(loc=100, scale=15, size=30)  # Sample 1
sample2 = np.random.normal(loc=100, scale=25, size=30)  # Sample 2 (higher variance)

f_test_and_visualize(sample1, sample2)
...
#Q27 Perform a Chi-square test for goodness of fit with simulated data and analyze the results.?
import numpy as np
import scipy.stats as stats

def chi_square_goodness_of_fit(observed, expected, alpha=0.05):
    """
    Perform a Chi-square goodness-of-fit test.

    Parameters:
    observed (np.array): Observed frequencies.
    expected (np.array): Expected frequencies.
    alpha (float): Significance level (default: 0.05).

    Returns:
    None (prints test results).
    """
    # Perform Chi-square goodness-of-fit test
    chi2_stat, p_value = stats.chisquare(observed, expected)

    # Print results
    print(f"Observed Frequencies: {observed}")
    print(f"Expected Frequencies: {expected}")
    print(f"Chi-square Statistic: {chi2_stat:.2f}")
    print(f"P-value: {p_value:.4f}")

    # Decision rule
    if p_value < alpha:
        print("Result: Reject the null hypothesis (H0). The observed data does NOT fit the expected distribution.")
    else:
        print("Result: Fail to reject the null hypothesis (H0). The observed data fits the expected distribution.")

# Example simulated data
np.random.seed(42)
observed_data = np.array([30, 25, 45])  # Example observed counts
expected_data = np.array([33, 33, 34])  # Hypothetical expected distribution

chi_square_goodness_of_fit(observed_data, expected_data)



...