Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

## Answers

Q1. ANOVA Assumptions and Violations:

Assumptions:

Normality: Residuals (errors) are normally distributed.

Homogeneity of variance: Variances of groups are equal.

Independence: Observations are independent of each other.

Violations:

Non-normality: Can lead to inaccurate p-values and affect F-statistic.

Heterogeneity of variance: Can inflate or deflate F-statistic, affecting significance testing.

Dependence: Can lead to biased estimates and unreliable conclusions.

Mitigation:

Transformations: Logarithmic or square root transformations can help achieve normality.

Welch's ANOVA: More robust to unequal variances.

Non-parametric tests: Kruskal-Wallis test doesn't require normality.

Q2. Types of ANOVA:

One-way ANOVA: Compares means of several independent groups (e.g., three dietary groups).

Two-way ANOVA: Compares means of groups with two factors (e.g., software programs and experience levels).

Repeated measures ANOVA: Analyzes data where the same subjects are measured on multiple occasions under different conditions (e.g., daily 
sales of three stores).

Q3. Partitioning of Variance:

Total sum of squares (SST): Total variation in the data.

Explained sum of squares (SSE): Variation explained by the group differences.

Residual sum of squares (SSR): Variation due to random error.

Importance:

Helps understand the proportion of variance attributable to group differences and random error.

Used to calculate the F-statistic for testing null hypothesis (no difference between groups).

In [None]:
# Q.4
'''import numpy as np

def calculate_sums_of_squares(data, groups):
  """Calculates sums of squares in ANOVA."""
  n = len(data)
  grand_mean = np.mean(data)
  
  SST = np.sum([(x - np.mean(group))**2 for x, group in zip(data, groups)])
  
  unique_groups = np.unique(groups)
  group_means = [np.mean(data[groups == group]) for group in unique_groups]
  SSE = np.sum([(group_mean - grand_mean)**2 * np.sum(groups == group) for group_mean in group_means])
  
  SSR = SST - SSE
  return SST, SSE, SSR'''

In [None]:
# Q.5
'''import statsmodels.api as sm

model = sm.MixedLM.from_formula('outcome ~ factor1 + factor2 + factor1*factor2', data=your_data)
fit_results = model.fit()

print(fit_results.summary())'''

Q6. Interpreting ANOVA Results (F = 5.23, p-value = 0.02):

The F-statistic (5.23) indicates a potential difference between group means.

The p-value (0.02) is less than 0.05, rejecting the null hypothesis at a 5% significance level.

We can conclude that there is a statistically significant difference between at least one pair of group means.

Further analysis: Post-hoc tests (e.g., Tukey's HSD) are needed to identify which specific groups differ significantly.

Q7. Handling Missing Data in Repeated Measures ANOVA:

Deletion: Simplest but can lead to loss of information and reduced power.

Mean imputation: Impute missing values with the mean value of the variable.

Last observation carried forward (LOCF): Use the previous value for missing observations.

Consequences:

Deletion can bias results and reduce power.

Imputation methods can introduce bias depending on the missing data mechanism.

Q8. Common Post-hoc Tests and Applications:

Tukey's HSD: Compares all possible pairs of means, good for equal sample sizes and few comparisons.

Scheffe's test: More conservative than Tukey's HSD, suitable for unequal sample sizes or many comparisons.

Bonferroni correction: Adjusts p-values for multiple comparisons, controlling for Type I error inflation.

Example: Use post-hoc tests after ANOVA to determine which dietary groups differ in weight loss.


Q10. The Isolation Forest algorithm implicitly detects global outliers by assigning higher anomaly scores to data points that are easier to isolate (i.e., require shorter path lengths across the trees).

Data points with significantly shorter path lengths compared to the average path length across the trees are likely further away from the majority of the data and are considered potential global outliers.

Q11. Local outlier detection:

Fraud detection: Analyzing individual transactions for deviations from user's typical spending behavior.

Sensor network anomaly detection: Identifying unusual readings from a specific device compared to historical data.

Image anomaly detection: Detecting anomalies within specific image regions (e.g., unusual textures or objects).

Global outlier detection:

Credit card fraud detection: Identifying transactions with extremely high amounts or unusual locations.

Weather anomaly detection: Detecting extreme temperature or pressure readings across a large region.

Stock market anomaly detection: Identifying stocks with significant price fluctuations compared to market trends.

Q12. How does the epsilon parameter affect the performance of DBSCAN in detecting anomalies?

The epsilon parameter (eps) in DBSCAN controls the size of the neighborhood considered around a data point. It directly impacts the algorithm's ability to detect anomalies:

Smaller epsilon:

Creates smaller clusters, potentially capturing local anomalies that deviate from their immediate neighbors.

May miss global anomalies if they are further away from other points in the dense region.

Larger epsilon:

May miss local anomalies if they are not densely surrounded by similar points.

Can be more effective in capturing global anomalies that stand out significantly from the overall data distribution.