In [1]:
# Jai Maa Saraswati :

Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

# Assumptions of ANOVA :-

* Independence
  * Example of violation: Conducting ANOVA on time-series data where each data point is dependent on the previous one          (autocorrelation).
  
* Normality
  * Example of violation: Conducting ANOVA on data that is heavily skewed or has extreme outliers that distort the               distribution.
  
* Homogeneity of Variance
  * Example of violation: Conducting ANOVA on data where one group has significantly larger variance than others, leading to     unequal spread of residuals.
  
# Exmaple of Violations And Their Impact :

* Non-independence: If observations are not independent (e.g., repeated measures on the same subjects over time), the         assumption of independent observations is violated. This can lead to underestimation of standard errors, inflated Type I     error rates, and biased parameter estimates.

* Non-normality: Violations occur when the residuals do not follow a normal distribution. This can lead to incorrect           confidence intervals, inaccurate p-values, and biased estimates of means. In extreme cases, ANOVA may not be appropriate,   and transformation of data or using non-parametric tests may be necessary.

* Heterogeneity of Variance: When the assumption of homogeneity of variance is violated (i.e., the variances of the           residuals are not equal across groups), the F-test in ANOVA becomes unreliable. This can result in incorrect conclusions     about group differences, as well as incorrect p-values and confidence intervals.

Q2. What are the three types of ANOVA, and in what situations would each be used?

# TYPES OF ANOVA :-

* ONE WAY - ANOVA :
* Useage => One Way ANOVA is used when we have one Independent Variable(Factor) with Three or More levels(Groups) and we
    Want to Determine the Significance Difference of Mean Among theses Groups .
    
* Example =>(i) Testing whether there is a difference in test scores among students from three different schools.
              (ii) Analyzing the effect of different doses of a drug (low, medium, high) on a medical condition.
              (iii) Comparing the average sales performance across multiple regions.
              
* TWO WAY - ANOVA :
* useage => Two Way of ANOVA is used when we have two Independent variable(Factors) And we Examine what is the Effect 
     When we Interact With the Dependent Variable .
     
* Example Situations:(i) Evaluating the effects of both gender and treatment (factor A: gender, factor B: treatment) on                               recovery time after surgery.
                        (ii) Assessing the impact of temperature (factor A: low, medium, high) and humidity (factor B: low,                              medium, high) on plant growth.
                        (iii) Studying the effects of education level (factor A: high school, college, graduate) and income                               level (factor B: low, medium, high) on job satisfaction.
                        
* FACTORIAL ANOVA :

* useage => Factorial ANOVA is like Two Way-ANOVA because they are used when there are multiple Independent
    Variable(FACTOR) with Multiple Levels Allowing us For More Complex Interaction Between Factors .
    
* Example Situations: (i) Studying the effects of different types of exercise (factor A: aerobic, strength training,                                  flexibility) and different diets (factor B: low carb, balanced, high protein) on weight loss.
                      (ii) Analyzing the effects of age group (factor A: young, middle-aged, elderly) and gender (factor B:                             male, female) on cognitive function.  

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

# PARTITIONING OF VARIANCE IN ANOVA :-

* Partitioning of variance in ANOVA is the Process in which Dividing the Total Obseved Variance into the Different Component
  That Can Be Attributed into the Different Source or Factors .
  
* Total Variance (Total Sum of Squares, SST):
  
  SST represents the total variability or dispersion in the data, regardless of any specific factors or treatments. It is     calculated as the sum of squared deviations of each data point from the overall mean.  
  
* Between-Group Variance (Between-Group Sum of Squares, SSB):

  SSB captures the variability among different groups or levels of a categorical variable (factor). It measures how much the   group means differ from the overall mean of all observations.
  
* Within-Group Variance (Within-Group Sum of Squares, SSW):

  SSW represents the variability within each group or treatment level. It measures how much individual data points deviate     from their respective group means.
  
# Importance of Understanding Partitioning of Variance:-

* Identifying Sources of Variation: By partitioning variance, ANOVA helps to identify which factors or variables               significantly contribute to the variation observed in the data. This is crucial in experimental design and hypothesis       testing to understand what factors are influencing the outcome.

* Quantifying Effects: It quantifies the relative importance of different sources of variation. For instance, SSB quantifies   how much of the total variability is due to differences between groups, which is essential in assessing the effectiveness   of treatments or interventions.

* Interpreting Results: Knowing the partitioning helps in interpreting the results correctly. For example, a large SSB         relative to SSW suggests that the factor being tested (e.g., different treatments) has a significant effect on the outcome   variable.

* Improving Experimental Design: Insight into variance components aids in designing future experiments. It guides decisions   on factors to control or manipulate to maximize the sensitivity and reliability of experimental results.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data: Suppose we have a DataFrame 'df' with 'group' and 'value' columns
# Replace this with actual data setup

data = {
     
    'group' : ['A','A','B','B','C','C'],
    'value' :  [10, 12, 15, 18, 9, 11]
}

df = pd.DataFrame(data)

# STEP 1 : Fit the One-Way of ANOVA Model:

model = ols('value ~ group' , data=df).fit()

# STEP 2 : Calculate the Total sum of Squares(SST) :
# SST = sum((y_i - y_mean)^2)
y_mean = df['value'].mean()
df['y_diff_sq'] = (df['value'] - y_mean) ** 2
SST = df['y_diff_sq'].sum()

# STEP 3 : Calculate the Explained sum of Squares(SSE):
# SSE = sum((y_hat_i - y_mean)^2)
# where y_hat_i is the predicted value from the ANOVA model
df['y_hat'] = model.predict(df)
df['y_hat_diff_sq'] = (df['y_hat'] - y_mean ) **2
SSE = df['y_hat_diff_sq'].sum()

# Calculate the Residual sum of Squares(SSR):
# SSR = sum((y_i - y_hat_i)**2)
df['residual_sq'] = (df['value'] - df['y_hat']) ** 2
SSR = df['residual_sq'].sum()

# Print the results
print(f"Total Sum of Squares (SST): {SST}")
print(f"Explained Sum of Squares (SSE): {SSE}")
print(f"Residual Sum of Squares (SSR): {SSR}")

# Extracting the SSE , SSR , SST From the Anova Table 
anova_table = sm.stats.anova_lm(model , type =1 )
SST_from_anova = anova_table['sum_sq']['group']
SSE_from_anova = anova_table['sum_sq']['Residual']
SSR_from_anova = SST_from_anova - SSE_from_anova  # SSR calculated as SST - SSE

print(f"\nTotal Sum of Squares (SST) from ANOVA: {SST_from_anova}")
print(f"Explained Sum of Squares (SSE) from ANOVA: {SSE_from_anova}")
print(f"Residual Sum of Squares (SSR) from ANOVA: {SSR_from_anova}")

Total Sum of Squares (SST): 57.5
Explained Sum of Squares (SSE): 49.000000000000014
Residual Sum of Squares (SSR): 8.5

Total Sum of Squares (SST) from ANOVA: 49.00000000000003
Explained Sum of Squares (SSE) from ANOVA: 8.5
Residual Sum of Squares (SSR) from ANOVA: 40.50000000000003


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [9]:
!pip install statsmodels pandas
import pandas as pd 
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example Data :
data = {
   'factor1' : ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],#represents the levels of the first factor (categorical variable 1)
   'factor2' : ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],# represents the levels of the second factor (categorical variable 2) 
   'response' : [5, 3, 7, 6, 4, 2, 8, 9] # represents the dependent variable.
}

df = pd.DataFrame(data)

# Fit the ANOVA Model :
formula = 'response ~ C(factor1) + C(factor2) + C(factor1) : (factor2)'
model = ols(formula,data=df).fit()
anova_table = sm.stats.anova_lm(model , typ=2)

# Print the ANOVA Table ;
print(anova_table)

                    sum_sq   df          F   PR(>F)
C(factor1)            32.0  1.0  21.333333  0.00989
C(factor2)             2.0  1.0   1.333333  0.31250
C(factor1):factor2     4.0  2.0   1.333333  0.36000
Residual               6.0  4.0        NaN      NaN


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

# F-STATISTICS 

* The F-statistics measure the Ratio variance between Groups(Due to Treatment Effect) to the variance Within Groups(Due to
  the Random Error). A Larger F- Effect Suggest that the means of the Groups are More Different from each other relative to
  Varibility within each Group . 
  
* In this Case, the F-Statistics is 5.23 This Indicates that there is some Degree of Difference Between the Group means .

# P-VALUE :

* The p-value is Associated with F-Statistics is used to Determine the Statistical Significance of the Observed Difference.
  It indicates the Probablity of observing such an Extreme(or more Extreme) result uder the NullHypothesis(which Assume the
  there is no difference between group mean).
  
* the obtained p-value is 0.02 this suggest that if there were no difference between the Groups(Null Hypothesis is True) .

# CONCLUSION AND INTERPRETATION :

* Statistical Significance : With a p-value of 0.02, which is less than the commonly chosen significance level of 0.05, we     conclude that there is sufficient evidence to reject the null hypothesis.

* Difference Between groups : The significant p-value (less than 0.05) indicates that there are likely genuine differences     between at least some of the group means in the population . 

* Practical Significance : While statistical significance indicates that there are differences, it does not quantify the       size or importance of these differences. Further analyses or comparisons may be needed to assess the practical               significance of the findings . 

* Direction OF Difference : The ANOVA itself does not tell you which specific groups differ from each other. Post-hoc tests   (such as Tukey's HSD test, Scheffe's test, etc.) can be performed to identify which pairs of groups show significant         differences.



Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

# 1. INDENTIFYIGN MISSING DATA :

* Missing Completely at Random (MCAR): Data is missing randomly across all observations, and the probability of missingness   is unrelated to any observed or unobserved variables.
* Missing at Random (MAR): Data is missing randomly conditional on other observed variables but not on the variable itself.
* Missing Not at Random (MNAR): The probability of missingness is related to the variable being studied.

# 2. Methods to Handle Missing Data :

* a. Complete Case Analysis (Listwise Deletion)
     Approach: Exclude any cases with missing data on any of the variables involved in the analysis.
     
* b. Pairwise Deletion
     Approach: Use all available data for each pair of variables in the analysis, excluding cases only where data is missing      for specific variables being analyzed.
     
* c.  Mean Imputation
      Approach: Replace missing values with the mean of the observed values for that variable.

* d.  Multiple Imputation
      Approach: Generate several plausible values for each missing data point, based on observed data and assumptions about       the distribution of missing data.
      
* e . Maximum Likelihood Estimation
      Approach: Incorporate missing data directly into the estimation process using statistical models that account for           missingness.      
      
# 3. Choosing the Right Method

* Considerations: The choice of method should be guided by the nature and extent of missing data, the missing data mechanism   (MCAR, MAR, MNAR), sample size, and the specific analysis being performed.

* Best Practices: Multiple imputation or maximum likelihood estimation are generally preferred when the assumption of         missing data mechanisms can be reasonably justified and computational resources allow.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

* After Conducating ANOVA and Determining there is a Significance difference Among the Group Mean , Post-Hoc test are used 
  to Determine the which Specific Group differ From Each other .
  
# Tukey's Honestly Significance Difference(HSD) :

* When to Use : Tukey HSD test is used when we have Sample Size and the Equal Variance Across Group(homogeneity of variances   assumption is met)
* Example: In a study comparing the effectiveness of three different therapies for depression, after conducting an ANOVA and   finding a significant difference among the therapies, Tukey's HSD can be used to identify which therapies show s             significantly different means.

# BONFERRONI CORRECTION :

* When to use: Bonferroni correction is used when you want to control the family-wise error rate (the probability of making   at least one Type I error across all comparisons).
* Example: In a clinical trial with four different dosage levels of a drug, if the ANOVA shows a significant difference in     treatment effectiveness, Bonferroni correction can be applied to compare each pair of dosage levels while controlling for   the overall Type I error rate.

# SIDAK CORRECTION :

* When to use: Sidak correction is similar to Bonferroni correction but is less conservative, making it appropriate when       conducting multiple comparisons.
* Example: In an educational study comparing the performance of students in different teaching methods, if ANOVA indicates     significant differences, Sidak correction can be used to compare specific pairs of teaching methods to see which ones lead   to significantly different outcomes.

# DUCAN'S NEW-MULTIPLE RANGE TEST :

* When to Use : Ducan's New multiple range test is used when we have unequal of sample Size and unequal of Variance Across
  Group :
* Example: In an agricultural study comparing the yields of several different fertilizers, ANOVA may reveal significant       differences in yield. Duncan's test can then be applied to determine which specific pairs of fertilizers lead to             significantly different yields.  

# Scheffé's method 

* When to use: Scheffé's method is very conservative and is used when assumptions of homogeneity of variances and equal       sample sizes are not met.
* Example: In a social science study comparing the effectiveness of different intervention programs across schools, if ANOVA   indicates significant differences, Scheffé's method can be used to conduct all pairwise comparisons while controlling for   Type I error rate.


# EXAMPLE SCENARIO :

* Let's say a researcher conducts an experiment to determine whether different types of exercise (aerobic, strength           training, and flexibility) have different effects on cardiovascular fitness (measured by VO2 max). The researcher collects   data from three groups of participants who undergo each type of exercise program for 12 weeks. After conducting an ANOVA     on the VO2 max scores post-intervention, the researcher finds a significant overall effect (p < 0.05).

* At this point, a post-hoc test is necessary to determine which specific exercise types lead to significantly different VO2   max scores. Here, the researcher might choose Tukey's HSD test if the assumptions of equal variances and equal sample       sizes are met. If the assumptions are not met, another appropriate post-hoc test such as Bonferroni correction, Sidak       correction, Duncan's test, or Scheffé's method would be selected based on the specific characteristics of the data (e.g.,   sample sizes, variance homogeneity).





Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.



In [4]:
import numpy as np
from scipy.stats import f_oneway

# Data for Each Diet Group 
diet_A = np.array([3.2, 2.7, 4.1, 3.4, 2.8])
diet_B = np.array([1.9, 2.4, 2.8, 2.5, 2.1])
diet_C = np.array([4.5, 3.7, 4.1, 5.0, 4.4])

# Perform the One-Way ANOVA :
f_statistics , p_value = f_oneway(diet_A , diet_B , diet_C)

# Print the Result :
print(f"F-Statistics:{f_statistics:.4f}")
print(f"P-Value:{p_value:.4f}")


# Interpretation of Result :

# F-Statistics => This value Measure the Ratio of Variation Between Groups to Variation within Groups . It Qunatifies the
# Whether the Means of Diet Groups are Significantly Different .

# P-Value => this p-value Indicates the Signifinace of the Observed Difference . A small P-value(Typically Less than 0.05)
# Suggest that the at Least one Pair of Diet Means is Significantly Different from Each Other .


F-Statistics:22.4963
P-Value:0.0001


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

# Example Data :
data = {
    'Software' :  np.array(['A', 'B', 'C']*20 ), # 20 Observation per Software 
    'Experience' : np.array(['Novice', 'Experienced']*30), # Example: 30 novice and 30 experienced
    'Time': np.array([10.2, 11.1, 9.8, 12.5, 10.9, 11.8, 9.5, 10.3, 10.7, 12.0, 11.5, 11.2, 9.3,
                      10.0, 9.7, 11.6, 10.8, 11.4, 9.9, 11.1,11.4, 11.2, 12.1, 10.6, 12.3, 9.7,
                      10.5, 10.9, 12.2, 11.7,13.2, 12.8, 13.5, 12.4, 13.0, 12.9, 12.1, 13.4, 13.3,
                      12.7, 12.5, 12.6, 12.3, 13.1, 12.0, 13.6, 12.2, 13.8, 12.6])
}

# Convert the Data into the DataFrame :
df = pd.DataFrame(data)

# Perfrom the Two-Way of ANOVA Test :
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience',data = df).fit()
anova_table = sm.stats.anova_lm(model,typ=2)

# Print the ANOVA Table
print(anova_table)

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [16]:
import numpy as np
from scipy.stats import ttest_ind

# Generate the Example Data :
np.random.seed(42) # For Reproducibilty 
control_score = np.random.normal(loc=75,scale=5,size=5) # Control Group Score
experimental_score = np.random.normal(loc=78,scale=5,size=5) # Experimental Group Score

# Perfrom The Two-Sample T-test :
t_statistics , p_value = ttest_ind(control_score , experimental_score)

# Print the Result :
print(f"T-Statistics :{t_statistics:.4f}")
print(f"P-Value : {p_value:.4f}")

T-Statistics :-1.1921
P-Value : 0.2674


# T-Statistics :

* T-statistic: This value indicates how large the differences are relative to the variation in the data. A larger absolute     value indicates a larger difference between the means of the two groups.

* P-value: This value indicates the probability of observing such an extreme difference (or more extreme) under the null       hypothesis that the two groups have equal means. A small p-value (typically less than 0.05) suggests that the difference     in means between the control and experimental groups is statistically significant.

# Step 2: Post-Hoc Test (if significant) :-

* If the results from the two-sample t-test are significant (i.e., p-value < 0.05), indicating that there is a significant     difference in test scores between the two groups, you can proceed with a post-hoc test to determine which group(s) differ   significantly from each other.

* Since we have only two groups (control and experimental), a common post-hoc test to use is simply comparing the means and   providing descriptive statistics such as means, standard deviations, and confidence intervals. We can also visualize the     data using box plots or histograms to further explore the differences between the groups.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the Mean & the Standard Deviation :
control_mean = np.mean(control_score)
experimental_mean = np.mean(experimental_score)
control_std = np.std(control_score)
experimental_std = np.std(experimental_score)

# Print the Mean & Standard Deviation :
print(f"Control Group Mean :{control_mean:.2f} ,Std = {control_std:.2f}")
print(f"Experimental Group Mean :{experimental_mean:2.f},Std = {experimental_std}")

# Plot Box-Plot for Comparison
sns.boxplot(x=['Control' , 'Experimental'],y=[control_score,experimental_score])
plt.xlabel('Group')
plt.ylabel('Experiment')
plt.title("Comparison of Test Score Between Control & Experiment Group")
plt.show()

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols 

# Example Data :
np.random.seed(42) # For Reproducibilty
data = {

    'Day' : np.tile(np.arange(1,31),3), # 30 Days Repeated for Each Store :
    'Store' : np.repeat(['A','B','C'],30), # Store Label Repeated for Each Days 
    'Sales' : np.random.normal(loc=[100,110,95],scale=10,size=90) # Example Sale Data 
}

# Covert into the DataFrame :
df = pd.DataFrame(data)

# Perfrom Repeated Measures ANOVA :
model = ols('Sales ~ C(Store) + C(Day)',data = df).fit()
anova_table = sm.stats.anova_lm(model , typ=2)

# Print the ANOVA Table 
print(anova_table)

# Interpretation of Repeated Measures ANOVA Results :-

* The ANOVA table (anova_table) will provide F-statistics, p-values, and other relevant statistics for the main effects of     Store, Day, and their interaction.

* Look at the p-values to determine if there are significant differences in daily sales between the three stores (Store       factor).

# Step 2: Post-Hoc Test (if significant) :

In [None]:
from statsmodels.stats.multicom import pairwise_tukeyhsd

# Perform the Tukey's HSD Post-Hoc Test :
tukey_results = pairwise_tukeyhsd(df['Sales'] , df['Store'])

# Print the Summary of the Post-HOC Test :
print(results )

# Interpretation of Post-Hoc Test Results:-

* The tukey_results object provides a summary table that includes the differences between group means, confidence intervals,   and whether each comparison is statistically significant.

* Look at the output to determine which specific pairs of stores have significantly different average daily sales.