# Pwskills

## Data Science Master

### Statistics Advance Assignment

## Q1
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

    ANOVA (Analysis of Variance) is a statistical technique used to compare the means of three or more groups. To use ANOVA, certain assumptions need to be met. Violations of these assumptions can affect the validity of the results. The assumptions for ANOVA are as follows:

        Independence: The observations within each group should be independent of each other. Violation of this assumption occurs when there is a dependency or correlation between the observations. For example, in a study where the same participants are measured repeatedly over time, the assumption of independence is violated.

        Normality: The data within each group should follow a normal distribution. Violation of this assumption happens when the data is skewed or has heavy tails. For example, if the data is strongly skewed or has outliers, it may not meet the normality assumption.

         Homogeneity of Variance: The variances of the groups being compared should be equal. Violation of this assumption occurs when the variability differs significantly between groups. For example, if one group has much larger variances compared to the others, the assumption of homogeneity of variance is violated.

         Interval or Ratio Data: ANOVA assumes that the dependent variable being analyzed is measured on an interval or ratio scale. Violation of this assumption happens when the data is measured on a nominal or ordinal scale. For example, if the dependent variable is categorical and only has two categories, ANOVA may not be appropriate.

      Violations of these assumptions can impact the validity of ANOVA results. Here are examples of violations and their impact:

        Violation of independence: If observations within groups are not independent, such as when data is collected from the same participant multiple times, it can lead to autocorrelation and bias the results. In such cases, repeated measures ANOVA or mixed-effects models may be more appropriate.

        Violation of normality: If the data within groups does not follow a normal distribution, the p-values and confidence intervals produced by ANOVA may be unreliable. Non-parametric tests, such as Kruskal-Wallis test, can be used instead.

        Violation of homogeneity of variance: When the variances of the groups are not equal, it can affect the overall significance test and lead to incorrect conclusions. Modified versions of ANOVA, such as Welch's ANOVA, can be used when the assumption of homogeneity of variance is violated.

         Violation of interval or ratio data: If the dependent variable is measured on a nominal or ordinal scale, ANOVA is not appropriate. In such cases, non-parametric tests like chi-square test or ordinal logistic regression should be used.

     It is important to assess these assumptions before conducting ANOVA and consider alternative analysis methods if the assumptions are violated.

## Q2
Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

     One-Way ANOVA: One-Way ANOVA is used when there is a single categorical independent variable (also known as a factor) with three or more levels, and a continuous dependent variable. It is used to determine if there are any statistically significant differences in the means of the dependent variable across the different levels of the independent variable. For example, a researcher may use One-Way ANOVA to compare the mean scores of three different treatment groups on a psychological test.

     Two-Way ANOVA: Two-Way ANOVA is used when there are two independent variables (factors) and a continuous dependent variable. It helps to assess the main effects of each independent variable as well as their interaction effect. The independent variables can be either categorical or continuous. For example, a researcher may use Two-Way ANOVA to examine the effects of gender (categorical) and age group (categorical) on a measure of cognitive ability.

      Factorial ANOVA: Factorial ANOVA is an extension of Two-Way ANOVA and is used when there are two or more independent variables (factors) and a continuous dependent variable. It allows for the examination of main effects and interaction effects between multiple independent variables. The independent variables can be categorical or continuous. For example, a researcher may use Factorial ANOVA to investigate the effects of treatment type (categorical), dosage level (categorical), and patient age (continuous) on a biological outcome measure.

These different types of ANOVA are used based on the specific research design and the number of independent variables involved. One-Way ANOVA is appropriate when there is a single factor, Two-Way ANOVA is used when there are two factors, and Factorial ANOVA is used when there are two or more factors. It is important to select the appropriate type of ANOVA based on the research question and study design to obtain accurate and meaningful results.

## Q3
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

       The partitioning of variance in ANOVA refers to the decomposition of the total variance observed in the data into different components associated with various sources of variability. This partitioning allows for a better understanding of how much of the total variability can be attributed to different factors in the analysis.

In ANOVA, the total variance in the data is divided into two main components: the variance between groups (also called the "between-group variance" or "explained variance") and the variance within groups (also called the "within-group variance" or "unexplained variance").

The variance between groups represents the differences in means among the groups being compared. It provides information about the extent to which the independent variable(s) explain the variation in the dependent variable. If the between-group variance is large relative to the within-group variance, it suggests that the independent variable(s) have a significant effect on the dependent variable.

The variance within groups represents the variability or differences within each group that cannot be accounted for by the independent variable(s). It includes random variation and measurement error. If the within-group variance is large, it indicates that there is substantial variability within each group, and the independent variable(s) may have less impact on the dependent variable.

Understanding the partitioning of variance is important for several reasons:

Assessing the significance of the independent variable(s): By comparing the between-group variance to the within-group variance, ANOVA determines whether the differences between the groups are statistically significant. This helps in evaluating the impact of the independent variable(s) on the dependent variable.

Interpreting the effect size: The partitioning of variance provides information about the proportion of total variability in the dependent variable that can be attributed to the independent variable(s). This allows for the calculation of effect sizes, such as eta-squared or partial eta-squared, which quantify the magnitude of the effect.

Identifying potential sources of variability: By partitioning the variance, ANOVA helps to identify which factors or variables contribute most to the variation in the dependent variable. This can guide further investigation and potentially suggest areas for intervention or further research.

Designing future studies: Understanding the partitioning of variance can inform the design of future studies by providing insights into the expected magnitude of the effects and the required sample size to detect them.

In summary, the partitioning of variance in ANOVA provides valuable information about the significance and impact of the independent variable(s) on the dependent variable, allowing for a deeper understanding of the relationships between variables and guiding subsequent analyses and research decisions.

## Q4
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy import stats


In [None]:
data = np.array(data)
groups = np.array(groups)

In [None]:
# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate the sum of squares total (SST)
sst = np.sum((data - overall_mean) ** 2)

# Calculate the sum of squares explained (SSE)
group_means = np.array([np.mean(data[groups == group]) for group in np.unique(groups)])
sse = np.sum((group_means - overall_mean) ** 2)

# Calculate the sum of squares residual (SSR)
ssr = sst - sse


## Q5
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols


In [5]:
formula = 'dependent_var ~ independent_var1 * independent_var2'


In [None]:
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model)


In [None]:
main_effect_1 = anova_table['sum_sq']['independent_var1']
main_effect_2 = anova_table['sum_sq']['independent_var2']
interaction_effect = anova_table['sum_sq']['independent_var1:independent_var2']


## Q6
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

    In the given scenario, where a one-way ANOVA resulted in an F-statistic of 5.23 and a p-value of 0.02, we can draw the following conclusions:

Differences between groups: The obtained F-statistic indicates that there are statistically significant differences between the groups being compared. The F-statistic of 5.23 suggests that the variation between the group means is 5.23 times larger than the variation within the groups.

Interpretation of results: With a p-value of 0.02, which is below the commonly used significance level of 0.05, we have evidence to reject the null hypothesis that there are no differences between the group means. Therefore, we can conclude that there are significant differences between at least two of the groups.

Additionally, the p-value of 0.02 indicates that the probability of observing such or more extreme results (assuming the null hypothesis is true) is 0.02. Therefore, if the null hypothesis were true (i.e., no differences between the groups), we would expect to observe differences as extreme as those observed in only 2% of the cases.

It's important to note that the one-way ANOVA does not indicate which specific groups are different from each other; it only confirms the presence of overall group differences. To identify the specific group differences, post-hoc tests (e.g., Tukey's test, Bonferroni correction, etc.) or planned comparisons can be conducted.

In summary, based on the given F-statistic and p-value, we can conclude that there are significant differences between the groups being compared. These results indicate that the independent variable (the factor) has a statistically significant effect on the dependent variable, and the groups are not equal in terms of their means.




## Q7
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

    Handling missing data in a repeated measures ANOVA requires careful consideration, as the presence of missing data can potentially introduce biases and affect the validity of the results. There are different approaches to handle missing data in this context, and the choice of method can have consequences. Here are a few common methods and their potential consequences:

Complete Case Analysis (Listwise deletion):

This approach involves excluding any participant with missing data on any variable included in the analysis.
Consequence: Listwise deletion can lead to reduced sample size and loss of statistical power. It may introduce bias if the missingness is related to the variables being analyzed, potentially affecting the generalizability of the results.
Pairwise Deletion:

With this approach, only the incomplete cases are excluded from specific pairwise comparisons in the analysis, allowing the use of all available data for each specific comparison.

Consequence: Pairwise deletion can result in different sample sizes for different comparisons, potentially affecting the precision and power of the analysis. It may also lead to biased estimates if the missingness is not random or is related to the variables being compared.

Imputation:

Imputation methods involve replacing missing values with estimated values. Common imputation techniques include mean imputation, last observation carried forward (LOCF), multiple imputation, etc.

Consequence: Imputation can help retain the sample size and reduce bias. However, it introduces additional uncertainty due to the imputed values, which can underestimate the standard errors and impact hypothesis testing and confidence intervals. The accuracy of imputation methods depends on the assumptions made about the missing data mechanism.

Mixed-effects models:

Mixed-effects models, such as linear mixed models, can handle missing data by utilizing all available data while estimating the fixed effects and accounting for within-subject dependencies.

Consequence: Mixed-effects models allow for more efficient use of available data, preserve statistical power, and can provide unbiased estimates under the missing at random (MAR) assumption. However, the validity of the results relies on correctly specifying the random effects structure and assumptions about the missing data mechanism.

It is essential to carefully evaluate the missing data pattern, understand the underlying mechanisms causing missingness, and select an appropriate method based on the missing data assumptions, research objectives, and potential consequences. Additionally, sensitivity analyses and comparisons between different missing data methods can help assess the robustness and consistency of the results.





## Q8
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

     After conducting an ANOVA and finding a significant overall effect, post-hoc tests are often employed to identify specific group differences. Some common post-hoc tests include:

Tukey's Honestly Significant Difference (HSD) test:

Tukey's HSD test is widely used to compare all possible pairs of group means in a balanced design (equal sample sizes).
It controls the familywise error rate, making it suitable for situations where multiple pairwise comparisons are conducted. It provides simultaneous confidence intervals to assess the differences between groups.
Bonferroni correction:

The Bonferroni correction is a conservative method that adjusts the significance level for each individual comparison to maintain an overall familywise error rate.
It is often used when performing a large number of pairwise comparisons. The significance level is divided by the number of comparisons to ensure that the familywise error rate is not exceeded.

Scheffe's test:

Scheffe's test is a conservative post-hoc test that allows for comparisons between any sets of means.
It is typically used in situations where the number of comparisons is small and there is a need for a more flexible approach that controls the 
familywise error rate.

Dunnett's test:

Dunnett's test is used when comparing multiple treatment groups against a control group or reference group.
It adjusts for multiple comparisons while maintaining a higher power by focusing only on the relevant comparisons between the treatment groups and the 

control/reference group.

Fisher's Least Significant Difference (LSD) test:

Fisher's LSD test is an older post-hoc test used when the number of pairwise comparisons is small and equal sample sizes are not necessary.
It compares each pair of means while controlling the experimentwise error rate. However, it can be less conservative compared to other post-hoc tests.

Example situation:

Suppose a researcher conducted a study comparing the effects of four different treatments on pain relief in patients with a specific condition. The ANOVA reveals a statistically significant overall effect, indicating that there are differences in pain relief among the treatment groups. In this case, a post-hoc test would be necessary to determine which specific treatment groups differ significantly from each other. The researcher could employ Tukey's HSD test to conduct pairwise comparisons of the treatment group means and identify the specific group differences. This would provide a comprehensive assessment of the pairwise differences and assist in interpreting the results of the study.

## Q9
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [None]:
import numpy as np
from scipy import stats


In [None]:
data_a = np.array([1.5, 2.0, 1.8, ...])  # Replace ... with the actual data
data_b = np.array([2.2, 1.9, 2.5, ...])  # Replace ... with the actual data
data_c = np.array([1.0, 1.5, 0.8, ...])  # Replace ... with the actual data

all_data = [data_a, data_b, data_c]


In [None]:
f_statistic, p_value = stats.f_oneway(*all_data)


In [None]:
alpha = 0.05  # Set the significance level

if p_value < alpha:
    print("The one-way ANOVA results are statistically significant.")
    print("There is evidence of significant differences between the mean weight loss of the three diets.")
else:
    print("The one-way ANOVA results are not statistically significant.")
    print("There is no evidence of significant differences between the mean weight loss of the three diets.")


## Q10
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols


In [None]:
data = pd.read_csv('data.csv')  # Replace 'data.csv' with the actual file name or provide the DataFrame directly

# Verify the structure of the DataFrame
print(data.head())


In [None]:
model = ols('Time ~ Program + Experience + Program:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model)


In [None]:
print(anova_table)


## Q11
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
import numpy as np
from scipy import stats


In [None]:
control_scores = np.array([80, 85, 90, ...])  # Replace ... with the actual control group scores
experimental_scores = np.array([90, 92, 88, ...])  # Replace ... with the actual experimental group scores


In [None]:
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)


In [None]:
alpha = 0.05  # Set the significance level

if p_value < alpha:
    print("The two-sample t-test results are statistically significant.")
    print("There is evidence of significant differences in test scores between the control and experimental groups.")
    print("You can proceed with a post-hoc test to determine which group(s) differ significantly.")
else:
    print("The two-sample t-test results are not statistically significant.")
    print("There is no evidence of significant differences in test scores between the control and experimental groups.")


## Q12
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM


In [None]:
data = pd.read_csv('data.csv')  # Replace 'data.csv' with the actual file name or provide the DataFrame directly

# Verify the structure of the DataFrame
print(data.head())


In [None]:
rm_anova = AnovaRM(data, 'Sales', 'Store', within=['Day'])
results = rm_anova.fit()


In [None]:
print(results)
