# ANOVA

### by

## Jeff Gross


based on SAS e-learning

## Graphical Analysis of Associations

<img src="files/twosample.png">

<img src="files/ANOVA.png">

<img src="files/ANOVA_flowchart.png">

<img src="files/boxplot_1.png">

In [2]:
libname statdata "/folders/myfolders/ECST131"; 
libname library "/folders/myfolders/ECST131";

### Assumptions for Two-Sample t-Test
#### 1.independent observations (random representative sample & data collected correctly-this assumption-True)
#### 2.normality
#### 3.homogeneity of variance (use F-test to formalize this assumption)

<img src="files/F_test.png">

<img src="files/F_test_1.png">

<img src="files/F_test_2.png">

<img src="files/F_test_3.png">

## Using PROC TTEST to Compare Means

### Task:  Determine the effectiveness of a new type of foreign language teaching technique on student grammar skills. 

### Result: 
#### Normality: Data normal based on histogram plot, prob plot, skewness, and kurtosis. 
#### Equal variance: Data has equal variance based on p-value  of .066 > .05 failure to reject null hypothesis of variances being equal.
#### Using Pooled test, p-value of .1788 > .05 failure to reject null hypothesis that the means are equal. Therefore, there is not enough evidence to say conclusively that the new teaching technique is better than the old teaching technique.

In [None]:
title 'Descriptive Statistics Using 
proc univariate data=Statdata.German noprint;
   var Change;
   histogram Change / normal(mu=est sigma=est noprint);
   inset min max skewness kurtosis / position=ne;
   probplot Change / normal(mu=est sigma=est);
   inset min max skewness kurtosis;
run;
title;

In [4]:
title "One—Sided t—Test Comparing Test Performance with and without teaching technique"; 
title2 'One-Sided t-Test';
proc ttest data=Statdata.German plots (only shownull)=interval h0=0 sides=L; 
    class Group; 
    var Change; 
run; 
title;

Group,N,Mean,Std Dev,Std Err,Minimum,Maximum
Control,13.0,6.9677,8.6166,2.3898,-6.24,19.41
Treatment,15.0,11.3587,14.8535,3.8352,-17.33,32.92
Diff (1-2),,-4.391,12.372,4.6882,,

Group,Method,Mean,95% CL Mean,95% CL Mean.1,Std Dev,95% CL Std Dev,95% CL Std Dev.1
Control,,6.9677,1.7607,12.1747,8.6166,6.1789,14.2238
Treatment,,11.3587,3.1331,19.5843,14.8535,10.8747,23.4255
Diff (1-2),Pooled,-4.391,-Infty,3.6052,12.372,9.7432,16.955
Diff (1-2),Satterthwaite,-4.391,-Infty,3.3545,,,

Method,Variances,DF,t Value,Pr < t
Pooled,Equal,26.0,-0.94,0.1788
Satterthwaite,Unequal,22.947,-0.97,0.1707

Equality of Variances,Equality of Variances,Equality of Variances,Equality of Variances,Equality of Variances
Method,Num DF,Den DF,F Value,Pr > F
Folded F,14,12,2.97,0.066


## One Way ANOVA

### one-way ANOVA = two-sample t-test
### F statistic = t statistic^2


<img src="files/ANOVA_1.png">

<img src="files/ANOVA_2.png">

<img src="files/ANOVA_3.png">

### SSM+SSE=SST

#### SSM: variability explained by the type of medication (you want larger piece of total to be represented by what you can explain versus what you can't explain)
#### SSE: variability not explained by the type of medication

<img src="files/SS.png">

### Assumptions:

#### 1.Independent observations (do you have a good random sample?)
#### 2.error terms are normally distributed (are residuals normal?)
#### 3.error terms have equal variances across treatments (Levene's test: if p>.05 then fail to reject null hypothesis of equal variances)

### Task:  Are the average sales significantly different for 4 advertising types:  local newspaper ads, local radio ads, in-store salespeople, and in-store displays.

### Result:
#### BoxPlot: In-store display mean is lower than the others. In-store display has a positive outlier, and local radio has outliers in both directions.
#### Normality: The histogram and Q-Q plot both show that the residuals seem normally distributed.
#### Homogeneity of Variance: The Levene's Test for Homogeneity of Variance shows a p-value greater than alpha. Therefore, you do not reject the hypothesis of homogeneity of variances or equal variances across advertising types. The overall F-value from the analysis of variance table is associated with a p value less than or equal to .0001.
#### Conclusion:  At least one treatment mean is different from one other treatment mean. At this point, it is not known which means are significantly different.

In [5]:
proc means data=statdata.ads printalltypes n mean 
           std skewness kurtosis;
   var Sales;
   class Ad;
   title 'Descriptive Statistics of Sales by Ad Type';
run;

proc sgplot data=statdata.ads;
   vbox Sales / category=Ad datalabel=Sales;
   title 'Box Plots of Sales by Ad Type';
run;
title;

proc glm data=statdata.ads plots(only)=diagnostics(unpack);
   class Ad;
   model Sales=Ad;
   means Ad / hovtest;
   title 'Testing for Equality of Ad Type on Sales';
run;
quit;
title;

Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales
N Obs,N,Mean,Std Dev,Skewness,Kurtosis
144,144,66.8194444,13.5278282,-0.2547089,-0.1295813

Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales,Analysis Variable : Sales
Ad,N Obs,N,Mean,Std Dev,Skewness,Kurtosis
display,36,36,56.5555556,11.6188134,0.345647,0.0256814
paper,36,36,73.2222222,9.7339204,-0.0474705,-0.5475341
people,36,36,66.6111111,13.4976776,-0.5998808,-0.2130516
radio,36,36,70.8888889,12.9676031,-0.2172278,1.6565242

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
Ad,4,display paper people radio

0,1
Number of Observations Read,144
Number of Observations Used,144

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,5866.08333,1955.36111,13.48,<.0001
Error,140,20303.22222,145.02302,,
Corrected Total,143,26169.30556,,,

R-Square,Coeff Var,Root MSE,Sales Mean
0.224159,18.02252,12.04255,66.81944

Source,DF,Type I SS,Mean Square,F Value,Pr > F
Ad,3,5866.083333,1955.361111,13.48,<.0001

Source,DF,Type III SS,Mean Square,F Value,Pr > F
Ad,3,5866.083333,1955.361111,13.48,<.0001

Levene's Test for Homogeneity of Sales Variance ANOVA of Squared Deviations from Group Means,Levene's Test for Homogeneity of Sales Variance ANOVA of Squared Deviations from Group Means,Levene's Test for Homogeneity of Sales Variance ANOVA of Squared Deviations from Group Means,Levene's Test for Homogeneity of Sales Variance ANOVA of Squared Deviations from Group Means,Levene's Test for Homogeneity of Sales Variance ANOVA of Squared Deviations from Group Means,Levene's Test for Homogeneity of Sales Variance ANOVA of Squared Deviations from Group Means
Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Ad,3,154637,51545.6,1.1,0.3532
Error,140,6586668,47047.6,,

Level of Ad,N,Sales,Sales
Level of Ad,N,Mean,Std Dev
display,36,56.5555556,11.6188134
paper,36,73.2222222,9.7339204
people,36,66.6111111,13.4976776
radio,36,70.8888889,12.9676031


## ANOVA with Data from a Randomized Block Design

### Levene's test for Homogeneity (equal variance) only for one way ANOVA
### If F factor>1, then it helped to add the blocking factor to the model.  Add the blocking factor helped to decrease the unexplained variability of the response.

<img src="files/rand_block.png">

### Task: Were the average sales were significantly different for four advertising types:  local newspaper ads, local radio ads, in-store salespeople, and in-store displays in 36 locations across the U.S.?

### Result: 
#### Normality: The Q-Q Plot of Residuals indicates that the normality assumption for ANOVA is met. 
#### The p-value for Ad (<.0001) indicates that there was some difference in sales among the advertising campaign types when controlling for Area. 
#### The large (statistically significant) F-value for Area gives evidence that the area of the country was a useful factor on which to block. It explains a significant amount of the variability, and helps improve the model.

In [7]:
title 'ANOVA for Randomized Block Design';
proc glm data=statdata.ads1 plots(only)=diagnostics(unpack);
   class Ad Area;
   model Sales=Ad Area;
run;
quit;
title;

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
Ad,4,display paper people radio
Area,18,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0,1
Number of Observations Read,144
Number of Observations Used,144

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,20,15131.38889,756.56944,8.43,<.0001
Error,123,11037.91667,89.73916,,
Corrected Total,143,26169.30556,,,

R-Square,Coeff Var,Root MSE,Sales Mean
0.578211,14.17712,9.473076,66.81944

Source,DF,Type I SS,Mean Square,F Value,Pr > F
Ad,3,5866.083333,1955.361111,21.79,<.0001
Area,17,9265.305556,545.017974,6.07,<.0001

Source,DF,Type III SS,Mean Square,F Value,Pr > F
Ad,3,5866.083333,1955.361111,21.79,<.0001
Area,17,9265.305556,545.017974,6.07,<.0001


## ANOVA Post Hoc Tests

### The probability that you conclude a difference exists at least one time when there really isn't a difference increases with the more tests you perform.  The chance of making a type I error increasing each time you perform a statistical test. 

<img src="files/post_hoc.png">

<img src="files/EER.png">

### CER is the probability of a Type I Error on a single pairwise t-test.

### EER is the probability of making at least one Type I error when performing a whole set of comparisons.  It takes into consideration the number of comparisons, so it increases as the number of tests increase.

### Controls the EER:

#### The Tukey method (Honestly Significant Test) -only for pairwise comparisons
#### Dunnett's Method-Control vs Other Treatments-

### Task: Perform a post hoc test and look at the individual differences among means for the advertising types.

### Result: 
#### Tukey: The Tukey comparisons show significant differences between Display and all other types of advertising, and between Paper and People (p=0.0190). 

#### Dunnett: Dunnett's method showed that all other advertising campaigns resulted in significantly better average sales (statistically different) than Display.

In [9]:
ods select LSMeans Diff MeanPlot DiffPlot ControlPlot;
proc glm data=statdata.ads1;
   class Ad Area;
   model Sales=Ad Area;
   lsmeans Ad / pdiff=all adjust=tukey;
   lsmeans Ad / pdiff=controlu('display') adjust=dunnett;
   title 'Pairwise Differences for Ad Types on Sales';
run;
quit;
title;

Ad,Sales LSMEAN,LSMEAN Number
display,56.5555556,1
paper,73.2222222,2
people,66.6111111,3
radio,70.8888889,4

Least Squares Means for effect Ad Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: Sales,Least Squares Means for effect Ad Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: Sales,Least Squares Means for effect Ad Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: Sales,Least Squares Means for effect Ad Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: Sales,Least Squares Means for effect Ad Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: Sales
i/j,1,2,3,4
1,,<.0001,<.0001,<.0001
2,<.0001,,0.0190,0.7233
3,<.0001,0.0190,,0.2268
4,<.0001,0.7233,0.2268,

Ad,Sales LSMEAN,H0:LSMean=Control
Ad,Sales LSMEAN,Pr > t
display,56.5555556,
paper,73.2222222,<.0001
people,66.6111111,<.0001
radio,70.8888889,<.0001


## Two-Way ANOVA with Interactions

### n-way ANOVA: n number of predictor variables

<img src="files/interact.png">

<img src="files/two_way.png">

<img src="files/interact_1.png">

<img src="files/interact_2.png">

<img src="files/store_out.png">

<img src="files/proc_plm.png">

<img src="files/proc_plm.png">

### Task: Consider an experiment to test three different brands of concrete to determine whether an additive makes the cement in the concrete stronger.

### Result:
#### Interaction term between Additive and Brand of Concrete:
The difference between reinforced and standard means for Graystone is about -5.38, whereas the mean difference for Consolidated is -3.2 and for EZ Mix is -2.86. Therefore, it appears that the difference between concretes using standard and reinforced cements differs by brand. In other words, it appears that there is an interaction between the Additive and the Brand of concrete. That means that an interaction term in the ANOVA model would be appropriate to assess the statistical significance of the interaction.

There is no significant interaction between Additive and Brand, p-value (.4682) > .05, even though the plot shows slightly different slopes among the three brands of concrete. The interaction term can be removed, and if the additive type is significant, it can be concluded that there is a difference in additive types.

#### Additive versus Standard Concrete
The test for Additive is still significant. There is a difference between standard and reinforced. The estimate of the two least squares means is found in the results for Least Suares means in the analysis the Effects of Additive and Brand
on Concrete Strength without Interaction. A reinforced additive in the concrete seems to add more strength than a standard additive does. The mean difference is about 3.8.

In [3]:
proc means data=statdata.concrete mean var std printalltypes;
   class Brand Additive;
   var Strength;
   output out=means mean=Strength_Mean;
   title 'Selected Descriptive Statistics for Concrete Data Set';
run;

proc sgplot data=means;
   where _TYPE_=3;
   scatter x=Additive y=Strength_Mean / group=Brand 
           markerattrs=(size=10);
   xaxis integer;
   title 'Plot of Stratified Means in Concrete Data Set';
run;
title;

proc glm data=statdata.concrete;
   class Additive Brand;
   model Strength=Additive Brand Additive*Brand;
   title 'Analyze the Effects of Additive and Brand';
   title2 'on Concrete Strength';
run;
quit;
title;

proc glm data=statdata.concrete;
   class Additive Brand;
   model Strength=Additive Brand;
   lsmeans Additive;
   title 'Analyze the Effects of Additive and Brand';
   title2 'on Concrete Strength without Interaction';
run;
quit;
title;

Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength
N Obs,Mean,Variance,Std Dev
30,26.0,11.7537931,3.4283805

Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength
Additive,N Obs,Mean,Variance,Std Dev
reinforced,15,27.9066667,7.6606667,2.7677909
standard,15,24.0933333,8.896381,2.9826802

Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength
Brand,N Obs,Mean,Variance,Std Dev
Consolidated,10,24.2,6.3888889,2.5276251
EZ Mix,10,25.83,10.3067778,3.2104171
Graystone,10,27.97,13.2334444,3.6377802

Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength,Analysis Variable : Strength
Brand,Additive,N Obs,Mean,Variance,Std Dev
Consolidated,reinforced,5,25.8,5.63,2.3727621
,standard,5,22.6,2.345,1.5313393
EZ Mix,reinforced,5,27.26,3.843,1.9603571
,standard,5,24.4,14.235,3.7729299
Graystone,reinforced,5,30.66,1.793,1.3390295
,standard,5,25.28,9.892,3.145155

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
Additive,2,reinforced standard
Brand,3,Consolidated EZ Mix Graystone

0,1
Number of Observations Read,30
Number of Observations Used,30

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,5,189.908,37.9816,6.04,0.0009
Error,24,150.952,6.2896667,,
Corrected Total,29,340.86,,,

R-Square,Coeff Var,Root MSE,Strength Mean
0.557144,9.645849,2.507921,26.0

Source,DF,Type I SS,Mean Square,F Value,Pr > F
Additive,1,109.0613333,109.0613333,17.34,0.0003
Brand,2,71.498,35.749,5.68,0.0095
Additive*Brand,2,9.3486667,4.6743333,0.74,0.4862

Source,DF,Type III SS,Mean Square,F Value,Pr > F
Additive,1,109.0613333,109.0613333,17.34,0.0003
Brand,2,71.498,35.749,5.68,0.0095
Additive*Brand,2,9.3486667,4.6743333,0.74,0.4862

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
Additive,2,reinforced standard
Brand,3,Consolidated EZ Mix Graystone

0,1
Number of Observations Read,30
Number of Observations Used,30

Source,DF,Sum of Squares,Mean Square,F Value,Pr > F
Model,3,180.5593333,60.1864444,9.76,0.0002
Error,26,160.3006667,6.1654103,,
Corrected Total,29,340.86,,,

R-Square,Coeff Var,Root MSE,Strength Mean
0.529717,9.550094,2.483024,26.0

Source,DF,Type I SS,Mean Square,F Value,Pr > F
Additive,1,109.0613333,109.0613333,17.69,0.0003
Brand,2,71.498,35.749,5.8,0.0083

Source,DF,Type III SS,Mean Square,F Value,Pr > F
Additive,1,109.0613333,109.0613333,17.69,0.0003
Brand,2,71.498,35.749,5.8,0.0083

Additive,Strength LSMEAN
reinforced,27.9066667
standard,24.0933333


<img src="files/rand_block_1.png">