# DSO105 Int Stats Page 4 and 5 revamp

In [None]:
## in R

In [None]:
## load libraries


install.packages('rcompanion')
install.packages('car')
install.packages('readxl')

library(tidyverse)
library(IDPmisc)
library(rcompanion)
library(car)
library(readxl)

In [None]:
## load data for one way between subjects Anova in R

In [None]:
cc = read.csv('../../datasets/PlayData/BankChurners.csv')

In [None]:
head(cc)

#### list out all column value counts to get an idea of your data and what to look for

In [None]:
unique(cc[c("Card_Category")])
## lets see if there is a difference in age among the 4 card categories.

In [None]:
## make sure your DV is continuous (age is already)

## One Way Between Subjects ANOVAs in R

Now that you have a basic idea about what an ANOVA is, you will learn how to create ANOVAs in R, starting with the One Way ANOVA.

## Load Libraries

ANOVAs come as part of the base package in R, so the only libraries you will need to load in are dplyr because you'll need it for some data wrangling, rcompanion because you'll use it to check for the assumption of normality, and car if you need to run an ANOVA that will correct for a violation of homogeneity of variance.

## Question Setup

With this data, you will answer the question:

*Is there a difference in age among the 4 card categories?*

**In order to answer this question, your x, or independent variable, will be the Card_Category, which has four levels: Blue, Silver, Gold, and Platinum. Your y, or dependent variable, will be the Customer_Age. As with all ANOVAs, the IV will be categorical, and the DV will be continuous.**

## Data Wrangling

Depending on the data that you've been given, it may need some wrangling!

## Filter the Data and Remove Missing Values

In this case we do not need to filter any of our categorical variables, because we are using all four card levels. All that is left is to drop the missing data and make sure our DV (Customer_Age) is numeric.

In [None]:
cc1 = na.omit(cc)

In [None]:
head(cc1)

In [None]:
str(cc1)
## Customer_Age is an integer so we are good to proceed!

## Test Assumptions

Before you go any further, it's important to test for assumptions. If the assumptions are not met for ANOVA, but you proceeded anyway, you run the risk of biasing your results

## Normality

You only need to test for the normality of the dependent variable, since the IV is categorical.

In [None]:
plotNormalHistogram(cc1$Customer_Age)
## This looks approx. normally distributed - alright!

## Homogeneity of Variance

You can test for homogeneity of variance easily using either Bartlett's test or Fligner's Test. Bartlett's test is for when your data is normally distributed, and Fligner's test is for when your data is non-parametric. No matter which test you are using, you are looking for a non-significant test. The null hypothesis for both of these is that the data has equal variance, so you'd like to have a p value of > .05. You have already determined your data is normally distributed, so ordinarily you would just perform Bartlett's test, but just for learning purposes, you'll try both here.

### Bartlett's Test

To do Bartlett's test, use the function bartlett.test(), with the argument of the y data separated by a tilde, followed by the x data. Then there's an argument data=, which is where you will specify the name of your dataset.

In [None]:
bartlett.test(Customer_Age ~ Card_Category, data = cc1)

The p value associated with this test is < .05, which means that unfortunately, you have violated the assumption of homogeneity of variance.

In [None]:
## this kind of makes sense, as cc customers of all ages have various levels of credit, 
## so there is more reason to believe that there is heterogeneity of variance

### Fligner's Test

To perform Fligner's test, use the function fligner.test(), with the argument of the y data separated by a tilde, followed by the x data. Then there's an argument data=, which is where you will specify the name of your dataset.

In [None]:
fligner.test(Customer_Age ~ Card_Category, data = cc1)

In [None]:
## still shows violation of assumption of homogeneity of variance

## Correcting for Violations of Homogeneity of Variance

There are two ways that you can correct for a violation of homogeneity of variance. The first is the BoxCox transformation of your data, and the second is running a slightly different type of ANOVA, one that was created specifically to handle this violation. That test is called the Welch One-Way Test, and you'll learn about this ANOVA option.

## Sample Size

An ANOVA requires a sample size of at least 20 per independent variable. In this case, you only have one independent variable, so as long as you have at least 20 cases, you are fine. Looking at the data, the n is 10127, so you are fine to proceed with this assumption!

## Independence

There is no statistical test for the assumption of independence.

In [None]:
#### Page 5

## Computing ANOVAs with Equal Variance (Met Homogeneity of Variance Assumption)

In this case, your data did not meet this assumption, but for the purposes of learning, you'll be shown what to do if you had.

Below is the code to run a one-way ANOVA in R. You can give your ANOVA a name; this one is named cc1ANOVA. Then you want to use the function aov(). The argument for this function is your y variable, which is continuous, followed by a tilde and then your x variable, which is categorical. Remember that the tilde reads as "by," so you can think of this as analyzing age by card category.

In [None]:
cc1ANOVA <- aov(cc1$Customer_Age ~ cc1$Card_Category)

You need to utilize the summary() function:

In [None]:
summary(cc1ANOVA)

The first row of the output has the Df, or degrees of freedom. The row for your category is calculated as # of Levels - 1, so that is always a good gut check. Next, you have rows for the Sum Sq and Mean Sq; these are just some of the calculations that went into getting your F value, which is the test statistic for an ANOVA. The real meat that you want to pay attention to is the F value itself and the associated p value next to it. Like anything else, if this value is less than .05, the test was significant. If you ever need a reminder of that, you can look at the star and Signif. codes down at the bottom - there's one star listed, so it is significant at .05.

In [None]:
## the star and signif codes do not appear in R/Jup Lab

## Computing ANOVAs with Unequal Variance (Violated Homogeneity of Variance Assumption)

If you need to correct for violating the assumption of homogeneity of variance, you can run an ANOVA that was meant to correct for that violation, using a Welch's One-Way Test. To do this, you will actually create a linear model first, and then use the function Anova() on it.

In [None]:
## this is the ANOVA we would like to run since we have heterogeneity of variance

First, create and name a linear model that uses the same set up as the ANOVA with equal variance. Then, call the Anova() function on that named model, include the argument of Type= and set it to "II" because this is a between subjects ANOVA, and then use the argument white.adjust=TRUE. This last part, setting white.adjust= to TRUE, is what makes this ANOVA appropriate when you have unequal variance.

In [None]:
ANOVA <- lm(Customer_Age ~ Card_Category, data = cc1)
Anova(ANOVA, Type="II", white.adjust=TRUE)

In [None]:
## need to run in R to see the star and signif codes

## Post Hocs

Now the problem with an ANOVA is that you have multiple groups. When you found significance with a t-test, you were able to just look at the means and you knew where the significant differences lie. You knew what was higher, and what was lower. But with an ANOVA, you can't just look at the means right away, because the F and associated p value just let you know that there is a difference between at least set of the three categories. In your example, the mean prices could be different between the beauty and food and drink category, the beauty and photography category, the food and drink and photography category, or some combination of those three!

That's where post hocs come in. They are specifically designed to test all the pairs between your data, which is why they are also often known as pairwise comparisons. This is done with t-tests. But the inherent problem with using multiple t-tests is that the more analyses you run, the more you increase your chances of Type I error. So you're more likely to find something significant when it really isn't. So, typically a post hoc will apply a correction, or adjustment, so that the t-tests become more stringent, and you are therefore correcting for doing multiple t-tests by applying stricter criteria. That way your Type I error doesn't run rampant!

There are many different corrections you can apply. But the most common ones you'll hear about are Tukey, Bonferroni, Holm, and Scheffe. All named by after the people who came up with them, by the way. These three are in order of how much correction they apply, with Tukey applying the least correction and Scheffe applying the most. Unfortunately, R does not compute Tukey and Scheffe automatically, so you'll stick to exploring the difference between no correction at all, and a Bonferroni correction.

## Computing Post Hocs with No Adjustment

Use the pairwise.t.test() function, with the arguments of the two variables you are crossing, and the argument p.adjust=. To show you why a correction is necessary, you will start out with a value of "none", which means that no correction is being made for Type I error. Here are the results:

Here is the code for computing a post hoc in R:

In [None]:
pairwise.t.test(cc1$Customer_Age, cc1$Card_Category, p.adjust="none")
## no correction for Type 1 Error

What is presented in the matrix above is the p values for each t-test between the pairs of the levels of your independent variable. Reading this, you can see that there was not a significant difference in age between Gold and Blue, Platinum and Blue, Gold and Platinum, Gold and Silver, and Silver and Platinum.  There is, however, a significant difference between the age of customers using Blue and Silver cards, at p = .048.

## Computing Post Hocs with Bonferroni Adjustment

You may be pretty pleased with finding a significant difference in age between card categories. But guess what? That difference may not really exist, because by running four t-tests, you may have increased your Type I error. So, better to typically stick with some form of correction, like Bonferroni. It is relatively "mild," but gets the job done!

In [None]:
pairwise.t.test(cc1$Customer_Age, cc1$Card_Category, p.adjust="bonferroni")
## here we are correcting for Type 1 Error
## taking into a count the several t-tests we are running

We see that our significant p-value is no longer significant.

We also see that our non-significant p-values became even more non-significant.With Blue and Gold, going from 0.213 to 1.00, and with others.

Since a p value can only be between 0 and 1, that's the end of line; as non-significant as something gets. This has just demonstrated why it's important to always, always, apply a correction to your post hocs!

## Computing Post Hocs When You've Violated the Assumption of Homogeneity of Variance

There is an easy solution to computing post hocs when you have violated the assumption of homogeneity of variance. You'll use the same codes as above, but include the argument pool.sd = FALSE at the end. Like this:

In [None]:
pairwise.t.test(cc1$Customer_Age, cc1$Card_Category, p.adjust="bonferroni", pool.sd = FALSE)

This provides a very similar output, the only difference being that is was calculated with non-pooled standard deviations, as noted at the top.

As you can see, once you've correct for this assumption, your results are more accurate, and your pairwise comparisons are still not significant, but give a more accurate p-value.

## Determine Means and Draw Conclusions

If you had found a significant difference after correction, you would want to then finish interpreting the results and draw some conclusions. To do that, you need to examine the means! Again, dplyr nicely comes to the rescue.

In [None]:
cc1Means = cc1 %>% group_by(Card_Category) %>% summarize(Mean = mean(Customer_Age))

In [None]:
cc1Means

Had you passed the corrected post-hoc with this data, you would have been able to look at the means and say that the Platinum card had a significantly higher average customer age than Gold or Silver cards. But, looking at these means, which are extremely close, it makes sense that this significant finding would "wash out" after Bonferroni correction. Is the difference between 45.5 years and 47.5 years really all that different?

In [None]:
libpaths

In [None]:
install.packages('libpaths')

In [None]:
findPkgAll <- function(pkg)
  unlist(lapply(.libPaths(), function(lib)
           find.package(pkg, lib, quiet=TRUE, verbose=FALSE)))

In [None]:
findPkgAll("MASS")
findPkgAll("knitr")

In [None]:
install.packages('tidyverse')

In [None]:
findPkgAll("tidyverse")

In [None]:
findPkgAll('readxl')

In [None]:
border = read_xlsx('C:/Users/nolan/OneDrive/Documents/GitHub/CurriculumPlayground/datasets/BorderCrossing.xlsx')

In [None]:
read_excel() read_xls() read_xlsx()