# One-Sided T.Test
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error

## Two sample hypothesis testing for comparing two means

### One-sided tests

In the last exercise we performed a two-sample, two-sided t-test. Now suppose that your hypothesis specifies which mean you expect to be larger. For example, suppose you want to test whether the mean price in Somerst is **larger** than the mean price in Old Town.

Null hypothesis:  
$H_0: \mu(Somerst) = \mu(OldTown)$

*Note: Some statisticians write this hypothesis as*
$H_0: \mu(Somerst) \le \mu(OldTown)$

Alternative hypothesis (one-sided):  
$H_A: \mu(Somerst) \gt \mu(OldTown)$

This **one-sided** test has only one rejection region, in the the upper tail of the distribution of the difference in group means. The probability in this tail is .05.

For one-sided tests, *the order of the vectors in t.test() is important,* because R interprets the alternative "greater" as (first group listed) > (second group listed). See "A note on formulas" at the end of this notebook.

In [2]:
# Load libraries
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mpurrr  [39m 0.2.5
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.1     [32m✔[39m [34mstringr[39m 1.3.1
[32m✔[39m [34mreadr  [39m 1.1.1     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [3]:
# Load data: ames_housing.csv
# Convert character columns to factor
# Select columns of interest: Neighborhood, SalePrice
# Store in df_ames_housing
# Hint: read_csv(), mutate_if(), select(), is.character(), as.factor()
df_ames_housing <- "ames_housing.csv" %>% 
   read_csv () %>% 
   mutate_if(is.character, as.factor ) %>% 
   select(Neighborhood, SalePrice )

Parsed with column specification:
cols(
  .default = col_character(),
  Order = col_double(),
  Lot_Frontage = col_integer(),
  Lot_Area = col_double(),
  Overall_Qual = col_integer(),
  Overall_Cond = col_integer(),
  Year_Built = col_double(),
  Year_Remod_Add = col_double(),
  Mas_Vnr_Area = col_integer(),
  BsmtFin_SF_1 = col_double(),
  BsmtFin_SF_2 = col_integer(),
  Bsmt_Unf_SF = col_integer(),
  Total_Bsmt_SF = col_integer(),
  Z_1st_Flr_SF = col_double(),
  Z_2nd_Flr_SF = col_double(),
  Low_Qual_Fin_SF = col_integer(),
  Gr_Liv_Area = col_integer(),
  Bsmt_Full_Bath = col_integer(),
  Bsmt_Half_Bath = col_integer(),
  Full_Bath = col_integer(),
  Half_Bath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.


In [4]:
# Explore data structure
# Data: AmesHousing
df_ames_housing %>% glimpse() %>% summary()

Observations: 2,930
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m NAmes, NAmes, NAmes, NAmes, Gilbert, Gilbert, StoneBr, S…
$ SalePrice    [3m[90m<dbl>[39m[23m 215000, 105000, 172000, 244000, 189900, 195500, 213500, …


  Neighborhood    SalePrice     
 NAmes  : 443   Min.   : 12789  
 CollgCr: 267   1st Qu.:129500  
 OldTown: 239   Median :160000  
 Edwards: 194   Mean   :180796  
 Somerst: 182   3rd Qu.:213500  
 NridgHt: 166   Max.   :755000  
 (Other):1439                   

# Create subsets for two neighborhoods
We are creating two samples which we will perform a hypothesis test that these two neighborhoods do not affect the sales price. We will perform a two-sample one-sided hypothesis test.

In [5]:
# Filter data for Neighborhood equal Somerst
# Store in dataframe df_somerst
df_somerst <- df_ames_housing %>% filter( Neighborhood == 'Somerst')

# Explore results
df_somerst %>% glimpse()

Observations: 182
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m Somerst, Somerst, Somerst, Somerst, Somerst, Somerst, So…
$ SalePrice    [3m[90m<dbl>[39m[23m 216000, 221500, 204500, 215200, 262500, 254900, 271500, …


Notice the number of rows is reduced and the Neighborhood is as expected.

In [6]:
# Filter data for Neighborhood equal OldTown
# Store in dataframe df_old_town
df_old_town <- df_ames_housing %>% filter( Neighborhood == "OldTown" )

# Explore results
df_old_town %>% glimpse()

Observations: 239
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m OldTown, OldTown, OldTown, OldTown, OldTown, OldTown, Ol…
$ SalePrice    [3m[90m<dbl>[39m[23m 144000, 80400, 96500, 109500, 115000, 143000, 107400, 80…


Notice the number of rows is reduced and the Neighborhood is as expected.

# Mean and standard deviation of each series
Look at the mean and standard deviation of each series to see how close they are to each other.

In [7]:
# Get the mean and standard deviation of Somerst
# Round to 1 decimal places
# Hint: mean(), sd(), round()
somerst_mean <- df_somerst$SalePrice %>% mean() %>% round(1)
somerst_sd <- df_somerst$SalePrice %>% sd() %>% round(1)

# Print results
cat(str_c("Somerst: mean = ", somerst_mean, " and standard deviation = ", somerst_sd))

Somerst: mean = 229707.3 and standard deviation = 57437.4

In [8]:
# Get the mean and standard deviation of OldTown
# Round to 1 decimal places
# Hint: mean(), sd(), round()
old_town_mean <- df_old_town$SalePrice %>% mean() %>% round(1)
old_town_sd <- df_old_town$SalePrice %>% sd() %>% round(1)

# Print results
cat(str_c("OldTown: mean = ", old_town_mean, " and standard deviation = ", old_town_sd))

OldTown: mean = 123991.9 and standard deviation = 44327.1

Notice that the mean of Somerst is almost double the mean of OldTown with the standard deviation not that much different from each other. That is a good indicator that they will be statistically significant. 

# t.test {stats}	R Documentation
Student's t-Test
## Description
Performs one and two sample t-tests on vectors of data.

## Usage
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)



# Perform a one-sided t-test
Perform a one-sided t-test using two samples, Somerst and OldTown. The result of this test will indicate whether these two samples are statistically different from each other. Because it is one side, we create an alternate hypothesis that Somerst is a higher price than OldTown.

In [9]:
# Perform a two-sample, one-sided t-test, on SalePrice for Somerst and OldTown
# Alternate hypotheis: Somerst (first sample) is greater than OldTown (second sample)
# Confidence level: 95%
# Hint: t.test()
cat("\n**** Somerst > OldTown ****")
t.test( df_somerst$SalePrice , df_old_town$SalePrice, conf.level = 0.95 , alternative = "greater" )

# Run the two-sided test for comparison
cat("\n**** Somerst != OldTown ****")
t.test(df_somerst$SalePrice, df_old_town$SalePrice , conf.level = 0.95 , alternative = "two.sided" )


**** Somerst > OldTown ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 97248.63      Inf
sample estimates:
mean of x mean of y 
 229707.3  123991.9 



**** Somerst != OldTown ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  95617.93 115812.94
sample estimates:
mean of x mean of y 
 229707.3  123991.9 


# Interpreting the results of the t.test output

## t statistic, degrees of freedom, p-value
Notice that the t-statistic, degrees of freedom and p-value are the same as for the two-sided test. The t-statistic is high and the p value is below 0.05 indicating that the null hypothesis test has been rejected.


## p-value < 0.05 rejects the null hypothesis
This 0.05 threshold is 1 - the confidence level which is 0.95 or 95%. Anything below this value means it is statistically significant for this confidence level. 

## Alternative hypothesis
The output reminds us of our alternative hypothesis of the difference in sample means *greater* than 0. This is a one-sided test. 

## Confidence interval
Then gives a confidence interval for the **difference** in population means. For a significant result (like this one), the confidence interval does not contain 0. Because it is only a single side, R only calculates one end of the CI. This is all we need to determine whether the interval contains zero. The other end is shown as positive or negative infinity. In this case, the lower end of the CI is 97249 and the upper end is infinity and thus does not contain 0. Again, we reject the null hypothesis.  

## Mean of each sample
The final line of the output just shows the means of the two samples. As expected, they are the same for both one-sided and two-sided tests.

# One-sided t.test from the other side
It makes sense that the test of $\mu(Somerst) \gt \mu(OldTown)$ was significant, because the sample mean for Somerst is so much larger than the sample mean for Old Town. What happens if we test the opposite alternative hypothesis?

Null hypothesis:  
$H_0: \mu(Somerst) = \mu(OldTown)$

*Note: Some statisticians write this hypothesis as*
$H_0: \mu(Somerst) \ge \mu(OldTown)$

Alternative hypothesis (one-sided):  
$H_A: \mu(Somerst) \lt \mu(OldTown)$

In [10]:
# Perform a two-sample, one-sided t-test, on SalePrice for Somerst and OldTown
# Alternate hypotheis: Somerst (first sample) is LESS than OldTown (second sample)
# Confidence level: 95%
# Hint: t.test()
cat("\n**** Somerst < OldTown ****")
t.test(df_somerst$SalePrice, df_old_town$SalePrice, conf.level = 0.95, alternative = "less" )

cat("\n**** Somerst != OldTown ****")
# Run the two-sided test for comparison
t.test(df_somerst$SalePrice, df_old_town$SalePrice, conf.level = 0.95, alternative = "two.sided" )

cat("\n**** Somerst > OldTown ****")
# Run the one-sided test where Somerst > OldTown for comparison
t.test(df_somerst$SalePrice, df_old_town$SalePrice, conf.level = 0.95, alternative = "greater" )


**** Somerst < OldTown ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value = 1
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf 114182.2
sample estimates:
mean of x mean of y 
 229707.3  123991.9 



**** Somerst != OldTown ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  95617.93 115812.94
sample estimates:
mean of x mean of y 
 229707.3  123991.9 



**** Somerst > OldTown ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 97248.63      Inf
sample estimates:
mean of x mean of y 
 229707.3  123991.9 


# Interpreting the results of the t.test output

## t statistic, degrees of freedom, p-value
Notice that the t-statistic, and degrees of freedom are the same as for all tests. The p-value is different for this one-sided test...as expected.


## p-value < 0.05 rejects the null hypothesis
This 0.05 threshold is 1 - the confidence level which is 0.95 or 95%. Anything below this value means it is statistically significant for this confidence level. Since the p-value is above 0.05, in fact it is 1, we *fail to reject* this null hypothesis. 

## Alternative hypothesis
The output reminds us of our alternative hypothesis of the difference in sample means *greater* than 0. This is a one-sided test. 

## Confidence interval
The lower end of the CI is negative infinity and the upper end is 114182, so the CI **does** contain zero. We fail to reject the null hypothesis; the evidence isn't consistent with the alternative hypothesis that the population mean price for Somerst is **less** than the population mean price for Old Town. 

## Mean of each sample
The final line of the output just shows the means of the two samples. As expected, they are the same for all tests.

# Swapping the order of sample one and sample two
For a two-sided test, it doesn't matter which order the samples are given to the t.test. However, when a one-sided test is preformed, the order of which sample is first and which is second, does make a difference. The hypothesis is from the perspective of the first sample, the first argument to the t.test() function.

Let's try the same code as before, simply swapping the order provided to t.test.

In [12]:
# Perform a two-sample, one-sided t-test, on SalePrice swapping the order of the samples
# OldTown is first and Somerst is second. 
# Alternate hypotheis: OldTown (first sample) is LESS than Somerst (second sample)
# Confidence level: 95%
# Hint: t.test()
cat("\n**** OldTown < Somerst ****")
t.test(df_old_town$SalePrice, df_somerst$SalePrice , conf.level = 0.95, alternative = "less")

cat("\n**** OldTown != Somerst ****")
# Run the two-sided test for comparison
t.test( df_somerst$SalePrice , df_old_town$SalePrice, conf.level = 0.95, alternative = "two.sided")

cat("\n**** OldTown > Somerst ****")
# Run the one-sided test where OldTown > Somerst for comparison
t.test(df_old_town$SalePrice, df_somerst$SalePrice ,  conf.level = 0.95, alternative = "greater")


**** OldTown < Somerst ****


	Welch Two Sample t-test

data:  df_old_town$SalePrice and df_somerst$SalePrice
t = -20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -97248.63
sample estimates:
mean of x mean of y 
 123991.9  229707.3 



**** OldTown != Somerst ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  95617.93 115812.94
sample estimates:
mean of x mean of y 
 229707.3  123991.9 



**** OldTown > Somerst ****


	Welch Two Sample t-test

data:  df_old_town$SalePrice and df_somerst$SalePrice
t = -20.595, df = 330.68, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -114182.2       Inf
sample estimates:
mean of x mean of y 
 123991.9  229707.3 


# Results of swapping order of samples

* OldTown <  Somerst is true with a p-value near 0 and a confidence interval NOT containing 0
* OldTown != Somerst is true with a p-value near 0 and a confidence interval NOT containing 0
* OldTown >  Somerst is false with a p-value near 1 and a confidence interval containing 0

# Code Summary
Let's summarize the code for a two-sample, one-sided t.test for both greater and less. Let's add the two-sided test as well for completeness sake.

In [19]:
# Load libraries
library( tidyverse )

# Load data: ames_housing.csv
df_ames_housing <- "ames_housing.csv" %>% 
   read_csv(progress=FALSE) %>% 
   mutate_if(is.character, as.factor ) %>% 
   select(Neighborhood, SalePrice )

df_ames_housing %>% glimpse()

# Get two samples
df_old_town <- df_ames_housing %>% filter( Neighborhood == "OldTown" )
df_somerst <- df_ames_housing %>% filter (Neighborhood == "Somerst")

# Perform t.test
t.test(df_somerst$SalePrice, df_old_town$SalePrice, alternative="greater") # Greater
t.test(df_somerst$SalePrice, df_old_town$SalePrice, alternative="less") # Less
t.test(df_somerst$SalePrice, df_old_town$SalePrice, alternative="two.sided") # Two sided

Parsed with column specification:
cols(
  .default = col_character(),
  Order = col_double(),
  Lot_Frontage = col_integer(),
  Lot_Area = col_double(),
  Overall_Qual = col_integer(),
  Overall_Cond = col_integer(),
  Year_Built = col_double(),
  Year_Remod_Add = col_double(),
  Mas_Vnr_Area = col_integer(),
  BsmtFin_SF_1 = col_double(),
  BsmtFin_SF_2 = col_integer(),
  Bsmt_Unf_SF = col_integer(),
  Total_Bsmt_SF = col_integer(),
  Z_1st_Flr_SF = col_double(),
  Z_2nd_Flr_SF = col_double(),
  Low_Qual_Fin_SF = col_integer(),
  Gr_Liv_Area = col_integer(),
  Bsmt_Full_Bath = col_integer(),
  Bsmt_Half_Bath = col_integer(),
  Full_Bath = col_integer(),
  Half_Bath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.


Observations: 2,930
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m NAmes, NAmes, NAmes, NAmes, Gilbert, Gilbert, StoneBr, S…
$ SalePrice    [3m[90m<dbl>[39m[23m 215000, 105000, 172000, 244000, 189900, 195500, 213500, …



	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 97248.63      Inf
sample estimates:
mean of x mean of y 
 229707.3  123991.9 



	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value = 1
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf 114182.2
sample estimates:
mean of x mean of y 
 229707.3  123991.9 



	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  95617.93 115812.94
sample estimates:
mean of x mean of y 
 229707.3  123991.9 
