# Two-Sided T.Test
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error

## Two sample hypothesis testing for comparing two means
### Two-sided tests
In L07-1, we saw a **two-sided** hypothesis test for inference about population means of two groups, based on samples from the two groups. It's two-sided because the null hypothesis is rejected if the first sample mean is sufficiently larger OR sufficiently smaller than the second sample mean. In other words, we are **not** specifying which mean we expect to be larger. 

Null hypothesis:  
$H_0: \mu_A = \mu_B$

(Note that this is the same as $\mu_A - \mu_B = 0$).

Alternative hypothesis (two-sided):  
$H_A: \mu_A \ne \mu_B$ 

**Assumptions: **
* Two normally distributed independent populations
* Population variances are unknown and not assumed to be equal   

As we saw in L07-1, the **t-statistic** for comparing two means is:  
$t=\frac{\bar{x_1} - \bar{x_2}}{\sqrt{s_1^2/n_1+s_2^2/n_2}}$

The two-sided test has "rejection regions" in the two tails of the normal distribution of the difference in group means. Extreme positive values of the t-statistic are in the upper rejection region, and extreme negative values are in the lower rejection region. Typically the "cutoff" values of these regions are set to make the probability in each region .025, so the total probability is .05.      

If the p-value for the test is **less than** .05, we **reject** the null hypothesis (which is our default hypothesis). Otherwise, we **fail to reject** the null hypothesis. The "fail to reject" language indicates that the data don't provide enough evidence to reject the default. In L07-1, we used the R function pt() to determine the p-value for the two-sided test. 

A very commonly used R function for performing hypothesis tests is **t.test()**. Let's see how to do the example from L07-1 with t.test().

**Example: **
Question: Is there a statistically significant difference in **mean sale prices** between Old Town & Somerst neighborhoods?  
$H_0: \mu(OldTown) = \mu(Somerst)$  
$H_A: \mu(OldTown) \ne \mu(Somerst)$

In [2]:
# Load libraries
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mpurrr  [39m 0.2.5
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.1     [32m✔[39m [34mstringr[39m 1.3.1
[32m✔[39m [34mreadr  [39m 1.1.1     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [3]:
# Load data: ames_housing.csv
# Convert character columns to factor
# Select columns of interest: Neighborhood, SalePrice
# Store in df_ames_housing
# Hint: read_csv(), mutate_if(), select(), is.character(), as.factor()
df_ames_housing <- "ames_housing.csv" %>% 
   read_csv() %>% 
   mutate_if(is.character, as.factor) %>% 
   select(Neighborhood, SalePrice)

Parsed with column specification:
cols(
  .default = col_character(),
  Order = col_double(),
  Lot_Frontage = col_integer(),
  Lot_Area = col_double(),
  Overall_Qual = col_integer(),
  Overall_Cond = col_integer(),
  Year_Built = col_double(),
  Year_Remod_Add = col_double(),
  Mas_Vnr_Area = col_integer(),
  BsmtFin_SF_1 = col_double(),
  BsmtFin_SF_2 = col_integer(),
  Bsmt_Unf_SF = col_integer(),
  Total_Bsmt_SF = col_integer(),
  Z_1st_Flr_SF = col_double(),
  Z_2nd_Flr_SF = col_double(),
  Low_Qual_Fin_SF = col_integer(),
  Gr_Liv_Area = col_integer(),
  Bsmt_Full_Bath = col_integer(),
  Bsmt_Half_Bath = col_integer(),
  Full_Bath = col_integer(),
  Half_Bath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.


In [5]:
# Explore data structure
# Data: AmesHousing
df_ames_housing %>% glimpse() %>% summary()

Observations: 2,930
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m NAmes, NAmes, NAmes, NAmes, Gilbert, Gilbert, StoneBr, S…
$ SalePrice    [3m[90m<dbl>[39m[23m 215000, 105000, 172000, 244000, 189900, 195500, 213500, …


  Neighborhood    SalePrice     
 NAmes  : 443   Min.   : 12789  
 CollgCr: 267   1st Qu.:129500  
 OldTown: 239   Median :160000  
 Edwards: 194   Mean   :180796  
 Somerst: 182   3rd Qu.:213500  
 NridgHt: 166   Max.   :755000  
 (Other):1439                   

# Create subsets for two neighborhoods
We are creating two samples which we will perform a hypothesis test that these two neighborhoods do not affect the sales price. We will perform a two-sample two-sided hypothesis test.

In [6]:
# Filter data for Neighborhood equal Somerst
# Store in dataframe df_somerst
df_somerst <- df_ames_housing %>% filter (Neighborhood == "Somerst")

# Explore results
df_somerst %>% glimpse() %>% summary()

Observations: 182
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m Somerst, Somerst, Somerst, Somerst, Somerst, Somerst, So…
$ SalePrice    [3m[90m<dbl>[39m[23m 216000, 221500, 204500, 215200, 262500, 254900, 271500, …


  Neighborhood   SalePrice     
 Somerst:182   Min.   :139000  
 Blmngtn:  0   1st Qu.:185000  
 Blueste:  0   Median :225500  
 BrDale :  0   Mean   :229707  
 BrkSide:  0   3rd Qu.:259375  
 ClearCr:  0   Max.   :468000  
 (Other):  0                   

Notice the number of rows is reduced and the Neighborhood is as expected.

In [7]:
# Filter data for Neighborhood equal OldTown
# Store in dataframe df_old_town
df_old_town <- df_ames_housing %>% filter (Neighborhood == "OldTown" )

# Explore results
df_old_town %>% glimpse() %>% summary()

Observations: 239
Variables: 2
$ Neighborhood [3m[90m<fct>[39m[23m OldTown, OldTown, OldTown, OldTown, OldTown, OldTown, Ol…
$ SalePrice    [3m[90m<dbl>[39m[23m 144000, 80400, 96500, 109500, 115000, 143000, 107400, 80…


  Neighborhood   SalePrice     
 OldTown:239   Min.   : 12789  
 Blmngtn:  0   1st Qu.:103350  
 Blueste:  0   Median :119900  
 BrDale :  0   Mean   :123992  
 BrkSide:  0   3rd Qu.:140000  
 ClearCr:  0   Max.   :475000  
 (Other):  0                   

Notice the number of rows is reduced and the Neighborhood is as expected.

# Mean and standard deviation of each series
Look at the mean and standard deviation of each series to see how close they are to each other.

In [9]:
# Get the mean and standard deviation of Somerst
# Round to 1 decimal places
# Hint: mean(), sd(), round()
somerst_mean <- df_somerst$SalePrice %>% mean() %>% round(1)
somerst_sd <- df_somerst$SalePrice %>% sd() %>% round(1)

# Print results
cat(str_c("Somerst: mean = ", somerst_mean, " and standard deviation = ", somerst_sd))

Somerst: mean = 229707.3 and standard deviation = 57437.4

In [11]:
# Get the mean and standard deviation of OldTown
# Round to 1 decimal places
# Hint: mean(), sd(), round()
old_town_mean <- df_old_town$SalePrice %>% mean () %>% round(1)
old_town_sd <- df_old_town$SalePrice %>% sd() %>% round(1)

# Print results
cat(str_c("OldTown: mean = ", old_town_mean, " and standard deviation = ", old_town_sd))

OldTown: mean = 123991.9 and standard deviation = 44327.1

Notice that the mean of Somerst is almost double the mean of OldTown with the standard deviation not that much different from each other. That is a good indicator that they will be statistically significant. 

# t.test()
The Student's t-Test can perform one and two sample tests. It can perform one and two sided tests. 

In [12]:
# Display help for t.test()
? t.test

# t.test {stats}	R Documentation
Student's t-Test
## Description
Performs one and two sample t-tests on vectors of data.

## Usage
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)



# Perform a two-sided t-test
Perform a two-sided t-test using two samples, Somerst and OldTown. The result of this test will indicate whether these two samples are statistically different from each other.

In [13]:
# Perform a two sample, two-sided t-test, on SalePrice for Somerst and OldTown
# Confidence level: 95%
# Hint: t.test()
cat("\n**** Somerst != OldTown ****")
t.test(df_somerst$SalePrice, df_old_town$SalePrice, conf.level = 0.95 , alternative="two.sided")


**** Somerst != OldTown ****


	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  95617.93 115812.94
sample estimates:
mean of x mean of y 
 229707.3  123991.9 


# Interpreting the results of the t.test output
Let's look at the output from t.test(). It's titled "Welch Two Sample t-test" because for unequal population variances, the (complicated) expression for degrees of freedom is called the Welch (or Satterthwaite) approximation.

## t statistic, degrees of freedom, p-value
The first row shows the value of the t-statistic (large positive), its degrees of freedom, and its p-value. For this test, the p-value is very small: on the order of 10 to the power -16! It's much less than .05, so we **reject** the null hypothesis that the average house price in Old Town & Somerst neighborhoods is the same.

## p-value < 0.05 rejects the null hypothesis
This 0.05 threshold is 1 - the confidence level which is 0.95 or 95%. Anything below this value means it is statistically significant for this confidence level. 

## Alternative hypothesis
The output reminds us of our alternative hypothesis of the difference in sample means *not equal* to 0. This is a two-sided test. 

## Confidence interval
Then gives a confidence interval for the **difference** in population means. For a significant result (like this one), the confidence interval does not contain 0. 

## Mean of each sample
The final line of the output just shows the means of the two samples. 

# t.test() default values
t.test will default to a two-sided test with a confidence level of 0.95. If you are performing this test, then it isn't necessary to specify these parameters, although you can for readability if you prefer.

# Code Summary
Let's summarize the code for a two-sample, two-sided t.test.

In [16]:
# Load libraries
library( tidyverse )

# Load data: ames_housing.csv
df_ames_housing <- "ames_housing.csv" %>% 
   read_csv(progress=FALSE) %>% 
   mutate_if (is.character, as.factor) %>% 
   select(Neighborhood, SalePrice)

# Get two samples
df_old_town <- df_ames_housing %>% filter( Neighborhood == "OldTown" )
df_somerst <- df_ames_housing %>% filter( Neighborhood == "Somerst" )

# Perform t.test
t.test(df_somerst$SalePrice, df_old_town$SalePrice, conf.level = 0.95, alternative = "two.sided" )

Parsed with column specification:
cols(
  .default = col_character(),
  Order = col_double(),
  Lot_Frontage = col_integer(),
  Lot_Area = col_double(),
  Overall_Qual = col_integer(),
  Overall_Cond = col_integer(),
  Year_Built = col_double(),
  Year_Remod_Add = col_double(),
  Mas_Vnr_Area = col_integer(),
  BsmtFin_SF_1 = col_double(),
  BsmtFin_SF_2 = col_integer(),
  Bsmt_Unf_SF = col_integer(),
  Total_Bsmt_SF = col_integer(),
  Z_1st_Flr_SF = col_double(),
  Z_2nd_Flr_SF = col_double(),
  Low_Qual_Fin_SF = col_integer(),
  Gr_Liv_Area = col_integer(),
  Bsmt_Full_Bath = col_integer(),
  Bsmt_Half_Bath = col_integer(),
  Full_Bath = col_integer(),
  Half_Bath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.



	Welch Two Sample t-test

data:  df_somerst$SalePrice and df_old_town$SalePrice
t = 20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  95617.93 115812.94
sample estimates:
mean of x mean of y 
 229707.3  123991.9 
