# T.Test With Factors
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error

## Running multiple t-tests for factor columns

In the last two exercises, we performed a two-sample test using two continous variables for SalePrice. Now suppose you want to do this for one continous variable and a factor. The first sample is the continous variable, and the second sample is the factor. Note that the factor must have only two levels. 

In [2]:
# Load libraries
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mpurrr  [39m 0.2.5
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.1     [32m✔[39m [34mstringr[39m 1.3.1
[32m✔[39m [34mreadr  [39m 1.1.1     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [4]:
# Load data: ames_housing.csv
# Convert character columns to factor
# Select columns of interest: Neighborhood, Central_Air, SalePrice
# Store in df_ames_housing
# Hint: read_csv(), mutate_if(), select(), is.character(), as.factor()
df_ames_housing <- "ames_housing.csv" %>% 
   read_csv () %>% 
   mutate_if(is.character, as.factor) %>% 
   select(Neighborhood, Central_Air, SalePrice)

Parsed with column specification:
cols(
  .default = col_character(),
  Order = col_double(),
  Lot_Frontage = col_integer(),
  Lot_Area = col_double(),
  Overall_Qual = col_integer(),
  Overall_Cond = col_integer(),
  Year_Built = col_double(),
  Year_Remod_Add = col_double(),
  Mas_Vnr_Area = col_integer(),
  BsmtFin_SF_1 = col_double(),
  BsmtFin_SF_2 = col_integer(),
  Bsmt_Unf_SF = col_integer(),
  Total_Bsmt_SF = col_integer(),
  Z_1st_Flr_SF = col_double(),
  Z_2nd_Flr_SF = col_double(),
  Low_Qual_Fin_SF = col_integer(),
  Gr_Liv_Area = col_integer(),
  Bsmt_Full_Bath = col_integer(),
  Bsmt_Half_Bath = col_integer(),
  Full_Bath = col_integer(),
  Half_Bath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.


In [5]:
# Explore data structure
# Data: AmesHousing
df_ames_housing %>% 
   glimpse() %>% 
   summary()

Observations: 2,930
Variables: 3
$ Neighborhood [3m[90m<fct>[39m[23m NAmes, NAmes, NAmes, NAmes, Gilbert, Gilbert, StoneBr, S…
$ Central_Air  [3m[90m<fct>[39m[23m Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
$ SalePrice    [3m[90m<dbl>[39m[23m 215000, 105000, 172000, 244000, 189900, 195500, 213500, …


  Neighborhood  Central_Air   SalePrice     
 NAmes  : 443   N: 196      Min.   : 12789  
 CollgCr: 267   Y:2734      1st Qu.:129500  
 OldTown: 239               Median :160000  
 Edwards: 194               Mean   :180796  
 Somerst: 182               3rd Qu.:213500  
 NridgHt: 166               Max.   :755000  
 (Other):1439                               

# Filter to only two factor levels
Since t.test wants two factor levels, and column with two factor levels will work. Let's recreate the Neighborhood test this way. 

In [6]:
# Create a dataframe filtered to two neighborhoods Somerst and OldTown
# Name the variable df_neighborhoods
# Hint: filter(), %in%, c()
df_neighborhoods <- df_ames_housing %>% 
   filter(Neighborhood %in% c("Somerst", "OldTown"))

# Explore the result
df_neighborhoods %>% glimpse() %>% summary()

Observations: 421
Variables: 3
$ Neighborhood [3m[90m<fct>[39m[23m Somerst, Somerst, Somerst, Somerst, Somerst, Somerst, So…
$ Central_Air  [3m[90m<fct>[39m[23m Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
$ SalePrice    [3m[90m<dbl>[39m[23m 216000, 221500, 204500, 215200, 262500, 254900, 271500, …


  Neighborhood Central_Air   SalePrice     
 OldTown:239   N: 56       Min.   : 12789  
 Somerst:182   Y:365       1st Qu.:115000  
 Blmngtn:  0               Median :151000  
 Blueste:  0               Mean   :169693  
 BrDale :  0               3rd Qu.:220000  
 BrkSide:  0               Max.   :475000  
 (Other):  0                               

Notice in the summary of Neighborhood, there still are all the values from the original data, but zeros for all but the two we filtered for. This won't be a problem for t.test, but if you did want to remove these 0 levels, you can use fct_drop().

# Perform t.test on Neighborhoods
Using our filtered dataframe with only two neighborhoods, let's run the t.test on it. Since both samples are in the same dataframe, we can use the alternate formula form for t.test. This has a data parameter to provide the dataframe and then the formula can reference the columns without the '$' or retyping the dataframe.

In [10]:
# Peform t.test on SalePrice ~ Neighborhood
# Use the default two.sided, 95% confidence level
t.test(SalePrice ~ Neighborhood, data = df_neighborhoods)


	Welch Two Sample t-test

data:  SalePrice by Neighborhood
t = -20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -115812.94  -95617.93
sample estimates:
mean in group OldTown mean in group Somerst 
             123991.9              229707.3 


# Analyzing results
Notice when stating the mean, that it states which is the first sample, OldTown, and which is the second sample, Somerst. This is needed when interpretting a one-sided test, but not important for this two-sided test.

So, what is the result of the test? The p-value is < 0.05 and the confidence interval does not contain 0, so alternate hypothesis is true...OldTown SalePrice is not the same as Somerst SalePrice. Of course, we already knew that. 

# t.test for Sale Price ~ Central Air
Let's try another one. I noticed that Central_Air is also in our dataframe and it has only two levels (values), Y and N. 

Is Central Air a statistically signficant feature that drives sale price? Let's find out.

In [11]:
# Peform t.test on SalePrice ~ Neighborhood
# Use the default two.sided, 95% confidence level
t.test(SalePrice ~ Central_Air, data = df_ames_housing)


	Welch Two Sample t-test

data:  SalePrice by Central_Air
t = -27.433, df = 336.06, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -90625.69 -78498.92
sample estimates:
mean in group N mean in group Y 
       101890.5        186452.8 


# Analyzing results
Notice when stating the mean, that it states which is the first sample, N, and which is the second sample, Y referring to the Yes and No as to whether a central air conditioner is in the home. This is needed when interpretting a one-sided test, but not important for this two-sided test.

So, what is the result of the test? The p-value is < 0.05 and the confidence interval does not contain 0, so alternate hypothesis is true...Central Air No SalePrice is not the same as Central Air Yes SalePrice. That makes sense. Having central air would likely affect the sale price of the home...and this test proves that the differences in sale price are not by random chance alone.

# Code Summary
Let's summarize the code for a two-sample, two-level factor t.test.

In [12]:
# Load libraries
library( tidyverse )

# Load data: ames_housing.csv
df_ames_housing <- "ames_housing.csv" %>% 
   read_csv () %>% 
   mutate_if(is.character, as.factor) %>% 
   select(Neighborhood, Central_Air, SalePrice)

# Filter data to get to two levels in the factor
df_neighborhoods <- df_ames_housing %>% 
   filter(Neighborhood %in% c("Somerst", "OldTown"))

# Perform t.test
t.test(SalePrice ~ Neighborhood, data = df_neighborhoods)
t.test(SalePrice ~ Central_Air, data = df_ames_housing)

Parsed with column specification:
cols(
  .default = col_character(),
  Order = col_double(),
  Lot_Frontage = col_integer(),
  Lot_Area = col_double(),
  Overall_Qual = col_integer(),
  Overall_Cond = col_integer(),
  Year_Built = col_double(),
  Year_Remod_Add = col_double(),
  Mas_Vnr_Area = col_integer(),
  BsmtFin_SF_1 = col_double(),
  BsmtFin_SF_2 = col_integer(),
  Bsmt_Unf_SF = col_integer(),
  Total_Bsmt_SF = col_integer(),
  Z_1st_Flr_SF = col_double(),
  Z_2nd_Flr_SF = col_double(),
  Low_Qual_Fin_SF = col_integer(),
  Gr_Liv_Area = col_integer(),
  Bsmt_Full_Bath = col_integer(),
  Bsmt_Half_Bath = col_integer(),
  Full_Bath = col_integer(),
  Half_Bath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.



	Welch Two Sample t-test

data:  SalePrice by Neighborhood
t = -20.595, df = 330.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -115812.94  -95617.93
sample estimates:
mean in group OldTown mean in group Somerst 
             123991.9              229707.3 



	Welch Two Sample t-test

data:  SalePrice by Central_Air
t = -27.433, df = 336.06, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -90625.69 -78498.92
sample estimates:
mean in group N mean in group Y 
       101890.5        186452.8 
