# Homework 3: Hypothesis Testing

**!!! IMPORTANT, DO NOT PROCEED BEFORE COMPLETING THE STEP BELOW !!!**

If you haven't already, please make a copy of this notebook and save to your Google Drive. This is imperative so that your work is saved as you go.

**Due Date**: Thursday April 24th at 11:59pm.

**Submission Instructions**:
- Download the notebook: Go to File --> Download --> Download .ipynb.
- Upload the notebook: Click the Files icon (left side under the Key icon) --> Click the Upload icon (left most of 4) --> Select the file you just downloaded.
- Run the last cell in this notebook.
- Find the new pdf file in the same location as your uploaded notebook.
- Click the 3 vertical dots for this pdf file --> Click Download.
- IMPORTANT: check that your pdf file has not cut off any work from your notebook.
- Upload the pdf to Gradescope.

**Learning Outcomes**:
- Understand how to formulate and conduct hypothesis tests.
- Interpret results of hypothesis tests.
- Determine which type of test is appropriate for various contexts.

## Poll responses

Remember to respond to all polls created by your group members! (~20 polls) [See assignments here](https://docs.google.com/spreadsheets/d/1NnGaO8c4BHo3Naqw0fyIrykgKp9I0Ipxx5rdpgWbSxg/edit?gid=0#gid=0)

## Set up

Run the cell below to import the libraries and packages we are going to use.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

## Exercise 0

We will continue building on the M&Ms experiment you began in class last week.

Let's again define the following:

- $p$: the *population* proportion of M&M's that are primary colored

- $\hat{p}$: the *sample* proportion of M&M's that are primary colored

**Part (a):** First, recall and print your count of M&Ms from your in-class sample (Homework 2, Exercise 0) and your original point estimate for $p$ (Homework 2, Exercise 1).

In [2]:
# Code here!
# --------------------------------- #

# --------------------------------- #

## Exercise 1: Setting up a hypothesis test

We will now use hypothesis tests to help us understand if primary colored M&Ms (blue, red, and yellow) are equally likely to be observed as non-primary colored M&Ms.

**Part (a):** Formulate a hypothesis test to test if primary colored M&Ms are equally likely as non-primary colored. State clearly the null and alternative hypotheses for your test, and define all population parameters used.


---

Answer here!


---



**Part (b):** What does rejecting the null and failing to reject the null mean in the context of the problem?


---

Answer here!


---


**Part (c):** What would a type I error be in this context? What would a type II error be?


---

Answer here!


---


## Exercise 2: Conducting a hypothesis test

**Part (a):** Calculate the standard error of the sampling distribution of $\hat{p}$ under the null hypothesis.

In [3]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (b):** Calculate a p-value for your hypothesis test from Exercise 1 using your point estimate and the standard error from part (a).

In [4]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (c):** How would you interpret this p-value in words? Using the p-value you calculated and a significance level of $\alpha = 0.05$, what conclusion would you draw about whether primary colored and non-primary colored M&Ms are equally likely to be observed? **(Answer in 2 sentences max.)**


---

Answer here!


---


## Exercise 3

Assume that the point estimate in your sample is 0.55. What is the minimum sample size needed in order to be able to conclude that primary colored M&Ms are more common than non-primary colored M&Ms at a significance level of 0.05? (Hint: you will first need to transform the estimator so that its approximate null distribution is standard normal N(0, 1)).

In [5]:
# Code here!
# --------------------------------- #

# --------------------------------- #

## Ames housing data


For exercises 4-6, we will use the Ames Housing Dataset, a well-known dataset in the field of machine learning and data analysis that contains numerical and categorical attributes of residential homes in Ames, Iowa.

In [6]:
# Run this cell to retrieve and preview the Ames housing data.
homes_url = "https://github.com/melindaleung/Ames-Iowa-Housing-Dataset/blob/master/data/ames%20iowa%20housing.csv?raw=true"
homes = pd.read_csv(homes_url)
homes.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Please see [here](https://github.com/melindaleung/Ames-Iowa-Housing-Dataset/blob/master/data/data%20description.txt) for additional information about the dataset, including a comprehensive list of variables and definitions. *Note some columns are hidden from view in the preview above.*

## Exercise 4

We want to test if the presence of a fireplace is associated with the month a home sells in -- specifically, if associated with selling in the winter months (which we will define as October - March, inclusive).



> First, let's transform the data to make this analysis simpler. We will add a new column "SoldInWinter" to the dataframe that has value "Y" if a home sold between October and March (inclusive) and "N" otherwise. Then we'll add another column "HasFireplace" to the dataframe that has value "Y" if a home has 1 or more fireplaces and "N" otherwise.



In [7]:
# Run this cell to transform month sold and fireplace data
homes['SoldInWinter'] = homes['MoSold'].apply(lambda x: 'Y' if x in [10, 11, 12, 1, 2, 3] else 'N')
homes['HasFireplace'] = homes['Fireplaces'].apply(lambda x: 'Y' if x > 0 else 'N')

**Part (a):** What type of hypothesis test should be used? Justify your answer in one sentence.


---

Answer here!


---


**Part (b):** What are the null and alternative hypotheses of the test in context of this problem?


---

Answer here!


---


**Part (c):** Report your test statistic and p-value for this test (you must calculate the test statistic manually without using existing hypothesis testing functions).

In [8]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (d):** What is your conclusion? Use a significance level of $\alpha = 0.05$.


---

Answer here!


---


## Exercise 5

Continuing with the Ames housing data, test if the average lot size in Ames is different than 10,000 square feet.

**Part (a):** What type of hypothesis test should be used? Justify your answer in one sentence.


---

Answer here!


---


**Part (b):** What are the null and alternative hypotheses of the test in context of this problem?


---

Answer here!


---



**Part (c):** Report your test statistic and p-value for this test (you must calculate the test statistic manually without using existing hypothesis testing functions).

In [9]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (d):** What is your conclusion? Use a significance level of $\alpha = 0.05$.


---

Answer here!


---


## Exercise 6

Continuing with the Ames housing data, test if the mean sale price of homes with vinyl siding exterior is different than the mean sale price of homes with brick face exterior (as defined by the 'Exterior1st' field).

**Part (a):** What type of hypothesis test should be used? Justify your answer in one sentence.


---

Answer here!


---


**Part (b):** What are the null and alternative hypotheses of the test in context of this problem?


---

Answer here!


---


**Part (c):** Report your test statistics and p-value for this test (you must calculate the test statistic manually without using existing hypothesis testing functions).

In [10]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (d):** What is your conclusion? Use a significance level of $\alpha = 0.05$.


---

Answer here!


---


## Converting to PDF

Use the below cell to convert your notebook to pdf, using the instructions at the beginning of the notebook.

In [11]:
!apt-get update -qq > /dev/null
!apt-get install -qq --fix-missing pandoc texlive-latex-base texlive-latex-extra > /dev/null
!jupyter nbconvert --to latex "/content/HW3.ipynb" > /dev/null
!sed -i 's/❗/!/g' /content/HW3.tex
!pdflatex -interaction=nonstopmode -halt-on-error "/content/HW3.tex" > /dev/null

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Extracting templates from packages: 100%
sed: can't read /content/HW3.tex: No such file or directory
