# Session 3 Notebook - Hypothesis Testing

## Exercise 1

*Your company’s research group surveys 500 women more than 40 years old to test the hypothesis that 28% of women prefer to have daily check-in meetings with their boss.*


*Should the hypothesis be rejected at the 5% significance level is 151 of the women respond affirmatively?*

Step 1: Import the necessary library

In [1]:
from statsmodels.stats.proportion import proportions_ztest

Step 2: Set the variables for the sample size, the sample success and the null hypothesis. 

In [None]:
sample_success = 151
sample_size = 500
null_hypothesis = 0.28

Step 3: Run the test using the proportions_ztest function from the imported library 

In [None]:
# check our sample against Ho for Ha > Ho
# for Ha < Ho use alternative='smaller'
# for Ha != Ho use alternative='two-sided'
# for Ha > Ho use alternative = 'larger'
proportions_ztest(count=sample_success, nobs=sample_size, value=null_hypothesis, alternative='two-sided')
#the first value in the output is the z score, the second is the p value

## Group Exercise 1: 

According to the February 2008 Federal Trade Commission report on consumer fraud and identity theft, 23% of all complaints in 2007 were for identity theft. In that year, Alaska had 321 complaints of identity theft out of 1,432 consumer complaints ("Consumer fraud and," 2008). 

Does this data provide enough evidence to show that Alaska had a lower proportion of identity theft than 23%? Test at the 5% level.


Step 1: Import the necessary library

In [None]:
from statsmodels.stats.proportion import proportions_ztest

Step 2: Set the variables for the sample size, the sample success and the null hypothesis. 

In [3]:
sample_success = 321
sample_size = 1432
null_hypothesis = 0.23

Step 3: Run the test using the proportions_ztest function from the imported library 

In [5]:
# we want to check ha <ho for smaller
proportions_ztest(count=sample_success, nobs=sample_size, value=null_hypothesis, alternative='smaller')
# Z Score and P Value

(-0.5297466765209304, 0.2981437903628086)

if the p value is higher than the sig level then reject the null hypo

## Exercise 2:

Suppose that early in an election campaign a telephone poll of 800 registered votes shows 460 in favor of a particular candidate. Just before election day, a second poll shows only 520 of 1000 registered voters expressing the same preference. 

At the 10% significance level is there sufficient evidence that the candidates popularity has decreased?

Step 1: Import the necessary libraries 

In [20]:
from statsmodels.stats.proportion import proportions_ztest
import numpy as np

(2.328219763531173, 0.019900437267002254)

Step 2: define the sample success and sample size for each of your two samples. 

In [None]:
# note - the samples do not need to be the same size
sample_success_a, sample_size_a = (460, 800)
sample_success_b, sample_size_b = (520, 1000)


Step 3: Check our sample against the Ho

In [None]:
# check our sample against Ho for Ha != Ho
successes = np.array([sample_success_a, sample_success_b])
samples = np.array([sample_size_a, sample_size_b])


Step 4: Use the proportions_ztest function to find your p value. 

In [None]:
# note, no need for a Ho value here - it's derived from the other parameters
proportions_ztest(count=successes, nobs=samples,  alternative='two-sided')

## Group Exercise 2:

An automobile manufacturer tries two distinct assembly procedures. In a sample of 350 cars coming off the line using the first procedure, there are 28 with major defects. A sample of 500 autos coming from the second line shows 32 with defects. 

Is the difference significant at the 10% significance level? 


Step 1: Import the necessary libraries 

In [6]:
from statsmodels.stats.proportion import proportions_ztest
import numpy as np

Step 2: define the sample success and sample size for each of your two samples. 

In [7]:
# note - the samples do not need to be the same size
sample_success_a, sample_size_a = (28, 350)
sample_success_b, sample_size_b = (32, 500)

Step 3: Check our sample against the Ho

In [8]:
# check our sample against Ho for Ha != Ho
successes = np.array([sample_success_a, sample_success_b])
samples = np.array([sample_size_a, sample_size_b])

Step 4: Use the proportions_ztest function to find your p value. 

In [9]:
proportions_ztest(count=successes, nobs=samples,  alternative='two-sided')

(0.8963121819021319, 0.3700860547245475)

## Exercise 3:

A manufacturer claims that a new brand of air-conditioning unit uses only 6.5 kilowatts of electricity per day. A consumer agency believes that the true figure is higher and runs a test on a sample of size 50. 

If the sample mean is 7.0 kilowatts with a standard deviation of 1.4, should the manufacturer’s claim be rejected at a significance level of 5%?  

In [10]:
from scipy.stats import truncnorm
from statsmodels.stats.weightstats import DescrStatsW as stats
# can we assume anything from our sample?


Step 2: Define the significance and the null hypothesis values

In [11]:
significance = 0.05

null_hypothesis = 6.5

Step 3: define the sample 

In [12]:
# Normally, in the real world, you would process an entire sample (i.e. sample_a)
# But for this test, we'll generate a sample from this shape, where:
# - min/max is the range of available options
# - sample mean/dev are used to define the normal distribution
# - size is how large the sample will be
sample_mean_a, sample_dev_a, sample_size_a = (7.0, 1.4, 50)


Step 4: Run the test!

In [13]:
# here - for our test - we're generating a random string of durations to be our sample
# these are in a normal distribution between min/max, normalised around the mean
sample_a = np.random.normal(loc=sample_mean_a, scale=sample_dev_a, size=sample_size_a)
# Get the stat data
(t_stat, p_value, degree_of_freedom) = stats(sample_a).ttest_mean(null_hypothesis, 'larger')
# report
print('t_stat: %0.3f, p_value: %0.3f' % (t_stat, p_value))

t_stat: 0.453, p_value: 0.326


## Group Exercise 3:

A local chamber of commerce claims that the mean sale price for homes in the city is $90,000. A real estate salesperson notes eight recent sales of $75,000, $102,000, $82,000, $87,000, $77,000, $ 93,000, $98,000, and $68,000. How strong is the evidence to reject the chamber of commerce claim with a 5% significance level? 


Step 1: Import the necessary libraries

Step 2: Define the significance and null hypothesis

Step 3: define the sample 

Step 4: Run the test!

## Exercise 4:

A city council member claims that male and female officers wait equal times for promotion in the police department. 
A women’s spokesperson, however, believes women must wait longer than men. 
If five men waited 8, 7, 10, 5 and 7 years, while four women waited 9, 5, 12, and 8 years, what conclusion should be drawn? 

Step 1: Import the necessary libraries:

In [21]:
from scipy.stats import truncnorm
from statsmodels.stats.weightstats import ttest_ind

Step 2: Define the significance and the null hypothesis 

In [22]:
# can we assume anything from our sample?
significance = 0.025
# we're checking if it takes more than 7 (the mean men wait time is around 7 years) years to get a promotion. 
# so Ho == 7
null_hypothesis = 7

Step 3: Define the sample 

In [23]:
# Normally, in the real world, you would process an entire sample (i.e. sample_a)
# But for this test, we'll generate a sample from this shape, wherE:
# - min/max is the range of available options
# - sample mean/dev are used to define the normal distribution
# - size is how large the sample will be
min, max = (5, 12)
sample_mean_v1, sample_dev_v1, sample_size_v1 = (7.4, 56, 5)
sample_mean_v2, sample_dev_v2, sample_size_v2 = (8.5, 2.9, 4)


Step 4: generate the random string of durations for samples 1 and 2

In [24]:

# here - for our test - we're generating a random string of durations to be our sample
# these are in a normal distribution between min/max, normalised around the mean
sample_v1 = truncnorm(
(min - sample_mean_v1) / sample_dev_v1,
(max - sample_mean_v1) / sample_dev_v1,
loc=sample_mean_v1,
scale=sample_dev_v1).rvs(sample_size_v1)

sample_v2 = truncnorm(
(min - sample_mean_v2) / sample_dev_v2,
(max - sample_mean_v2) / sample_dev_v2,
loc=sample_mean_v2,
scale=sample_dev_v2).rvs(sample_size_v2)

Step 5: Get and report the p value

In [25]:
(t_stat, p_value, degree_of_freedom) = ttest_ind(sample_v2, sample_v1, alternative='larger')
# report
print('t_stat: %0.3f, p_value: %0.3f' % (t_stat, p_value))

t_stat: 0.871, p_value: 0.206


## Group Exercise 4

A store manager wishes to determine whether there is a significant difference between two trucking firms with regard to the handling of egg cartons. In a simple random sample of 200 cartons on one firm’s truck there was an average of 0.7 broken eggs per carton with a standard deviation of 0.31, while a sample of 300 cartons on the second firm’s truck showed an average of 0.775 broken eggs per carton with a standard deviation of 0.42. 

Is the difference between averages significant at a significance level of 5%?  


Step 1: Import the necessary libraries:

Step 2: Define the significance and the null hypothesis 

Step 3: Define the sample 

Step 4: generate the random string of durations for samples 1 and 2

Step 5: Get and report the p value