# Notebook Imports

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from math import factorial as f
import seaborn as sns

%matplotlib inline

# Problem Statement 1:
    In each of the following situations, state whether it is a correctly stated hypothesis testing problem and why?
    1. 𝐻0: 𝜇 = 25, 𝐻1: 𝜇 ≠ 25 
    2. 𝐻0: 𝜎 > 10, 𝐻1: 𝜎 = 10 
    3. 𝐻0: 𝑥 = 50, 𝐻1: 𝑥 ≠ 50 
    4. 𝐻0: 𝑝 = 0.1, 𝐻1: 𝑝 = 0.5 
    5. 𝐻0: 𝑠 = 30, 𝐻1: 𝑠 > 30

## Ans:
    1. 𝐻0: 𝜇 = 25, 𝐻1: 𝜇 ≠ 25 -> is correct as both H0 & H1 covers all possible values
    2. 𝐻0: 𝜎 > 10, 𝐻1: 𝜎 = 10 -> is incorrectly stated as values < 10 are not considered
    3. 𝐻0: 𝑥 = 50, 𝐻1: 𝑥 ≠ 50 -> is correctly stated - H0 & H1 covers all possible values
    4. 𝐻0: 𝑝 = 0.1, 𝐻1: 𝑝 = 0.5 -> is incorrect because hypotheses tests only at 2 values
    5. 𝐻0: 𝑠 = 30, 𝐻1: 𝑠 > 30 -> Incorrect - ignores < 30 values

# Problem Statement 2:
    The college bookstore tells prospective students that the average cost of its textbooks is Rs. 52 with a standard deviation of Rs. 4.50. 
    A group of smart statistics students thinks that the average cost is higher. 
    To test the bookstore’s claim against their alternative, the students will select a random sample of size 100. 
    Assume that the mean from their random sample is Rs. 52.80. 
    Perform a hypothesis test at the 5% level of significance and state your decision.

In [2]:
# H0: Bookstore's claim is True

mu = 52
sigma = 4.5

s_mean = 52.8
n = 100

# Standard Error
se = sigma / (n**0.5)

# Z value (sample size is large)
z_test = (s_mean - mu) / se
print("Z value for test sample =", round(z_test, 2))

# Z value for SL -> alpha = 0.05
z = stats.norm.ppf(0.05)
print("Z value for given Significant Limit =", round(z, 2))

Z value for test sample = 1.78
Z value for given Significant Limit = -1.64


- As the Z value for test is greater than that of SL. We would not reject the Null Hypotheis, i.e, Bookstore's Claim is True.

In [3]:
#

# Problem Statement 3:
    A certain chemical pollutant in the Genesee River has been constant for several years with mean μ = 34 ppm (parts per million) and standard deviation σ = 8 ppm. 
    A group of factory representatives whose companies discharge liquids into the river is now claiming that they have lowered the average with improved filtration devices. 
    A group of environmentalists will test to see if this is true at the 1% level of significance. Assume \ that their sample of size 50 gives a mean of 32.5 ppm. 
    Perform a hypothesis test at the 1% level of significance and state your decision.

In [4]:
# H0: Pollutants < 34ppm : claim is correct
mu = 34
sigma = 8

s_mean = 32.5
n = 50

# Standard Error
se = sigma / (n**0.5)

# Z value (sample size is large)
z_test = (s_mean - mu) / se
print("Z value for test sample =", round(z_test, 2))

# Z value for SL = 1% -> alpha = 0.01
z = stats.norm.ppf(0.01)
print("Z value for given Significant Limit =", round(z, 2))

Z value for test sample = -1.33
Z value for given Significant Limit = -2.33


- As the Z value for test is greater than that of SL. We would not reject the Null Hypotheis, i.e, Pollutants < 34ppm, factory'd c;aimis true.

In [5]:
#

# Problem Statement 4:
    Based on population figures and other general information on the U.S. population, suppose it has been estimated that, on average, a family of four in the U.S. spends about $1135 annually on dental expenditures. 
    Suppose further that a regional dental association wants to test to determine if this figure is accurate for their area of country. 
    To test this, 22 families of 4 are randomly selected from the population in that area of the country and a log is kept of the family’s dental expenditure for one year. 
    The resulting data are given below. 
    Assuming, that dental expenditure is normally distributed in the population, use the data and an alpha of 0.5 to test the dental association’s hypothesis.
1008, 812, 1117, 1323, 1308, 1415, 831, 1021, 1287, 851, 930, 730, 699, 872, 913, 944, 954, 987, 1695, 995, 1003, 994

In [6]:
# H0: on average, a family of four in the U.S. spends about $1135 annually on dental expenditures.
mu = 1135

sample = np.array([1008, 812, 1117, 1323, 1308, 1415, 831, 1021, 1287, 851, 930, 
          730, 699, 872, 913, 944, 954, 987, 1695, 995, 1003, 994])

s_mean = np.mean(sample)
s_std = np.std(sample)
n = len(sample)  # n-1 = degree of freedom

# Standard Error
se = s_std / (n**0.5)

# t Test - small sample size and unknown population variance
t_test = (s_mean - mu) / se
print("Value of t Test Statistic for sample =", t_test)

# stats.ttest_1samp(sample, mu)

t_val = stats.t.ppf(0.25, 21)   # 2 tail test
print("t Test value for given SL =", t_val)

Value of t Test Statistic for sample = -2.070747228595759
t Test value for given SL = -0.6863519891164291


- t statistic for test sample is very less than that for alpha, therefore we will reject the null hypothesis.

In [7]:
#

# Problem Statement 5:
    In a report prepared by the Economic Research Department of a major bank the Department manager maintains that the average annual family income on Metropolis is $48,432. 
    What do you conclude about the validity of the report if a random sample of 400 families shows and average income of $48,574 with a standard deviation of 2000?

In [8]:
# H0: average annual family income on Metropolis is $48,432.

mu = 48432

n = 400
s_mean = 48574
s_std = 2000

se = s_std / (n**0.5)

# 2 tail Z test (n > 30 & distribution is unknown)
z_test = (s_mean - mu) / se
print("Z value for test sample =", round(z_test, 2))

# let SL=5% -> alpha=0.05
z = stats.norm.ppf(0.025)
print("Z value for given Significant Limit =", round(z, 2))

Z value for test sample = 1.42
Z value for given Significant Limit = -1.96


- As the Z value for test is greater than that of SL. We would not reject the Null Hypotheis, i.e, average annual family income on Metropolis is $48,432.

In [9]:
#

# Problem Statement 6:
    Suppose that in past years the average price per square foot for warehouses in the United States has been $32.28. 
    A national real estate investor wants to determine whether that figure has changed now. 
    The investor hires a researcher who randomly samples 19 warehouses that are for sale across the United States and finds that the mean price per square foot is $31.67, with a standard deviation of $1.29. 
    assume that the prices of warehouse footage are normally distributed in population. 
    If the researcher uses a 5% level of significance, what statistical conclusion can be reached? What are the hypotheses?

In [10]:
# H0: the average price per square foot for warehouses in the United States has been $32.28. 

mu = 32.28

#Sample
n = 19     # < 30
s_mean = 31.67
s_std = 1.29

se = s_std / (n**0.5)

# t-Test - small sample size and unknown population variance
t_test = (s_mean - mu) / se
print("Value of t Test Statistic for sample =", t_test)

# stats.ttest_1samp(sample, mu)

t_val = stats.t.ppf(0.025, 18)   # 2 tail test
print("t Test value for given SL =", t_val)

Value of t Test Statistic for sample = -2.06118477175179
t Test value for given SL = -2.10092204024096


- t statistic for test sample is very less than that for alpha, therefore we will reject the null hypothesis.

In [11]:
#

# Problem Statement 7:
    Fill in the blank spaces in the table and draw your conclusions from it.

![title](img/Stats2_07.png)

In [12]:
#

In [13]:
#

# Problem Statement 8:
Find the t-score for a sample size of 16 taken from a population with mean 10 when the sample mean is 12 and the sample standard deviation is 1.5.

In [14]:
n = 16
mu = 10
s_mean = 12
s_std = 1.5

se = s_std / (n**0.5)

t_test = (s_mean - mu) / se
print("t-score for sample =", t_test)

t-score for sample = 5.333333333333333


In [15]:
#

# Problem Statement 9:
    Find the t-score below which we can expect 99% of sample means will fall if samples of size 16 are taken from a normally distributed population.

In [16]:
stats.t.ppf(0.99, 15)

2.602480294995493

In [17]:
#

# Problem Statement 10:
    If a random sample of size 25 drawn from a normal population gives a mean of 60 and a standard deviation of 4, find the range of t-scores where we can expect to find the middle 95% of all sample means. 
    Compute the probability that (−𝑡0.05 <𝑡<𝑡0.10).

In [18]:
n = 25

mu = 60
std = 4

stats.t.ppf(0.025, 24)

-2.063898561628021

- range of t-scores where we can expect to find the middle 95% of all sample means : -2.06 to +2.06


In [19]:
# Compute the probability that (−𝑡0.05 <𝑡<𝑡0.10)
stats.t.ppf(0.1, 24) - stats.t.ppf(0.05, 24)

0.3930461480498255

In [20]:
#

# Problem Statement 11:
    Two-tailed test for difference between two population means
    Is there evidence to conclude that the number of people travelling from Bangalore to Chennai is different from the number of people travelling from Bangalore to Hosur in a week, given the following:
    
    Population 1: Bangalore to Chennai 
    n1 = 1200
    x1 = 452
    s1 = 212
    
    Population 2: Bangalore to Hosur 
    n2 = 800 
    x2 = 523
    s2 = 185

In [21]:
# H0: Number of people same in both cases
n1 = 1200
x1 = 452
s1 = 212

n2 = 800 
x2 = 523
s2 = 185

se = ((s1**2 / n1) + (s2**2 / n2)) ** 0.5

# Z test - large sample size
z_test = (x2 - x1) / se
print("Z value for test sample =", round(z_test, 2))

# let 5% SL - 2 tailed
z = stats.norm.ppf(0.025)
print(f"Z value ranges for given Significant Limit (Critical Regions) = <{round(z, 2)} & >{-round(z, 2)}")

Z value for test sample = 7.93
Z value ranges for given Significant Limit (Critical Regions) = <-1.96 & >1.96


### As the Z value for sample lies in the critical region, Null Hypothesis will be rejected.
- Therefore, there is evidence that the number of people travelling from Bangalore to Chennai is different from the number of people travelling from Bangalore to Hosur in a week.

In [22]:
#

# Problem Statement 12:
    Is there evidence to conclude that the number of people preferring Duracell battery is different from the number of people preferring Energizer battery, given the following: 
    
    Population 1: Duracell
    n1 = 100
    x1 = 308
    s1 = 84
    
    Population 2: Energizer 
    n2 = 100
    x2 = 254
    s2 = 67

In [23]:
# H0: Number of people same in both cases
n1 = 100
x1 = 308
s1 = 84

n2 = 100 
x2 = 254
s2 = 67

se = ((s1**2 / n1) + (s2**2 / n2)) ** 0.5

# Z test - large sample size
z_test = (x2 - x1) / se
print("Z value for test sample =", round(z_test, 2))

# let 5% SL - 2 tailed
z = stats.norm.ppf(0.025)
print(f"Z value ranges for given Significant Limit (Critical Regions) = <{round(z, 2)} & >{-round(z, 2)}")

Z value for test sample = -5.03
Z value ranges for given Significant Limit (Critical Regions) = <-1.96 & >1.96


### As the Z value for sample lies in the critical region, Null Hypothesis will be rejected.
- Therefore there is eidence that the number of people preferring Duracell battery is different from the number of people preferring Energizer battery.

In [24]:
#

# Problem Statement 13:
    Pooled estimate of the population variance
    Does the data provide sufficient evidence to conclude that average percentage increase in the price of sugar differs when it is sold at two different prices? 
    
    Population 1: 
    Price of sugar = Rs. 27.50 
    n1 = 14
    x1 = 0.317%
    s1 = 0.12%
    
    Population 2: 
    Price of sugar = Rs. 20.00 
    n2 = 9
    x2 = 0.21%
    s2 = 0.11%

In [25]:
# H0: Both samples come from same population
# Population 1: Price of sugar = Rs. 27.50 
n1 = 14
x1 = 0.317
s1 = 0.12

# Population 2: Price of sugar = Rs. 20.00 
n2 = 9
x2 = 0.21
s2 = 0.11

se = ((s1**2 / n1) + (s2**2 / n2)) ** 0.5

# t-Test - small sample size and unknown population variance
t_test = (x2 - x1) / se
print("t Test value for test sample =", round(t_test, 2))

# let 5% SL - 2 tailed
t = stats.t.ppf(0.025, df = n1+n2-2)
print(f"t Test value ranges for given Significant Limit (Critical Regions) = <{round(t, 2)} & >{-round(t, 2)}")

t Test value for test sample = -2.2
t Test value ranges for given Significant Limit (Critical Regions) = <-2.08 & >2.08


### As the t-Test value for sample lies in the critical region, Null Hypothesis will be rejected.
- the data provide sufficient evidence to conclude that average percentage increase in the price of sugar differs when it is sold at two different prices

In [26]:
#

# Problem Statement 14:
    The manufacturers of compact disk players want to test whether a small price reduction is enough to increase sales of their product. Is there evidence that the small price reduction is enough to increase sales of compact disk players? 
    
    Population 1: Before reduction
    n1 = 15
    x1 = Rs. 6598 
    s1 = Rs. 844 
    
    Population 2: After reduction 
    n2 = 12 
    x2 = RS. 6870
    s2 = Rs. 669

In [27]:
# H0: Both samples come from same population - Small Reduction not enough to increase sales of product
# Population 1: Before Reduction
n1 = 15
x1 = 6598 
s1 = 844

# Population 2: After reduction 
n2 = 12 
x2 = 6870
s2 = 669

se = ((s1**2 / n1) + (s2**2 / n2)) ** 0.5

# t-Test - small sample size and unknown population variance
t_test = (x2 - x1) / se
print("t Test value for test sample =", round(t_test, 2))

# let 5% SL - 2 tailed
t = stats.t.ppf(0.025, df = n1+n2-2)
print(f"t Test value ranges for given Significant Limit (Critical Regions) = <{round(t, 2)} & >{-round(t, 2)}")

t Test value for test sample = 0.93
t Test value ranges for given Significant Limit (Critical Regions) = <-2.06 & >2.06


- the t Test value lies outside the range of critical values, therefore null hypthesis is true i.e. the small reduction made was not enough for increasing the sales.

In [28]:
#

# Problem Statement 15:
    Comparisons of two population proportions when the hypothesized difference is zero 
    Carry out a two-tailed test of the equality of banks’ share of the car loan market in 1980 and 1995.
    
    Population 1: 1980
    n1 = 1000
    x1 = 53
    𝑝 1 = 0.53 
    
    Population 2: 1985 
    n2 = 100
    x2 = 43
    𝑝 2= 0.53

In [29]:
# H0 : Have same market share in 1980 & 1985
# Population 1: 1980
n1 = 1000
x1 = 53
p1 = 0.53 

# Population 2: 1985 
n2 = 100
x2 = 43
p2 = 0.53

# Combined Sample Proportion
p = (x1 + x2) / (n1 + n2)

se = ((p * (1 - p) / n1) + (p * (1 - p) / n2)) ** 0.5
# print(se)

# Z test - large sample size
z_test = (p2 - p1) / se
print("Z value for test sample =", round(z_test, 2))

# let 5% SL - 2 tailed
z = stats.norm.ppf(0.025)
print(f"Z value ranges for given Significant Limit (Critical Regions) = <{round(z, 2)} & >{-round(z, 2)}")

Z value for test sample = 0.0
Z value ranges for given Significant Limit (Critical Regions) = <-1.96 & >1.96


### The Z value for the test data = 0 : we can also observe this as p1 = p2.
- Since the individual proportions are same, z value lies on the mean and hence only lie in critical region if alpha = 1.

In [30]:
#

# Problem Statement 16:
    Carry out a one-tailed test to determine whether the population proportion of traveler’s check buyers who buy at least $2500 in checks when sweepstakes prizes are offered as at least 10% higher than the proportion of such buyers when no sweepstakes are on.
    
    Population 1: With sweepstakes
    n1 = 300
    x1 = 120
    p1 = 0.40
    
    Population 2: No sweepstakes 
    n2 = 700 
    x2 = 140
    p2 = 0.20

In [31]:
# H0 : Proportion of buyers with sweeptakers are equal to 10% higher than that with no sweeptakers
# H1 : Proportion of buyers with sweeptakers are greater than 10% higher than that with no sweeptakers

# Population 1: With sweepstakes
n1 = 300
x1 = 120
p1 = 0.40

# Population 2: No sweepstakes 
n2 = 700 
x2 = 140
p2 = 0.20

d = 0.1

p = (x1 + x2) / (n1 + n2)
se = ((p * (1 - p) / n1) + (p * (1 - p) / n2)) ** 0.5

z_test = (p1 - p2 - d) / se
print("Z value for test sample =", round(z_test, 2))

# let 5% SL - 1 tailed - ony checking for higher values
z = stats.norm.ppf(0.05)
print(f"Z value range for given Significant Limit (Critical Regions) = >{-round(z, 2)}")

Z value for test sample = 3.3
Z value range for given Significant Limit (Critical Regions) = >1.64


### Z value for test samples lies in the critical region, therefore we will reject the null hypothesis.
### i.e. Proportion of buyers with sweeptakers are greater than 10% higher than that with no sweeptakers

In [32]:
#

# Problem Statement 17:
    A die is thrown 132 times with the following results: 
    Number turned up: 1, 2, 3, 4, 5, 6 
    Frequency: 16, 20, 25, 14, 29, 28
    Is the die unbiased? Consider the degrees of freedom as p − 1.

In [33]:
# H0: Die is unbiased
# Observed frequency
obs = np.array([16, 20, 25, 14, 29, 28])

# Expected Frequency
exp = np.array([132/6] * 6)

chi2_stat = stats.chisquare(obs, exp).statistic
chi2_pval = stats.chisquare(obs, exp).pvalue

# Let SL = 0.05
print('chi-square stastic =',chi2_stat)
print('p value for chi2 =', chi2_pval)

chi-square stastic = 9.0
p value for chi2 = 0.1090641579497725


## As the p value(0.109) > SL(0.05) -> Die is Unbiased

In [34]:
#

# Problem Statement 18:
In a certain town, there are about one million eligible voters. A simple random sample of 10,000 eligible voters was chosen to study the relationship between gender and participation in the last election. The results are summarized in the following 2X2 (read two by two) contingency table:


![title](img/Stats2_18.png)

 We would want to check whether being a man or a woman (columns) is independent of having voted in the last election (rows). In other words, is “gender and voting independent”?

In [35]:
# H0: Gender & Voting Independent 

# Observed Data
obs_men_voted = 2792
obs_men_not_voted = 1486
obs_women_voted = 3591
obs_women_not_voted = 2131

obs = np.array([obs_men_voted, obs_men_not_voted, obs_women_voted, obs_women_not_voted]) 

total_men = obs_men_voted + obs_men_not_voted
total_women = obs_women_voted + obs_women_not_voted

total_voted = obs_men_voted + obs_women_voted
total_not_voted = obs_men_not_voted + obs_women_not_voted

total = total_not_voted + total_voted

ratio_voted = total_voted / total
ratio_not_voted = total_not_voted / total

# Expected Data
exp_men_voted = ratio_voted * total_men
exp_men_not_voted = ratio_not_voted * total_men
exp_women_voted = ratio_voted * total_women
exp_women_not_voted = ratio_not_voted * total_women

exp = np.array([exp_men_voted, exp_men_not_voted, exp_women_voted, exp_women_not_voted])

chi2_stat = stats.chisquare(obs, exp, ddof=1).statistic
chi2_pval = stats.chisquare(obs, exp, ddof=1).pvalue

# Let SL = 0.05
print('chi-square stastic =',chi2_stat)
print('p value for chi2 =', chi2_pval)

chi-square stastic = 6.660455899328042
p value for chi2 = 0.03578494697123385


## p value of chi2(0.035) < SL(0.05), therefore reject null hypothesis.
## gender and voting are not independent

In [36]:
#

# Problem Statement 19:
    A sample of 100 voters are asked which of four candidates they would vote for in an election. The number supporting each candidate is given below:
    
    Higgins - 41
    Reardon - 19
    White.  - 24
    Charlton - 16
    
    Do the data suggest that all candidates are equally popular? [Chi-Square = 14.96, with 3 df, 𝑝 < 0.05 .

In [37]:
# H0: All candidates equally popular
obs = np.array([41, 19, 24, 16])
exp = np.array([100/4]*4)

chi2_stat = stats.chisquare(obs, exp, ddof=3).statistic
chi2_pval = stats.chisquare(obs, exp, ddof=3).pvalue

print('chi-square stastic =',chi2_stat)
print('p value for chi2 =', chi2_pval)

chi-square stastic = 14.959999999999999
p value for chi2 = nan


-- Chi2 value for p = 0.05 with df=3 is 7.815. Chi2 for test is 14.9 (>7.815).
- Hence H0 will be rejected, i.e. All candidates are not equally important.

In [38]:
#

# Problem Statement 20:
    Children of three ages are asked to indicate their preference for three photographs of adults. 
    Do the data suggest that there is a significant relationship between age and photograph preference? 
    What is wrong with this study? [Chi-Square = 29.6, with 4 df: 𝑝 < 0.05].
    
![title](img/Stats2_20.png)

In [39]:
# H0: There is no relationship between age and photograph.
stats.chi2.ppf(q=0.95, df=4)

9.487729036781154

- chi2 test(29.6) > chi2(p<0.05) (9.487). Reject Null Hypothesis.
- Therefore, there is a relationship between age and photograph.

In [40]:
#

# Problem Statement 21:

![title](img/Stats2_21.png)

In [41]:
# H0: There is no difference between Support & No Support.
stats.chi2.ppf(q=0.95, df=1)

3.841458820694124

- Reject Null Hypothesis -> There is difference between the "support" and "no support" conditions in the frequency with which individuals are likely to conform

In [42]:
#

![title](img/Stats2_22.png)

In [43]:
# H0: There is no relationship between height and leadership qualities
stats.chi2.ppf(q=0.99, df=2)

9.21034037197618

- 10.71 > 9.2103: Reject Null Hypothesis
- There is relationship between height and leadership qualities

In [44]:
#

# Problem Statement 23:
    Each respondent in the Current Population Survey of March 1993 was classified as employed, unemployed, or outside the labor force. The results for men in California age 35-
 
![title](img/Stats2_23.png)

In [45]:
# H0: No relation in marriage and employment
# Observed Values
obs = np.array([679, 63, 42, 103, 10, 18, 114, 20, 25])

# Expected values - calculated as in ques 18
exp = np.array([623, 29, 48, 136, 6, 11, 114, 6, 8])

stats.chisquare(obs, exp).statistic

129.56600855974102

In [46]:
# Let SL = 0.05
stats.chi2.ppf(q=0.95, df=9-3)

12.591587243743977

- There is a very strong evidence that there s relation between marrital status and employment