## Problem Statement 1:
In each of the following situations, state whether it is a correctly stated hypothesis testing problem and why?  
1. H0: μ = 25, H1: μ ≠ 25
2. H0: σ > 10, H1: σ = 10
3. H0: x = 50, H1: x ≠ 50
4. H0: p = 0.1, H1: p = 0.5
5. H0: s = 30, H1: s > 30

### Answer
1. $H_0$ contains `=` and the contrary is seen in $H_1$. Therefore the hypothesis is stated correctly.
2. $H_0$ contains an inequality symbol(`>`) and $H_1$ contains `=`. Therefore the hypothesis is incorrectly stated.
3. The reason being similar to `1`, the hypothesis is correctly stated.
4. $H_1$ contains `=`. Therefore the hypothesis is incorrectly stated.
5. $H_0$ contains `=` and $H_1$ contains an inequality (`>`). Therefore the hypothesis is stated correctly.

## Problem Statement 2:
The college bookstore tells prospective students that the average cost of its textbooks is Rs. 52 with a standard deviation of Rs. 4.50. A group of smart statistics students thinks that the average cost is higher. To test the bookstore’s claim against their alternative, the students will select a random sample of size 100. Assume that the mean from their random sample is Rs. 52.80. Perform a hypothesis test at the 5% level of significance and state your decision.

In [60]:
import numpy as np
#from statsmodels.stats.weightstats import ztest
from scipy.stats import *


# Population Mean is 'mu' and standard deviation is 'sd'
mu = 52
sd = 4.5
# Sample mean is 'x', 'n' is sample size.
n = 100
x = 52.8
# Finding Sample Standard Deviation According to Central Limit Theorem
s = sd/(n**.5) # Standard Error

#sample = s * np.random.randn(n) + x

# Null Hypothesis (H0): mu = 52
# Alternate Hypothesis (H1): mu > 52

# Significance Level
alpha = 0.05

# Decision Rule: As alternative claim is greater than the average.
# So its a Right tailed test.
z_alpha = norm.ppf(1 - alpha)#, loc=x, scale=s)
print(f'Z alpha Score: {z_alpha}')

# Test Stats
#z, p = ztest(sample, value=mu, alternative='larger')
z = (x - mu) / s
p = norm.sf(z)
print(f'Z test Score: {z}\nP value: {p}')

# Results
if p < alpha:
    print("Reject Null Hypothesis")
    print("Average cost is higher than the bookstore's claim")
else:
    print("Accept Null Hypothesis")
    print("Average cost that the bookstore claims is correct")

Z alpha Score: 1.6448536269514722
Z test Score: 1.7777777777777715
P value: 0.03772017981340073
Reject Null Hypothesis
Average cost is higher than the bookstore's claim


## Problem Statement 3:
A certain chemical pollutant in the Genesee River has been constant for several years with mean μ = 34 ppm (parts per million) and standard deviation σ = 8 ppm. A group of factory representatives whose companies discharge liquids into the river is now claiming that they have lowered the average with improved filtration devices. A group of environmentalists will test to see if this is true at the 1% level of significance. Assume that their sample of size 50 gives a mean of 32.5 ppm. Perform a hypothesis test at the 1% level of significance and state your decision.

In [61]:
# Population Mean is 'mu' and standard deviation is 's'
mu = 34
sd = 8
# Sample mean is 'x', 'n' is sample size.
n = 50
x = 32.5
# Finding Sample Standard Deviation According to Central Limit Theorem
s = sd/(n**.5) # Standard Error

#sample = s * np.random.randn(n) + x

# Null Hypothesis (H0): mu = 34
# Alternate Hypothesis (H1): mu < 34

# Significance Level
alpha = 0.01

# Decision Rule: The average discharge of chemical pollutant has been  lowered.
# So its a Left tailed test.
z_alpha = norm.ppf(alpha)
print(f'Z alpha Score: {z_alpha}')

# Test Stats
#z, p = ztest(sample, value=x, alternative='smaller')
z = (x - mu) / s
p = norm.sf(z)
print(f'Z test Score: {z}\nP value: {p}')

# Results
if p < alpha:
    print("Reject Null Hypothesis")
    print("The average pollutant infusion into the river has been reduced")
else:
    print("Accept Null Hypothesis")
    print("The average pollutant infusion into the river did not reduce")

Z alpha Score: -2.3263478740408408
Z test Score: -1.3258252147247767
P value: 0.9075512005172
Accept Null Hypothesis
The average pollutant infusion into the river did not reduce


## Problem Statement 4:
Based on population figures and other general information on the U.S. population, suppose it has been estimated that, on average, a family of four in the U.S. spends about $1135 annually on dental expenditures. Suppose further that a regional dental association wants to test to determine if this figure is accurate for their area of country. To test this, 22 families of 4 are randomly selected from the population in that area of the country and a log is kept of the family’s dental expenditure for one year. The resulting data are given below. Assuming, that dental expenditure is normally distributed in the population, use the data and an alpha of 0.5 to test the dental association’s hypothesis.  
1008, 812, 1117, 1323, 1308, 1415, 831, 1021, 1287, 851, 930, 730, 699, 872, 913, 944, 954, 987, 1695, 995, 1003, 994

In [12]:
# Population Mean is 'mu'
# 'n' is sample size.
n = 22
mu = 1135

data = [
    1008, 812, 1117, 1323, 1308, 1415, 831, 1021, 1287, 851,
    930, 730, 699, 872, 913, 944, 954, 987, 1695, 995, 1003, 994
       ]

# Null Hypothesis (H0): mu = 1135
# Alternate Hypothesis (H1): mu != 1135 ## mu > 35 or mu < 35

# Significance Level
alpha = 0.5

# Decision Rule: To determine the accuracy of a figure, it is tested for lower as well as higher values.
# So its a Two tailed test.
z_alpha = norm.ppf(alpha/2)
print(f'Critical Region: {z_alpha}, {-z_alpha}')

# Test Stats
t, p = ttest_1samp(data, mu)
print(f't test Score: {z}\nP value: {p}')

# Results
if p < alpha:
    print("Reject Null Hypothesis")
    print("The average dental expenses as stated is inaccurate")
else:
    print("Accept Null Hypothesis")
    print("The average dental expense as stated is accurate")

Critical Region: -0.6744897501960817, 0.6744897501960817
t test Score: -9.325003875219066
P value: 0.0559738319464585
Reject Null Hypothesis
The average dental expenses as stated is inaccurate


## Problem Statement 5:
In a report prepared by the Economic Research Department of a major bank the Department manager maintains that the average annual family income on Metropolis is `$ 48,432`.
What do you conclude about the validity of the report if a random sample of 400 families shows and average income of `$ 48,574` with a standard deviation of 2000?

In [None]:
# Population Mean is 'mu'
# 'n' is sample size, 'x' is sample mean, 's' is sample standard deviation
mu = 48432
n = 400
x = 48574
s = 2000

data = s * np.random.randn(n) + x

# Null Hypothesis (H0): mu = 1135
# Alternate Hypothesis (H1): mu != 1135 ## mu > 35 or mu < 35

# Significance Level
alpha = 0.5

# Decision Rule: To determine the accuracy of a figure, it is tested for lower as well as higher values.
# So its a Two tailed test.
z_alpha = norm.ppf(alpha/2)
print(f'Critical Region: {z_alpha}, {-z_alpha}')

# Test Stats
t, p = ttest_1samp(data, mu)
print(f't test Score: {z}\nP value: {p}')

# Results
if p < alpha:
    print("Reject Null Hypothesis")
    print("The average dental expenses as stated is inaccurate")
else:
    print("Accept Null Hypothesis")
    print("The average dental expense as stated is accurate")

## Problem Statement 6:
Suppose that in past years the average price per square foot for warehouses in the United States has been `$32.28`. A national real estate investor wants to determine whether that figure has changed now. The investor hires a researcher who randomly samples 19 warehouses that are for sale across the United States and finds that the mean price per square foot is `$31.67`, with a standard deviation of `$1.29`. assume that the prices of warehouse footage are normally distributed in population. If the researcher uses a 5\% level of significance, what statistical conclusion can be reached? What are the hypotheses?

## Problem Statement 7:
Fill in the blank spaces in the table and draw your conclusions from it. 

|Acceptance Region of distribution|Sample Size| $\alpha$ | $\beta$ at $\mu$ = 52 | $\beta$ at $\mu$ = 50.5 |
|:----------------|:---------:|:--------:|:---------------------:|----------------------:|
|$48.5 < \bar{x} < 51.5$|10| | | |
|$48 < \bar{x} < 52$|10| | | |
|$48.81 < \bar{x} < 51.9$|10| | | |
|$48.42 < \bar{x} < 51.58$|10| | | |

## Problem Statement 8:
Find the t-score for a sample size of 16 taken from a population with mean 10 when the sample mean is 12 and the sample standard deviation is 1.5.

## Problem Statement 9:
Find the t-score below which we can expect 99% of sample means will fall if samples of size 16 are taken from a normally distributed population.

## Problem Statement 10:
If a random sample of size 25 drawn from a normal population gives a mean of 60 and a standard deviation of 4, find the range of t-scores where we can expect to find the middle 95% of all sample means. Compute the probability that (−t0.05 <t<t0.10).

## Problem Statement 11:
Two-tailed test for difference between two population means  
Is there evidence to conclude that the number of people travelling from Bangalore to Chennai is different from the number of people travelling from Bangalore to Hosur in a week, given the following:
> Population 1: Bangalore to Chennai 
>> $n_1$ = 1200  
>> $x_1$ = 452  
>> $s_1$ = 212

> Population 2: Bangalore to Hosur 
>> $n_2$ = 800  
>> $x_2$ = 523  
>> $s_2$ = 185

## Problem Statement 12:
Is there evidence to conclude that the number of people preferring Duracell battery is different from the number of people preferring Energizer battery, given the following:
> Population 1: Duracell
>> $n_1$ = 100  
>> $x_1$ = 308  
>> $s_1$ = 84

> Population 2: Energizer
>> $n_2$ = 100  
>> $x_2$ = 254  
>> $s_2$ = 67

## Problem Statement 13:
Pooled estimate of the population variance  
Does the data provide sufficient evidence to conclude that average percentage increase in the price of sugar differs when it is sold at two different prices?
> Population 1: Price of sugar = ₹27.50
>> $n_1$ = 14  
>> $x_1$ = 0.317%  
>> $s_1$ = 0.12%  

> Population 2: Price of sugar = ₹20.00
>> $n_2$ = 9  
>> $x_1$ = 0.21%  
>> $s_1$ = 0.11%  

## Problem Statement 14:
The manufacturers of compact disk players want to test whether a small price reduction is enough to increase sales of their product. Is there evidence that the small price reduction is enough to increase sales of compact disk players?
> Population 1: Before reduction 
>> $n_1$ = 15  
>> $x_1$ = ₹6598  
>> $s_1$ = ₹844

> Population 2: After reduction 
>> $n_2$ = 12  
>> $x_2$ = ₹6870  
>> $s_2$ = ₹669

## Problem Statement 15:
Comparisons of two population proportions when the hypothesized difference is zero  
Carry out a two-tailed test of the equality of banks’ share of the car loan market in 1980 and 1995.
> Population 1: 1980
>> $n_1$ = 1000  
>> $x_1$ = 53  
>> $\hat{p}_1$ = 0.53

> Population 2: 1985
>> $n_2$ = 100  
>> $x_2$ = 43  
>> $\hat{p}_2$= 0.53

## Problem Statement 16:
Carry out a one-tailed test to determine whether the population proportion of traveler’s check buyers who buy at least \$2500 in checks when sweepstakes prizes are offered as at least 10\% higher than the proportion of such buyers when no sweepstakes are on.
> Population 1: With sweepstakes
>> $n_1$ = 300
>> $x_1$ = 120
>> $\hat{p}_1$ = 0.4
> Population 2: No sweepstakes
>> $n_2$ = 700  
>> $x_2$ = 140  
>> $\hat{p}_2$= 0.2

## Problem Statement 17:
A die is thrown 132 times with the following results:

|Number Turned up|Frequency|
|:--------------:|:-------:|
|1|16|
|2|20|
|3|25|
|4|14|
|5|29|
|6|28|

Is the die unbiased? Consider the degrees of freedom as $\hat{p} -1$

## Problem Statement 18:
In a certain town, there are about one million eligible voters. A simple random sample of 10,000 eligible voters was chosen to study the relationship between gender and participation in the last election. The results are summarized in the following 2X2 (read two by two) contingency table:

||Men|Women|
|:---|:---:|:---:|
|Voted|2792|3591|
|Not Voted|1486|2131|

We would want to check whether being a man or a woman (columns) is independent of having voted in the last election (rows). In other words, is “gender and voting independent”?

## Problem Statement 19:
A sample of 100 voters are asked which of four candidates they would vote for in an election. The number supporting each candidate is given below:

|Higgins|Reardon|White|Charlton|
|:---:|:---:|:---:|:---:|
|41|19|24|16|

Do the data suggest that all candidates are equally popular? \[$\chi^2 = 14.96$, with 3 df, $\hat{p} < 0.05$ \].

## Problem Statement 20:
Children of three ages are asked to indicate their preference for three photographs of adults. Do the data suggest that there is a significant relationship between age and photograph preference? What is wrong with this study? \[$\chi^2 = 29.6$, with 4 df, $\hat{p} < 0.05$ \].

|#############|Photograph|
|:---:|:---:|

| Age of Child | A  | B  | C  |
|:--------------|:----:|:----:|:----:|
| 5-6 years    | 18 | 22 | 20 |
| 7-8 years    | 2  | 28 | 40 |
| 9-10 years   | 20 | 10 | 40 |

## Problem Statement 21:

A study of conformity using the Asch paradigm involved two conditions: one where one confederate supported the true judgement and another where no confederate gave the correct response.

|             | Support | No Support |
|:------------|:-------:|:----------:|
| Conform     | 18      | 40         |
| Not Conform | 32      | 10         |

Is there a significant difference between the "support" and "no support" conditions in the frequency with which individuals are likely to conform?
\[$\chi^2 = 19.87$, with 1 df, $\hat{p} < 0.05$ \]

## Problem Statement 22:

We want to test whether short people differ with respect to their leadership qualities (Genghis Khan, Adolf Hitler and Napoleon were all stature-deprived, and how many midget MP's are there?) The following table shows the frequencies with which 43 short people and 52 tall people were categorized as "leaders", "followers" or as "unclassifiable". Is there a relationship between height and leadership qualities?
\[$\chi^2 = 10.71$, with 2 df, $\hat{p} < 0.01$ \]

| Height       | Short | Tall |
|:-------------|:-----:|:----:|
| Leader       | 12    | 32   |
| Follower     | 22    | 14   |
| Unclassified | 9     | 6    |

## Problem Statement 23:

Each respondent in the Current Population Survey of March 1993 was classified as employed, unemployed, or outside the labor force. The results for men in California age 35-44 can be cross-tabulated by marital status, as follows:

|                    | Married | Widowed,<br>Divorced or<br>Separated | Never Married |
|--------------------|:-------:|:------------------------------------:|:-------------:|
| Employed           |   679   |                  103                 |      114      |
| Unemployed         |    63   |                  10                  |       20      |
| Not in Labor Force |    42   |                  18                  |       25      |

Men of different marital status seem to have different distributions of labor force status. Or is this just chance variation? (you may assume the table results from a simple random sample.)