### Importing the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

In [None]:
# to suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Mount Google Colab drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Q1. Samy, Product Manager of K2 Jeans, wants to Launch a product line in a new market area. A survey of a random sample of 400 households in that market showed a mean income of 30000 dollars per household. The standard deviation based on an earlier pilot study is 8000 dollars. Samy strongly believes that the product line will be adequately profitable only in markets where the mean household income is greater than 29000 dollars. Samy wants your help in deciding whether the product line should be introduced in the new market. Perform statistical analysis with a significance level of 0.05 and conclude.

**Null hypothesis**: H0:  mu <= 29000   
**Alternate hypothesis**: H1: mu > 29000

Perform one sample Z test (univariate, continuous an normal, 1 sample, known variance)

This is a one-tailed test to determine if the mean income is significantly greater than 29000 dollars.

In [None]:
# Define our variables
mu, Sigma = 29000, 8000
xbar, n = 30000, 400
alpha = 0.05

# Calculate the test-statistic
test_stat = (xbar - mu) / (Sigma / np.sqrt(n))

# Calculate the p-value
p_value = 1 - stats.norm.cdf(test_stat)

# Print the results
print("Test statistic:", test_stat)
print("p-value:", p_value)


Test statistic: 2.5
p-value: 0.006209665325776159


## **Insight**

The p value of 0.006 is less that 0.05, we reject the null hypothesis.

## **Conclusion**

There is good evidence to suggest the mean income of households is greater than 29000

There is enough statistical evidence to conclude that the mean income of households is greater than 29000 at the 5% significance level.

---




# One-sample t-test

### Q2. The average mass of all acorns is 10 g. The mass of 20 acorns collected from a forest, subjected to acid rain from a coal power plant, are m = 8.8, 6.6, 9.5, 11.2, 10.2, 7.4, 8.0, 9.6, 9.9, 9.0, 7.6, 7.4, 10.4, 11.1, 8.5, 10.0, 11.6, 10.7, 10.3, and 7.0 g. Is there enough statistical evidence to conclude that the average mass of this sample is different from the average mass of acorns with a significance level of 0.05?

**a) Formulate the null hypothesis and alternate hypothesis**

**Null hypothesis**: H0:  mu = 10   
**Alternate hypothesis**: H1: mu != 10

Perform one sample t test (univariate, continuous an normal, 1 sample, unknown variance)

This is a two-tailed test because we are testing if the sample mean is different (either greater or smaller) from the population mean.

**b) Calculate the test statistics and based on the p-value provide a conclusion.**

In [None]:
# import the required functions
from scipy.stats import ttest_1samp

# Define our variables
x = [8.8, 6.6, 9.5, 11.2, 10.2, 7.4, 8.0, 9.6, 9.9, 9.0, 7.6, 7.4, 10.4, 11.1, 8.5, 10.0, 11.6, 10.7, 10.3, 7.0]
x = np.array(x)
mu = 10
alpha = 0.05

# calculate the test statistic and p-value
test_stat, p_value = ttest_1samp(x, popmean = 10, alternative = 'two-sided')

# print the results
print('The test statistic is ', test_stat)
print('The p-value is ', p_value)


The test statistic is  -2.2491611580763973
The p-value is  0.03655562279112415


---

## **Insight**

The p value of 0.036 is less than 0.05, we reject the null hypothesis.

## **Conclusion**

There is enough statistical evidence to conclude that the average mass of the acorns in the sample is significantly different from the average mass of all acorns at the 5% significance level.

---

# Independent (unpaired) two-sample t-test

### Q3. The mass of N<sub>1</sub>=20 acorns from oak trees upwind from a coal power plant and N<sub>2</sub>=30 acorns from oak trees downwind from the same coal power plant is measured. Is the mass of acorns from trees downwind different from the ones from upwind?

**Note:**
- The sample sizes are not equal but we will assume that the population variance of sample 1 and sample 2 are equal to satisfy the assumptions.
- Since the significance level is not provided. We can assume it to be 0.05.

#### sample upwind:
x1 = [10.8, 10.0, 8.2, 9.9, 11.6, 10.1, 11.3, 10.3, 10.7, 9.7,
      7.8, 9.6, 9.7, 11.6, 10.3, 9.8, 12.3, 11.0, 10.4, 10.4]

#### sample downwind:
x2 = [7.8, 7.5, 9.5, 11.7, 8.1, 8.8, 8.8, 7.7, 9.7, 7.0,
      9.0, 9.7, 11.3, 8.7, 8.8, 10.9, 10.3, 9.6, 8.4, 6.6,
      7.2, 7.6, 11.5, 6.6, 8.6, 10.5, 8.4, 8.5, 10.2, 9.2]

**a) Formulate null hypothesis and alternate hypothesis.**

**Null hypothesis**: H0:  x1 = x2   
**Alternate hypothesis**: H1: x1 != x2

Perform two sample t test (univariate, continuous an normal, 2 independent samples, unknown variance)

**b) Calculate the test statistic and based on the p-value provide a conclusion.**

In [None]:
# Define our variables
x1 = np.array([10.8, 10.0, 8.2, 9.9, 11.6, 10.1, 11.3, 10.3, 10.7, 9.7, 7.8, 9.6, 9.7, 11.6, 10.3, 9.8, 12.3, 11.0, 10.4, 10.4])
x2 = np.array([7.8, 7.5, 9.5, 11.7, 8.1, 8.8, 8.8, 7.7, 9.7, 7.0, 9.0, 9.7, 11.3, 8.7, 8.8, 10.9, 10.3, 9.6, 8.4, 6.6, 7.2, 7.6, 11.5, 6.6, 8.6, 10.5, 8.4, 8.5, 10.2, 9.2])

# find the sample means and sample standard deviations for the two samples
print('The mean upwind is ' + str(round(x1.mean(), 2)))
print('The mean downwind is ' + str(round(x2.mean(), 2)))

print('The standard deviation upwind is ' + str(round(x1.std(), 2)))
print('The standard deviation downwind is ' + str(round(x2.std(), 2)))

# calculate the test statistic and p-value
t, p_value = stats.ttest_ind(x2, x1)
print("tstats = ",t, ", p_value = ", p_value)

The mean upwind is 10.28
The mean downwind is 8.94
The standard deviation upwind is 1.04
The standard deviation downwind is 1.39
tstats =  -3.5981947686898033 , p_value =  0.0007560337478801464


---

## **Insight**

The p value of 0.0007 is less than 0.05, we reject the null hypothesis.

## **Conclusion**

There is enough statistical evidence to conclude that the two means are different.

---



# Paired samples t-test

### Q4. The average mass of acorns from the same N=30 trees downwind of a power plant is measured before (x<sub>1</sub>) and after (x<sub>2</sub>) the power plant converts burning coal to burning natural gas. Does the mass of the acorns change after the conversion from coal to natural gas?

**Note**: Since the significance level is not provided. We can assume it to be 0.05.

### sample before conversion to natural gas
x1 = np.array([10.8, 6.4, 8.3, 7.6, 11.4, 9.9, 10.6, 8.7, 8.1, 10.9,
      11.0, 11.8, 7.3, 9.6, 9.3, 9.9, 9.0, 9.5, 10.6, 10.3,
      8.8, 12.3, 8.9, 10.5, 11.6, 7.6, 8.9, 10.4, 10.2, 8.8])

### sample after conversion to natural gas
x2 = np.array([10.1, 6.9, 8.6, 8.8, 12.1, 11.3, 12.4, 9.3, 9.3, 10.8,
      12.4, 11.5, 7.4, 10.0, 11.1, 10.6, 9.4, 9.5, 10.0, 10.0,
      9.7, 13.5, 9.6, 11.6, 11.7, 7.9, 8.6, 10.8, 9.5, 9.6])

**a) Formulate null hypothesis and alternate hypothesis.**

**Null hypothesis**: H0:  mu1 = mu2   
**Alternate hypothesis**: H1: mu1 != mu2

Perform two sample t test (univariate, continuous an normal, 2 samples, unknown variance)

**b) Calculate the test statistic and based on the p-value provide a conclusion.**

In [None]:
# import the required functions
from scipy.stats import ttest_rel

# define variables
x1 = np.array([10.8, 6.4, 8.3, 7.6, 11.4, 9.9, 10.6, 8.7, 8.1, 10.9, 11.0, 11.8, 7.3, 9.6, 9.3, 9.9, 9.0, 9.5, 10.6, 10.3, 8.8, 12.3, 8.9, 10.5, 11.6, 7.6, 8.9, 10.4, 10.2, 8.8])
x2 = np.array([10.1, 6.9, 8.6, 8.8, 12.1, 11.3, 12.4, 9.3, 9.3, 10.8, 12.4, 11.5, 7.4, 10.0, 11.1, 10.6, 9.4, 9.5, 10.0, 10.0, 9.7, 13.5, 9.6, 11.6, 11.7, 7.9, 8.6, 10.8, 9.5, 9.6])

# find the p-value
test_stat, p_value = ttest_rel(x1, x2)

# print the results
print('The p-value is ', p_value)

The p-value is  0.0005168689824684378



---

## **Insight**

The p value of 0.0005 is less than 0.05, we reject the null hypothesis.

## **Conclusion**

There is enough statistical evidence to conclude that the two means are different.

---