### One sample t-test

The t-test is used when the population standard deviation *S* is unknown and is estimated from the sample.

$t-statistics = \frac{(\bar{X} - \mu)}{\frac{S}{\sqrt{n}}}$


For example, Aravind Productions (AP) is a newly formed movie production house based out of Mumbai, India. AP was interested in understanding the production cost required for producing a Bollywood movie. The industry believes that the production house will require at least INR 500 million (50 crore) on average. It is assumed that the Bollywood movie production cost follows a normal distribution. Production cost of 40 Bollywood movies in millions of rupees are given in *bollywoodmovies.csv* file. Conduct an appropriate hypothesis test at $\alpha$ = 0.05 to check whether the belief about average production cost is correct.


In [None]:
import pandas as pd
# Read the CSV file into a DataFrame
df_bollywood_movies = pd.read_csv('bollywoodmovies.csv')
print(df_bollywood_movies)

$H_{0}$: $\mu =$ 500

$H_{A}$: $\mu \ne$ 500


*scipy.stats.ttest_1samp()* can be used to doing this test. It takes two parameters:

- a : array_like - sample observation
- popmean : float - expected value in null hypothesis.

In [None]:
# Perform t test
from scipy import stats
# Assuming 'production cost' is the column we want to perform the t-test on
t_statistic, p_value = stats.ttest_1samp(df_bollywood_movies['production_cost'], 500)
print(f"T-statistic: {t_statistic}, P-value: {p_value}")

# Make inference for p value if alpha is 0.05
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The mean production cost is significantly different from 500.")
else:
    print("Fail to reject the null hypothesis: The mean production cost is not significantly different from 500.")

### Two sample t-test

Two sample t-test is required to test difference between two population means, but standard deviations are unknown. The parameters are estimated by the samples.

For example, a company makes a claim that children (in the age group between 7 and 12) who drink their health drink will grow taller than the children who do not drink that health drink. Data in Table 6.10 shows average increase in height over one-year period from two groups: one drinking the health drink and the other not drinking the health drink. At $\alpha$ = 0.05, test whether the increase in height for the children who drink the health drink is different than those who do not drink health drink. 

In [20]:
# Read healthdrink data from excel file
df_healthdrink_yes = pd.read_excel('healthdrink.xlsx', sheet_name='healthdrink_yes')
df_healthdrink_no = pd.read_excel('healthdrink.xlsx', sheet_name='healthdrink_no')
print(df_healthdrink_yes)
print(df_healthdrink_no)

    height_increase
0               8.6
1               5.8
2              10.2
3               8.5
4               6.8
..              ...
74              6.5
75              8.1
76              7.2
77              8.8
78              9.8

[79 rows x 1 columns]
    height_increase
0               5.3
1               9.0
2               5.7
3               5.5
4               5.4
..              ...
75              4.4
76              5.3
77              6.2
78              7.4
79              4.2

[80 rows x 1 columns]


In [23]:
# Perform 2 sample t test on df_healthdrink_yes and df_healthdrink_no
t_statistic, p_value = stats.ttest_ind(df_healthdrink_yes['height_increase'], df_healthdrink_no['height_increase'])
print(f"T-statistic: {t_statistic}, P-value: {p_value}")



T-statistic: 8.131675069083359, P-value: 1.197698592263946e-13
