## <font color='blue'>Two-sample Z-Test</font>

In today's lecture, we will be learning about **Two-sample Z-Test** to compare the difference of means.

Let's get started.

### Problem No 1
##### <font color='purple'>Imagine you're working for a renowned institution like ICMR, WHO, or FDA, and your task is to determine whether two different medicines, **M1** and **M2**, are equally effective in recovery time.</font>

In this journey of discovery, we will explore two fundamental ways to answer this question:
- Using a 95% confidence interval and
- Conducting a hypothesis test.

> <font color='purple'>We want to test whether the population mean recovery times for medicines **M1** and **M2** are equal.</font>

Let's set up our hypotheses:
- Null Hypothesis ($H_0$): $μ_1 = μ_2$
- Alternative Hypothesis ($H_a$): $μ_1 ≠ μ_2$

  - $μ_1$: Represents the population mean for group 1.
  - $μ_2$: Represents the population mean for group 2.

For medicine M1, we have data from 100 patients,
- Denoted as: $x_{11}$, $x_{12}$, $x_{13}$, and so on, up to $x_{1100}$.

Similarly, for medicine M2, we have data from 90 patients,
- Denoted as: $x_{21}$, $x_{22}$, $x_{23}$, and so forth, up to $x_{290}$.


We take sample means for both samples M1 and M2 which is represented as x̄1 (for M1) and x̄2 (for M2)

With n1 = 100 and n2 = 90, we can compute our test statistic, Z, using the following formula:

Z = $\frac{(\bar{x}_1 - \bar{x}_2)  -
 (μ_1 - μ_2)}{\sqrt{\frac{σ_1^2}{n_1} + \frac{σ_2^2}{n_2}}}$
  - Z: The z-score, a standard normal variable used to determine the probability of the observed difference between the two samples.
  - $\bar{x}_1$: The sample mean of the first sample.
  - $\bar{x}_2$: The sample mean of the second sample.
  - σ₁: The standard deviation of the first population.
  - σ₂: The standard deviation of the second population.
  - n₁: The size of the first sample.
  - n₂: The size of the second sample.
  - $μ_1$: Mean of the first population
  - $μ_2$: Mean of the second population
    - $μ_1 - μ_2 = 0$, represents the null hypothesis, which is the  assumption that there is no difference between the average values (means) of the two populations we're comparing.



Now, we have two parts to test the test statistic

<font color='purple'>**Part 1:** Interpreting the Test Statistic:</font>

- Let's analyze the first part of our test statistic.
- **If our null hypothesis is true** (i.e. there's no difference in the effectiveness of the two medicines), our test statistic will be **close to 0.**
- Conversely, if the alternative hypothesis is correct, our test statistic will be **significantly different from 0**.
  - It will have either a large positive value or a large negative value

<br>

Population standard deviations (σ) represent true values but are often unknown.
  - Therefore, in such cases, **sample standard deviations** (S1 and S2) can be used as reliable estimations (S1 ≈ σ1, S2 ≈ σ2).
  - The reliability of these estimations **improves with larger sample sizes** (n1 and n2), **ideally exceeding 30**.

<br>

<font color='purple'>**Part 2:** Consider the Distribution of the test statistic under the Null Hypothesis.:</font>

- It follows a **normal distribution** with a mean of 0 and a standard deviation of 1.
- This distribution is represented as `z(0,1)` or simply a standard normal distribution.

Coming back to the problem at hand,

##### <font color='green'>**STEP 1:**</font>

> **What should be the null and alternate hypothesis?**

- Null Hypothesis ($H_0$:): The mean recovery times for medicines M1 and M2 are the same (i.e., $μ_1 = μ_2$)

- Alternative Hypothesis ($H_a$:): The mean recovery times for medicines M1 and M2 are the not same (i.e., $μ_1 ≠ μ_2$)

##### <font color='green'>**STEP 2:**</font>

We have given **M1** for **100** patients and **M2** for **90** patients.

> **What is the distribution?**

- Normal distribution.

##### <font color='green'>**STEP 3:**</font>

> **Is the team looking for an effect towards the left side or right side or two-tailed?**

- Two-tailed test

##### Generating random recovery times for both medicines M1 & M2

In [None]:
import random
import numpy as np

# Set a random seed for reproducibility
# Setting random seed to 123 for consistent, deterministic generation of recovery times across groups.
random.seed(123)  # You can use any integer as the seed

# Create an empty list to store the recovery times
M1_data = []

# Generate 100 random recovery times
# The underscore _ in the loop is a placeholder, signifying indifference to the counter value and solely focusing on executing the code block 100 times.
for _ in range(100):
    recovery_time = random.uniform(5.0, 20.0)  # Generating values between 5 and 20
    M1_data.append(np.round(recovery_time,0)) # Rounding recovery time to whole numbers

# Print the generated data
print("M1_data:",M1_data)

# Create an empty list to store the recovery times
M2_data = []

# Generate 90 random recovery times
for _ in range(90):
    recovery_time = random.uniform(5.0, 30.0) # Generating values between 5 and 30
    M2_data.append(np.round(recovery_time,0)) # Rounding recovery time to whole numbers

# Print the generated data
print("M2_data:",M2_data)

M1_data: [6.0, 6.0, 11.0, 7.0, 19.0, 6.0, 13.0, 10.0, 18.0, 7.0, 10.0, 10.0, 9.0, 5.0, 12.0, 6.0, 14.0, 6.0, 10.0, 12.0, 19.0, 6.0, 7.0, 17.0, 5.0, 19.0, 14.0, 9.0, 18.0, 17.0, 10.0, 17.0, 8.0, 14.0, 13.0, 17.0, 10.0, 11.0, 17.0, 13.0, 15.0, 15.0, 15.0, 18.0, 12.0, 15.0, 11.0, 5.0, 16.0, 8.0, 17.0, 19.0, 15.0, 8.0, 10.0, 11.0, 6.0, 12.0, 14.0, 10.0, 8.0, 6.0, 14.0, 5.0, 10.0, 8.0, 13.0, 16.0, 19.0, 15.0, 20.0, 19.0, 14.0, 9.0, 12.0, 8.0, 20.0, 12.0, 16.0, 6.0, 7.0, 8.0, 9.0, 10.0, 9.0, 12.0, 11.0, 7.0, 6.0, 10.0, 6.0, 11.0, 5.0, 13.0, 6.0, 15.0, 11.0, 20.0, 9.0, 12.0]
M2_data: [24.0, 13.0, 14.0, 21.0, 29.0, 27.0, 15.0, 23.0, 22.0, 14.0, 11.0, 7.0, 27.0, 6.0, 18.0, 18.0, 9.0, 21.0, 28.0, 24.0, 15.0, 8.0, 29.0, 18.0, 21.0, 25.0, 15.0, 7.0, 8.0, 22.0, 14.0, 27.0, 24.0, 30.0, 22.0, 13.0, 25.0, 6.0, 18.0, 27.0, 18.0, 8.0, 18.0, 19.0, 14.0, 27.0, 14.0, 12.0, 18.0, 10.0, 13.0, 15.0, 28.0, 7.0, 10.0, 10.0, 30.0, 27.0, 10.0, 19.0, 7.0, 13.0, 14.0, 18.0, 28.0, 25.0, 18.0, 25.0, 17.0, 23.0, 24.0,

##### <font color='green'>**STEP 4:**</font>

We perform Two sample Z-test and calculate the P-Value

[Documentation](https://www.statsmodels.org/devel/generated/statsmodels.stats.weightstats.ztest.html)

In [None]:
# import a library to perform a Z-test
# Assumes independent samples
from statsmodels.stats import weightstats as stests
from scipy import stats

For doing that, we can use a function `statsmodel.stats.weightstats.ztest()`

Besides passing the two different sample's data and specifying the kind of tailed test (left/ right/ two sided) we wish to perform, there is another important parameter: `value`
- `value = 0` represents the null hypothesis for the z-test.
- It defines the **expected difference between the means of the two samples** under the assumption that they are equal.
- In other words, it assumes that there is no difference in the population means of the two groups.


In [None]:
# value = 0 represents the null hypothesis for the z-test.
# It defines the expected difference between the means of the two samples under the assumption that they are equal.
# In other words, it assumes that there is no difference in the population means of the two groups.

z_score, pval = stests.ztest(x1 = M1_data, x2 = M2_data, value = 0, alternative = 'two-sided')

# print the test statistic and corresponding p-value
print("Z-score: ", z_score)
print("p-value: ", pval)

Z-score:  -7.68917478890992
p-value:  1.4808703984296164e-14


Note that this function was able to calculate the **Z statistic** and **pvalue** on its own.

##### <font color='green'>**STEP 5:**</font>

We defined $α = 0.01$ for confidence level 99%

In [None]:
alpha = 0.01

if pval < alpha:
  print("Reject the null hypothesis, (i.e, The recovery time of two medicines are different)")
else:
  print("Fail to reject the null hypothesis  (i.e, The recovery time of two medicines are same)")

Reject the null hypothesis, (i.e, The recovery time of two medicines are different)


Since $1.4808703984296164e-14 < α$

This means that we can **Reject the null hypothesis**

So, we can report that the recovery times of two medicines M1 & M2 are different.

---

### Problem No. 2

> **If there is already a function defined that automatically computes the test stat, why do we need to remember the z-score formula for 2 Samples?**

##### <font color='green'>**When no data array is provided, the formula-based approach is necessary for solving the problem.**</font>

**Let's look into another example:**

A car manufacturer conducted a study to compare the fuel efficiency of two different engine types: Engine X and Engine Y.

They collected data from two groups: Group X and Group Y.

- In Group X, a random sample of 50 cars with Engine X had an average fuel efficiency of 30 miles per gallon (mpg) with a standard deviation of 3 mpg.

- In Group Y, a random sample of 60 cars with Engine Y had an average fuel efficiency of 32 mpg with a standard deviation of 2.5 mpg.

The significance level (α) is set at 0.05. Can it be concluded that one engine type is more fuel-efficient than the other?

Based on the problem, we define our hypothese as:
- **Null Hypothesis (H₀):** The average fuel efficiency of cars with Engine X is equal to the average fuel efficiency of cars with Engine Y.
- **Alternative Hypothesis (H₁):** The average fuel efficiency of cars with Engine X is not equal to the average fuel efficiency of cars with Engine Y.


In [None]:
import numpy as np
from scipy import stats

# Given data
sample_mean_X = 30 # Average fuel efficiency for Group X (Engine X)
sample_mean_Y = 32 # Average fuel efficiency for Group Y (Engine Y)
sample_std_X = 3 # Standard deviation for Group X
sample_std_Y = 2.5 # Standard deviation for Group Y
significance_level = 0.05
sample_size_X = 50 # Sample size for Group X
sample_size_Y = 60 # Sample size for Group Y

In [None]:
# Define the function to calculate the test statistic and corresponding p-value
def TwoSampZTest(samp_mean_1, samp_mean_2, samp_std_1, samp_std_2, n1, n2):
  # Calculate the test statistic
  denominator = np.sqrt((samp_std_1**2 / n1) + (samp_std_2**2 / n2))
  z_score = (samp_mean_1 - samp_mean_2) / denominator
  return z_score


# Calculate the z-score using the function
z_score = TwoSampZTest(sample_mean_X, sample_mean_Y, sample_std_X, sample_std_Y, sample_size_X, sample_size_Y)

# Calculate the two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

# A two-tailed z-test considers deviations from the null hypothesis in both positive and negative directions.
# Using abs(z_score) ignores the sign of the deviation, focusing solely on its magnitude, ensuring we consider both directions and calculate the accurate two-tailed p-value.

# Compare the p-value to the significance level
if p_value < significance_level:
  conclusion = "Reject the null hypothesis. Engine Y is more fuel-efficient."
else:
  conclusion = "Fail to reject the null hypothesis. No significant difference in fuel efficiency."

print(f'z-score: {z_score:.4f}')
print(f'p-value: {p_value:.4f}')
print('Conclusion:', conclusion)

z-score: -3.7518
p-value: 0.0002
Conclusion: Reject the null hypothesis. Engine Y is more fuel-efficient.


#### **Conditions for using the Two Sample Z-Test:**

We must ensure that certain conditions are met for a z-test to be applicable:

  - The population must have **known finite** means (μ) and standard deviations (σ).
  - We need to know σ1 and σ2, or we should be able to estimate S1 and S2.
    - This indirectly implies that the sample sizes n1 and n2 should preferably be greater than 30, and not be too small.
  - The data in each population must be **continuous**, not discrete.
  - The data in each population should be **approximately normally distributed.**
  - N1 and N2 don't have to be the same; (i.e., they can be the same or different.)

---

# **Question**

```
Determine whether there is a statistically significant difference in the average heights of plants grown with fertilizer X and fertilizer Y.
Group A (Fertilizer X):
Heights = [162, 164, 168, 170, 174, 176, 180, 182, 186, 188, 192, 194, 198, 200, 204, 206, 210, 212, 216, 218, 222, 224, 228, 230, 234, 236, 240, 242, 246, 248, 252, 254, 258, 260, 264, 266, 270]

Group B (Fertilizer Y):
Heights = [158, 162, 166, 170, 174, 178, 182, 186, 190, 194, 198, 202, 206, 210, 214, 218, 222, 226, 230, 234, 238, 242, 246, 250, 254, 258, 262, 266, 270, 274, 278, 282, 286, 290, 294, 298, 302]

Significance Level (α): 0.1
```

**Options:**

- [ ] Yes, there is enough evidence to conclude there is a difference in the average heights of plants.
- [x] No, there is not enough evidence to conclude the difference in the average heights of plants.
- [ ] The p-value is not relevant in this context.
- [ ] The sample sizes are insufficient to make any conclusions.

**Solution**

In [None]:
# Null Hypothesis (H0): The average heights of plants grown with fertilizers X and Y are equal (μ₁ = μ₂).
# Alternative Hypothesis (Ha): The average heights of plants grown with fertilizers X and Y are different (μ₁ ≠ μ₂).

from statsmodels.stats import weightstats as stests

# Group A heights
heights_a = [162, 164, 168, 170, 174, 176, 180, 182, 186, 188, 192, 194, 198, 200, 204, 206, 210, 212, 216, 218, 222, 224, 228, 230, 234, 236, 240, 242, 246, 248, 252, 254, 258, 260, 264, 266, 270]

# Group B heights
heights_b = [158, 162, 166, 170, 174, 178, 182, 186, 190, 194, 198, 202, 206, 210, 214, 218, 222, 226, 230, 234, 238, 242, 246, 250, 254, 258, 262, 266, 270, 274, 278, 282, 286, 290, 294, 298, 302]

# Significance level
alpha = 0.1

# Perform the z-test
z_stat, p_value = stests.ztest(heights_a, heights_b, value=0, alternative='two-sided')

# Print the z-statistic and p-value
print("z-statistic:", z_stat)
print("p-value:", p_value)

# Decision
if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference in the average heights of plants grown with fertilizer X and Y.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference in the average heights of plants grown with fertilizer X and Y.")

z-statistic: -1.6280691715301856
p-value: 0.10351021950900992
Fail to reject the null hypothesis. There is no statistically significant difference in the average heights of plants grown with fertilizer X and Y.


---

# **Question:**

```
A researcher is studying the satisfaction level of customers after implementing a new customer service system.
They collected survey responses from 250 customers and found that 65 of them were dissatisfied with the new system.
The researcher wants to test the null hypothesis that no more than 30% of customers are dissatisfied with the new system.
Use the p-value technique to test the claim with a significance level of α = 0.05.
```

**Options:**
- [ ] Reject the null hypothesis.
- [x] Fail to reject the null hypothesis.
- [ ] The p-value is irrelevant in this context.
- [ ] The null hypothesis cannot be tested with the given data.

**Solution:**

In [None]:
import numpy as np
from scipy.stats import norm

# Null Hypothesis (H0): The proportion of customers dissatisfied with the new system is less than or equal to 30%.(p ≤ 0.30)
# Alternative Hypothesis (H1): The proportion of customers dissatisfied with the new system is greater than 30%.(p > 0.30).

n = 250 # Sample size
x = 65 # Number of customers dissatisfied with the new system
p_hat = x/n # Sample proportion
p = 0.30 # Hypothesized proportion

# Calculate test statistic value for one sample proportion test
Z = (p_hat - p) / np.sqrt((p * (1 - p)) / n)
print('Test statistic:',Z)

# Calculate the p-value for the test statistic
p_value = 1 - norm.cdf(Z)
print('p-value:', p_value)

# Define the significance level
alpha = 0.05

# Make a decision based on the p-value and significance level
if p_value < alpha:
  print('Reject the null hypothesis.')
else:
  print('Fail to reject the null hypothesis.')

Test statistic: -1.3801311186847078
p-value: 0.9162268612556912
Fail to reject the null hypothesis.


##### Solving the same using the formula

- $Z$ = $\frac{\hat{p_1} - \hat{p_2}}{\sqrt{\hat{p} (1 - \hat{p} ) \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$

In [None]:
import numpy as np
import scipy.stats as stats

# Step 1: Define the data
# Data for the old web page (Web Page A)
n_1 = 1000 # visits_1
x_1 = 50 # conversions_1

# Data for the new web page (Web Page B)
n_2 = 500 # visits_2
x_2 = 30 # conversions_2

# Step 2: Define the hypotheses
# Null Hypothesis (H0): Conversion rates are the same.
# Alternative Hypothesis (Ha): Conversion rates are different.
# This is a two-tailed test.
p_1_hat = x_1 / n_1
p_2_hat = x_2 / n_2

# p_hat: is the combined sample proportion for both web pages.
p_hat = (x_1 + x_2) / (n_1 + n_2)

# Step 3: Calculate the test statistic (Z)
Z = (p_1_hat - p_2_hat) / np.sqrt(p_hat * (1 - p_hat) * ((1 / n_1) + (1 / n_2)))

# Step 4: Interpret the test statistic
# Z follows a standard normal distribution. We will calculate the two-tailed p-value next.

# Step 5: Calculate the p-value
p_value = 2 * (1 - stats.norm.cdf(np.abs(Z)))

# Print the results
print(f"Z = {Z}")
print(f"P-value = {p_value}")

Z = -0.8125338562826986
P-value = 0.4164853677823288


---

# **Question**

```
A company introduces a new feature in its mobile App that allows users to subscribe to a premium service.
They want to evaluate if the introduction of this feature has led to an increase in the number of premium users.
They collect data from two different time periods: before the feature was introduced (Group A) and after the feature was introduced (Group B).
Which test should you use to determine if the new feature has significantly increased in the number of premium users?
```

**Options:**
- [ ] One Sample Z-Test
- [ ] Two Sample Z-Test
- [ ] One Sample Z-Proportion Test
- [x] Two Sample Z-Proportion Test

**Solution:**

- In this scenario, you are comparing the proportion of users who subscribe to the premium service before and after the introduction of the new feature.
  - You have two sets of data:
    - Group A (before the feature was introduced) and
    - Group B (after the feature was introduced).
  - To determine if the new feature has significantly increased the proportion  of significantly increased in the number of premium users, you need to compare proportions from two independent groups.

- The one-sample tests (Options A and C) are used when you are comparing a means of a single group to a known value or when you are testing a proportion within a single group.

- The two-sample z-test (Option B) is typically used when you are comparing means from two independent groups, which is not the appropriate test for comparing proportions.

Therefore, the correct choice here is the Two Sample Z-Proportion Test, which allows you to compare proportions from two independent groups and assess if the new feature has significantly increased the subscription rate.

---

# **Question**

```
A shoe manufacturer claims that their new running shoes make people run faster.
To test this claim, they select two groups: Group A wears the new shoes, and Group B wears the old ones.
After a 4-week trial, you find that Group A improved their running speed by 15%, while Group B improved by only 10%.
Which test should you use to determine if the new shoes are more effective?
```

**Options:**
- [ ] One Sample Z-Test
- [x] Two Sample Z-Test
- [ ] One Sample Z-Proportion Test
- [ ] Two Sample Z-Proportion Test

**Solution:**

- In this scenario, you are comparing two groups (Group A and Group B) to determine if the new running shoes are more effective in improving running speed.
  - Since we have data from two different groups and we are interested in comparing the means (improvement percentages), a two-sample z-test is appropriate.

- The one-sample tests (Option A and Option C) are used when you have a single group, and you are comparing it to means and a known value (one-sample z-test) or when you have a single group and want to test a proportion (one-sample z-proportion test).

- The two-sample z-proportion test (Option D) is used when you want to compare proportions from two independent groups, which is not the case in this scenario.

Therefore, the correct choice here is a Two Sample Z-Test, where you would compare the improvement percentages of Group A (new shoes) and Group B (old shoes) to determine if the new shoes are more effective in making people run faster.

---

# T-Test Examples

### <font color='purple'>Cricket Example</font>

In [None]:
!wget --no-check-certificate https://drive.google.com/uc?id=1bvVVbWUu6JKQDol0xwj3pqsTs4Qxm_oj -O Sachin_ODI.csv

--2024-01-17 13:21:37--  https://drive.google.com/uc?id=1bvVVbWUu6JKQDol0xwj3pqsTs4Qxm_oj
Resolving drive.google.com (drive.google.com)... 142.251.16.113, 142.251.16.138, 142.251.16.139, ...
Connecting to drive.google.com (drive.google.com)|142.251.16.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1bvVVbWUu6JKQDol0xwj3pqsTs4Qxm_oj [following]
--2024-01-17 13:21:37--  https://drive.usercontent.google.com/download?id=1bvVVbWUu6JKQDol0xwj3pqsTs4Qxm_oj
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.251.167.132, 2607:f8b0:4004:c1d::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.251.167.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26440 (26K) [application/octet-stream]
Saving to: ‘Sachin_ODI.csv’


2024-01-17 13:21:38 (97.0 MB/s) - ‘Sachin_ODI.csv’ saved [26440/26440]



In [None]:
df = pd.read_csv('/content/Sachin_ODI.csv')
df

Unnamed: 0,runs,NotOut,mins,bf,fours,sixes,sr,Inns,Opp,Ground,Date,Winner,Won,century
0,13,0,30,15,3,0,86.66,1,New Zealand,Napier,1995-02-16,New Zealand,False,False
1,37,0,75,51,3,1,72.54,2,South Africa,Hamilton,1995-02-18,South Africa,False,False
2,47,0,65,40,7,0,117.50,2,Australia,Dunedin,1995-02-22,India,True,False
3,48,0,37,30,9,1,160.00,2,Bangladesh,Sharjah,1995-04-05,India,True,False
4,4,0,13,9,1,0,44.44,2,Pakistan,Sharjah,1995-04-07,Pakistan,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355,14,0,34,15,2,0,93.33,2,Australia,Sydney,2012-02-26,Australia,False,False
356,39,0,45,30,5,0,130.00,2,Sri Lanka,Hobart,2012-02-28,India,True,False
357,6,0,25,19,1,0,31.57,1,Sri Lanka,Dhaka,2012-03-13,India,True,False
358,114,0,205,147,12,1,77.55,1,Bangladesh,Dhaka,2012-03-16,Bangladesh,False,True


Now based on this dataset, we will analyze and answer a few questions using our statistical tools.

#### <font color='purple'>Batting pattern in first and second Innings</font>

First, let's look at the respective means.

In [None]:
df.groupby('Inns')['runs'].mean()

Inns
1    46.670588
2    40.173684
Name: runs, dtype: float64

> <font color='purple'>**What do you think, is it a coincidence or is it significant?**</font>

Seems like a hypothesis we can test using T-test!

Let's find out.

<br>

> <font color='purple'>**What will be the null and alternate hypothesis?**</font>

Let the average runs scored in the first and second innings be $μ_1$ and $μ_2$ respectively.

- $H_0: μ_1 = μ_2$

For the alternate hypothesis, we have a sense that maybe the runs scored in the first innings are greater than in the second innings.

So we can set it like:
- $H_a: μ_1 > μ_2$

In [None]:
df_first_innings = df[df['Inns'] == 1]
df_second_innings = df[df['Inns'] == 2]

<font color='purple'>Performing T-test</font>

In [None]:
t_stat, pvalue = ttest_ind(df_first_innings['runs'], df_second_innings['runs'], alternative = "greater")
t_stat, pvalue

(1.4612016295532178, 0.07241862097379981)

In [None]:
alpha = 0.05 # 95% confidence

if pvalue < alpha:
  print('Reject H0')
  print('First innings is better')
else:
  print ('Fail to Reject H0')
  print('Difference we observe is just chance')

Fail to Reject H0
Difference we observe is just chance


#### <font color="purple">What if we want to look at the Batting pattern when the team won vs lost?

In [None]:
df.groupby('Won')['runs'].mean()

Won
False    35.130682
True     51.000000
Name: runs, dtype: float64

This seems to be a significant difference.

Let's check using T-test

In [None]:
df_won = df[df['Won'] == True]
df_lost = df[df['Won'] == False]

> <font color='purple'>**What are null and alternate hypothesis?**</font>

- $H_0: μ_1 = μ_2$, i.e. No difference in batting, irrespective of win or loss
- $H_a: μ_1 > μ_2$, i.e. better batting when match is won

With this, let's perform the test

In [None]:
t_stat, pvalue = ttest_ind(df_won['runs'], df_lost['runs'], alternative = "greater")
t_stat, pvalue

(3.628068563969343, 0.00016353077486826558)

In [None]:
alpha = 0.05 # 95% confidence

if pvalue < alpha:
  print('Reject H0')
  print('Better scores when team won the match')
else:
  print ('Fail to Reject H0')
  print('Difference we observe is just chance')

Reject H0
First innings is better


The obtained pvalue was very small, making it highly unlikely for the difference to be a coincidence.

Therefore we can say with a 95% confidence:
- `runs` and `Won` have a good relationship between them.
- Whereas, `runs` and `Inns`  do not.

---

### <font color='purple'>Drug Recovery Time Example</font>

Suppose there are 2 competing companies that have created a drug for tackling the same disease.

A test was conducted using these 2 drugs on a group of people and you are given the same in the following data.

This data contains the **number of days** an individual took to recover from illness, using the mentioned drug.

<font color='purple'>Which drug is more effective?</font>

In [None]:
!wget --no-check-certificate https://drive.google.com/uc?id=1aTrYo2_PIeYcg8Fvpr5m2GYAXiPvvOmz -O drug_1_recovery.csv

--2024-04-02 07:49:20--  https://drive.google.com/uc?id=1aTrYo2_PIeYcg8Fvpr5m2GYAXiPvvOmz
Resolving drive.google.com (drive.google.com)... 173.194.212.100, 173.194.212.138, 173.194.212.102, ...
Connecting to drive.google.com (drive.google.com)|173.194.212.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1aTrYo2_PIeYcg8Fvpr5m2GYAXiPvvOmz [following]
--2024-04-02 07:49:20--  https://drive.usercontent.google.com/download?id=1aTrYo2_PIeYcg8Fvpr5m2GYAXiPvvOmz
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.251.162.132, 2607:f8b0:400c:c38::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.251.162.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1102 (1.1K) [application/octet-stream]
Saving to: ‘drug_1_recovery.csv’


2024-04-02 07:49:20 (42.3 MB/s) - ‘drug_1_recovery.csv’ saved [1102/1102]



In [None]:
import pandas as pd
d1 = pd.read_csv('/content/drug_1_recovery.csv')
d1

Unnamed: 0,drug_1
0,8.824208
1,7.477745
2,7.557121
3,7.981314
4,6.827716
...,...
95,6.890506
96,7.725759
97,6.848016
98,7.969997


In [None]:
d1.mean()

drug_1    7.104917
dtype: float64

Now, for Drug 2

In [None]:
!wget --no-check-certificate https://drive.google.com/uc?id=1YgAgnzkfiCFz_kSO6BPPLCYAB2K5VwXG -O drug_2_recovery.csv

--2024-04-02 07:50:13--  https://drive.google.com/uc?id=1YgAgnzkfiCFz_kSO6BPPLCYAB2K5VwXG
Resolving drive.google.com (drive.google.com)... 173.194.212.100, 173.194.212.138, 173.194.212.102, ...
Connecting to drive.google.com (drive.google.com)|173.194.212.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1YgAgnzkfiCFz_kSO6BPPLCYAB2K5VwXG [following]
--2024-04-02 07:50:13--  https://drive.usercontent.google.com/download?id=1YgAgnzkfiCFz_kSO6BPPLCYAB2K5VwXG
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.251.162.132, 2607:f8b0:400c:c38::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.251.162.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1328 (1.3K) [application/octet-stream]
Saving to: ‘drug_2_recovery.csv’


2024-04-02 07:50:13 (92.9 MB/s) - ‘drug_2_recovery.csv’ saved [1328/1328]



In [None]:
d2 = pd.read_csv('/content/drug_2_recovery.csv')
d2

Unnamed: 0,drug_2
0,9.565974
1,7.492915
2,8.738418
3,7.635235
4,4.125593
...,...
115,7.861993
116,8.233510
117,5.876257
118,7.789454


In [None]:
d2.mean()

drug_2    8.073423
dtype: float64

This presents a similar problem to what we've seen till now.

> <font color='purple'>**What will be the Null and Alternate Hypothesis?**</font>

We observe that the recovery time of drug 1 seems better (less no of days).

So we define a hypothesis as:
- $H_0: μ_1 = μ_2$
- $H_a: μ_1 < μ_2$

Based on this we can perform Two sample T-test.

In [None]:
from scipy.stats import ttest_ind

t_stat, pvalue = ttest_ind(d1, d2, alternative = "less")
t_stat, pvalue

(array([-5.32112438]), array([1.27713574e-07]))

In [None]:
alpha = 0.05 # 95% confidence

if pvalue < alpha:
  print('Reject H0')
  print('First drug has less recovery time.')
else:
  print ('Fail to Reject H0')
  print('Both have same recovery time')

Reject H0
First drug has less recovery time.


---