# Hypothesis testing 

Hypothesis testing in statistics is a method used to make inferences or draw conclusions about a population based on sample data. The idea is to assess whether there is enough evidence in a sample to support or reject a claim (or hypothesis) about a population parameter.

In [2]:
# imports 
import numpy as np 
import pandas as pd 
import seaborn as sns

In [3]:
# data loading
df = pd.read_csv('./data/student-mat.csv',delimiter=';')
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


## Understanding the Null Hypothesis

    Null Hypothesis: There is no relationship between A and B
    Example: "There is no relationship between this flu medication and a reduced recovery time from the flu."

    The null hypothesis is usually denoted as H₀

    Alternative Hypothesis: The hypothesis traditionally thought of when creating a hypothesis for an experiment
    Example: "This flu medication reduces recovery time for the flu."

    The alternative hypothesis is usually denoted as H₁



## p-value
p-value: The probability of observing a test statistic at least as large as the one observed, by random chance, assuming that the null hypothesis is true.


## alpha value
α(alpha value): The marginal threshold at which you're okay with agreeing that the null hypothesis is implausible enough to be rejected.

If you set an alpha value of
, you're essentially saying "I'm okay with accepting my alternative hypothesis as true when we expect that the null hypothesis would randomly cause the results I'm seeing less than 5% of the time."


Reject H₀ in favor of H₁, if p < α

Fail to reject H₀, if p ≥ α


An alpha value can be any value set between 0 and 1. However, the most common alpha value in data science is 0.05

### How to Interpret

Assume:

    Null Hypothesis (H₀): The average score <= 10.5

    Alternative Hypothesis (H₁): The average score > 10.5

**Step-by-step interpretation:**

    p_value < 0.05	Reject H₀ → statistically significant

    p_value >= 0.05	Fail to reject H₀ → not significant


In [4]:
# hypothesis testing
def hypothesis_test(h0,h1,p_value,alpha=0.05):
    if p_value < alpha:
        print(f"Reject H₀ → {h1}")
    else:
        print(f"Fail to reject H₀ → {h0}")

### Type I Error (False Positive)

A **Type I Error** occurs when we **reject the null hypothesis (H₀)** even though it is actually **true**.

### Definition:
> Type I Error is a **false positive** — we detect an effect or difference that doesn't actually exist.

### Probability:
- The probability of making a Type I Error is denoted by **alpha (α)**.
- Common values:  
  - α = 0.05 → 5% chance of Type I Error  
  - α = 0.01 → 1% chance  
  - α = 0.10 → 10% chance



### Type II Error (False Negative)

A **Type II Error** occurs when we **fail to reject the null hypothesis (H₀)** even though it is actually **false**.

### Definition:
> Type II Error is a **false negative** — we fail to detect a real effect or difference that **actually exists**.

### Probability:
- The probability of making a Type II Error is denoted by **beta (β)**.
- Unlike α (which we set), β depends on:
  - Sample size
  - Effect size
  - Significance level (α)
  - Variability in the data

## One-Tail and Two-Tail Tests



### One-tailed test (directional)

You test if the sample mean is greater than or less than the population mean.

    H₀: μ ≤ μ₀ (or μ ≥ μ₀)

    H₁: μ > μ₀ (or μ < μ₀)

    Used when: You have a specific directional hypothesis.

    Example:
    "Students who took extra classes scored higher than the average."

### Right-tailed test (testing if sample mean is greater than population mean):

    H₀ (null): μ_sample ≤ μ_population

    H₁ (alternative): μ_sample > μ_population

Use this when you want to detect an increase.

In [5]:
# get the population mean
population_mean = df['G3'].mean()
population_mean

np.float64(10.415189873417722)

In [6]:
# sample 100 random rows
sample = df["G3"].sample(n=100)    
sample.mean()

np.float64(10.72)

In [7]:
from scipy.stats import ttest_1samp

# t test to compare the sample mean to the population mean
# Perform one-sample t-test
t_stat, p_value = ttest_1samp(sample,population_mean,alternative="greater")

# For right-tailed test (sample mean > population mean)
# p_val_right = p_value / 2 if t_stat > 0 else (1 - p_value) / 2

print(f"t_stat{t_stat:.2f}, p_value{p_value:.4f}")


t_stat0.64, p_value0.2619


In [8]:
h0 = "The sample mean <= pop mean"
h1 = "The sample mean > pop mean"
 
print(f"sample mean = {sample.mean()} ,population mean = {population_mean}",)
print(f't_stat = {t_stat} , p_val_right = {p_value}')
hypothesis_test(h0,h1,p_value)

sample mean = 10.72 ,population mean = 10.415189873417722
t_stat = 0.6396367093784506 , p_val_right = 0.26194311389997826
Fail to reject H₀ → The sample mean <= pop mean


#### Left-tailed test (testing if sample mean is less than population mean):

    H₀ (null): μ_sample ≥ μ_population

    H₁ (alternative): μ_sample < μ_population

Use this when you're checking for a decrease.

In [9]:
t_stat, p_value = ttest_1samp(sample,population_mean,alternative="less")

In [10]:

# p_val_left = p_value / 2 if t_stat < 0 else 1 - (p_value / 2)

In [11]:
h1 = "The sample mean < population mean"
h0 = "The sample mean >= population mean"

print(f"sample mean = {sample.mean()} ,population mean = {population_mean}",)
print(f't_stat = {t_stat} , p_val_left = {p_value}')
hypothesis_test(h0,h1,p_value)

sample mean = 10.72 ,population mean = 10.415189873417722
t_stat = 0.6396367093784506 , p_val_left = 0.7380568861000217
Fail to reject H₀ → The sample mean >= population mean


In [12]:
# Examples

### Two-tailed test (default in ttest_1samp)

You're checking if the sample mean is different (could be higher or lower) than the population mean.

    H₀ (Null): μ = μ₀

    H₁ (Alternative): μ ≠ μ₀

Used when: You don't know the direction of the effect.

    Example:
    "Is the average grade different from 10?"

    (You don’t know if it’s higher or lower — just different.)

In general, in the alternative hypothesis and in the null hypothesis indicate that this is a two-tailed test.

In [13]:
# Perform two-tailed one-sample t-test
t_stat, p_value = ttest_1samp(sample, population_mean,)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")

T-statistic: 0.64
P-value: 0.5239


In [14]:
h1 = "The sample mean != population mean"
h0 = "The sample mean = population mean"

print(f"sample mean = {sample.mean()} ,population mean = {population_mean}",)
print(f't_stat = {t_stat} , p_value = {p_value}')
hypothesis_test(h0,h1,p_value)

sample mean = 10.72 ,population mean = 10.415189873417722
t_stat = 0.6396367093784506 , p_value = 0.5238862277999565
Fail to reject H₀ → The sample mean = population mean


### Two-Sample T-Test (Independent Samples)

You're checking if **two independent groups** have **different means** (could be higher or lower).

#### Two-tailed test (default in `ttest_ind`):

You're testing for any difference — **either** group could have a higher mean.

    H₀ (Null): μ₁ = μ₂  
    H₁ (Alternative): μ₁ ≠ μ₂

Used when:  
You want to know **if there's a difference**, but you don't know in **which direction**.

    Example:
    "Do students in class A and class B have different average scores?"
    (You're not assuming which class scores higher — you're just checking if they differ.)

In general:  
- Use this when you expect a **difference**, but you're **not predicting** which group will score higher.
- It’s the default test type in `scipy.stats.ttest_ind`.


Q: Do male and female students perform differently in G3?

In [15]:
from scipy.stats import ttest_ind

male_scores = df[df["sex"] == "M"]["G3"]
female_scores = df[df["sex"] == "F"]["G3"]

(female_scores.mean(),male_scores.mean())

(np.float64(9.966346153846153), np.float64(10.914438502673796))

In [16]:

# Perform two-sample t-test
t_stat, p_value = ttest_ind(male_scores,female_scores)
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")


T-statistic: 2.06
P-value: 0.0399


In [17]:
h1 = "male_scores != female score in g3"
h0 = "male_scores = female score in g3"

print(f"Male scores = {male_scores.mean()} ,Female Scores = {female_scores.mean()}",)
print(f't_stat = {t_stat} , p_val_left = {p_value}')
hypothesis_test(h0,h1,p_value)


Male scores = 10.914438502673796 ,Female Scores = 9.966346153846153
t_stat = 2.061992815503971 , p_val_left = 0.039865332341527636
Reject H₀ → male_scores != female score in g3


In [None]:
# do male student perfom any better than female students in G3
# Right tailed test
h1 = "male_scores > female score in g3"
h0 = "male_scores <= female score in g3"

t_stat,p_value = ttest_ind(male_scores,female_scores,alternative="greater")

print(f"Male scores = {male_scores.mean()} ,Female Scores = {female_scores.mean()}",)
print(f't_stat = {t_stat} , p_val_left = {p_value}')
hypothesis_test(h0,h1,p_value)

Male scores = 10.914438502673796 ,Female Scores = 9.966346153846153
t_stat = 2.061992815503971 , p_val_left = 0.019932666170763818
Reject H₀ → male_scores > female score in g3


In [33]:
# left tailed test
# do male student perfom worse than female students in G3

h1 = "male_scores < female score in g3"
h0 = "male_scores >= female score in g3"

t_stat,p_value = ttest_ind(male_scores,female_scores,alternative="less")

print(f"Male scores = {male_scores.mean()} ,Female Scores = {female_scores.mean()}",)
print(f't_stat = {t_stat} , p_val_left = {p_value}')
hypothesis_test(h0,h1,p_value) 

Male scores = 10.914438502673796 ,Female Scores = 9.966346153846153
t_stat = 2.061992815503971 , p_val_left = 0.9800673338292362
Fail to reject H₀ → male_scores >= female score in g3


### Paired T-Test (Dependent Samples)

You're checking if the **means of two related (paired) groups** — such as the same students' grades in two periods — are **significantly different**.

#### Two-tailed test (default in `ttest_rel`):

You're testing for any difference between paired values — the difference could go **either way**.

    H₀ (Null): μ_d = 0  
    H₁ (Alternative): μ_d ≠ 0

Where:  
- μ_d = the **mean of the differences** between G1 and G3 scores (i.e., G1 - G3)

Used when:  
You're interested in whether students' grades changed between two time points.

    Example:
    "Did students' final grades (G3) differ from their first period grades (G1)?"
    (You're comparing G1 and G3 for the same students, and you're open to either improvement or decline.)

In general:  
- Use this when comparing **before and after** scores for the **same students**.
- `G1` and `G3` are **paired** since they belong to the same individual.
- This is done using `scipy.stats.ttest_rel`.


In [34]:
from scipy.stats import ttest_rel

t_stat, p_value = ttest_rel(df["G3"], df["G1"])
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")


T-statistic: -3.55
P-value: 0.0004


In [21]:
# Define new hypotheses
h0 = "Students G3 perfomance  is same as G1 perfomance"
h1 = "Students G3 perfomance  is different from G1 perfomance"

# Perform hypothesis testing
print(f"G3 scores Mean = {df["G3"].mean()} , G1  Scores Mean = {df["G1"].mean()}")
print(f"T-statistic = {t_stat}, P-value = {p_value}")
hypothesis_test(h0, h1, p_value)

G3 scores Mean = 10.415189873417722 , G1  Scores Mean = 10.90886075949367
T-statistic = -3.5517031247185855, P-value = 0.00042906738658041643
Reject H₀ → Students G3 perfomance  is different from G1 perfomance


In [None]:
# right tailed test

In [None]:
# left tailed test 

### One-Way ANOVA (Analysis of Variance)

You're checking if the **means of a numeric variable** (e.g., total grades) **differ across more than two groups** — such as levels of mother's education.

#### Hypotheses:

    H₀ (Null): μ₁ = μ₂ = μ₃ = ... = μₖ  
    H₁ (Alternative): At least one group mean is different

Used when:  
You're comparing **more than two groups** and want to know if **group membership affects** the outcome.

    Example:
    "Does a student’s mother's education level affect their total grade (G1 + G2 + G3)?"
    (We're testing if students whose mothers have different education levels score differently on average.)

In general:  
- Use when your **independent variable is categorical** (e.g., Mjob with 5 levels)
- Use when your **dependent variable is numeric** (e.g., total grade)


In [22]:
from scipy.stats import f_oneway
import pandas as pd


# Create a total grade column
df['total_grade'] = df['G1'] + df['G2'] + df['G3']

# Group total grades by mother's education/job
groups = df.groupby('Medu')['total_grade'].apply(list)


In [35]:
groups

Medu
0                                         [24, 43, 46]
1    [16, 25, 28, 29, 27, 19, 29, 27, 24, 46, 8, 38...
2    [35, 34, 46, 38, 27, 23, 35, 21, 40, 28, 27, 2...
3    [26, 53, 44, 28, 16, 33, 30, 41, 35, 34, 36, 2...
4    [17, 44, 45, 17, 27, 42, 31, 42, 41, 28, 42, 4...
Name: total_grade, dtype: object

In [36]:
# Perform one-way ANOVA
f_stat, p_value = f_oneway(*groups)

print(f"F-statistic:{f_stat:.2f}")
print(f"p-value: {p_value:.4f}")

F-statistic:6.65
p-value: 0.0000


In [24]:
# Calculate average scores for different levels of mother's education
average_scores_by_education = df.groupby('Medu')['total_grade'].mean()

# Print the results
print(average_scores_by_education)

Medu
0    37.666667
1    27.593220
2    30.650485
3    31.353535
4    35.519084
Name: total_grade, dtype: float64


In [25]:
# Define new hypotheses
h0 = "Mother's level of education has no effect on the final score (all group means are equal)"
h1 = "Mother's level of education affects the final score (at least one group mean is different)"

# Perform hypothesis testing
print(f"F-statistic = {f_stat}, P-value = {p_value}")
hypothesis_test(h0, h1, p_value)


F-statistic = 6.646720488531874, P-value = 3.4959926061031416e-05
Reject H₀ → Mother's level of education affects the final score (at least one group mean is different)


### Chi-Square Test of Independence

You're checking if there is a **relationship between two categorical variables** — like study time level and internet access.

#### Hypotheses:

    H₀ (Null): The two variables are independent (no association)  
    H₁ (Alternative): The two variables are dependent (there is an association)

Used when:  
You want to know if one **categorical variable** affects or is associated with another.

    Example:
    "Is there a relationship between internet access and high study time?"
    (Are students with internet more likely to study more, or is there no link?)

In general:

    Use Chi-Square when comparing counts/frequencies between categorical groups

    This test helps you understand whether the distribution of one variable depends on another





In [26]:
from scipy.stats import chi2_contingency

# Create a binary column for high vs low study time
df["high_study"] = df["studytime"].apply(lambda x: "High" if x >= 3 else "Low")

# Build the contingency table
contingency = pd.crosstab(df["internet"], df["high_study"])

# Perform Chi-Square test of independence
chi2, p, dof, expected = chi2_contingency(contingency)

# Output
print(f"Chi2: {chi2:.2f}, P-value: {p:.4f}")

Chi2: 2.42, P-value: 0.1200


In [27]:
# Define hypotheses
h0 = "There is no association between internet access and high study time (independent)"
h1 = "There is an association between internet access and high study time (dependent)"

# Perform hypothesis testing
print(f"Chi2 = {chi2:.2f}, Degrees of Freedom = {dof}, P-value = {p:.4f}")
hypothesis_test(h0, h1, p)

Chi2 = 2.42, Degrees of Freedom = 1, P-value = 0.1200
Fail to reject H₀ → There is no association between internet access and high study time (independent)


### Two-Sample T-Test (Independent Samples)

You're checking if the **(G3) Score** differs between **students with high alcohol consumption** and **students with low alcohol consumption**.

#### Two-tailed test (default in `ttest_ind`):

You're testing for any difference — **either** group could have a higher mean.

    H₀ (Null): μ₁ = μ₂  
    H₁ (Alternative): μ₁ ≠ μ₂

Where:
- μ₁ = mean total grade for students with high alcohol consumption
- μ₂ = mean total grade for students with low alcohol consumption

Used when:  
You're interested in knowing if there's a difference in grades based on alcohol consumption.

    Example:
    "Do students with high alcohol consumption (Walc > 3) perform differently than those with low alcohol consumption (Walc <= 3)?"

In general:  
- Use this test to compare the means of **two independent groups**.
- The `high_alc` and `low_alc` groups are **independent**, and you're comparing their **mean grades**.

In [28]:
from scipy.stats import ttest_ind

# Create groups based on alcohol consumption
high_alc = df[df["Walc"] > 3]["G3"]
low_alc = df[df["Walc"] <= 3]["G3"]

# Perform two-sample t-test
t_stat, p_value = ttest_ind(high_alc, low_alc)

# Output results
print(f"T-statistic: {t_stat:.2f}, P-value: {p_value:.4f}")

T-statistic: -1.23, P-value: 0.2191


In [29]:
print(f"High Alcohol Consumption Mean: {high_alc.mean():.2f}")
print(f"Low Alcohol Consumption Mean: {low_alc.mean():.2f}")

High Alcohol Consumption Mean: 9.85
Low Alcohol Consumption Mean: 10.56


In [30]:
# hypothesis testing

h0 = "Alcohol consumption has no effect on final scores"
h1 = "Alcohol consumption has an effect on final scores"
print(f"T-statistic = {t_stat:.2f}, P-value = {p_value:.4f}")
hypothesis_test(h0, h1, p_value)

T-statistic = -1.23, P-value = 0.2191
Fail to reject H₀ → Alcohol consumption has no effect on final scores


### Cohen’s d (Effect Size)

You're calculating **Cohen's d**, which is a measure of the **standardized difference** between the means of two groups. It tells you how large the effect of the independent variable is on the dependent variable, in terms of standard deviations.

#### Formula for Cohen’s d:

    Cohen's d = (Mean of Group 1 - Mean of Group 2) / Pooled Standard Deviation

Where:
- **Mean of Group 1**: Mean score for the first group (e.g., male scores)
- **Mean of Group 2**: Mean score for the second group (e.g., female scores)
- **Pooled Standard Deviation**: Combined standard deviation of both groups

Used when:  
You're interested in knowing **how big the difference** is between two groups, not just whether it's statistically significant.

    Example:
    "How big is the difference in scores between male and female students?"

In general:  
- **Cohen’s d** is useful for understanding the **magnitude of an effect**.
- It’s often used alongside a t-test or other statistical tests to give context to the significance.


In [31]:

mean_diff = male_scores.mean() - female_scores.mean()

# Calculate pooled standard deviation
pooled_std = np.sqrt((male_scores.std(ddof=1) ** 2 + female_scores.std(ddof=1) ** 2) / 2)

# Calculate Cohen's d
cohen_d = mean_diff / pooled_std

# Output the result
print(f"Cohen's d: {cohen_d:.2f}")

Cohen's d: 0.21


Interpretation of Cohen's d:

    Cohen's d = 0.2 → Small effect size (small difference between the groups)

    Cohen's d = 0.5 → Medium effect size (moderate difference)

    Cohen's d = 0.8 or higher → Large effect size (substantial difference)

In general:

    Cohen’s d helps you understand how big the difference is between the two groups, beyond just the statistical significance.

    This value provides insight into whether the difference is meaningful in practical terms, not just in statistical terms.