### One-Sample T-test

The one-sample t-test is used to determine whether the mean of a single sample differs significantly from a known or hypothesized population mean.

#### When to Apply:
- Use when you have one group of observations and you want to compare its mean to a known or hypothesized value.

#### Code Sample:

In [2]:
import numpy as np
from scipy import stats

In [12]:
# Sample data
sample_data = np.random.normal(loc=10, scale=2, size=100)
# sample_data.std()
sample_data

array([14.13847744,  8.51130873, 10.76301342,  8.15389352, 10.80776807,
       12.07471374, 11.82758479, 12.24298422, 11.35386995,  8.71263928,
       10.53061582, 13.6356925 , 11.67003486,  9.3772728 , 11.55588383,
        9.15269299,  7.28008418,  6.97007295,  6.74660512,  8.07520447,
        9.70096757, 10.16537656, 13.11903482, 11.12762602, 12.31883808,
        9.4554467 , 11.06915818, 13.04106853,  9.82112348,  6.08792657,
       10.87655977,  7.63359499,  8.07955384, 11.27889987, 12.20410083,
       10.45596289,  8.49299958, 13.50109259, 12.42045815,  5.39056633,
       12.20477402,  8.58373144,  8.01972275, 11.16061705,  9.53521536,
        9.37970425,  6.91973087,  8.30849975,  7.98213748,  8.99594372,
       13.10408774,  8.96512782,  8.4913333 ,  8.78241597, 10.69176743,
       11.61655308,  7.1144428 , 12.0299466 ,  9.109632  ,  8.04515345,
       12.91218481,  8.68743242, 10.19931155, 10.30559005,  8.6377382 ,
       10.8023848 ,  9.75864923, 10.81804328,  7.59509095, 12.57

In [13]:
# Hypothesized population mean
pop_mean = 9

In [14]:
# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, pop_mean)
p_value

6.856568666890962e-06

In [10]:
# Interpret results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The mean of the sample is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")

Reject the null hypothesis. The mean of the sample is significantly different from the population mean.


### Two-Sample T-test

The two-sample t-test compares the means of two independent groups to determine if they are significantly different from each other.

#### When to Apply:
- Use when you have two independent groups of observations and you want to compare their means.

#### Code Sample:

In [21]:
# Sample data for two groups
group1 = np.random.normal(loc=10, scale=2, size=100)
group2 = np.random.normal(loc=12, scale=2, size=100)
group1.mean()

10.026935738490176

In [22]:
group2.mean()

12.205749405809673

In [18]:
# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(group1, group2)
p_value

1.6245199125611691e-12

In [8]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. The means of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")

Reject the null hypothesis. The means of the two groups are significantly different.


### F-test (Variance Ratio Test)

The F-test compares the variances of two samples to determine if they are significantly different from each other.

#### When to Apply:
- Use when you want to compare the variances of two samples.

#### Code Sample:

In [25]:
# Perform F-test
F_statistic, p_value = stats.f_oneway(group1, group2)
group1.var()

4.385625050093605

In [26]:
group2.var()

4.573971543400904

In [24]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. The variances of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in variances.")

Reject the null hypothesis. The variances of the two groups are significantly different.


### ANOVA (Analysis of Variance)

ANOVA is used to compare the means of three or more independent groups to determine if at least one group is different from the others.

#### When to Apply:
- Use when you have three or more independent groups of observations and you want to compare their means.

#### Code Sample:

In [32]:
# Sample data for three groups
group3 = np.random.normal(loc=14, scale=2, size=100)
group3.var()

4.249777240975971

In [33]:
# Perform ANOVA
F_statistic, p_value = stats.f_oneway(group1, group2, group3)
p_value

3.0881881412105206e-31

In [31]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. At least one group mean is significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in group means.")

Reject the null hypothesis. At least one group mean is significantly different.


### Chi-Square Test

The Chi-Square test of independence is used to determine if there is a significant association between two categorical variables.

#### When to Apply:
- Use when you have categorical data and you want to test if there is a relationship between two variables.

#### Code Sample:

In [44]:
# Create contingency table
observed = np.array([[30, 20, 10],
                     [30, 20, 20]])
observed

array([[30, 20, 10],
       [30, 20, 20]])

In [42]:
# Perform Chi-Square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
p_value

0.27535818451244715

In [43]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant association between the two variables.")
else:
    print("Fail to reject the null hypothesis. There is no significant association.")

Fail to reject the null hypothesis. There is no significant association.


### Z-test

The Z-test is used to compare a sample mean to a population mean when the population standard deviation is known.

#### When to Apply:
- Use when the population standard deviation is known and you want to compare a sample mean to the population mean.

#### Code Sample:

In [45]:
# Sample data
sample_data = np.random.normal(loc=10, scale=2, size=100)

In [46]:
# Population parameters
pop_mean = 9
pop_std = 2

In [47]:
# Perform Z-test
z_score = (np.mean(sample_data) - pop_mean) / (pop_std / np.sqrt(len(sample_data)))
z_score

5.163942582318182

In [48]:
# Calculate p-value
p_value = stats.norm.sf(abs(z_score))
p_value

1.2090094790443962e-07

In [39]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. The mean of the sample is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")

Reject the null hypothesis. The mean of the sample is significantly different from the population mean.


### Step 1: Load the Dataset

In [49]:
import pandas as pd

In [50]:
# Load the dataset
transport_data = pd.read_csv('synthetic_data.csv')

In [51]:
# Display the first few rows of the dataset
# print(transport_data.head())
transport_data.head()

Unnamed: 0,Date,Time,Stop/Station,Passenger_Count,Vehicle_ID,Latitude,Longitude,Temperature (°C),Precipitation (mm),Humidity (%),Age_Group,Gender,Feedback
0,2023-11-16,12:03,Johor Bahru,46,TRAIN82,3.906935,106.068464,11,3,63,18-24,Male,Driver was friendly
1,2023-07-14,05:07,Cameron Highlands,17,TRAIN65,4.227106,118.407191,3,3,74,25-40,Female,Seats were uncomfortable
2,2023-09-22,14:11,Ipoh,91,TRAIN38,6.819556,101.272984,27,1,81,40-60,Male,Delay in departure
3,2022-07-12,09:11,Penang,41,BUS245,3.627521,106.22699,1,7,98,25-40,Female,Driver was friendly
4,2023-12-09,16:59,Kuching,53,BUS958,1.418952,117.050925,15,9,71,40-60,Male,Service was excellent


### Step 2: One-Sample T-test

Let's test whether the mean of the 'Passenger_Count' column is significantly different from a specific value (e.g., 50).

In [56]:
from scipy import stats

In [57]:
# One-sample t-test
t_statistic, p_value = stats.ttest_1samp(transport_data['Passenger_Count'], 55.751)
transport_data['Passenger_Count'].mean()
p_value

1.0

In [58]:
# Interpret results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The mean of Passenger_Count is significantly different from 50.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")

Fail to reject the null hypothesis. There is no significant difference in means.


### Step 3: Two-Sample T-test

Let's compare the mean 'Temperature (°C)' for two different 'Feedback' groups (e.g., "Driver was friendly" and "Seats were uncomfortable").

In [59]:
# Separate data into two groups based on feedback
group1 = transport_data[transport_data['Feedback'] == "Driver was friendly"]['Temperature (°C)']
group2 = transport_data[transport_data['Feedback'] == "Seats were uncomfortable"]['Temperature (°C)']

In [62]:
# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(group1, group2)
group1.mean()

14.754901960784315

In [63]:
group2.mean()

14.236559139784946

In [61]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. The means of the two feedback groups are significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")

Fail to reject the null hypothesis. There is no significant difference in means.


### Step 4: F-test (Variance Ratio Test)

Let's compare the variances of 'Passenger_Count' for two different 'Age_Group' categories (e.g., "18-24" and "25-40").

In [64]:
# Separate data into two groups based on age group
group1 = transport_data[transport_data['Age_Group'] == "18-24"]['Passenger_Count']
group2 = transport_data[transport_data['Age_Group'] == "25-40"]['Passenger_Count']

In [72]:
# Perform F-test
F_statistic, p_value = stats.f_oneway(group1, group2)
p_value
group1.var()

652.189561176155

In [74]:
group2.var()
p_value

0.5559826821515097

In [75]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. The variances of the two age groups are significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in variances.")

Fail to reject the null hypothesis. There is no significant difference in variances.


### Step 5: ANOVA (Analysis of Variance)

Let's compare the means of 'Temperature (°C)' for different 'Stop/Station' categories.

In [82]:
# Perform ANOVA
F_statistic, p_value = stats.f_oneway(*[group['Temperature (°C)'] for name, group in transport_data.groupby('Stop/Station')])
p_value

0.6561408487926885

In [83]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. At least one Stop/Station mean is significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in Stop/Station means.")

Fail to reject the null hypothesis. There is no significant difference in Stop/Station means.


### Step 6: Chi-Square Test

Let's test the association between 'Age_Group' and 'Gender'.

In [84]:
# Create contingency table
contingency_table = pd.crosstab(transport_data['Age_Group'], transport_data['Gender'])
contingency_table

Gender,Female,Male
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1
18-24,149,186
25-40,144,159
40-60,173,189


In [86]:
# Perform Chi-Square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
p_value

0.6323697792457594

In [85]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant association between Age_Group and Gender.")
else:
    print("Fail to reject the null hypothesis. There is no significant association.")

Fail to reject the null hypothesis. There is no significant association.


### Step 7: Z-test

Let's compare the mean 'Passenger_Count' to a population mean of 60, assuming a population standard deviation of 15.

In [90]:
# Sample data
sample_data = transport_data['Passenger_Count']

# Population parameters
pop_mean = 60
pop_std = 15
m = transport_data['Passenger_Count'].mean()
l = transport_data['Passenger_Count'].count()
print(f"mean of sample data is {m} and length of sample data is {l}")

mean of sample data is 55.751 and length of sample data is 1000


In [97]:
# Perform Z-test
z_score = (sample_data.mean() - pop_mean) / (pop_std / (len(sample_data) ** 0.5))
(55.751-60)/(15/np.sqrt(1000))
# z_score

-8.957678518703634

In [98]:
# Calculate p-value
p_value = stats.norm.sf(abs(z_score))

In [99]:
# Interpret results
if p_value < alpha:
    print("Reject the null hypothesis. The mean of the sample is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")

Reject the null hypothesis. The mean of the sample is significantly different from the population mean.
