# **HYPOTHESIS TESTING**
Hypothesis testing is a statistical method to determine whether observed data patterns are due to chance or reflect real effects. In this project, it was used to analyze relationships and differences in weather variables and air quality.

In [2]:
import pandas as pd
df=pd.read_csv(r"C:\Users\NOOR AL MUSABAH\Desktop\eda-project\data\cleaned_day2.csv",parse_dates=["Date"])
df.head()

Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.4,144.16,3.46,67.51,6.36
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.3,12.87
3,Mumbai,India,2023-03-16,119.7,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71
4,Paris,France,2023-04-04,55.2,36.62,76.85,21.85,2.0,67.09,16.02,90.28,14.16


In [5]:
df.columns


City                     object
Country                  object
Date             datetime64[ns]
PM2.5                   float64
PM10                    float64
NO2                     float64
SO2                     float64
CO                      float64
O3                      float64
Temperature             float64
Humidity                float64
Wind Speed              float64
HumidityLevel          category
dtype: object

## **Chi-Square Test**
- A Chi-Square Test of Independence was conducted to examine whether humidity levels vary across different days of the week.  
- H₀: Humidity level is independent of the day of the week.  
- H₁: Humidity level is dependent on the day of the week. 


In [28]:

from scipy.stats import chi2_contingency
import numpy as np

df['DayOfWeekName'] = df['Date'].dt.day_name()
df['HumidityLevel'] = pd.cut(df['Humidity'], bins=[0,30,60,100], labels=['Low','Medium','High'])
contingency = pd.crosstab(df['HumidityLevel'], df['DayOfWeekName'])
chi2, p, dof, ex = chi2_contingency(contingency)
print(f"Chi-square: chi2={chi2:.2f}, p-value={p:.4f}")

if p < 0.05:
    print("Humidity level distribution depends on day of week")
else:
    print("Humidity level distribution is independent of day of week")



Chi-square: chi2=5.22, p-value=0.9501
Humidity level distribution is independent of day of week



- Since the p-value (0.9501) is greater than 0.05, we fail to reject the null hypothesis.  
- This indicates that there is no statistically significant association between humidity levels and day of the week.  
- Humidity does not change based on the day of the week.
- This is because weather conditions like temperature, rainfall, and atmospheric pressure affect humidity — not the calendar day.

In [20]:

df.head()

Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed,HumidityLevel,DayOfWeekName
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76,Medium,Sunday
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.4,144.16,3.46,67.51,6.36,High,Thursday
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.3,12.87,Low,Monday
3,Mumbai,India,2023-03-16,119.7,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71,High,Thursday
4,Paris,France,2023-04-04,55.2,36.62,76.85,21.85,2.0,67.09,16.02,90.28,14.16,High,Tuesday


## Independent T-Test
- The independent t-test indicates that there is no statistically significant difference in the mean NO2 levels between India and the USA.  
- H₀:There is no significant difference in the mean NO₂ levels between India and the USA.  
- H₁:There is no significant difference in the mean NO₂ levels between India and the USA.  

In [23]:
from scipy.stats import ttest_ind
india = df[df['Country'] == 'India']['NO2']
usa = df[df['Country'] == 'USA']['NO2']
t_stat, p = ttest_ind(india, usa)
print("T-statistic:", t_stat)
print("p-value:", p)
if p < 0.05:
    print("Significant difference in NO2 between India and USA")
else:
    print("No significant difference in NO2 between India and USA")

T-statistic: 0.7013957637675777
p-value: 0.483164864260359
No significant difference in NO2 between India and USA


 
- since p-value is greater than 0.05,we fail to reject the null hypothesis (H₀)  
- The independent t-test indicates that there is no statistically significant difference in the mean NO₂ levels between India and the USA.  
- Therefore, country-wise variation does not significantly influence NO₂ levels.

In [59]:
df.head()

Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.4,144.16,3.46,67.51,6.36
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.3,12.87
3,Mumbai,India,2023-03-16,119.7,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71
4,Paris,France,2023-04-04,55.2,36.62,76.85,21.85,2.0,67.09,16.02,90.28,14.16


In [62]:
df['AQI'] = (df['PM2.5']*0.52 + df['PM10']*0.28 + df['NO2']*0.12 + df['SO2']*0.08).round(0)


In [63]:
df

Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed,AQI
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76,67.0
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.40,144.16,3.46,67.51,6.36,60.0
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.30,12.87,96.0
3,Mumbai,India,2023-03-16,119.70,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71,103.0
4,Paris,France,2023-04-04,55.20,36.62,76.85,21.85,2.00,67.09,16.02,90.28,14.16,50.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Johannesburg,South Africa,2023-09-16,147.85,184.34,90.33,34.93,2.81,191.45,-1.92,65.22,15.48,142.0
9996,Berlin,Germany,2023-12-05,12.22,121.49,49.04,5.66,2.10,184.56,-9.81,12.16,10.75,47.0
9997,Moscow,Russia,2023-11-26,44.07,143.62,8.41,32.58,0.69,167.68,39.35,53.95,4.56,67.0
9998,Berlin,Germany,2023-02-03,67.43,96.79,43.23,29.19,6.01,148.50,26.21,58.54,2.71,70.0


In [42]:
df['AQI_Category'] = pd.cut(df['AQI'], 
                           bins=[0,50,100,150,200,300,500], 
                           labels=['Good','Moderate','Unhealthy(Sensitive)','Unhealthy','Very Unhealthy','Hazardous'],
                           include_lowest=True)

In [44]:
df.head()

Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed,HumidityLevel,DayOfWeekName,AQI,AQI_Category
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76,Medium,Sunday,67.0,Moderate
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.4,144.16,3.46,67.51,6.36,High,Thursday,60.0,Moderate
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.3,12.87,Low,Monday,96.0,Moderate
3,Mumbai,India,2023-03-16,119.7,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71,High,Thursday,103.0,Unhealthy(Sensitive)
4,Paris,France,2023-04-04,55.2,36.62,76.85,21.85,2.0,67.09,16.02,90.28,14.16,High,Tuesday,50.0,Good


In [48]:
df['AQI_Category'].value_counts()

AQI_Category
Moderate                6068
Unhealthy(Sensitive)    2198
Good                    1734
Unhealthy                  0
Very Unhealthy             0
Hazardous                  0
Name: count, dtype: int64

## **ANOVA Test**
A one-way ANOVA test was conducted to examine whether temperatures differ across different AQI categories (Good, Moderate, Unhealthy(Sensitive)).
H₀: The average temperature is the same across all AQI categories.
H₁: At least one AQI category has a different average temperature.

In [50]:
from scipy.stats import f_oneway
good_temp = df[df['AQI_Category'] == 'Good']['Temperature']
moderate_temp = df[df['AQI_Category'] == 'Moderate']['Temperature']
unhealthy_temp = df[df['AQI_Category'] == 'Unhealthy(Sensitive)']['Temperature']

f_stat, p_value = f_oneway(good_temp, moderate_temp, unhealthy_temp)

print("F-statistic:", round(f_stat, 3))
print("p-value:", round(p_value, 6))
if p_value < 0.05:
    print("We REJECT H₀: Temperatures differ between AQI categories")
else:
    print("We FAIL to reject H₀: No significant difference in temperatures")

F-statistic: 0.006
p-value: 0.993575
We FAIL to reject H₀: No significant difference in temperatures


- Since the p-value (0.994) is greater than 0.05, we fail to reject the null hypothesis.
- This indicates that there is no statistically significant difference in temperature between AQI categories.
- Temperature does not vary significantly based on air quality levels.
- This is reasonable because temperature is influenced by weather patterns and seasons rather than the AQI category alone.

In [53]:
df.to_csv("final_cleaned_day4.csv",index=False)

# **INSIGHTS**
- Implementing air quality monitoring systems and pollution control measures tailored to specific city trends could enhance overall public health.
- The dataset is 100% production-ready . 0 duplicates, 0 missing values, and 0 outliers.
- Most of measurements show Moderate AQI  - acceptable for most but risky for sensitive groups like asthmatics.
- Although O₃ and PM10 show highest concentrations, prioritize PM2.5 mitigation due to its superior health impact penetrates deep into lungs/bloodstream causing heart attacks, strokes, and premature deaths.
- Realistic urban pollution pattern.
-  The monthly global average PM2.5 levels show moderate seasonal fluctuation, with peaks in January and August and a notable dip in September.
-  Although extreme pollution events appear limited, the average PM2.5 concentration falls within the moderate pollution category, which still poses significant long-term health risks, particularly for sensitive groups such as children, the elderly, and individuals with respiratory conditions.
-  NO₂ is mainly produced from fuel combustion, especially in vehicles, power plants, and industries.
-  Air pollutant levels are largely independent of each other and also independent of Temperature and Humidity in this dataset.

## **RECOMMENDATION**
- Plant high-efficiency trees (Neem, Peepal, Silver Birch) along urban roadsides. Urban trees remove 5-15 tons PM2.5 per km² annually, 
reducing Moderate AQI  to Good AQI.
- Electrify 20% of the vehicle fleet for a 15% PM2.5 reduction
- Increase public transport mode
- Encourage cycling & walking infrastructure
- Enforce environmental protection laws
- Air pollution control requires technological innovation, strict regulations, sustainable urban planning, and public awareness