# DAY 1: DATASET UNDERSTANDING & PROFILING  
## Global Air Quality EDA Project  

## Business Context  

**Domain:** Environmental Science / Public Health  
**Dataset:** Global city air quality measurements (~10,000 records)  
**Objective:** To analyze global air pollution data to identify pollution patterns, seasonal variations, and city-level differences.


## Dataset Overview  

The dataset contains air quality measurements collected from major cities worldwide. It includes key pollutant indicators such as:

- **PM2.5 & PM10** – Fine and coarse particulate matter concentrations  
- **NO₂, SO₂, CO, O₃** – Major gaseous pollutants  
- **Temperature, Humidity, Wind Speed** – Supporting meteorological variables  

These indicators help evaluate environmental conditions and understand factors influencing pollution levels.


##  Problem Statement  

Air pollution is one of the most significant global environmental and public health challenges. Rapid urbanization, industrial growth, and transportation expansion have led to increased pollutant concentrations in urban areas.

However, pollution levels vary across cities and time periods, making it challenging for stakeholders to:

- Identify high-risk regions  
- Monitor pollutant fluctuations  
- Compare environmental conditions across locations  
- Support data-driven policy decisions  

This project aims to convert raw environmental measurements into actionable insights that assist environmental agencies, urban planners, and public health authorities in making informed decisions.


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df=pd.read_csv(r"C:\Users\NOOR AL MUSABAH\Downloads\global_air_quality_data_10000.csv")
print(f"\n Dataset Shape : {df.shape[0]:,} rows & {df.shape[1]} columns")
print("\n First 5 rows :")
display(df.head())
print("Data Types : ")
print(df.dtypes)




 Dataset Shape : 10,000 rows & 12 columns

 First 5 rows :


Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.4,144.16,3.46,67.51,6.36
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.3,12.87
3,Mumbai,India,2023-03-16,119.7,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71
4,Paris,France,2023-04-04,55.2,36.62,76.85,21.85,2.0,67.09,16.02,90.28,14.16


Data Types : 
City            object
Country         object
Date            object
PM2.5          float64
PM10           float64
NO2            float64
SO2            float64
CO             float64
O3             float64
Temperature    float64
Humidity       float64
Wind Speed     float64
dtype: object


### **Information about the Dataset**

In [19]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   City         10000 non-null  object 
 1   Country      10000 non-null  object 
 2   Date         10000 non-null  object 
 3   PM2.5        10000 non-null  float64
 4   PM10         10000 non-null  float64
 5   NO2          10000 non-null  float64
 6   SO2          10000 non-null  float64
 7   CO           10000 non-null  float64
 8   O3           10000 non-null  float64
 9   Temperature  10000 non-null  float64
 10  Humidity     10000 non-null  float64
 11  Wind Speed   10000 non-null  float64
dtypes: float64(9), object(3)
memory usage: 937.6+ KB
None


### **Data Type Conversion**

In [25]:
df['Date']=pd.to_datetime(df['Date'],errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   City         10000 non-null  object        
 1   Country      10000 non-null  object        
 2   Date         10000 non-null  datetime64[ns]
 3   PM2.5        10000 non-null  float64       
 4   PM10         10000 non-null  float64       
 5   NO2          10000 non-null  float64       
 6   SO2          10000 non-null  float64       
 7   CO           10000 non-null  float64       
 8   O3           10000 non-null  float64       
 9   Temperature  10000 non-null  float64       
 10  Humidity     10000 non-null  float64       
 11  Wind Speed   10000 non-null  float64       
dtypes: datetime64[ns](1), float64(9), object(2)
memory usage: 937.6+ KB


### **Statistical Summary about the Dataset**

In [20]:
print(df.describe())

              PM2.5          PM10           NO2           SO2            CO  \
count  10000.000000  10000.000000  10000.000000  10000.000000  10000.000000   
mean      77.448439    104.438161     52.198649     25.344490      5.047984   
std       41.927871     55.062396     27.320490     14.091194      2.852625   
min        5.020000     10.000000      5.010000      1.000000      0.100000   
25%       41.185000     57.137500     28.347500     13.190000      2.560000   
50%       77.725000    103.690000     52.100000     25.350000      5.090000   
75%      113.392500    152.265000     75.705000     37.500000      7.480000   
max      149.980000    200.000000    100.000000     49.990000     10.000000   

                 O3  Temperature      Humidity    Wind Speed  
count  10000.000000  10000.00000  10000.000000  10000.000000  
mean     106.031643     14.89715     55.078579     10.231636  
std       55.081345     14.44380     25.982232      5.632628  
min       10.040000    -10.00000    

### **Unique Values per Column**

In [22]:
df.nunique().sort_values(ascending=False)

PM10           7767
O3             7741
PM2.5          7214
NO2            6162
Humidity       6010
Temperature    4319
SO2            4276
Wind Speed     1940
CO              990
Date            336
City             20
Country          19
dtype: int64

### **Column Classification**

In [42]:
numerical_columns=df.select_dtypes(include=['int64','float64']).columns
categorical_columns=df.select_dtypes(include=['object']).columns
date_columns=df.select_dtypes(include=['datetime64[ns]']).columns
print(f"\nNumerical Columns : ({len(numerical_columns)})")
print(numerical_columns)

print(f"\nCategorical Columns : ({len(categorical_columns)})")
print(categorical_columns)
print(f"\nDatetime Column : ({len(date_columns)})")
print(date_columns)


Numerical Columns : (9)
Index(['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3', 'Temperature', 'Humidity',
       'Wind Speed'],
      dtype='object')

Categorical Columns : (2)
Index(['City', 'Country'], dtype='object')

Datetime Column : (1)
Index(['Date'], dtype='object')


### **Missing Value Analysis**

In [46]:
missing_count=df.isnull().sum()
print(missing_count)

City           0
Country        0
Date           0
PM2.5          0
PM10           0
NO2            0
SO2            0
CO             0
O3             0
Temperature    0
Humidity       0
Wind Speed     0
dtype: int64


### **Checking Duplicates**

In [49]:
duplicate_count=df.duplicated().sum()
print("Total Duplicate Records : ",duplicate_count)

Total Duplicate Records :  0


In [54]:
pollutant_cols = ['CO', 'PM2.5', 'PM10', 'NO2', 'O3']

for col in pollutant_cols:
    print(col, "Negative Values:", (df[col] < 0).sum())



CO Negative Values: 0
PM2.5 Negative Values: 0
PM10 Negative Values: 0
NO2 Negative Values: 0
O3 Negative Values: 0


In [56]:
print("Temperature below -50°C:", (df['Temperature'] < -50).sum())
print("Temperature above 60°C:", (df['Temperature'] > 60).sum())


Temperature below -50°C: 0
Temperature above 60°C: 0


In [58]:
df[pollutant_cols].describe()


Unnamed: 0,CO,PM2.5,PM10,NO2,O3
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.047984,77.448439,104.438161,52.198649,106.031643
std,2.852625,41.927871,55.062396,27.32049,55.081345
min,0.1,5.02,10.0,5.01,10.04
25%,2.56,41.185,57.1375,28.3475,58.38
50%,5.09,77.725,103.69,52.1,106.055
75%,7.48,113.3925,152.265,75.705,153.9825
max,10.0,149.98,200.0,100.0,200.0


In [60]:
df[df['PM2.5'] > 1000]
df[df['CO'] > 50]
df[df['Temperature'] > 60]
df[df['Temperature'] < -50]


Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed


## Data Quality Assessment

A structured validation process was conducted to ensure data reliability and analytical accuracy.

### Missing Values & Duplicates
The dataset contains no missing values or duplicate records, indicating strong structural completeness.

### Negative Value Validation
All pollutant concentration columns (CO, PM2.5, PM10, NO2, O3) were examined for invalid negative values.  
No logically invalid negative pollutant readings were detected.

### Range Validation
Descriptive statistics were reviewed to identify unrealistic extreme values.  
All pollutant measurements fall within acceptable environmental ranges.


### Temperature Validation
Temperature values were reviewed to confirm they fall within realistic atmospheric ranges.


### Conclusion

The dataset demonstrates high data integrity with no major structural or logical inconsistencies.  
It is suitable for further exploratory analysis and trend identification.
