<img src="https://devra.ai/analyst/notebook/2271/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;"><div style="font-size:150%; color:#FEE100"><b>Air Quality and Health Outcomes Analysis</b></div><div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>Air quality and respiratory health outcomes data present plenty of challenges and opportunities to uncover hidden correlations. We dive into the intricacies of pollutant levels, weather factors, and hospital admissions to reveal stories that challenge our assumptions about urban living. If you find this notebook useful, please consider upvoting it.

## Table of Contents
- [Data Loading and Preprocessing](#Data-Loading-and-Preprocessing)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Predictive Modeling](#Predictive-Modeling)
- [Conclusion](#Conclusion)

In [1]:
# Import necessary libraries and suppress warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')  # set Agg backend for matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# For model building
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.inspection import permutation_importance

sns.set(style='whitegrid')

# Additional inline configuration
plt.switch_backend('Agg')  # if only plt is used, ensure using Agg backend

## Data Loading and Preprocessing

We start by loading the dataset from the provided CSV. Note that the file encoding is ISO-8859-1 and the delimiter is a comma.

In [2]:
# Load the data
file_path = 'D:/DSMLAI(insaid)/ML/DATA SETS/air_quality_health_dataset.csv'
try:
    df = pd.read_csv(file_path, encoding='ISO-8859-1', delimiter=',')
    print('Data loaded successfully.')
except Exception as e:
    print('Error loading data:', e)

# Display the first few rows as a sanity check
df.head()

Data loaded successfully.


Unnamed: 0,city,date,aqi,pm2_5,pm10,no2,o3,temperature,humidity,hospital_admissions,population_density,hospital_capacity
0,Los Angeles,2020-01-01,65,34.0,52.7,2.2,38.5,33.5,33,5,Rural,1337
1,Beijing,2020-01-02,137,33.7,31.5,36.7,27.5,-1.6,32,4,Urban,1545
2,London,2020-01-03,266,43.0,59.6,30.4,57.3,36.4,25,10,Suburban,1539
3,Mexico City,2020-01-04,293,33.7,37.9,12.3,42.7,-1.0,67,10,Urban,552
4,Delhi,2020-01-05,493,50.3,34.8,31.2,35.6,33.5,72,9,Suburban,1631


## Data Cleaning and Preprocessing

In this section we address common data quality issues. Note the following steps:

- For the `date` column, we convert the string representation to a datetime type. This is critical for time-series analyses later.
- We check for missing values and decide on appropriate imputations or dropping strategies.
- For columns with unexpected data types (for instance, `population_density` which is encoded as a string), further conversion logic might be necessary.

The dry truth is: good data cleaning is 80% of the work. Let us get this done.

In [3]:
# Convert the 'date' column to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Check for missing values
missing_values = df.isnull().sum()
print('Missing values in each column:')
print(missing_values)

# Handling missing values (basic strategy): drop rows with critical missing data
df.dropna(subset=['date', 'aqi', 'pm2_5', 'hospital_admissions'], inplace=True)

# Convert population_density to a numeric value if possible; if not possible, leave it as is
try:
    df['population_density_numeric'] = pd.to_numeric(df['population_density'], errors='coerce')
    print('Converted population_density to numeric where possible.')
except Exception as e:
    print('Error converting population_density:', e)

# For any remaining non-numeric entries in population_density_numeric, we can fill them with the median
if 'population_density_numeric' in df.columns:
    median_val = df['population_density_numeric'].median()
    df['population_density_numeric'].fillna(median_val, inplace=True)

# Final sanity check
print('Data types after processing:')
print(df.dtypes)

Missing values in each column:
city                   0
date                   0
aqi                    0
pm2_5                  0
pm10                   0
no2                    0
o3                     0
temperature            0
humidity               0
hospital_admissions    0
population_density     0
hospital_capacity      0
dtype: int64
Converted population_density to numeric where possible.
Data types after processing:
city                                  object
date                          datetime64[ns]
aqi                                    int64
pm2_5                                float64
pm10                                 float64
no2                                  float64
o3                                   float64
temperature                          float64
humidity                               int64
hospital_admissions                    int64
population_density                    object
hospital_capacity                      int64
population_density_numeric     

## Exploratory Data Analysis

Now we explore potential relationships in the data. In this section, we generate several plots:

- Correlation heatmap for numeric features (only if four or more numeric columns exist).
- Pair plot to assess distributions and relationships between key variables.
- Histograms for distribution of individual numeric variables.
- Count plot (pie chart style) for categorical features (e.g., city).

Each plot is a step towards a more complete picture of the pollutant-health connection. Grab your binoculars.

In [6]:
# Extract numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=[np.number])

# Generate a correlation heatmap if there are 4 or more numeric columns
if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(10, 8))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap')
    plt.tight_layout()
    plt.savefig('D:/DSMLAI(insaid)/ML/practice/AIR-QUALITY/EDA-images/correlation_heatmap.png')
    plt.show()
else:
    print('Not enough numeric columns for a correlation heatmap.')

# Pair Plot of key variables
sns.pairplot(df[['aqi', 'pm2_5', 'pm10', 'hospital_admissions']])
plt.suptitle('Pair Plot of Selected Variables', y=1.02)
plt.savefig('D:/DSMLAI(insaid)/ML/practice/AIR-QUALITY/EDA-images/pairplot.png')
plt.show()

# Histograms for numeric features
numeric_cols = numeric_df.columns
for col in numeric_cols:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col].dropna(), kde=True, bins=30)
    plt.title(f'Histogram of {col}')
    plt.savefig(f'D:/DSMLAI(insaid)/ML/practice/AIR-QUALITY/EDA-images/histogram_{col}.png')
    plt.show()

# Count plot for the categorical 'city' feature
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='city', order=df['city'].value_counts().index)
plt.title('Count Plot of Cities')
plt.xticks(rotation=45, ha='right')
plt.savefig('D:/DSMLAI(insaid)/ML/practice/AIR-QUALITY/EDA-images/countplot_cities.png')
plt.show()

## Predictive Modeling

Given that hospital admissions in the context of air quality metrics is a pressing health indicator, we develop a predictor to see if we can forecast hospital admissions using air quality data and related features. We select a subset of features that are both plausible and available in the dataset. In our case, the features include:

- aqi
- pm2_5
- pm10
- no2
- o3
- temperature
- humidity
- hospital_capacity

We use a RandomForestRegressor as our predictive model. While simple linear models are elegant, we prefer the brute force of tree ensembles for capturing non-linearities. We then evaluate the model using R2 and Mean Squared Error on a hold-out test set.

In [7]:
# Define the target and predictors
target = 'hospital_admissions'
features = ['aqi', 'pm2_5', 'pm10', 'no2', 'o3', 'temperature', 'humidity', 'hospital_capacity']

# Drop any rows with missing values in the selected feature columns
model_df = df.dropna(subset=features + [target]).copy()

# Extract features X and target y
X = model_df[features]
y = model_df[target]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R^2 Score: {r2:.2f}')

# Permutation importance to understand feature contributions
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)

feature_importance = pd.Series(perm_importance.importances_mean, index=features)
feature_importance = feature_importance.sort_values()

plt.figure(figsize=(8, 6))
plt.barh(feature_importance.index, feature_importance.values, color='skyblue')
plt.xlabel('Permutation Importance')
plt.title('Feature Importance')
plt.savefig('feature_importance.png')
plt.show()

Mean Squared Error: 11.88
R^2 Score: 0.13


## Conclusion

This notebook takes us through loading, cleaning, and exploring air quality data in relation to public health outcomes. Our exploratory analysis highlights several key relationships, and the predictive modeling section demonstrates that a RandomForestRegressor can capture the complexity of these interactions to predict hospital admissions with a reasonable degree of accuracy. 

Merits of this approach include:

- A comprehensive data cleaning methodology that handles date conversions and missing values robustly.
- Multiple visualization methods that offer varied perspectives on the data.
- A predictive model that uses permutation importance to highlight valuable features.

Future analyses might incorporate time-series forecasting elements, integrate more granular geographical information, or experiment with other machine learning algorithms to further improve prediction accuracy.

If you found this notebook useful, please consider upvoting it.