# The AireLibre Dataset

The AireLibre dataset provides environmental and weather-related data collected from various sensors across Paraguay. The dataset includes the following columns:


| Column Name	| Description |
|---------------|-------------|
| sensor        | Unique identifier for the sensor |
| source	| Source of the data |
| description	| Description of the sensor |
|version	| Version of the sensor or data |
| pm1dot0	| PM1.0 particulate matter concentration |
| pm2dot5	| PM2.5 particulate matter concentration |
| pm10	| PM10 particulate matter concentration |
| humidity	| Relative humidity |
| temperature	| Temperature (in degrees Celsius) |
| pressure	| Atmospheric pressure (in hPa) |
| co2	| CO2 concentration (in ppm) |
| longitude	| Longitude of the sensor location |
| latitude	| Latitude of the sensor location |
| recorded	| Timestamp of the recorded data |

In [19]:
import pandas as pd
import numpy as np

# Define the column names based on the API response
column_names = [
    "sensor", "source", "description", "version", 
    "pm1dot0", "pm2dot5", "pm10", "humidity", 
    "temperature", "pressure", "co2", 
    "longitude", "latitude", "recorded"
]

# Load the CSV without headers and assign column names
df = pd.read_csv('../data/raw/airelibre_data.csv', header=None, names=column_names)

# Convert columns to appropriate data types
dtype_mapping = {
    "sensor": "string",
    "source": "string",
    "description": "string",
    "version": "string",
    "pm1dot0": "float64",
    "pm2dot5": "float64",
    "pm10": "float64",
    "humidity": "float64",
    "temperature": "float64",
    "pressure": "float64",
    "co2": "float64",
    "longitude": "float64",
    "latitude": "float64",
}
df = df.astype(dtype_mapping)

# Convert 'recorded' to timestamp with timezone
df['recorded'] = pd.to_datetime(df['recorded'], utc=True)

df.head()

Unnamed: 0,sensor,source,description,version,pm1dot0,pm2dot5,pm10,humidity,temperature,pressure,co2,longitude,latitude,recorded
0,PMS7003,356690,Barrio San Roque,0.3.2,1.0,1.0,1.0,,,,,-57.629712,-25.291214,2023-11-04 00:00:00+00:00
1,PMS7003,d87553,FP-UNA SAN LORENZO,0.3.2,17.0,33.0,33.0,,,,,-57.513368,-25.3354,2023-11-06 00:00:00+00:00
2,PMS7003,7c4984,Benjamín Aceval,0.3.2,7.0,8.0,8.0,,,,,-57.557254,-24.981153,2023-11-10 00:00:00+00:00
3,PMS7003,17a34d,FCA - CAMPUS UNP,0.3.2,1.0,1.0,1.0,,,,,-58.288459,-26.879844,2023-11-10 00:00:00+00:00
4,PMS7003,b27800,San Rafael,0.3.2,7.0,15.0,15.0,,,,,-57.476,-25.298,2023-11-28 00:00:00+00:00


# Missing Data

Several columns in the dataset contain missing values:

* Weather data: The temperature, humidity, and pressure columns are mostly empty for most sensors. These columns require data from external weather sources to be filled.

* Pollution data: The pollution-related columns (pm1dot0, pm2dot5, pm10) are mostly populated and provide valuable insights into air quality.

The absence of weather data does not affect the air quality measurements, but models focused on regional weather patterns will require complementary weather data from external sources.

In [20]:
# Calculate the threshold for missing data
threshold = 0.6

# Find columns with more than 60% missing data
missing_data_ratio = df.isnull().mean()
columns_to_drop = missing_data_ratio[missing_data_ratio > threshold].index

# Drop the columns
df_cleaned = df.drop(columns=columns_to_drop)

# Display the remaining columns
print(f"Dropped columns: {list(columns_to_drop)}")
print(f"Remaining columns: {df_cleaned.columns.tolist()}")


Dropped columns: ['humidity', 'temperature', 'pressure', 'co2']
Remaining columns: ['sensor', 'source', 'description', 'version', 'pm1dot0', 'pm2dot5', 'pm10', 'longitude', 'latitude', 'recorded']


In [21]:
unique_sensors = df[['source', 'latitude', 'longitude']].drop_duplicates()
unique_sensors.reset_index(drop=True, inplace=True)


sensor_readings_count = df.groupby('source').size().reset_index(name='readings_count')
unique_sensors_with_counts = unique_sensors.merge(sensor_readings_count, on='source', how='left')

max_readings = unique_sensors_with_counts['readings_count'].max()

# Calculate 80% of the maximum number of readings
threshold = max_readings * 0.8

# Filter the sensors that have at least 80% of the maximum readings
filtered_sensors = unique_sensors_with_counts[unique_sensors_with_counts['readings_count'] >= threshold]
print(filtered_sensors)

# Optionally, save the filtered list to a CSV file
filtered_sensors.to_csv('../reports/working_sensors_80p.csv', index=False)

# Number of filtered sensors
num_filtered_sensors = filtered_sensors.shape[0]
print(f"Number of sensors with at least 80% of the maximum readings: {num_filtered_sensors}")

    source   latitude  longitude  readings_count
0   356690 -25.291214 -57.629712          250811
1   d87553 -25.335400 -57.513368          229713
2   7c4984 -24.981153 -57.557254          238473
6   dab551 -25.321224 -57.598298          258552
7   e48dde -26.892240 -57.020990          258080
8     8eb9 -25.313345 -57.621712          259426
10  62a828 -27.313142 -55.848649          253187
11   2104a -25.303082 -57.629141          252377
13  991219 -25.378970 -57.144951          248940
14    4e86 -25.365339 -57.482567          218782
17  9b5c52 -26.857200 -58.298800          241236
19  d0caf2 -25.281869 -57.610068          228609
20  216b9a -25.194156 -57.521369          243077
21  41ce3d -25.265753 -57.514637          253009
Number of sensors with at least 80% of the maximum readings: 14


# Sensor Locations and Clustering

The AireLibre sensors are distributed across different regions of Paraguay, with a concentration in the Asunción Metropolitan Area (9 out of 14 sensors). Other notable locations with sensors include:

* Villa Hayes
* Caacupe
* Ciudad del Este
* San Ignacio (Misiones)
* Pilar

Regional Model Considerations

* Metropolitan Area: Given the higher density of sensors in the Asunción Metropolitan Area, regional statistical models can be employed to enhance prediction accuracy. This localized approach takes advantage of a higher concentration of data points, which can improve air quality modeling.

* Other regions: To replicate this strategy in other parts of Paraguay, additional sensors are required. A more distributed sensor network is necessary to ensure that regional variations in air quality are adequately captured and modeled.

In [24]:
import folium
from geopy.distance import geodesic
from sklearn.cluster import DBSCAN
import numpy as np
from IPython.display import display  
import pandas as pd

df_sensors = filtered_sensors

def calculate_distance_matrix(df):
    coords = df[['latitude', 'longitude']].to_numpy()
    distance_matrix = np.zeros((len(coords), len(coords)))
    
    for i in range(len(coords)):
        for j in range(i + 1, len(coords)):
            dist = geodesic(coords[i], coords[j]).km
            distance_matrix[i, j] = dist
            distance_matrix[j, i] = dist
    
    return distance_matrix

distance_matrix = calculate_distance_matrix(df_sensors)

db = DBSCAN(eps=10, min_samples=1, metric='precomputed')
df_sensors['cluster'] = db.fit_predict(distance_matrix)

cluster_colors = {
    -1: 'blue',  # Outliers
    0: 'red',
    1: 'green',
    2: 'darkorange',
    3: 'purple',
    4: 'darkblue',
    5: 'yellow'
}

# Initialize the map centered around the mean location of all sensors
map = folium.Map(location=[df_sensors['latitude'].mean(), df_sensors['longitude'].mean()], zoom_start=10)

for idx, row in df_sensors.iterrows():
    cluster = row['cluster']
    color = cluster_colors.get(cluster, 'gray')  # Default to gray for unknown clusters
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=6,
        color=color,  
        fill=True,
        fill_color=color,
        fill_opacity=0.6,
        popup=f"Sensor: {row['source']}<br>Readings: {row['readings_count']}"
    ).add_to(map)

# Display the map inline in the notebook
display(map)

# Save the map as an HTML file for later use
map.save('../reports/figures/sensor_clusters_map.html')

# Save the updated DataFrame with clusters to a CSV file
df_sensors.to_csv('../reports/clustered_sensors.csv', index=False)

# Print confirmation messages
print(f"Map saved as 'sensor_clusters_map.html'.")
print(f"Cluster information saved to 'clustered_sensors.csv'.")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sensors['cluster'] = db.fit_predict(distance_matrix)


Map saved as 'sensor_clusters_map.html'.
Cluster information saved to 'clustered_sensors.csv'.
