# Vulnerability Score Analysis

This notebook processes census sections (seccions censals) from Barcelona and assigns weather stations to each section based on proximity.

In [1]:
import geopandas as gpd
import pandas as pd


## 1. Load Census Sections Data

Load the Barcelona census sections from CSV. The file contains polygon geometries in WKT format (WGS84 coordinate system).


In [2]:
# Load CSV with WKT geometry
df = gpd.read_file("data/BarcelonaCiutat_SeccionsCensals.csv", GEOM_POSSIBLE_NAMES="geometria_wgs84", KEEP_GEOM_COLUMNS="NO")

# Create GeoDataFrame with geometry
gdf = gpd.GeoDataFrame(df, geometry="geometry", crs="EPSG:4326")

# Calculate centroids (for distance calculations later)
# Note: For accurate centroid calculations, we'll reproject to UTM in a later step
gdf["centroid"] = gdf.geometry.centroid

# Extract latitude and longitude from centroids
gdf["centroid_lat"] = gdf["centroid"].y
gdf["centroid_lon"] = gdf["centroid"].x

gdf.head()


  gdf["centroid"] = gdf.geometry.centroid


Unnamed: 0,codi_districte,nom_districte,codi_barri,nom_barri,codi_aeb,codi_seccio_censal,geometria_etrs89,geometry,centroid,centroid_lat,centroid_lon
0,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,2.177219
1,1,Ciutat Vella,1,el Raval,1,2,"POLYGON ((431023.5455 4581164.3265, 430990.550...","POLYGON ((2.1751 41.37905, 2.1747 41.37951, 2....",POINT (2.17391 41.37793),41.377927,2.173914
2,1,Ciutat Vella,1,el Raval,2,3,"POLYGON ((430778.3455 4580930.5395, 430766.851...","POLYGON ((2.1722 41.37692, 2.17206 41.37696, 2...",POINT (2.17199 41.37576),41.375757,2.171985
3,1,Ciutat Vella,1,el Raval,2,4,"POLYGON ((430564.2645 4581104.2995, 430496.863...","POLYGON ((2.16962 41.37847, 2.16882 41.37784, ...",POINT (2.16924 41.37642),41.376416,2.169238
4,1,Ciutat Vella,1,el Raval,3,5,"POLYGON ((430905.0315 4581350.0725, 430874.963...","POLYGON ((2.17366 41.38071, 2.1733 41.38113, 2...",POINT (2.17277 41.37884),41.378836,2.172773


## 2. Create SECCIO_CENSAL Identifier

Create a unique identifier for each census section by concatenating:
- Prefix: `080193` (Barcelona municipality code)
- District code: `codi_districte`
- Section code: `codi_seccio_censal`

This creates a standardized identifier format: `080193 + codi_districte + codi_seccio_censal`


In [3]:
# Create SECCIO_CENSAL field: prefix 080193 + codi_districte + codi_seccio_censal
gdf["SECCIO_CENSAL"] = "080193" + gdf["codi_districte"].astype(str) + gdf["codi_seccio_censal"].astype(str)

# Verify the identifier creation
gdf[["codi_districte", "codi_seccio_censal", "SECCIO_CENSAL"]].head()


Unnamed: 0,codi_districte,codi_seccio_censal,SECCIO_CENSAL
0,1,1,8019301001
1,1,2,8019301002
2,1,3,8019301003
3,1,4,8019301004
4,1,5,8019301005


In [4]:
# Verify uniqueness of SECCIO_CENSAL
num_unique = gdf["SECCIO_CENSAL"].nunique()
print(f"Number of unique values in SECCIO_CENSAL: {num_unique}")
print(f"Total number of rows: {len(gdf)}")


Number of unique values in SECCIO_CENSAL: 1068
Total number of rows: 1068


**Note:** This corresponds to the number of census sections in Barcelona (1068), confirming that our identifier field is correct and unique for each section.

## 3. Load Weather Stations

Load the weather stations for which we have complete data for 2023 and 2024. 

**Note:** We exclude the Zoo weather station (X2) for our simplified version of the map, since it was dismantled in October 2024.

The three stations used are:
- **D5**: Located at coordinates (2.12379, 41.41864)
- **X4**: Located at coordinates (2.16775, 41.38390)
- **X8**: Located at coordinates (2.10540, 41.37919)


In [5]:
# Create GeoDataFrame with weather station locations
stations = gpd.GeoDataFrame(
    {
        "name": ["D5", "X4", "X8"],
        "lat": [41.41864, 41.38390, 41.37919],
        "lon": [2.12379, 2.16775, 2.10540],
    },
    geometry=gpd.points_from_xy(
        [2.12379, 2.16775, 2.10540],  # longitude (x)
        [41.41864, 41.38390, 41.37919],  # latitude (y)
    ),
    crs="EPSG:4326"  # WGS84 coordinate system
)

stations

Unnamed: 0,name,lat,lon,geometry
0,D5,41.41864,2.12379,POINT (2.12379 41.41864)
1,X4,41.3839,2.16775,POINT (2.16775 41.3839)
2,X8,41.37919,2.1054,POINT (2.1054 41.37919)


## 4. Assign Weather Stations to Census Sections

For each census section, we find the nearest weather station using spatial joins. This process involves:

1. **Reproject to UTM (EPSG:25831)**: Convert both datasets to a projected coordinate system (UTM Zone 31N) for accurate distance calculations in meters
2. **Calculate centroids**: Compute the centroid of each census section polygon in the projected CRS
3. **Find nearest station**: Use `sjoin_nearest` to find the closest weather station to each centroid
4. **Merge results**: Add the assigned station name and distance to the original GeoDataFrame


In [6]:
# Step 1: Reproject to UTM for accurate distance calculations
gdf_utm = gdf.to_crs(epsg=25831)  # UTM Zone 31N (Spain)
stations_utm = stations.to_crs(epsg=25831)

# Step 2: Calculate centroids in projected CRS (more accurate than WGS84)
gdf_utm["centroid"] = gdf_utm.geometry.centroid
centroids = gdf_utm.set_geometry("centroid")

# Step 3: Find nearest weather station to each census section centroid
nearest = gpd.sjoin_nearest(
    centroids,
    stations_utm[["name", "geometry"]],
    how="left",
    distance_col="dist_m"  # Distance in meters
)

# Step 4: Merge results back to original GeoDataFrame
# Merge on index since sjoin_nearest preserves the index from centroids
gdf = gdf.merge(
    nearest[["name", "dist_m"]],
    left_index=True,
    right_index=True,
    how="left"
)
gdf = gdf.rename(columns={"name": "WEATHER_STATION"})

# Display results
print(f"Number of census sections assigned to stations: {gdf['WEATHER_STATION'].notna().sum()}")
print(f"\nDistribution of stations:")
print(gdf["WEATHER_STATION"].value_counts())
print(f"\nDistance statistics (meters):")
print(gdf["dist_m"].describe())

Number of census sections assigned to stations: 1068

Distribution of stations:
WEATHER_STATION
X4    625
D5    338
X8    105
Name: count, dtype: int64

Distance statistics (meters):
count    1068.000000
mean     2988.696968
std      1492.095398
min       121.156567
25%      1914.985512
50%      2611.524009
75%      4095.279686
max      7152.228289
Name: dist_m, dtype: float64


## 5. Preview Results

Preview the final GeoDataFrame with assigned weather stations:


In [7]:
# Preview the final GeoDataFrame
gdf[["SECCIO_CENSAL", "nom_districte", "nom_barri", "WEATHER_STATION", "dist_m"]].head(10)

Unnamed: 0,SECCIO_CENSAL,nom_districte,nom_barri,WEATHER_STATION,dist_m
0,8019301001,Ciutat Vella,el Raval,X4,1325.66282
1,8019301002,Ciutat Vella,el Raval,X4,839.842507
2,8019301003,Ciutat Vella,el Raval,X4,970.927625
3,8019301004,Ciutat Vella,el Raval,X4,840.1558
4,8019301005,Ciutat Vella,el Raval,X4,701.76885
5,8019301006,Ciutat Vella,el Raval,X4,544.606912
6,8019301007,Ciutat Vella,el Raval,X4,560.63126
7,8019301008,Ciutat Vella,el Raval,X4,675.929543
8,8019301009,Ciutat Vella,el Raval,X4,726.465578
9,8019301010,Ciutat Vella,el Raval,X4,563.659021


## 6. Merge Daily Weather Data with Census Sections

Load the cleaned weather data and merge it with census sections. The weather data is in long format (one row per station-date-variable), so we need to pivot it to wide format (one row per station-date with columns for each variable).

### Date Range Filtering

For simplicity, we filter the weather data to the period from **2023-01-04 to 2024-12-31**. This represents the temporal overlap between all our data sources:
- **Weather data**: Available from 2021-01-01 to 2025-11-16
- **Consumption data**: Available from 2023-01-04 onwards
- **Leak incidents**: Available from 2023 onwards

By focusing on this common period, we ensure all data sources are available for analysis while maintaining a substantial time range for our vulnerability score calculations.


In [8]:
# Load weather data and filter to only the stations we use (D5, X4, X8)
weather = pd.read_parquet("clean/weather_clean.parquet")

# Filter to only the stations assigned to census sections
stations_to_keep = ['D5', 'X4', 'X8']
weather = weather[weather['CODI_ESTACIO'].isin(stations_to_keep)].copy()

# Drop NOM_ESTACIO since we already have CODI_ESTACIO
if 'NOM_ESTACIO' in weather.columns:
    weather = weather.drop(columns=['NOM_ESTACIO'])

# Filter to date range: 2023-01-04 to 2024-12-31 (temporal overlap with consumption and leaks)
date_start = pd.to_datetime('2023-01-04')
date_end = pd.to_datetime('2024-12-31')
weather = weather[
    (pd.to_datetime(weather['DATA_LECTURA']) >= date_start) & 
    (pd.to_datetime(weather['DATA_LECTURA']) <= date_end)
].copy()

print(f"Weather data shape (after filtering): {weather.shape}")
print(f"\nStations in weather data: {sorted(weather['CODI_ESTACIO'].unique())}")
print(f"\nDate range: {weather['DATA_LECTURA'].min()} to {weather['DATA_LECTURA'].max()}")
print(f"\nNumber of variables: {weather['NOM_VARIABLE'].nunique()}")
print(f"\nSample of weather data:")
weather.head()


Weather data shape (after filtering): (48021, 9)

Stations in weather data: ['D5', 'X4', 'X8']

Date range: 2023-01-04 00:00:00 to 2024-12-31 00:00:00

Number of variables: 22

Sample of weather data:


Unnamed: 0,ID,CODI_ESTACIO,DATA_LECTURA,CODI_VARIABLE,NOM_VARIABLE,VALOR,UNITAT,HORA _TU,VALOR_NUM
54232,D51000012304,D5,2023-01-04,1.0,Temperatura mitjana diària,117,°C,,11.7
54233,D51001012304,D5,2023-01-04,1.001,Temperatura màxima diària + hora,177,°C,14:14:00,17.7
54234,D51002012304,D5,2023-01-04,1.002,Temperatura mínima diària + hora,91,°C,06:06:00,9.1
54235,D51003012304,D5,2023-01-04,1.003,Temperatura mitjana diària clàssica,134,°C,,13.4
54236,D51004012304,D5,2023-01-04,1.004,Amplitud tèrmica diària,86,°C,,8.6


In [9]:
# Pivot weather data from long to wide format
# Each row will be a unique combination of station (CODI_ESTACIO) and date (DATA_LECTURA)
# Each variable (NOM_VARIABLE) becomes a column with its VALOR_NUM value
# Note: No duplicate measurements - each station-date-variable has only one value

weather_daily = weather.pivot_table(
    index=['CODI_ESTACIO', 'DATA_LECTURA'],
    columns='NOM_VARIABLE',
    values='VALOR_NUM',
    aggfunc='first'  # Since there are no duplicates, 'first' is sufficient
).reset_index()

# Flatten column names (remove multi-index if any)
weather_daily.columns.name = None

print(f"Weather daily shape: {weather_daily.shape}")
print(f"Number of station-date combinations: {len(weather_daily)}")
print(f"Number of weather variables: {len(weather_daily.columns) - 2}")  # Subtract CODI_ESTACIO and DATA_LECTURA
print(f"\nColumns: {list(weather_daily.columns[:10])}...")  # Show first 10 columns
weather_daily.head()


Weather daily shape: (2184, 24)
Number of station-date combinations: 2184
Number of weather variables: 22

Columns: ['CODI_ESTACIO', 'DATA_LECTURA', 'Amplitud tèrmica diària', 'Direcció de la ratxa màx. diària de vent 10 m', 'Direcció mitjana diària del vent 10 m (m. 1)', 'Evapotranspiració de referència', 'Humitat relativa mitjana diària', 'Humitat relativa màxima diària + data', 'Humitat relativa mínima diària + data', 'Irradiació solar global diària']...


Unnamed: 0,CODI_ESTACIO,DATA_LECTURA,Amplitud tèrmica diària,Direcció de la ratxa màx. diària de vent 10 m,Direcció mitjana diària del vent 10 m (m. 1),Evapotranspiració de referència,Humitat relativa mitjana diària,Humitat relativa màxima diària + data,Humitat relativa mínima diària + data,Irradiació solar global diària,...,Precipitació màxima en 30 min (diària)+ hora,Pressió atmosfèrica mitjana diària,Pressió atmosfèrica màxima diària + hora,Pressió atmosfèrica mínima diària + hora,Ratxa màxima diària del vent 10 m + hora,Temperatura mitjana diària,Temperatura mitjana diària clàssica,Temperatura màxima diària + hora,Temperatura mínima diària + hora,Velocitat mitjana diària del vent 10 m (esc.)
0,D5,2023-01-04,8.6,338.0,335.0,1.09,66.0,87.0,39.0,8.9,...,0.0,983.1,984.7,982.3,7.6,11.7,13.4,17.7,9.1,3.5
1,D5,2023-01-05,7.6,304.0,263.0,1.18,53.0,67.0,35.0,9.2,...,0.0,979.0,982.7,975.8,8.7,11.7,12.5,16.3,8.7,3.5
2,D5,2023-01-06,7.7,310.0,284.0,1.02,59.0,77.0,47.0,9.2,...,0.0,975.5,976.9,974.7,10.4,10.0,11.1,14.9,7.2,3.9
3,D5,2023-01-07,6.1,295.0,267.0,1.13,67.0,100.0,46.0,9.6,...,0.0,971.3,974.7,968.7,14.7,9.5,10.3,13.3,7.2,5.8
4,D5,2023-01-08,3.8,274.0,266.0,0.58,72.0,86.0,56.0,3.2,...,0.1,964.4,968.8,961.1,11.9,11.9,11.7,13.6,9.8,4.5


In [10]:
# Merge weather data with census sections
# Each census section gets the weather data from its assigned station (WEATHER_STATION)
gdf_daily = gdf.merge(
    weather_daily,
    left_on="WEATHER_STATION",
    right_on="CODI_ESTACIO",
    how="left"
)

# Drop WEATHER_STATION since it's the same as CODI_ESTACIO (which comes from weather data)
gdf_daily = gdf_daily.drop(columns=['WEATHER_STATION'])

print(f"Final GeoDataFrame shape: {gdf_daily.shape}")
print(f"\nNumber of census sections: {gdf_daily['SECCIO_CENSAL'].nunique()}")
print(f"\nNumber of unique dates: {gdf_daily['DATA_LECTURA'].nunique()}")
print(f"\nDate range: {gdf_daily['DATA_LECTURA'].min()} to {gdf_daily['DATA_LECTURA'].max()}")

# Show sample
gdf_daily.head(10)


Final GeoDataFrame shape: (777504, 37)

Number of census sections: 1068

Number of unique dates: 728

Date range: 2023-01-04 00:00:00 to 2024-12-31 00:00:00


Unnamed: 0,codi_districte,nom_districte,codi_barri,nom_barri,codi_aeb,codi_seccio_censal,geometria_etrs89,geometry,centroid,centroid_lat,...,Precipitació màxima en 30 min (diària)+ hora,Pressió atmosfèrica mitjana diària,Pressió atmosfèrica màxima diària + hora,Pressió atmosfèrica mínima diària + hora,Ratxa màxima diària del vent 10 m + hora,Temperatura mitjana diària,Temperatura mitjana diària clàssica,Temperatura màxima diària + hora,Temperatura mínima diària + hora,Velocitat mitjana diària del vent 10 m (esc.)
0,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1028.5,1030.2,1027.7,4.1,13.0,13.5,16.8,10.2,0.9
1,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1024.2,1028.0,1021.2,5.4,12.5,13.1,17.3,8.8,1.1
2,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1020.8,1022.4,1019.9,5.8,11.8,11.9,16.0,7.7,0.9
3,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1016.6,1020.1,1013.6,10.2,12.5,13.0,16.7,9.3,1.7
4,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.1,1008.9,1013.7,1005.5,6.8,14.2,14.0,16.2,11.7,1.8
5,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1013.2,1020.7,1006.8,14.5,14.5,14.4,17.1,11.7,4.2
6,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1023.0,1025.2,1020.0,8.6,12.2,12.6,15.7,9.4,1.6
7,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1020.8,1023.9,1019.0,9.4,13.0,13.2,16.4,9.9,1.2
8,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1023.1,1025.6,1020.5,7.5,12.2,12.8,16.0,9.6,1.9
9,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,0.0,1021.7,1024.5,1019.6,5.1,12.7,13.5,16.8,10.2,1.1


### Current Structure of `gdf_daily`

At this point, `gdf_daily` contains:

**Rows (Observations):**
- One row per **census section** per **date** (from 2023-01-04 to 2024-12-31)
- Total rows = Number of census sections (1068) × Number of dates in the filtered period
- Each row represents a unique combination of a census section and a date

**Columns (Variables):**

1. **Census Section Information:**
   - `SECCIO_CENSAL`: Unique identifier (format: 080193 + district + section)
   - `codi_districte`, `nom_districte`: District codes and names
   - `codi_barri`, `nom_barri`: Neighborhood codes and names
   - `codi_aeb`, `codi_seccio_censal`: Additional identifiers
   - `geometry`: Polygon geometry of the census section
   - `centroid`, `centroid_lat`, `centroid_lon`: Geographic centroids

2. **Weather Station Assignment:**
   - `CODI_ESTACIO`: Assigned weather station code (D5, X4, or X8)
   - `dist_m`: Distance to nearest weather station (in meters)

3. **Weather Variables (24 columns):**
   - `DATA_LECTURA`: Date of the weather reading
   - All 24 weather variables from the weather stations (temperature, precipitation, humidity, pressure, wind, etc.)

**Next Steps:**
We will now merge consumption, leak incidents, and socioeconomic data to enrich this dataset further.


## 7. Merge Additional Data Sources

Now we'll merge consumption, leak incidents, and socioeconomic data with the weather-enriched census sections. Each dataset needs to be aggregated appropriately to match the daily structure of `gdf_daily`.


### 7.1 Load and Merge Socioeconomic Data (IST)

The socioeconomic data (IST - Índex socioeconòmic territorial) is a static factor that we will keep constant across years (for now) for each census section. We'll load it and merge directly by `SECCIO_CENSAL`.


In [11]:
# Load socioeconomic data
socio = pd.read_parquet("clean/socio_clean.parquet")

print(f"Socioeconomic data shape: {socio.shape}")
print(f"Years: {sorted(socio['any'].unique())}")
print(f"Number of unique census sections: {socio['SECCIO_CENSAL'].nunique()}")
print(f"\nSample socioeconomic data:")
socio.head()


Socioeconomic data shape: (1068, 4)
Years: [2022]
Number of unique census sections: 1068

Sample socioeconomic data:


Unnamed: 0,any,SECCIO_CENSAL,concepte,valor
0,2022,8019301001,Índex socioeconòmic territorial,85.7
1,2022,8019301002,Índex socioeconòmic territorial,75.8
2,2022,8019301003,Índex socioeconòmic territorial,73.7
3,2022,8019301004,Índex socioeconòmic territorial,81.8
4,2022,8019301005,Índex socioeconòmic territorial,79.1


In [12]:
# Since IST is static (constant across years), we'll take one value per SECCIO_CENSAL
# Filter to get the IST value (concepte = "Índex socioeconòmic territorial")
socio_ist = socio[socio['concepte'] == 'Índex socioeconòmic territorial'].copy()

# Select SECCIO_CENSAL and valor, rename valor to ist
socio_ist = socio_ist[['SECCIO_CENSAL', 'valor']].copy()
socio_ist = socio_ist.rename(columns={'valor': 'ist'})

# If there are multiple years, take the most recent one (or first if all same)
# Group by SECCIO_CENSAL and take the first value (they should all be the same anyway)
socio_ist = socio_ist.groupby('SECCIO_CENSAL')['ist'].first().reset_index()

print(f"Socioeconomic IST shape: {socio_ist.shape}")
print(f"Number of unique census sections: {socio_ist['SECCIO_CENSAL'].nunique()}")
print(f"\nIST statistics:")
print(socio_ist['ist'].describe())
print(f"\nSample IST data:")
socio_ist.head()


Socioeconomic IST shape: (1068, 2)
Number of unique census sections: 1068

IST statistics:
count    1068.000000
mean      108.917603
std        14.578281
min        51.700000
25%       101.200000
50%       111.100000
75%       118.200000
max       138.000000
Name: ist, dtype: float64

Sample IST data:


Unnamed: 0,SECCIO_CENSAL,ist
0,8019301001,85.7
1,8019301002,75.8
2,8019301003,73.7
3,8019301004,81.8
4,8019301005,79.1


### 7.2 Load and Aggregate Consumption Data

Consumption data is split across multiple parquet files. We'll load all files, aggregate by `SECCIO_CENSAL` and `FECHA` (date), and calculate daily consumption metrics.


In [13]:
import glob
import os

# Load all consumption parquet files
consum_files = glob.glob("clean/split_consum_bcn/consum_clean_bcn_part_*.parquet")
consum_files.sort()  # Ensure consistent order

print(f"Found {len(consum_files)} consumption files")

# Load and concatenate all consumption files
consum_list = []
for file in consum_files:
    df = pd.read_parquet(file)
    consum_list.append(df)

consum = pd.concat(consum_list, ignore_index=True)

# Filter to start at 2023-01-04 (temporal overlap with weather and leaks)
date_start = pd.to_datetime('2023-01-04')
consum = consum[pd.to_datetime(consum['FECHA']) >= date_start].copy()

print(f"\nTotal consumption records (after filtering): {len(consum)}")
print(f"Date range: {consum['FECHA'].min()} to {consum['FECHA'].max()}")
print(f"Number of unique census sections: {consum['SECCIO_CENSAL'].nunique()}")
print(f"\nSample consumption data:")
consum.head()


Found 18 consumption files

Total consumption records (after filtering): 5014350
Date range: 2023-01-04 00:00:00 to 2024-12-31 00:00:00
Number of unique census sections: 621

Sample consumption data:


Unnamed: 0,POLIZA_SUMINISTRO,FECHA,CONSUMO_REAL,SECCIO_CENSAL,US_AIGUA_GEST,DATA_INST_COMP
730,VECWAVDUULZDSBOP,2023-01-04,2070,8019303025,C,2016-04-25
731,VECWAVDUULZDSBOP,2023-01-05,1938,8019303025,C,2016-04-25
732,VECWAVDUULZDSBOP,2023-01-06,4,8019303025,C,2016-04-25
733,VECWAVDUULZDSBOP,2023-01-07,53,8019303025,C,2016-04-25
734,VECWAVDUULZDSBOP,2023-01-08,7,8019303025,C,2016-04-25


In [14]:
# Aggregate to one metric per SECCIO_CENSAL per day: total consumption
consum_daily = consum.groupby(['SECCIO_CENSAL', 'FECHA'], as_index=False).agg(
    CONSUMO_TOTAL=('CONSUMO_REAL', 'sum')
)

# Rename FECHA for consistency
consum_daily = consum_daily.rename(columns={'FECHA': 'DATA_LECTURA'})

print(f"Consumption daily shape: {consum_daily.shape}")
print(f"Date range: {consum_daily['DATA_LECTURA'].min()} to {consum_daily['DATA_LECTURA'].max()}")
consum_daily.head()


Consumption daily shape: (450195, 3)
Date range: 2023-01-04 00:00:00 to 2024-12-31 00:00:00


Unnamed: 0,SECCIO_CENSAL,DATA_LECTURA,CONSUMO_TOTAL
0,8019301001,2023-01-04,4948
1,8019301001,2023-01-05,5259
2,8019301001,2023-01-06,5006
3,8019301001,2023-01-07,6301
4,8019301001,2023-01-08,5428


In [15]:
# Load leak incidents data
leaks = pd.read_parquet("clean/fuites_clean_bcn.parquet")

# Filter to date range: 2023-01-04 to 2024-12-31 (temporal overlap with weather and consumption)
date_start = pd.to_datetime('2023-01-04')
date_end = pd.to_datetime('2024-12-31')
leaks = leaks[
    (pd.to_datetime(leaks['CREATED_MENSAJE']) >= date_start) & 
    (pd.to_datetime(leaks['CREATED_MENSAJE']) <= date_end)
].copy()

print(f"Leak incidents shape (after filtering): {leaks.shape}")
print(f"Date range: {leaks['CREATED_MENSAJE'].min()} to {leaks['CREATED_MENSAJE'].max()}")
print(f"Number of unique census sections: {leaks['SECCIO_CENSAL'].nunique()}")

# Note: Not all dates will have leaks - this is normal
# When merged with gdf_daily, days without leaks will have NUM_FUITES = 0

print(f"\nSample leak data:")
leaks.head()


Leak incidents shape (after filtering): (1243, 5)
Date range: 2023-01-04 to 2024-12-30
Number of unique census sections: 428

Sample leak data:


Unnamed: 0,POLISSA_SUBM,CREATED_MENSAJE,CODIGO_MENSAJE,US_AIGUA_SUBM,SECCIO_CENSAL
0,KWHZ5UG2ZKENUFC2,2023-12-03,FUITA,DOMÈSTIC,8019305059
1,GVXPU34GVXQUIWFK,2023-08-10,FUITA,DOMÈSTIC,8019310139
2,GVXPU34GVXQUIWFK,2023-06-10,FUITA,DOMÈSTIC,8019310139
3,I7GGTJ6C6FMR5ARW,2024-09-06,FUITA,DOMÈSTIC,8019302087
4,I7GGTJ6C6FMR5ARW,2024-11-13,FUITA,DOMÈSTIC,8019302087


In [16]:
# Convert CREATED_MENSAJE to datetime if needed
leaks['CREATED_MENSAJE'] = pd.to_datetime(leaks['CREATED_MENSAJE'])

# Aggregate: count total number of leaks per census section per day
# This counts ALL leak incidents (rows), including multiple leaks from different contracts (POLISSA_SUBM) 
# on the same day in the same census section. Each row is one leak incident.
leaks_daily = leaks.groupby(['SECCIO_CENSAL', 'CREATED_MENSAJE'], as_index=False).agg(
    NUM_FUITES=('POLISSA_SUBM', 'count'),  # Count all leak incidents (rows) - sums all leaks
)

# Rename for consistency
leaks_daily = leaks_daily.rename(columns={'CREATED_MENSAJE': 'DATA_LECTURA'})

print(f"Leaks daily shape: {leaks_daily.shape}")
print(f"Date range: {leaks_daily['DATA_LECTURA'].min()} to {leaks_daily['DATA_LECTURA'].max()}")
print(f"\nExample: Days with multiple leaks from different contracts:")
print(leaks_daily[leaks_daily['NUM_FUITES'] > 1].head())
leaks_daily.head()

Leaks daily shape: (1235, 3)
Date range: 2023-01-04 00:00:00 to 2024-12-30 00:00:00

Example: Days with multiple leaks from different contracts:
    SECCIO_CENSAL DATA_LECTURA  NUM_FUITES
44    08019301020   2024-05-07           2
55    08019301021   2023-11-08           2
56    08019301021   2024-01-10           2
465   08019303025   2024-12-18           2
685   08019305025   2024-07-18           2


Unnamed: 0,SECCIO_CENSAL,DATA_LECTURA,NUM_FUITES
0,8019301002,2024-06-19,1
1,8019301002,2024-08-16,1
2,8019301004,2023-03-22,1
3,8019301004,2024-03-04,1
4,8019301004,2024-06-25,1


### 7.4 Merge All Data Sources

Now we'll merge consumption, leaks, and socioeconomic data with the weather-enriched `gdf_daily` GeoDataFrame.


In [17]:
# Step 1: Merge socioeconomic data (IST) - static factor
# Merge on SECCIO_CENSAL only (no date needed since it's constant)
gdf_daily = gdf_daily.merge(
    socio_ist,
    on='SECCIO_CENSAL',
    how='left'
)

print(f"After IST merge: {gdf_daily.shape}")
print(f"IST records matched: {gdf_daily['ist'].notna().sum()}")

# Step 2: Merge consumption data
# Merge on SECCIO_CENSAL and DATA_LECTURA (date)
gdf_daily = gdf_daily.merge(
    consum_daily,
    on=['SECCIO_CENSAL', 'DATA_LECTURA'],
    how='left'
)

print(f"\nAfter consumption merge: {gdf_daily.shape}")
print(f"Consumption records matched: {gdf_daily['CONSUMO_TOTAL'].notna().sum()}")

# Step 3: Merge leak incidents data
# Merge on SECCIO_CENSAL and DATA_LECTURA (date)
gdf_daily = gdf_daily.merge(
    leaks_daily,
    on=['SECCIO_CENSAL', 'DATA_LECTURA'],
    how='left'
)

# Fill NaN with 0 for leak counts (no leaks = 0)
gdf_daily['NUM_FUITES'] = gdf_daily['NUM_FUITES'].fillna(0).astype(int)

print(f"\nAfter leaks merge: {gdf_daily.shape}")
print(f"Days with leaks: {(gdf_daily['NUM_FUITES'] > 0).sum()}")

print(f"\nFinal columns: {len(gdf_daily.columns)}")
print(f"\nSample of final merged data:")
gdf_daily.head(10)


After IST merge: (777504, 38)
IST records matched: 777504

After consumption merge: (777504, 39)
Consumption records matched: 450195

After leaks merge: (777504, 40)
Days with leaks: 1235

Final columns: 40

Sample of final merged data:


Unnamed: 0,codi_districte,nom_districte,codi_barri,nom_barri,codi_aeb,codi_seccio_censal,geometria_etrs89,geometry,centroid,centroid_lat,...,Pressió atmosfèrica mínima diària + hora,Ratxa màxima diària del vent 10 m + hora,Temperatura mitjana diària,Temperatura mitjana diària clàssica,Temperatura màxima diària + hora,Temperatura mínima diària + hora,Velocitat mitjana diària del vent 10 m (esc.),ist,CONSUMO_TOTAL,NUM_FUITES
0,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1027.7,4.1,13.0,13.5,16.8,10.2,0.9,85.7,4948.0,0
1,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1021.2,5.4,12.5,13.1,17.3,8.8,1.1,85.7,5259.0,0
2,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1019.9,5.8,11.8,11.9,16.0,7.7,0.9,85.7,5006.0,0
3,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1013.6,10.2,12.5,13.0,16.7,9.3,1.7,85.7,6301.0,0
4,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1005.5,6.8,14.2,14.0,16.2,11.7,1.8,85.7,5428.0,0
5,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1006.8,14.5,14.5,14.4,17.1,11.7,4.2,85.7,4970.0,0
6,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1020.0,8.6,12.2,12.6,15.7,9.4,1.6,85.7,4487.0,0
7,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1019.0,9.4,13.0,13.2,16.4,9.9,1.2,85.7,4972.0,0
8,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1020.5,7.5,12.2,12.8,16.0,9.6,1.9,85.7,5323.0,0
9,1,Ciutat Vella,1,el Raval,1,1,"POLYGON ((431076.9025 4581077.3095, 431058.164...","POLYGON ((2.17575 41.37827, 2.17552 41.37865, ...",POINT (2.17722 41.37432),41.374324,...,1019.6,5.1,12.7,13.5,16.8,10.2,1.1,85.7,4747.0,0


## 8. Verify Dataset Structure

Let's verify that `gdf_daily` has the expected structure:
- 1068 census sections
- One row per census section per day from 2023-01-04 to 2024-12-31


In [18]:
# Calculate expected number of rows
date_start = pd.to_datetime('2023-01-04')
date_end = pd.to_datetime('2024-12-31')
expected_days = (date_end - date_start).days + 1  # +1 to include both start and end dates
expected_census_sections = 1068
expected_rows = expected_census_sections * expected_days

# Actual values
actual_rows = len(gdf_daily)
actual_census_sections = gdf_daily['SECCIO_CENSAL'].nunique()
actual_dates = gdf_daily['DATA_LECTURA'].nunique()
actual_date_range = (gdf_daily['DATA_LECTURA'].min(), gdf_daily['DATA_LECTURA'].max())

# Check for duplicates (should be 0)
duplicates = gdf_daily.duplicated(subset=['SECCIO_CENSAL', 'DATA_LECTURA']).sum()

print("=== Dataset Structure Verification ===\n")
print(f"Expected rows: {expected_rows:,}")
print(f"Actual rows: {actual_rows:,}")
print(f"Match: {'✓' if actual_rows == expected_rows else '✗'}\n")

print(f"Expected census sections: {expected_census_sections}")
print(f"Actual census sections: {actual_census_sections}")
print(f"Match: {'✓' if actual_census_sections == expected_census_sections else '✗'}\n")

print(f"Expected days: {expected_days}")
print(f"Actual unique dates: {actual_dates}")
print(f"Match: {'✓' if actual_dates == expected_days else '✗'}\n")

print(f"Expected date range: {date_start.date()} to {date_end.date()}")
print(f"Actual date range: {actual_date_range[0].date()} to {actual_date_range[1].date()}")
print(f"Match: {'✓' if actual_date_range[0].date() == date_start.date() and actual_date_range[1].date() == date_end.date() else '✗'}\n")

print(f"Duplicate rows (SECCIO_CENSAL + DATA_LECTURA): {duplicates}")
print(f"No duplicates: {'✓' if duplicates == 0 else '✗'}\n")

# Check if we have exactly one row per census section per date
if actual_rows == expected_rows and duplicates == 0:
    print("✓ Dataset structure is correct: one row per census section per day")
else:
    print("✗ Dataset structure needs review")
    
# Show sample of date distribution
print(f"\nSample: First few dates for one census section:")
sample_seccio = gdf_daily['SECCIO_CENSAL'].iloc[0]
print(gdf_daily[gdf_daily['SECCIO_CENSAL'] == sample_seccio][['SECCIO_CENSAL', 'DATA_LECTURA']].head(10))


=== Dataset Structure Verification ===

Expected rows: 777,504
Actual rows: 777,504
Match: ✓

Expected census sections: 1068
Actual census sections: 1068
Match: ✓

Expected days: 728
Actual unique dates: 728
Match: ✓

Expected date range: 2023-01-04 to 2024-12-31
Actual date range: 2023-01-04 to 2024-12-31
Match: ✓

Duplicate rows (SECCIO_CENSAL + DATA_LECTURA): 0
No duplicates: ✓

✓ Dataset structure is correct: one row per census section per day

Sample: First few dates for one census section:
  SECCIO_CENSAL DATA_LECTURA
0   08019301001   2023-01-04
1   08019301001   2023-01-05
2   08019301001   2023-01-06
3   08019301001   2023-01-07
4   08019301001   2023-01-08
5   08019301001   2023-01-09
6   08019301001   2023-01-10
7   08019301001   2023-01-11
8   08019301001   2023-01-12
9   08019301001   2023-01-13


## 8.5. Compute Section-Level Normalized Metrics (Based on Correlation Analysis)

Based on the correlation analysis findings, we compute normalized metrics that provide distinct information for vulnerability assessment:

1. **`leaks_per_year`**: Historical leak frequency per section (instead of total_leaks, as they're highly correlated)
2. **`leaks_per_1000_meters`**: Leak density normalized by network size (captures leakiness relative to infrastructure size)
3. **`consumption_per_meter`**: Consumption intensity per meter (captures intensity of use, distinct from total consumption)
4. **`num_meters`**: Number of water meters per section (network size indicator)

These normalized metrics capture different aspects of vulnerability as identified in the correlation analysis.


In [19]:
# Compute section-level normalized metrics based on correlation analysis findings

# 1. Compute number of meters per section from consumption data
num_meters_per_section = (
    consum.groupby('SECCIO_CENSAL')['POLIZA_SUMINISTRO']
    .nunique()
    .reset_index()
    .rename(columns={'POLIZA_SUMINISTRO': 'num_meters'})
)

# 2. Compute leaks_per_year: total leaks divided by observation years
# Get leak observation period per section
leaks['year'] = pd.to_datetime(leaks['CREATED_MENSAJE']).dt.year
leaks_stats = (
    leaks.groupby('SECCIO_CENSAL')
    .agg(
        total_leaks=('POLISSA_SUBM', 'count'),
        first_leak_date=('CREATED_MENSAJE', 'min'),
        last_leak_date=('CREATED_MENSAJE', 'max'),
        n_years_obs=('year', 'nunique')
    )
    .reset_index()
)

# Calculate leaks_per_year (avoid division by zero)
leaks_stats['n_years_obs'] = leaks_stats['n_years_obs'].replace(0, pd.NA)
leaks_stats['leaks_per_year'] = (
    leaks_stats['total_leaks'] / leaks_stats['n_years_obs']
)

# 3. Compute mean daily consumption per section (for consumption_per_meter calculation)
mean_daily_consumption_per_section = (
    consum_daily.groupby('SECCIO_CENSAL')['CONSUMO_TOTAL']
    .mean()
    .reset_index()
    .rename(columns={'CONSUMO_TOTAL': 'mean_daily_consumption'})
)

# 4. Merge metrics and compute normalized values
section_metrics = (
    num_meters_per_section
    .merge(mean_daily_consumption_per_section, on='SECCIO_CENSAL', how='outer')
    .merge(leaks_stats[['SECCIO_CENSAL', 'total_leaks', 'leaks_per_year']], on='SECCIO_CENSAL', how='left')
)

# Fill missing num_meters with 0 for sections without consumption data
section_metrics['num_meters'] = section_metrics['num_meters'].fillna(0)

# Compute consumption_per_meter (mean daily consumption per meter)
section_metrics['consumption_per_meter'] = (
    section_metrics['mean_daily_consumption'] / 
    section_metrics['num_meters'].replace(0, pd.NA)
)

# Compute leaks_per_1000_meters (leak density normalized by network size)
section_metrics['leaks_per_1000_meters'] = (
    section_metrics['total_leaks'] / 
    (section_metrics['num_meters'].replace(0, pd.NA) / 1000)
)

# Fill NaN values appropriately
section_metrics['leaks_per_year'] = section_metrics['leaks_per_year'].fillna(0)
section_metrics['leaks_per_1000_meters'] = section_metrics['leaks_per_1000_meters'].fillna(0)
section_metrics['consumption_per_meter'] = section_metrics['consumption_per_meter'].fillna(0)
section_metrics['mean_daily_consumption'] = section_metrics['mean_daily_consumption'].fillna(0)

# Merge section-level metrics into gdf_daily
gdf_daily = gdf_daily.merge(
    section_metrics[['SECCIO_CENSAL', 'num_meters', 'leaks_per_year', 'leaks_per_1000_meters', 
                     'consumption_per_meter', 'mean_daily_consumption']],
    on='SECCIO_CENSAL',
    how='left'
)

print("Section-level normalized metrics computed and merged:")
print(f"  - num_meters: {section_metrics['num_meters'].notna().sum()} sections")
print(f"  - leaks_per_year: {section_metrics['leaks_per_year'].notna().sum()} sections")
print(f"  - leaks_per_1000_meters: {section_metrics['leaks_per_1000_meters'].notna().sum()} sections")
print(f"  - consumption_per_meter: {section_metrics['consumption_per_meter'].notna().sum()} sections")
print(f"\nMetric statistics:")
print(section_metrics[['num_meters', 'leaks_per_year', 'leaks_per_1000_meters', 'consumption_per_meter']].describe())


Section-level normalized metrics computed and merged:
  - num_meters: 621 sections
  - leaks_per_year: 621 sections
  - leaks_per_1000_meters: 621 sections
  - consumption_per_meter: 621 sections

Metric statistics:
       num_meters  leaks_per_year  leaks_per_1000_meters  \
count  621.000000      621.000000             621.000000   
mean    11.159420        1.108696             375.688443   
std     17.932904        1.681212            1021.586851   
min      1.000000        0.000000               0.000000   
25%      3.000000        0.000000               0.000000   
50%      6.000000        1.000000              19.607843   
75%     13.000000        1.500000             285.714286   
max    256.000000       12.500000           11000.000000   

       consumption_per_meter  
count             621.000000  
mean              403.279008  
std               725.556966  
min                34.158621  
25%               189.955172  
50%               234.735861  
75%               323.7333

In [20]:
gdf_daily.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 777504 entries, 0 to 777503
Data columns (total 45 columns):
 #   Column                                         Non-Null Count   Dtype         
---  ------                                         --------------   -----         
 0   codi_districte                                 777504 non-null  object        
 1   nom_districte                                  777504 non-null  object        
 2   codi_barri                                     777504 non-null  object        
 3   nom_barri                                      777504 non-null  object        
 4   codi_aeb                                       777504 non-null  object        
 5   codi_seccio_censal                             777504 non-null  object        
 6   geometria_etrs89                               777504 non-null  object        
 7   geometry                                       777504 non-null  geometry      
 8   centroid                            

## 9. Calculate Daily Vulnerability Scores

Calculate **two separate vulnerability scores** per census section per day:

1. **Rainfall Vulnerability Score**: For intense rainfall episodes
   - Precipitation extremes and anomalies
   - High humidity conditions
   - Infrastructure vulnerability (leaks, drainage issues)
   - Socioeconomic factors

2. **Heatwave Vulnerability Score**: For heatwave episodes
   - Temperature extremes and anomalies
   - Low humidity conditions
   - Socioeconomic factors (elderly, poor housing conditions)
   - Infrastructure factors

Each score ranges from 0 (lowest vulnerability) to 100 (highest vulnerability), allowing us to identify the most vulnerable census sections for each type of weather event separately.


In [21]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# First, let's identify all available weather variables
weather_cols = [col for col in gdf_daily.columns if col not in [
    'codi_districte', 'nom_districte', 'codi_barri', 'nom_barri', 'codi_aeb', 
    'codi_seccio_censal', 'geometria_etrs89', 'geometry', 'centroid', 
    'centroid_lat', 'centroid_lon', 'SECCIO_CENSAL', 'dist_m', 'CODI_ESTACIO', 
    'DATA_LECTURA', 'ist', 'CONSUMO_TOTAL', 'NUM_FUITES'
]]

print(f"Available weather variables ({len(weather_cols)}):")
for i, col in enumerate(weather_cols, 1):
    print(f"{i}. {col}")


Available weather variables (27):
1. Amplitud tèrmica diària
2. Direcció de la ratxa màx. diària de vent 10 m
3. Direcció mitjana diària del vent 10 m (m. 1)
4. Evapotranspiració de referència
5. Humitat relativa mitjana diària
6. Humitat relativa màxima diària + data
7. Humitat relativa mínima diària + data
8. Irradiació solar global diària
9. Precipitació acumulada diària
10. Precipitació acumulada diària (8-8 h)
11. Precipitació màxima en 1 h (diària) + hora
12. Precipitació màxima en 1 min (diària) + hora
13. Precipitació màxima en 30 min (diària)+ hora
14. Pressió atmosfèrica mitjana diària
15. Pressió atmosfèrica màxima diària + hora
16. Pressió atmosfèrica mínima diària + hora
17. Ratxa màxima diària del vent 10 m + hora
18. Temperatura mitjana diària
19. Temperatura mitjana diària clàssica
20. Temperatura màxima diària + hora
21. Temperatura mínima diària + hora
22. Velocitat mitjana diària del vent 10 m (esc.)
23. num_meters
24. leaks_per_year
25. leaks_per_1000_meters
26. con

In [22]:
# Identify key weather variables for vulnerability assessment
# These are the most relevant for heatwaves and intense rainfall

# Temperature-related variables (for heatwave vulnerability)
temp_vars = [col for col in weather_cols if any(x in col.lower() for x in ['temperatura', 'tèrmica', 'evapotranspiració'])]

# Precipitation-related variables (for rainfall vulnerability)
precip_vars = [col for col in weather_cols if any(x in col.lower() for x in ['precipitació', 'pluja', 'pluviositat'])]

# Other relevant variables
humidity_vars = [col for col in weather_cols if 'humitat' in col.lower()]
pressure_vars = [col for col in weather_cols if 'pressió' in col.lower()]
wind_vars = [col for col in weather_cols if 'vent' in col.lower()]

print("Temperature-related variables:", temp_vars)
print("\nPrecipitation-related variables:", precip_vars)
print("\nHumidity variables:", humidity_vars)
print("\nPressure variables:", pressure_vars)
print("\nWind variables:", wind_vars)


Temperature-related variables: ['Amplitud tèrmica diària', 'Evapotranspiració de referència', 'Temperatura mitjana diària', 'Temperatura mitjana diària clàssica', 'Temperatura màxima diària + hora', 'Temperatura mínima diària + hora']

Precipitation-related variables: ['Precipitació acumulada diària', 'Precipitació acumulada diària (8-8 h)', 'Precipitació màxima en 1 h (diària) + hora', 'Precipitació màxima en 1 min (diària) + hora', 'Precipitació màxima en 30 min (diària)+ hora']

Humidity variables: ['Humitat relativa mitjana diària', 'Humitat relativa màxima diària + data', 'Humitat relativa mínima diària + data']

Pressure variables: ['Pressió atmosfèrica mitjana diària', 'Pressió atmosfèrica màxima diària + hora', 'Pressió atmosfèrica mínima diària + hora']

Wind variables: ['Direcció de la ratxa màx. diària de vent 10 m', 'Direcció mitjana diària del vent 10 m (m. 1)', 'Ratxa màxima diària del vent 10 m + hora', 'Velocitat mitjana diària del vent 10 m (esc.)']


In [23]:
# Create a copy for calculations
gdf_vuln = gdf_daily.copy()

# Extract year and month for seasonal analysis
gdf_vuln['year'] = gdf_vuln['DATA_LECTURA'].dt.year
gdf_vuln['month'] = gdf_vuln['DATA_LECTURA'].dt.month
gdf_vuln['day_of_year'] = gdf_vuln['DATA_LECTURA'].dt.dayofyear

# Initialize vulnerability score components
print("Calculating vulnerability score components...")


Calculating vulnerability score components...


In [24]:
# 1. SOCIOECONOMIC VULNERABILITY COMPONENT
# Lower IST = higher vulnerability (inverse relationship)
# Normalize IST to 0-1 scale where 1 = most vulnerable (lowest IST)
ist_min = gdf_vuln['ist'].min()
ist_max = gdf_vuln['ist'].max()
gdf_vuln['vuln_socio'] = 1 - ((gdf_vuln['ist'] - ist_min) / (ist_max - ist_min))
print(f"Socioeconomic vulnerability: min={gdf_vuln['vuln_socio'].min():.3f}, max={gdf_vuln['vuln_socio'].max():.3f}")


Socioeconomic vulnerability: min=0.000, max=1.000


In [25]:
# 2. INFRASTRUCTURE VULNERABILITY COMPONENT
# Based on correlation analysis findings - using normalized metrics that provide distinct information

# 2a. Leak frequency vulnerability: use leaks_per_year (from correlation analysis)
# This captures historical leak frequency per section (instead of total_leaks which is highly correlated)
if 'leaks_per_year' in gdf_vuln.columns:
    leaks_per_year_max = gdf_vuln['leaks_per_year'].max()
    if leaks_per_year_max > 0:
        gdf_vuln['vuln_leaks_frequency'] = gdf_vuln['leaks_per_year'] / leaks_per_year_max
    else:
        gdf_vuln['vuln_leaks_frequency'] = 0
else:
    gdf_vuln['vuln_leaks_frequency'] = 0

# 2b. Leak density vulnerability: use leaks_per_1000_meters (from correlation analysis)
# This captures leakiness relative to network size - provides distinct information from frequency
if 'leaks_per_1000_meters' in gdf_vuln.columns:
    leaks_density_max = gdf_vuln['leaks_per_1000_meters'].max()
    if leaks_density_max > 0:
        gdf_vuln['vuln_leaks_density'] = gdf_vuln['leaks_per_1000_meters'] / leaks_density_max
    else:
        gdf_vuln['vuln_leaks_density'] = 0
else:
    gdf_vuln['vuln_leaks_density'] = 0

# 2c. Daily leak indicator: recent leaks increase vulnerability
gdf_vuln['vuln_leaks_daily'] = (gdf_vuln['NUM_FUITES'] > 0).astype(float)

# 2d. Consumption intensity vulnerability: use consumption_per_meter (from correlation analysis)
# This captures intensity of use per meter - distinct from total consumption which is highly correlated with network size
if 'consumption_per_meter' in gdf_vuln.columns:
    consum_intensity_max = gdf_vuln['consumption_per_meter'].max()
    if consum_intensity_max > 0:
        gdf_vuln['vuln_consumption_intensity'] = gdf_vuln['consumption_per_meter'] / consum_intensity_max
    else:
        gdf_vuln['vuln_consumption_intensity'] = 0
    gdf_vuln['vuln_consumption_intensity'] = gdf_vuln['vuln_consumption_intensity'].fillna(0)
else:
    gdf_vuln['vuln_consumption_intensity'] = 0

# Combine infrastructure components (weighted average based on correlation analysis findings)
# Weights reflect the distinct information each metric provides:
# - leaks_per_year: historical frequency (0.25)
# - leaks_per_1000_meters: leak density, distinct dimension (0.25)
# - leaks_daily: immediate vulnerability indicator (0.25)
# - consumption_per_meter: intensity of use (0.25)
gdf_vuln['vuln_infrastructure'] = (
    0.25 * gdf_vuln['vuln_leaks_frequency'] + 
    0.25 * gdf_vuln['vuln_leaks_density'] + 
    0.25 * gdf_vuln['vuln_leaks_daily'] + 
    0.25 * gdf_vuln['vuln_consumption_intensity']
)

print(f"Infrastructure vulnerability (using normalized metrics): min={gdf_vuln['vuln_infrastructure'].min():.3f}, max={gdf_vuln['vuln_infrastructure'].max():.3f}")
print(f"  - Leak frequency component: min={gdf_vuln['vuln_leaks_frequency'].min():.3f}, max={gdf_vuln['vuln_leaks_frequency'].max():.3f}")
print(f"  - Leak density component: min={gdf_vuln['vuln_leaks_density'].min():.3f}, max={gdf_vuln['vuln_leaks_density'].max():.3f}")
print(f"  - Consumption intensity component: min={gdf_vuln['vuln_consumption_intensity'].min():.3f}, max={gdf_vuln['vuln_consumption_intensity'].max():.3f}")


Infrastructure vulnerability (using normalized metrics): min=0.001, max=0.704
  - Leak frequency component: min=0.000, max=1.000
  - Leak density component: min=0.000, max=1.000
  - Consumption intensity component: min=0.000, max=1.000


In [26]:
# 3. WEATHER VULNERABILITY COMPONENT
# Calculate vulnerability based on extreme weather conditions

# 3a. Temperature extremes (for heatwave vulnerability)
# Find temperature-related columns
temp_cols = [col for col in weather_cols if 'temperatura' in col.lower() or 'tèrmica' in col.lower()]

if temp_cols:
    # Use maximum temperature if available, otherwise use thermal amplitude
    if any('màxima' in col.lower() for col in temp_cols):
        temp_col = [col for col in temp_cols if 'màxima' in col.lower()][0]
    elif 'Amplitud tèrmica diària' in temp_cols:
        temp_col = 'Amplitud tèrmica diària'
    else:
        temp_col = temp_cols[0]
    
    # Calculate temperature anomaly (deviation from seasonal average)
    temp_seasonal = gdf_vuln.groupby(['SECCIO_CENSAL', 'month'])[temp_col].transform('mean')
    temp_anomaly = gdf_vuln[temp_col] - temp_seasonal
    temp_anomaly_max = temp_anomaly.abs().max()
    if temp_anomaly_max > 0:
        gdf_vuln['vuln_temp'] = (temp_anomaly / temp_anomaly_max).abs()
        gdf_vuln['vuln_temp'] = (gdf_vuln['vuln_temp'] - gdf_vuln['vuln_temp'].min()) / (gdf_vuln['vuln_temp'].max() - gdf_vuln['vuln_temp'].min())
    else:
        gdf_vuln['vuln_temp'] = 0
    print(f"Temperature vulnerability using: {temp_col}")
else:
    gdf_vuln['vuln_temp'] = 0
    print("No temperature variables found")

print(f"Temperature vulnerability: min={gdf_vuln['vuln_temp'].min():.3f}, max={gdf_vuln['vuln_temp'].max():.3f}")


Temperature vulnerability using: Temperatura màxima diària + hora
Temperature vulnerability: min=0.000, max=1.000


In [27]:
# 3b. Precipitation extremes (for rainfall vulnerability)
precip_cols = [col for col in weather_cols if 'precipitació' in col.lower() or 'pluja' in col.lower() or 'pluviositat' in col.lower()]

if precip_cols:
    # Use daily precipitation
    precip_col = precip_cols[0]
    
    # Calculate precipitation anomaly (deviation from seasonal average)
    precip_seasonal = gdf_vuln.groupby(['SECCIO_CENSAL', 'month'])[precip_col].transform('mean')
    precip_anomaly = gdf_vuln[precip_col] - precip_seasonal
    precip_anomaly_max = precip_anomaly.max()  # Only positive anomalies matter for vulnerability
    if precip_anomaly_max > 0:
        gdf_vuln['vuln_precip'] = np.maximum(0, precip_anomaly) / precip_anomaly_max
        gdf_vuln['vuln_precip'] = (gdf_vuln['vuln_precip'] - gdf_vuln['vuln_precip'].min()) / (gdf_vuln['vuln_precip'].max() - gdf_vuln['vuln_precip'].min() + 1e-10)
    else:
        gdf_vuln['vuln_precip'] = 0
    print(f"Precipitation vulnerability using: {precip_col}")
else:
    gdf_vuln['vuln_precip'] = 0
    print("No precipitation variables found")

print(f"Precipitation vulnerability: min={gdf_vuln['vuln_precip'].min():.3f}, max={gdf_vuln['vuln_precip'].max():.3f}")


Precipitation vulnerability using: Precipitació acumulada diària
Precipitation vulnerability: min=0.000, max=1.000


In [28]:
# 3c. Humidity extremes (relevant for both heatwaves and rainfall)
humidity_cols = [col for col in weather_cols if 'humitat' in col.lower()]

if humidity_cols:
    # Use mean humidity
    if any('mitjana' in col.lower() for col in humidity_cols):
        hum_col = [col for col in humidity_cols if 'mitjana' in col.lower()][0]
    else:
        hum_col = humidity_cols[0]
    
    # Extreme low humidity (heatwave) or extreme high humidity (rainfall)
    hum_mean = gdf_vuln[hum_col].mean()
    hum_std = gdf_vuln[hum_col].std()
    if hum_std > 0:
        # Vulnerability increases with deviation from normal (both high and low)
        hum_anomaly = abs(gdf_vuln[hum_col] - hum_mean) / hum_std
        gdf_vuln['vuln_humidity'] = (hum_anomaly - hum_anomaly.min()) / (hum_anomaly.max() - hum_anomaly.min() + 1e-10)
    else:
        gdf_vuln['vuln_humidity'] = 0
    print(f"Humidity vulnerability using: {hum_col}")
else:
    gdf_vuln['vuln_humidity'] = 0
    print("No humidity variables found")

print(f"Humidity vulnerability: min={gdf_vuln['vuln_humidity'].min():.3f}, max={gdf_vuln['vuln_humidity'].max():.3f}")


Humidity vulnerability using: Humitat relativa mitjana diària
Humidity vulnerability: min=0.000, max=1.000


In [29]:
# Combine weather components
# Weight: temperature (heatwaves) and precipitation (rainfall) are most important
gdf_vuln['vuln_weather'] = (
    0.4 * gdf_vuln['vuln_temp'] + 
    0.4 * gdf_vuln['vuln_precip'] + 
    0.2 * gdf_vuln['vuln_humidity']
)

print(f"Weather vulnerability: min={gdf_vuln['vuln_weather'].min():.3f}, max={gdf_vuln['vuln_weather'].max():.3f}")


Weather vulnerability: min=0.001, max=0.794


In [30]:
# 4. CALCULATE TWO SEPARATE VULNERABILITY SCORES

# 4a. RAINFALL VULNERABILITY SCORE
# Focus on: precipitation, high humidity, infrastructure (leaks/drainage), socioeconomic
# Weights:
# - Socioeconomic: 0.25 (lower IST = more vulnerable to flooding impacts)
# - Infrastructure: 0.35 (leaks and drainage issues are critical for rainfall)
# - Weather (precipitation + humidity): 0.40 (precipitation is primary, humidity secondary)

gdf_vuln['vuln_weather_rainfall'] = (
    0.7 * gdf_vuln['vuln_precip'] +  # Precipitation is most important for rainfall
    0.3 * gdf_vuln['vuln_humidity']  # High humidity also relevant
)

gdf_vuln['VULNERABILITY_SCORE_RAINFALL'] = (
    0.25 * gdf_vuln['vuln_socio'] + 
    0.35 * gdf_vuln['vuln_infrastructure'] + 
    0.40 * gdf_vuln['vuln_weather_rainfall']
)

# Scale to 0-100
gdf_vuln['VULNERABILITY_SCORE_RAINFALL'] = gdf_vuln['VULNERABILITY_SCORE_RAINFALL'] * 100

# 4b. HEATWAVE VULNERABILITY SCORE
# Focus on: temperature extremes, low humidity, socioeconomic (elderly, poor housing), infrastructure
# Weights:
# - Socioeconomic: 0.35 (higher weight - elderly, poor housing more vulnerable to heat)
# - Infrastructure: 0.20 (less critical for heatwaves than rainfall)
# - Weather (temperature + low humidity): 0.45 (temperature is primary, low humidity secondary)

# For heatwaves, low humidity increases vulnerability (inverse of high humidity vulnerability)
# Create a low humidity indicator (extreme low humidity = high vulnerability)
# Find the humidity column used earlier
humidity_cols = [col for col in weather_cols if 'humitat' in col.lower()]
if humidity_cols and 'vuln_humidity' in gdf_vuln.columns:
    # Use mean humidity column (same as used for vuln_humidity calculation)
    if any('mitjana' in col.lower() for col in humidity_cols):
        hum_col_heat = [col for col in humidity_cols if 'mitjana' in col.lower()][0]
    else:
        hum_col_heat = humidity_cols[0]
    
    # Low humidity vulnerability: areas with very low humidity are more vulnerable
    hum_mean = gdf_vuln[hum_col_heat].mean()
    hum_std = gdf_vuln[hum_col_heat].std()
    if hum_std > 0:
        # Low humidity (below mean) increases vulnerability
        low_hum_anomaly = np.maximum(0, (hum_mean - gdf_vuln[hum_col_heat]) / hum_std)
        low_hum_max = low_hum_anomaly.max()
        if low_hum_max > 0:
            gdf_vuln['vuln_low_humidity'] = low_hum_anomaly / low_hum_max
            gdf_vuln['vuln_low_humidity'] = (gdf_vuln['vuln_low_humidity'] - gdf_vuln['vuln_low_humidity'].min()) / (gdf_vuln['vuln_low_humidity'].max() - gdf_vuln['vuln_low_humidity'].min() + 1e-10)
        else:
            gdf_vuln['vuln_low_humidity'] = 0
    else:
        gdf_vuln['vuln_low_humidity'] = 0
else:
    gdf_vuln['vuln_low_humidity'] = 0

gdf_vuln['vuln_weather_heatwave'] = (
    0.7 * gdf_vuln['vuln_temp'] +  # Temperature is most important for heatwaves
    0.3 * gdf_vuln['vuln_low_humidity']  # Low humidity also relevant
)

gdf_vuln['VULNERABILITY_SCORE_HEATWAVE'] = (
    0.35 * gdf_vuln['vuln_socio'] + 
    0.20 * gdf_vuln['vuln_infrastructure'] + 
    0.45 * gdf_vuln['vuln_weather_heatwave']
)

# Scale to 0-100
gdf_vuln['VULNERABILITY_SCORE_HEATWAVE'] = gdf_vuln['VULNERABILITY_SCORE_HEATWAVE'] * 100

print("=== Rainfall Vulnerability Score Summary ===")
print(f"Rainfall Vulnerability Score Statistics:")
print(gdf_vuln['VULNERABILITY_SCORE_RAINFALL'].describe())
print(f"\nScore range: {gdf_vuln['VULNERABILITY_SCORE_RAINFALL'].min():.2f} - {gdf_vuln['VULNERABILITY_SCORE_RAINFALL'].max():.2f}")

print("\n=== Heatwave Vulnerability Score Summary ===")
print(f"Heatwave Vulnerability Score Statistics:")
print(gdf_vuln['VULNERABILITY_SCORE_HEATWAVE'].describe())
print(f"\nScore range: {gdf_vuln['VULNERABILITY_SCORE_HEATWAVE'].min():.2f} - {gdf_vuln['VULNERABILITY_SCORE_HEATWAVE'].max():.2f}")

print("\n=== Top 10 Most Vulnerable Days (Rainfall) ===")
top_rainfall = gdf_vuln.nlargest(10, 'VULNERABILITY_SCORE_RAINFALL')[['SECCIO_CENSAL', 'DATA_LECTURA', 'VULNERABILITY_SCORE_RAINFALL', 'vuln_socio', 'vuln_infrastructure', 'vuln_weather_rainfall']]
print(top_rainfall)

print("\n=== Top 10 Most Vulnerable Days (Heatwave) ===")
top_heatwave = gdf_vuln.nlargest(10, 'VULNERABILITY_SCORE_HEATWAVE')[['SECCIO_CENSAL', 'DATA_LECTURA', 'VULNERABILITY_SCORE_HEATWAVE', 'vuln_socio', 'vuln_infrastructure', 'vuln_weather_heatwave']]
print(top_heatwave)


=== Rainfall Vulnerability Score Summary ===
Rainfall Vulnerability Score Statistics:
count    452088.000000
mean         12.284599
std           5.153437
min           0.371834
25%           8.761352
50%          11.414874
75%          14.846508
max          60.563999
Name: VULNERABILITY_SCORE_RAINFALL, dtype: float64

Score range: 0.37 - 60.56

=== Heatwave Vulnerability Score Summary ===
Heatwave Vulnerability Score Statistics:
count    452088.000000
mean         19.232190
std           7.747571
min           0.356493
25%          13.766415
50%          18.152824
75%          23.567344
max          68.421016
Name: VULNERABILITY_SCORE_HEATWAVE, dtype: float64

Score range: 0.36 - 68.42

=== Top 10 Most Vulnerable Days (Rainfall) ===
       SECCIO_CENSAL DATA_LECTURA  VULNERABILITY_SCORE_RAINFALL  vuln_socio  \
183209   08019303025   2024-04-29                     60.563999    0.660487   
17953    08019301026   2024-04-29                     54.703997    1.000000   
735033   080193100

### Vulnerability Score Components

#### Rainfall Vulnerability Score (0-100)

Calculated as a weighted combination of:

1. **Socioeconomic Vulnerability (25%)**: Based on IST index
   - Lower IST = higher vulnerability to flooding impacts
   
2. **Infrastructure Vulnerability (35%)**: Based on normalized metrics from correlation analysis:
   - **Leak frequency** (`leaks_per_year`) - 25%: Historical leak frequency per section
   - **Leak density** (`leaks_per_1000_meters`) - 25%: Leakiness relative to network size (distinct dimension)
   - **Daily leak incidents** - 25%: Immediate vulnerability indicator
   - **Consumption intensity** (`consumption_per_meter`) - 25%: Intensity of use per meter (distinct from total consumption)
   - Higher weight because leaks/drainage are critical for rainfall
   - Uses normalized metrics that provide distinct information (based on correlation analysis findings)

3. **Weather Vulnerability (40%)**: Based on:
   - Precipitation extremes/anomalies (70%)
   - High humidity conditions (30%)

#### Heatwave Vulnerability Score (0-100)

Calculated as a weighted combination of:

1. **Socioeconomic Vulnerability (35%)**: Based on IST index
   - Lower IST = higher vulnerability (elderly, poor housing conditions)
   - Higher weight than rainfall because socioeconomic factors are more critical for heat
   
2. **Infrastructure Vulnerability (20%)**: Based on normalized metrics from correlation analysis:
   - **Leak frequency** (`leaks_per_year`) - 25%: Historical leak frequency per section
   - **Leak density** (`leaks_per_1000_meters`) - 25%: Leakiness relative to network size (distinct dimension)
   - **Daily leak incidents** - 25%: Immediate vulnerability indicator
   - **Consumption intensity** (`consumption_per_meter`) - 25%: Intensity of use per meter (distinct from total consumption)
   - Lower weight than rainfall (less critical for heatwaves)
   - Uses normalized metrics that provide distinct information (based on correlation analysis findings)

3. **Weather Vulnerability (45%)**: Based on:
   - Temperature extremes/anomalies (70%)
   - Low humidity conditions (30%)

Both scores range from **0 (lowest vulnerability)** to **100 (highest vulnerability)**.


In [31]:
# Update gdf_daily with vulnerability scores
gdf_daily = gdf_vuln.copy()

# Verify the scores were added
print(f"Final dataset shape: {gdf_daily.shape}")
print(f"Rainfall vulnerability score column present: {'VULNERABILITY_SCORE_RAINFALL' in gdf_daily.columns}")
print(f"Heatwave vulnerability score column present: {'VULNERABILITY_SCORE_HEATWAVE' in gdf_daily.columns}")
print(f"\nSample rows with vulnerability scores:")
sample_cols = ['SECCIO_CENSAL', 'DATA_LECTURA', 'VULNERABILITY_SCORE_RAINFALL', 'VULNERABILITY_SCORE_HEATWAVE', 'ist', 'NUM_FUITES', 'CONSUMO_TOTAL']
print(gdf_daily[sample_cols].head(10))


Final dataset shape: (777504, 63)
Rainfall vulnerability score column present: True
Heatwave vulnerability score column present: True

Sample rows with vulnerability scores:
  SECCIO_CENSAL DATA_LECTURA  VULNERABILITY_SCORE_RAINFALL  \
0   08019301001   2023-01-04                     15.427258   
1   08019301001   2023-01-05                     17.319935   
2   08019301001   2023-01-06                     17.608716   
3   08019301001   2023-01-07                     17.897496   
4   08019301001   2023-01-08                     16.164813   
5   08019301001   2023-01-09                     20.496522   
6   08019301001   2023-01-10                     18.763838   
7   08019301001   2023-01-11                     17.319935   
8   08019301001   2023-01-12                     18.186277   
9   08019301001   2023-01-13                     19.052619   

   VULNERABILITY_SCORE_HEATWAVE   ist  NUM_FUITES  CONSUMO_TOTAL  
0                     25.684880  85.7           0         4948.0  
1        

In [32]:
# Distribution of vulnerability scores by date (to identify vulnerable periods)
vuln_rainfall_by_date = gdf_daily.groupby('DATA_LECTURA')['VULNERABILITY_SCORE_RAINFALL'].agg(['mean', 'std', 'max']).reset_index()
vuln_rainfall_by_date = vuln_rainfall_by_date.sort_values('mean', ascending=False)

vuln_heatwave_by_date = gdf_daily.groupby('DATA_LECTURA')['VULNERABILITY_SCORE_HEATWAVE'].agg(['mean', 'std', 'max']).reset_index()
vuln_heatwave_by_date = vuln_heatwave_by_date.sort_values('mean', ascending=False)

print("Top 20 dates with highest average RAINFALL vulnerability scores:")
print(vuln_rainfall_by_date.head(20))

print("\nTop 20 dates with highest average HEATWAVE vulnerability scores:")
print(vuln_heatwave_by_date.head(20))


Top 20 dates with highest average RAINFALL vulnerability scores:
    DATA_LECTURA       mean       std        max
481   2024-04-29  39.624435  4.794450  60.563999
613   2024-09-08  28.855467  5.445242  47.281608
670   2024-11-04  28.822160  4.602291  49.009729
524   2024-06-11  27.722218  4.437939  43.704028
430   2024-03-09  27.232552  4.182529  43.623578
661   2024-10-26  27.168693  4.195881  44.199666
235   2023-08-27  26.966477  4.116423  43.590688
254   2023-09-15  26.001440  4.221826  43.097318
708   2024-12-12  24.639798  4.248608  46.041309
496   2024-05-14  21.852776  4.205216  38.012290
474   2024-04-22  21.077598  4.138873  37.865593
665   2024-10-30  20.792432  4.346861  36.752950
452   2024-03-31  20.585490  4.275480  37.855726
625   2024-09-20  20.373004  4.504327  41.488541
141   2023-05-25  20.122433  4.559007  43.710053
637   2024-10-02  20.063090  4.113614  36.510912
17    2023-01-21  19.940981  4.209081  37.127653
285   2023-10-16  19.915526  4.285906  35.968929
116 

In [33]:
# Distribution of vulnerability scores by census section (to identify most vulnerable areas)
vuln_rainfall_by_section = gdf_daily.groupby('SECCIO_CENSAL')['VULNERABILITY_SCORE_RAINFALL'].agg(['mean', 'std', 'max']).reset_index()
vuln_rainfall_by_section = vuln_rainfall_by_section.merge(
    gdf_daily[['SECCIO_CENSAL', 'nom_districte', 'nom_barri']].drop_duplicates(),
    on='SECCIO_CENSAL',
    how='left'
)
vuln_rainfall_by_section = vuln_rainfall_by_section.sort_values('mean', ascending=False)

vuln_heatwave_by_section = gdf_daily.groupby('SECCIO_CENSAL')['VULNERABILITY_SCORE_HEATWAVE'].agg(['mean', 'std', 'max']).reset_index()
vuln_heatwave_by_section = vuln_heatwave_by_section.merge(
    gdf_daily[['SECCIO_CENSAL', 'nom_districte', 'nom_barri']].drop_duplicates(),
    on='SECCIO_CENSAL',
    how='left'
)
vuln_heatwave_by_section = vuln_heatwave_by_section.sort_values('mean', ascending=False)

print("Top 20 census sections with highest average RAINFALL vulnerability scores:")
print(vuln_rainfall_by_section.head(20))

print("\nTop 20 census sections with highest average HEATWAVE vulnerability scores:")
print(vuln_heatwave_by_section.head(20))


Top 20 census sections with highest average RAINFALL vulnerability scores:
     SECCIO_CENSAL       mean       std        max   nom_districte  \
24     08019301026  29.041701  2.951780  54.703997    Ciutat Vella   
1009   08019310089  28.812021  2.967048  54.474317      Sant Martí   
33     08019301035  28.648883  3.087830  54.190986    Ciutat Vella   
7      08019301008  28.031496  2.985340  53.669754    Ciutat Vella   
839    08019309015  26.994262  2.939779  52.668577     Sant Andreu   
251    08019303025  26.762036  3.414345  60.563999  Sants-Montjuïc   
1005   08019310085  25.474291  3.005364  51.136586      Sant Martí   
521    08019306024  24.880559  3.632095  53.557103          Gràcia   
1      08019301002  24.333644  2.978243  49.983920    Ciutat Vella   
9      08019301010  24.242611  2.946413  49.904907    Ciutat Vella   
16     08019301017  24.214142  2.988312  49.852399    Ciutat Vella   
19     08019301020  24.018893  3.324578  49.404746    Ciutat Vella   
6      08019301

## 10. Save Results

Save the final dataset with vulnerability scores for use in mapping and analysis.


In [34]:
# Save the dataset with vulnerability scores
# Save as parquet (recommended for large datasets)
output_file = "clean/vulnerability_daily.parquet"
gdf_daily.to_parquet(output_file, index=False)
print(f"Saved vulnerability dataset to: {output_file}")
print(f"Dataset contains {len(gdf_daily):,} rows (census sections × days)")
print(f"Date range: {gdf_daily['DATA_LECTURA'].min().date()} to {gdf_daily['DATA_LECTURA'].max().date()}")
print(f"\nRainfall vulnerability score range: {gdf_daily['VULNERABILITY_SCORE_RAINFALL'].min():.2f} - {gdf_daily['VULNERABILITY_SCORE_RAINFALL'].max():.2f}")
print(f"Heatwave vulnerability score range: {gdf_daily['VULNERABILITY_SCORE_HEATWAVE'].min():.2f} - {gdf_daily['VULNERABILITY_SCORE_HEATWAVE'].max():.2f}")


Saved vulnerability dataset to: clean/vulnerability_daily.parquet
Dataset contains 777,504 rows (census sections × days)
Date range: 2023-01-04 to 2024-12-31

Rainfall vulnerability score range: 0.37 - 60.56
Heatwave vulnerability score range: 0.36 - 68.42


## 11. Using Vulnerability Scores for Mapping

The dataset now contains two separate vulnerability scores:

- **`VULNERABILITY_SCORE_RAINFALL`**: Use this for mapping vulnerability to intense rainfall episodes
- **`VULNERABILITY_SCORE_HEATWAVE`**: Use this for mapping vulnerability to heatwave episodes

### Mapping Recommendations:

1. **For Rainfall Maps**: 
   - Filter by dates with high precipitation
   - Use `VULNERABILITY_SCORE_RAINFALL` for color scaling
   - Consider using blue color scale (darker = more vulnerable)

2. **For Heatwave Maps**:
   - Filter by dates with high temperatures
   - Use `VULNERABILITY_SCORE_HEATWAVE` for color scaling
   - Consider using red/orange color scale (darker = more vulnerable)

3. **Dual Visualization**:
   - Create side-by-side maps for the same date
   - Or use a bivariate color scheme to show both vulnerabilities simultaneously
