Objective: Prepare and clean the dengue incidence dataset for geospatial analysis in QGIS or Google Earth Engine.

Key Steps:

Loaded the official dataset using pandas.

Checked for missing values, notably in Age, AgeUnit, and SEROTYPE_DENV.

Verified the columns relevant for spatial and temporal analysis (e.g., GCODENAME, CASE_YEAR, CASE_MONTH, SEROTYPE_DENV).

Converted dates and filtered rows (if needed).

Saved a cleaned .CSV version for import into GIS tools.


##  Data Preprocessing and Cleaning (Updated)

We successfully loaded the dengue incidence dataset from the Puerto Rico Department of Health using pandas.

While cleaning the data:
- We dropped rows missing critical fields like `Age`, `Sex`, `CASE_YEAR`, `CASE_MONTH`, and `GCODENAME`.
- We also converted the `OnsetDate` and `DateOfDeath



In [3]:
import pandas as pd

# Load the dataset
df = pd.read_excel("CasosDengue_2020-2025_SanJuan_04.16.2025.xlsx", engine="openpyxl")

# Preview the structure
print("Columns in the dataset:", df.columns.tolist())
print("\nMissing values:\n", df.isnull().sum())

# Drop rows with missing essential values
df_cleaned = df.dropna(subset=["Age", "Sex", "CASE_YEAR", "CASE_MONTH", "GCODENAME"]).copy()

# Safely convert dates using .loc
df_cleaned.loc[:, "OnsetDate"] = pd.to_datetime(df_cleaned["OnsetDate"], errors="coerce")
df_cleaned.loc[:, "DateOfDeath"] = pd.to_datetime(df_cleaned["DateOfDeath"], errors="coerce")

# Save cleaned file
df_cleaned.to_csv("dengue_sanjuan_cleaned.csv", index=False)

print("\nCleaned dataset saved as 'dengue_sanjuan_cleaned.csv'. Ready for spatial analysis!")


Columns in the dataset: ['Id', 'GCODENAME', 'AGE_GROUP', 'Age', 'AgeUnit', 'Sex', 'CASE_YEAR', 'CASE_MONTH', 'CASE_CDC_DB_WEEK', 'PRDH_REGION', 'SEROTYPE_DENV', 'CaseType', 'OnsetDate', 'DateOfDeath', 'Arbovirus', 'CaseStatus']

Missing values:
 Id                     0
GCODENAME              0
AGE_GROUP              0
Age                    1
AgeUnit                1
Sex                    0
CASE_YEAR              0
CASE_MONTH             0
CASE_CDC_DB_WEEK       0
PRDH_REGION            0
SEROTYPE_DENV        329
CaseType               0
OnsetDate              0
DateOfDeath         2227
Arbovirus              0
CaseStatus             0
dtype: int64

Cleaned dataset saved as 'dengue_sanjuan_cleaned.csv'. Ready for spatial analysis!


## Step 4: Initial Mapping
I e created a basic folium map centered on San Juan, Puerto Rico to visualize the general study area. Since our current dataset does not contain exact coordinates for each dengue case, we placed a single marker indicating the overall research location. In later steps, we will spatially join the dengue data with geographic boundaries for deeper analysis in QGIS.


In [5]:
import folium
import pandas as pd

# Load the cleaned dengue dataset
df = pd.read_csv("dengue_sanjuan_cleaned.csv")

# Create a base map centered roughly around San Juan, PR
san_juan_coords = [18.4655, -66.1057]  # Latitude, Longitude of San Juan
map_sj = folium.Map(location=san_juan_coords, zoom_start=11)

# Optional: if your data had coordinates, you would add points here.
# Right now, we don't have latitude and longitude in the dataset.
# We can just put a marker showing "San Juan - Dengue Study Area"

folium.Marker(
    location=san_juan_coords,
    popup="San Juan Study Area: Dengue Fever",
    icon=folium.Icon(color='red', icon='info-sign')
).add_to(map_sj)

# Show the map
map_sj


To obtain the administrative boundaries for San Juan, I accessed the U.S. Census Bureau’s TIGER/Line Shapefiles FTP site. I selected the 2024 shapefiles and navigated to the COUSUB/ folder, which contains county subdivision boundaries. I downloaded the tl_2024_72_cousub.zip file (Puerto Rico shapefile) and extracted it for use in my GIS analysis. This shapefile will allow me to map and analyze dengue cases by geographic area.


Phase 1 - Step 3: Data Preprocessing and Integration into QGIS After preparing my dengue fever cleaned dataset (dengue_sanjuan_cleaned.csv) in Jupyter Notebook, I proceeded to QGIS to integrate the epidemiological data with the San Juan geographic shapefile.

First, I successfully loaded the San Juan municipalities shapefile (tl_2024_72_cousub) into QGIS.

Next, I used the "Data Source Manager" in QGIS to load my cleaned dengue dataset as a delimited text layer (CSV format).

During the CSV upload, I made sure of the following:

Selected the correct file (dengue_sanjuan_cleaned.csv).

Confirmed that "First record has field names" was checked.

Recognized that this dengue dataset does not yet have latitude/longitude coordinates, so I chose "No geometry (attribute-only table)" at this step.

Kept the coordinate system as EPSG:4326 - WGS 84 for consistency when ready to create mapped outputs later.

This setup will allow me to later join the dengue cases to the geographic layer (municipal boundaries) based on matching fields like GCODENAME (municipality name) or other codes.

In [6]:
import pandas as pd

# Reload the correct Excel file
df = pd.read_excel("CasosDengue_2020-2025_SanJuan_04.16.2025.xlsx", engine="openpyxl")

# Drop rows missing important info
df_cleaned = df.dropna(subset=["Age", "Sex", "CASE_YEAR", "CASE_MONTH", "GCODENAME"])

# Save again
df_cleaned.to_csv("dengue_sanjuan_cleaned.csv", index=False)

print("Saved cleaned CSV!")


Saved cleaned CSV!


In [7]:
df.head()

Unnamed: 0,Id,GCODENAME,AGE_GROUP,Age,AgeUnit,Sex,CASE_YEAR,CASE_MONTH,CASE_CDC_DB_WEEK,PRDH_REGION,SEROTYPE_DENV,CaseType,OnsetDate,DateOfDeath,Arbovirus,CaseStatus
0,1,San Juan(incl. Rio Piedras),10-14,13.0,Year,Male,2020,1,2,San Juan,Dengue 1 Indentified,ConfirmedDengue,2020-01-08,,DEN,Confirmed
1,2,San Juan(incl. Rio Piedras),10-14,13.0,Year,Female,2020,1,3,San Juan,Dengue Unspecified Subtype,ConfirmedDengue,2020-01-17,,DEN,Confirmed
2,3,San Juan(incl. Rio Piedras),05-09,6.0,Year,Male,2020,1,3,San Juan,Dengue 1 Indentified,ConfirmedDengue,2020-01-20,,DEN,Confirmed
3,4,San Juan(incl. Rio Piedras),05-09,6.0,Year,Female,2020,1,3,San Juan,Dengue 1 Indentified,ConfirmedDengue,2020-01-15,,DEN,Confirmed
4,5,San Juan(incl. Rio Piedras),60-79,64.0,Year,Male,2020,1,2,San Juan,Dengue 1 Indentified,ConfirmedDengue,2020-01-10,,DEN,Confirmed
