# AtlasBR: Unifying Brazilian Spatial Data

**A Guide to Reproducible Urban Analytics**

Welcome to **AtlasBR**. This library solves a kind of "fragmentation problem" of Brazilian socio-economic data. Instead of writing custom SQL for BigQuery, parsing diverse FTP schemas, and manually fixing geometries, AtlasBR provides a unified, domain-driven interface to load:

1.  **Census 2010/2022** (Demographics, Income, Race, Built Environment)
2.  **RAIS** (Formal Employment)
3.  **CNES** (Health Infrastructure)
4.  **INEP** (School Census)

**Key Features Demonstrated:**

  * **Employment:** Filling the "public sector gap" in RAIS by injecting data from Schools and Hospitals.
  * **Hybrid Geocoding:** Combining Lat/Lon (Schools), CEPs (Firms), and Tracts (Census).
  * **Spatial Harmonization:** Re-aggregating data to H3 grids for comparison.


# 1\. Setup & Configuration

First, we ensure we can import the library. If you are running this notebook from the `tutorials/` folder without installing the package yet, the following cell will fix your python path.



In [1]:
import sys
import os
from pathlib import Path

# --- DEVELOPER SETUP (Optional) ---
# If running locally without 'pip install', we add the '../src' folder to path.
current_path = Path(os.getcwd())
if current_path.name == "tutorials":
    # Go up one level to root, then into 'src' (if using src-layout) or just root (flat-layout)
    root_dir = current_path.parent
    src_dir = root_dir / "src"
    
    if src_dir.exists():
        sys.path.append(str(src_dir))
    else:
        sys.path.append(str(root_dir))

import atlasbr
import pandas as pd
import logging
from getpass import getpass

# Enable library logging (shows progress emojis)
atlasbr.configure_logging(level=logging.INFO)

print(f"✅ AtlasBR version {getattr(atlasbr, '__version__', 'dev')} loaded.")

✅ AtlasBR version 0.1.0 loaded.


### 1.1 Authentication

AtlasBR requires a Google Cloud Project ID to query the *Base dos Dados* data lake.

In [2]:
# --- 🔑 SECURE AUTHENTICATION ---
# Strategy: Try environment variable first, fallback to prompt.

project_id = os.getenv("GOOGLE_CLOUD_PROJECT") or os.getenv("GCLOUD_PROJECT_ID")

if not project_id:
    # This masks your input so secrets aren't saved in the notebook
    project_id = getpass("Enter Google Cloud Project ID: ")

try:
    atlasbr.set_billing_id(project_id.strip())
    print("✅ Billing ID configured.")
except Exception as e:
    print(f"❌ Configuration failed: {e}")



✅ Billing ID configured.


## 2\. The Census: Demographics & Tracts

The core of urban analysis is the Census. AtlasBR fetches geometry (Tracts), attributes (Income/Pop), and cleans the result (UTM projection + Urban Clipping) in one go.


### 2.1 Basic Usage

We request data for **Niterói** and **São Gonçalo** (RJ). The library resolves these names to IBGE IDs automatically.

In [3]:
# Load Census 2010 (Basic + Income), clipped to urban footprint
gdf_census = atlasbr.load_census(
    places=["Niterói, RJ", "São Gonçalo, RJ"],
    year=2010,
    themes=["basic", "income"],
    clip_urban=True  # ✂️ Cuts away forests/water automatically
)

print(f"Loaded {len(gdf_census)} tracts.")
gdf_census.head()

2025-11-27 20:33:20,609 - atlasbr - INFO - Fetching municipality metadata from geobr...
2025-11-27 20:33:21,240 - atlasbr - INFO - 🔄 Resolved 2 inputs into 2 unique municipalities.
2025-11-27 20:33:21,241 - atlasbr - INFO - Fetching Census Tracts for 2 municipalities (Year 2010)...
2025-11-27 20:33:24,297 - atlasbr - INFO -     ✂️  Clipping to Urban Area...
2025-11-27 20:33:24,298 - atlasbr - INFO - Downloading Urban Areas (Epoch 2005) from IBGE...
2025-11-27 20:33:27,343 - atlasbr - INFO -        -> Retained 2857 tracts after clip.
2025-11-27 20:33:27,344 - atlasbr - INFO -     📦 Loading theme: 'basic'...


    ☁️  Fetching 2 columns from basedosdados.br_ibge_censo_demografico.setor_censitario_basico_2010...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:28,786 - atlasbr - INFO -     📦 Loading theme: 'income'...



    ☁️  Fetching 1 columns from basedosdados.br_ibge_censo_demografico.setor_censitario_basico_2010...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:30,086 - atlasbr - INFO - ✅ Loaded Census 2010 for 2 municipalities.



Loaded 2857 tracts.


Unnamed: 0_level_0,geometry,domicilios,habitantes,rendimento_medio,year
id_setor_censitario,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
330490410000059,"POLYGON ((709015.408 7471735.281, 709025.75 74...",146.0,447.0,427.26,2010
330490410000060,"POLYGON ((707823.471 7469367.663, 707783.152 7...",88.0,281.0,637.48,2010
330490410000061,"POLYGON ((707890.616 7468196.71, 707778.589 74...",236.0,823.0,411.77,2010
330490410000062,"POLYGON ((710628.29 7469819.425, 710529.25 746...",174.0,541.0,446.02,2010
330490410000063,"POLYGON ((707839.414 7467573.461, 707811.875 7...",358.0,1061.0,498.55,2010


### 2.2 Interactive Mapping

The output is a standard `GeoDataFrame`, ready for `explore()`.

In [4]:
# Map Average Income
gdf_census.explore(
    column="rendimento_medio",
    cmap="RdYlBu",
    tiles="CartoDB positron",
    scheme='NaturalBreaks',
    style_kwds={"stroke": False},
    legend_kwds={"caption": "Average Income (2010)"}
)

## 3\. The "RAIS+" Pipeline: Unified Employment

Official RAIS data (Formal Employment) is excellent for the private sector but often missing public servants (*statutários*).

AtlasBR offers a **Federated Pipeline** (`include_public_sector=True`). It:

1.  Fetches private jobs (RAIS).
2.  Fetches schools (INEP) and health units (CNES).
3.  Harmonizes columns and assigns proxy CNAE codes.
4.  Merges everything into one geospatial table.

In [5]:
# Load Unified Jobs (Private + Public)
gdf_jobs = atlasbr.load_rais(
    places=["Niterói, RJ"],
    year=2022,
    include_public_sector=True, # <--- The Magic Switch
    geocode=True                # <--- Calculates lat/lon via CEP or source coords
)

print(f"✅ Total Establishments: {len(gdf_jobs)}")

    🏭 Fetching RAIS 2022 from Base dos Dados...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:41,792 - atlasbr - INFO -     🌍 Geocoding RAIS via CEP...



    📍 Fetching CEP coordinates from Base dos Dados...
Downloading: 100%|[32m██████████[0m|


2025-11-27 20:33:43,267 - atlasbr - INFO -     ➕ Injecting Public Sector (Schools & Health) for 2022...


    🎓 Fetching Schools 2022 from Base dos Dados...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:44,366 - atlasbr - INFO -     🌍 Converting 258 schools to geometry...
2025-11-27 20:33:44,368 - atlasbr - INFO - ✅ Loaded 258 schools.



    🏥 Fetching CNES 9/2022 from Base dos Dados...
Downloading: 100%|[32m██████████[0m|
    📍 Fetching CEP coordinates from Base dos Dados...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:47,305 - atlasbr - INFO -     🌍 Geocoding 99 healthcare units via CEP...
2025-11-27 20:33:47,316 - atlasbr - INFO - ✅ Loaded 99 CNES units (Geolocated).
2025-11-27 20:33:47,334 - atlasbr - INFO -        -> Integrated 134 schools and 99 health units.
2025-11-27 20:33:47,358 - atlasbr - INFO - ✅ Loaded 24834 total establishments.



✅ Total Establishments: 24834


### 3.1 Inspecting the Integration

We can verify the source of the data. Notice how "Escola (INEP)" and "Saude (CNES)" appear alongside standard RAIS types.

In [6]:
# Breakdown by source type
gdf_jobs["tipo_estabelecimento"].value_counts()

tipo_estabelecimento
1                23329
2                 1025
3                  247
Escola (INEP)      134
Saude (CNES)        99
Name: count, dtype: int64

In [7]:
# Map the establishments (Schools in Blue, Health in Red, Private in Gray)
m = gdf_jobs[gdf_jobs["tipo_estabelecimento"] == "Escola (INEP)"].explore(
    color="blue", name="Schools"
)
m = gdf_jobs[gdf_jobs["tipo_estabelecimento"] == "Saude (CNES)"].explore(
    m=m, color="red", name="Health"
)
m # Display map

## 4\. Specialized Datasets

Sometimes you need the raw, domain-specific metrics (e.g., number of MRI machines or classroom counts) rather than just job counts.

### 4.1 Schools (INEP)

Note that Schools use high-precision Lat/Lon coordinates natively, unlike the CEP centroids used for businesses.

In [8]:
gdf_schools = atlasbr.load_schools(
    places=["Niterói, RJ"],
    year=2023,
    gcp_billing=project_id,
    as_gdf=True
)

cols = ["rede", "quantidade_matricula_fundamental", "quantidade_docente_educacao_basica"]
gdf_schools[cols].head()

    🎓 Fetching Schools 2023 from Base dos Dados...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:48,625 - atlasbr - INFO -     🌍 Converting 249 schools to geometry...
2025-11-27 20:33:48,629 - atlasbr - INFO - ✅ Loaded 249 schools.





Unnamed: 0,rede,quantidade_matricula_fundamental,quantidade_docente_educacao_basica
0,Publica,0,55
1,Publica,0,44
2,Publica,221,50
3,Privada,0,14
4,Privada,78,17


### 4.2 Health Units (CNES)

CNES data includes specific infrastructure metrics like bed counts.

In [9]:
gdf_health = atlasbr.load_cnes(
    places=["Niterói, RJ"],
    year=2023,
    month=9,
    gcp_billing=project_id,
    geocode=True
)

gdf_health[["total_leitos_internacao", "total_consultorios"]].head()

    🏥 Fetching CNES 9/2023 from Base dos Dados...
Downloading: 100%|[32m██████████[0m|
    📍 Fetching CEP coordinates from Base dos Dados...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:51,549 - atlasbr - INFO -     🌍 Geocoding 108 healthcare units via CEP...
2025-11-27 20:33:51,565 - atlasbr - INFO - ✅ Loaded 108 CNES units (Geolocated).





Unnamed: 0,total_leitos_internacao,total_consultorios
0,0,19
1,0,4
2,0,1
3,0,12
4,37,37


## 5\. Spatial Harmonization (H3 Grids)

Comparing Census Tracts (irregular polygons) with other data is difficult. AtlasBR simplifies this by offering **H3 Interpolation** directly in the loader.

It uses **Tobler's Areal Weighting** to transfer data from Tracts to a regular Hexagonal Grid.


In [10]:

# Load Census data, but output as H3 Hexagons (Resolution 8)
gdf_hex = atlasbr.load_census(
    places=["Niterói, RJ"],
    year=2010,
    themes=["income", "basic"],
    clip_urban=True,
    geometry="h3",  # <--- Change output format
    h3_res=9        # ~0.7km² hexagons
)

print(f"⬢ Generated {len(gdf_hex)} hexagons.")
gdf_hex.explore(column="rendimento_medio", tiles="CartoDB DarkMatter")

2025-11-27 20:33:51,661 - atlasbr - INFO - 🔄 Resolved 1 inputs into 1 unique municipalities.
2025-11-27 20:33:51,663 - atlasbr - INFO - Fetching Census Tracts for 1 municipalities (Year 2010)...
2025-11-27 20:33:51,860 - atlasbr - INFO -     ✂️  Clipping to Urban Area...
2025-11-27 20:33:52,333 - atlasbr - INFO -        -> Retained 907 tracts after clip.
2025-11-27 20:33:52,334 - atlasbr - INFO -     📦 Loading theme: 'income'...


    ☁️  Fetching 1 columns from basedosdados.br_ibge_censo_demografico.setor_censitario_basico_2010...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:53,422 - atlasbr - INFO -     📦 Loading theme: 'basic'...



    ☁️  Fetching 2 columns from basedosdados.br_ibge_censo_demografico.setor_censitario_basico_2010...
Downloading: 100%|[32m██████████[0m|

2025-11-27 20:33:54,363 - atlasbr - INFO -     ⬢  Aggregating to H3 Grid (Res 9)...





TypeError: cell_to_boundary() got an unexpected keyword argument 'geo_json'

## 6\. Cache Management

To avoid re-downloading large files, AtlasBR caches data locally. You can manage this cache easily.