# Sanity check on A2

This notebook is a sanity check on the A2 datasets. It checks that the datasets are well-formed and that the data is consistent. This can only be run after the code in [`a2.py`](/src/airflow/dags/pipelines/a2.py) has been run.

**Note**: 
- This corresponds to the data validation step in the A2 pipeline that can be accessed in the streamlit application.
- The code in this notebook uses plotly graphs, which might not render correctly if you opem the notebook for the first time. The code may need to be run once to render the graphs correctly.

In [1]:
import os
import sys

from pathlib import Path

import deltalake as dl
import polars as pl

sys.path.append(os.path.abspath(os.path.join("..")))

from src.utils.eda_dashboard import inspect_dataframe_interactive

DATA_PATH = Path("../data_zones/02_formatted")

In [2]:
# Read using deltalake first, then convert to polars (Otherwise type conversion errors)
dt_income = dl.DeltaTable(str(DATA_PATH / "income"))
df_income = pl.from_arrow(dt_income.to_pyarrow_table())

In [3]:
_ = inspect_dataframe_interactive(df_income)

📊 DATAFRAME - SUMMARY
Shape: 811 rows × 9 columns
Memory usage: 0.05 MB

Column Types:
  Numeric: 5
  Categorical: 3
  Date: 1
  Other: 0

Data Quality:
  Missing values: 8 (0.11%)
  Duplicate rows: 0 (0.00%)

✅ Data appears to be in good shape!
Summary overview:
shape: (9, 10)
┌────────────┬─────────────┬───────────────┬─────────────────────┬───────────────────┬───────────────────┬──────────────┬──────────────────────┬─────────────────────────────────┬────────────────┐
│ statistic  ┆ year        ┆ district_code ┆ district_name       ┆ neighborhood_code ┆ neighborhood_name ┆ population   ┆ income_index_bcn_100 ┆ load_timestamp                  ┆ source_dataset │
│ ---        ┆ ---         ┆ ---           ┆ ---                 ┆ ---               ┆ ---               ┆ ---          ┆ ---                  ┆ ---                             ┆ ---            │
│ str        ┆ f64         ┆ f64           ┆ str                 ┆ f64               ┆ str               ┆ f64          ┆ f64        

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

In [4]:
dt_idealista = dl.DeltaTable(str(DATA_PATH / "idealista"))
df_idealista = pl.from_arrow(dt_idealista.to_pyarrow_table())

In [5]:
_ = inspect_dataframe_interactive(df_idealista)

📊 DATAFRAME - SUMMARY
Shape: 20,978 rows × 22 columns
Memory usage: 3.93 MB

Column Types:
  Numeric: 7
  Categorical: 11
  Date: 1
  Other: 3

Data Quality:
  Missing values: 12,110 (2.62%)
  Duplicate rows: 10,468 (49.90%)

💡 Key Insights:
  ⚠️  Significant duplicates (49.9%)
Summary overview:
shape: (9, 23)
┌────────────┬───────────────┬─────────────────────────────────┬───────────────────────────────┬───────────────────────────────┬──────────────────┬──────────────┬───────────┬───────────┬───────────────┬─────────────────────────────┬─────────────┬──────────┬───────────┬───────┬────────┬───────────────┬──────────────┬─────────────┬──────────┬─────────────┬─────────────────────────────────┬────────────────┐
│ statistic  ┆ property_code ┆ property_url                    ┆ district                      ┆ neighborhood                  ┆ address          ┆ municipality ┆ latitude  ┆ longitude ┆ property_type ┆ detailed_type               ┆ size_m2     ┆ rooms    ┆ bathrooms ┆ floor ┆ st

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

In [6]:
dt_cs = dl.DeltaTable(str(DATA_PATH / "cultural_sites"))
df_cs = pl.from_arrow(dt_cs.to_pyarrow_table())

In [7]:
_ = inspect_dataframe_interactive(df_cs)

📊 DATAFRAME - SUMMARY
Shape: 871 rows × 12 columns
Memory usage: 0.11 MB

Column Types:
  Numeric: 3
  Categorical: 8
  Date: 1
  Other: 0

Data Quality:
  Missing values: 1,409 (13.48%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  High missing data rate (13.5%)
Summary overview:
shape: (9, 13)
┌────────────┬─────────┬─────────────────────────────────┬──────────────────────────┬───────────────┬─────────────────────┬─────────────────┬──────────┬───────────┬──────────────────────┬────────────────────────┬─────────────────────────────────┬────────────────┐
│ statistic  ┆ site_id ┆ facility_name                   ┆ institution_name         ┆ district_code ┆ district            ┆ neighborhood    ┆ latitude ┆ longitude ┆ category             ┆ facility_type          ┆ load_timestamp                  ┆ source_dataset │
│ ---        ┆ ---     ┆ ---                             ┆ ---                      ┆ ---           ┆ ---                 ┆ ---             ┆ ---      ┆ ---       ┆ ---  

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…