# Sanity check on A3

This notebook is a sanity check on the A3 datasets. It checks that the datasets are well-formed and that the data is consistent. This can only be run after the code in [`a3.py`](/src/airflow/dags/pipelines/a3.py) has been run.

**Note**: 
- This corresponds to the data validation step in the A3 pipeline that can be accessed in the streamlit application.
- The code in this notebook uses plotly graphs, which might not render correctly if you opem the notebook for the first time. The code may need to be run once to render the graphs correctly.

In [1]:
import os
import sys

from pathlib import Path

import deltalake as dl
import polars as pl

sys.path.append(os.path.abspath(os.path.join("..")))

from src.utils.eda_dashboard import inspect_dataframe_interactive

DATA_PATH = Path("../data_zones/03_exploitation")

In [None]:
dfs = []
for folder in DATA_PATH.glob("*_*"): # Ignore .gitkeep
    print(folder.name)
    dt = dl.DeltaTable(str(DATA_PATH / folder.name))
    df = pl.from_arrow(dt.to_pyarrow_table())
    dfs.append(df)
    globals()[f"df_{folder.name}"] = df

cultural_category_analytics
cultural_district_analytics
cultural_neighborhood_analytics
income_quintiles
integrated_analytics
property_analytics
property_type_analytics
socioeconomic_district_analytics
socioeconomic_neighborhood_analytics


In [4]:
for df in dfs:
    inspect_dataframe_interactive(df)

📊 DATAFRAME - SUMMARY
Shape: 28 rows × 3 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 1
  Categorical: 2
  Date: 0
  Other: 0

Data Quality:
  Missing values: 0 (0.00%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 4)
┌────────────┬─────────────────────┬──────────────────────┬────────────────┐
│ statistic  ┆ district            ┆ category             ┆ category_count │
│ ---        ┆ ---                 ┆ ---                  ┆ ---            │
│ str        ┆ str                 ┆ str                  ┆ f64            │
╞════════════╪═════════════════════╪══════════════════════╪════════════════╡
│ count      ┆ 28                  ┆ 28                   ┆ 28.0           │
│ null_count ┆ 0                   ┆ 0                    ┆ 0.0            │
│ mean       ┆ null                ┆ null                 ┆ 21.107143      │
│ std        ┆ null                ┆ null                 ┆ 23.633528      │
│ min        ┆ Ciutat Vella   

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 10 rows × 7 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 5
  Categorical: 2
  Date: 0
  Other: 0

Data Quality:
  Missing values: 0 (0.00%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 8)
┌────────────┬─────────────────────┬────────────────┬──────────────────────┬───────────────────┬───────────────────────┬──────────────────┬─────────────────────────────────┐
│ statistic  ┆ district            ┆ analysis_level ┆ total_cultural_sites ┆ unique_categories ┆ unique_facility_types ┆ total_population ┆ cultural_sites_per_1000_reside… │
│ ---        ┆ ---                 ┆ ---            ┆ ---                  ┆ ---               ┆ ---                   ┆ ---              ┆ ---                             │
│ str        ┆ str                 ┆ str            ┆ f64                  ┆ f64               ┆ f64                   ┆ f64              ┆ f64                             │
╞════════════╪═════════════

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 68 rows × 8 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 5
  Categorical: 3
  Date: 0
  Other: 0

Data Quality:
  Missing values: 1 (0.18%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 9)
┌────────────┬─────────────────────┬─────────────────┬────────────────┬──────────────────────┬───────────────────┬───────────────────────┬──────────────┬─────────────────────────────────┐
│ statistic  ┆ district            ┆ neighborhood    ┆ analysis_level ┆ total_cultural_sites ┆ unique_categories ┆ unique_facility_types ┆ population   ┆ cultural_sites_per_1000_reside… │
│ ---        ┆ ---                 ┆ ---             ┆ ---            ┆ ---                  ┆ ---               ┆ ---                   ┆ ---          ┆ ---                             │
│ str        ┆ str                 ┆ str             ┆ str            ┆ f64                  ┆ f64               ┆ f64                   ┆ f64          ┆ f64  

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 73 rows × 4 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 1
  Categorical: 3
  Date: 0
  Other: 0

Data Quality:
  Missing values: 0 (0.00%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 5)
┌────────────┬─────────────────────┬───────────────────┬──────────────────────┬─────────────────┐
│ statistic  ┆ district_name       ┆ neighborhood_name ┆ income_index_bcn_100 ┆ income_quintile │
│ ---        ┆ ---                 ┆ ---               ┆ ---                  ┆ ---             │
│ str        ┆ str                 ┆ str               ┆ f64                  ┆ str             │
╞════════════╪═════════════════════╪═══════════════════╪══════════════════════╪═════════════════╡
│ count      ┆ 73                  ┆ 73                ┆ 73.0                 ┆ 73              │
│ null_count ┆ 0                   ┆ 0                 ┆ 0.0                  ┆ 0               │
│ mean       ┆ null                ┆ 

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 14 rows × 17 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 12
  Categorical: 4
  Date: 1
  Other: 0

Data Quality:
  Missing values: 6 (2.52%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 18)
┌────────────┬─────────────────────┬───────────────────────────────┬──────────────────┬───────────────┬──────────────────┬──────────────────┬─────────────────────┬─────────────┬──────────────────────┬──────────────┬──────────────────────┬─────────────────────────────────┬─────────────────┬─────────────────────┬──────────────────────┬──────────────────────┬─────────────────────────────────┐
│ statistic  ┆ district            ┆ neighborhood                  ┆ total_properties ┆ avg_price_eur ┆ median_price_eur ┆ avg_price_per_m2 ┆ median_price_per_m2 ┆ avg_size_m2 ┆ income_index_bcn_100 ┆ population   ┆ total_cultural_sites ┆ cultural_sites_per_1000_reside… ┆ income_quintile ┆ affordability_ratio ┆ attractiveness

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 85 rows × 14 columns
Memory usage: 0.01 MB

Column Types:
  Numeric: 10
  Categorical: 3
  Date: 1
  Other: 0

Data Quality:
  Missing values: 23 (1.93%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 15)
┌────────────┬─────────────────────┬───────────────────────────────┬────────────────┬──────────────────┬───────────────┬──────────────────┬──────────────────┬─────────────────────┬─────────────┬───────────┬─────────────────────┬──────────────────┬──────────────────┬─────────────────────────────────┐
│ statistic  ┆ district            ┆ neighborhood                  ┆ analysis_level ┆ total_properties ┆ avg_price_eur ┆ median_price_eur ┆ avg_price_per_m2 ┆ median_price_per_m2 ┆ avg_size_m2 ┆ avg_rooms ┆ price_per_m2_stddev ┆ min_price_per_m2 ┆ max_price_per_m2 ┆ created_at                      │
│ ---        ┆ ---                 ┆ ---                           ┆ ---            ┆ ---              ┆ ---         

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 73 rows × 5 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 3
  Categorical: 2
  Date: 0
  Other: 0

Data Quality:
  Missing values: 0 (0.00%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 6)
┌────────────┬─────────────────────┬───────────────┬────────────┬──────────────────────────┬─────────────────────────────┐
│ statistic  ┆ district            ┆ property_type ┆ type_count ┆ avg_price_per_m2_by_type ┆ median_price_per_m2_by_type │
│ ---        ┆ ---                 ┆ ---           ┆ ---        ┆ ---                      ┆ ---                         │
│ str        ┆ str                 ┆ str           ┆ f64        ┆ f64                      ┆ f64                         │
╞════════════╪═════════════════════╪═══════════════╪════════════╪══════════════════════════╪═════════════════════════════╡
│ count      ┆ 73                  ┆ 73            ┆ 73.0       ┆ 73.0                     ┆ 73.0           

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 10 rows × 12 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 10
  Categorical: 2
  Date: 0
  Other: 0

Data Quality:
  Missing values: 0 (0.00%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 13)
┌────────────┬─────────────────────┬───────────────┬────────────────────┬──────────────────┬─────────────────────┬──────────────────────────┬──────────────────┬──────────────────┬──────────────────┬─────────────────────────────┬──────────────────────┬────────────────┐
│ statistic  ┆ district_name       ┆ district_code ┆ neighborhood_count ┆ avg_income_index ┆ median_income_index ┆ income_inequality_stddev ┆ min_income_index ┆ max_income_index ┆ total_population ┆ avg_neighborhood_population ┆ income_inequality_cv ┆ analysis_level │
│ ---        ┆ ---                 ┆ ---           ┆ ---                ┆ ---              ┆ ---                 ┆ ---                      ┆ ---              ┆ ---              ┆ --

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…

📊 DATAFRAME - SUMMARY
Shape: 73 rows × 8 columns
Memory usage: 0.00 MB

Column Types:
  Numeric: 5
  Categorical: 3
  Date: 0
  Other: 0

Data Quality:
  Missing values: 0 (0.00%)
  Duplicate rows: 0 (0.00%)

💡 Key Insights:
  ⚠️  Small dataset
Summary overview:
shape: (9, 9)
┌────────────┬─────────────────────┬───────────────────┬───────────────┬───────────────────┬──────────────────────┬──────────────┬───────────┬────────────────┐
│ statistic  ┆ district_name       ┆ neighborhood_name ┆ district_code ┆ neighborhood_code ┆ income_index_bcn_100 ┆ population   ┆ data_year ┆ analysis_level │
│ ---        ┆ ---                 ┆ ---               ┆ ---           ┆ ---               ┆ ---                  ┆ ---          ┆ ---       ┆ ---            │
│ str        ┆ str                 ┆ str               ┆ f64           ┆ f64               ┆ f64                  ┆ f64          ┆ f64       ┆ str            │
╞════════════╪═════════════════════╪═══════════════════╪═══════════════╪═══════════

VBox(children=(VBox(children=(HTML(value='<h3>🔍 Interactive DataFrame Inspector</h3>'), HBox(children=(Dropdow…