# Flows distribution on a road network

Initialize the Sedona context passing Spark options to set the maximum memory for the Spark driver to 20GB, which is useful for handling large datasets.

In [1]:
from libadalina_core.sedona_configuration import init_sedona_context

init_sedona_context(spark_configs={
    "spark.driver.memory": "20g"
})

https://artifacts.unidata.ucar.edu/repository/unidata-all added as a remote repository with the name: repo-1
Ivy Default Cache set to: /home/marco/.ivy2/cache
The jars for the packages stored in: /home/marco/.ivy2/jars
org.apache.sedona#sedona-spark-3.3_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2d3b137d-a07e-4553-b23f-64fb31fc839e;1.0
	confs: [default]
	found org.apache.sedona#sedona-spark-3.3_2.12;1.7.1 in central
	found org.apache.sedona#sedona-common;1.7.1 in central
	found org.apache.commons#commons-math3;3.6.1 in central
	found org.locationtech.jts#jts-core;1.20.0 in central
	found org.wololo#jts2geojson;0.16.1 in central


:: loading settings :: url = jar:file:/home/marco/Workspace/miniconda/v3/envs/adalina-analytics/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.locationtech.spatial4j#spatial4j;0.8 in central
	found com.google.geometry#s2-geometry;2.0.0 in central
	found com.google.guava#guava;25.1-jre in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found org.checkerframework#checker-qual;2.0.0 in central
	found com.google.errorprone#error_prone_annotations;2.1.3 in central
	found com.google.j2objc#j2objc-annotations;1.1 in central
	found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central
	found com.uber#h3;4.1.1 in central
	found net.sf.geographiclib#GeographicLib-Java;1.52 in central
	found com.github.ben-manes.caffeine#caffeine;2.9.2 in central
	found org.checkerframework#checker-qual;3.10.0 in central
	found com.google.errorprone#error_prone_annotations;2.5.1 in central
	found org.apache.sedona#sedona-spark-common-3.3_2.12;1.7.1 in central
	found org.apache.sedona#shade-proto;1.7.1 in central
	found org.xerial#sqlite-jdbc;3.41.2.2 in central
	found commons-lang#commons-lang;2.6 in central
	found gra

25/08/30 14:50:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Import the packages necessary to read the datasets from disk and set the base path of the datasets


In [2]:
import pathlib
import os

from libadalina_core.readers import geopackage_to_dataframe
from libadalina_core.graph_extraction.readers import OpenStreetMapReader, RoadTypes
import pandas as pd

base_path = pathlib.Path(os.environ.get("SAMPLES_DIR", ""))

We also import `Timing` utility to monitor the computing time of the steps and the garbace collector module `gc` to force garbage collection since the following operations are memory intensive.


In [3]:
from libadalina_analytics.utils import Timing
import gc

In this step, we read two datasets:
1. The OpenStreetMap roadmap of Lombardia region
2. The GeoPackage containing population grid data of Lombardia

The population dataset is optional for graph building. However, in this example, it is joined with the roadmap to enrich the graph edges by adding population data, specifically the number of people living within a 5 kilometers radius of each edge.

Additionally, the graph is automatically enriched with distance information for each edge.

In [4]:
from libadalina_core.graph_extraction.builders import build_graph
from libadalina_core.spatial_operators import AggregationFunction, AggregationType

with Timing('Building graph: {}'):
    osm_df = OpenStreetMapReader(RoadTypes.MAIN_ROADS).read(str(base_path / 'road_maps' / 'Lombardia.gpkg'))
    population = geopackage_to_dataframe(
        str(base_path / "population-north-italy" / "Lombardia.gpkg"),
        "dataframe"
    )[['T', 'geometry']]

    graph = build_graph(osm_df,
                        name='milan_road',
                        joined_df=population,
                        buffer_radius_meters=5000, # 5km buffer around the roads to consider population
                        aggregate_functions=[
                            AggregationFunction("T", AggregationType.SUM, 'population', proportional='geometry_right')
                        ]
                        )

    del osm_df
    del population
    gc.collect()
    print("Created graph with", graph.number_of_nodes(), "nodes and", graph.number_of_edges(), "edges")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


25/08/30 14:50:46 WARN TaskSetManager: Stage 3 contains a task of very large size (1738 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

25/08/30 14:51:00 WARN TaskSetManager: Stage 4 contains a task of very large size (1738 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

25/08/30 14:51:08 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.


[Stage 5:>  (0 + 16) / 16][Stage 6:>   (0 + 0) / 16][Stage 8:>   (0 + 0) / 16]

25/08/30 14:51:08 WARN TaskSetManager: Stage 6 contains a task of very large size (1738 KiB). The maximum recommended task size is 1000 KiB.


[Stage 6:>                (0 + 16) / 16][Stage 8:>                 (0 + 0) / 16]

25/08/30 14:51:14 WARN TaskSetManager: Stage 8 contains a task of very large size (1738 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

25/08/30 14:54:12 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.


                                                                                

Created graph with 977287 nodes and 1749309 edges
Building graph: 278.51437854766846


We read two datasets for the flow distribution analysis:

1. **Transportation Flows Dataset**:
   - CSV file containing flow data between different regions of Lombardia
   - Contains information about transportation demand between origin-destination pairs

2. **Regional Shapes Dataset**:
   - GeoPackage file containing geometric shapes of Lombardia regions
   - Used to map flows between specific geographic areas

In this example, we analyze flows between Milano's regions and Bergamo's regions.

The `flows_distribution_algorithm` is then applied with:
- Population-based cost weight: 0.7
- Distance-based cost weight: 0.3

The output is the `flows` DataFrame enriched with the paths going from each pair of source-destination regions.
For each path we report the nodes traversed, the geometry of the path, the cost of the path, the distance and the amount of population involved.


In [6]:
from libadalina_analytics.flows_distribution.algorithms.flows_distribution_algorithm import flows_distribution_algorithm, GraphCost

with Timing('Computing flows distribution: {}'):
    flows = pd.read_csv(base_path / "flows" / "matrice_od_2016_merci.csv", sep=',')
    flows['demand'] = flows.apply(lambda row: row['N1'] + row['N2'] + row['N3'], axis=1)

    shapes = geopackage_to_dataframe(
        str(base_path / "flows" / "Shape_Matrice_OD2016_-_Veicoli_commerciali_e_pesanti_-_Zone_interne_20250816.gpkg"),
        'dataframe'
    )

    source_node_ids = [215, 218, 219, 220, 221, 222] # shape ID area of Milano
    destination_node_ids = [10, 11, 12]  # shape ID area of Bergamo

    result = flows_distribution_algorithm(graph, shapes, flows,
                                 graph_costs=[
                                     GraphCost(name='population', cost_per_unit=1, weight=.7),
                                     GraphCost(name='distance', cost_per_unit=1, weight=.3),
                                 ],
                                 shapes_id_column='ID_Z_IIL',
                                 flows_origin_id_column='Z_IIL_O',
                                 flows_destination_id_column='Z_IIL_D',
                                 flows_demand_column='demand',
                                 sources=source_node_ids,
                                 destinations=destination_node_ids,
                             )
    print(result.columns)
    print(result.head())

25/08/30 14:58:47 WARN RangeJoinExec: [SedonaSQL] Join dominant side partition number 16 is larger than 1/2 of the dominant side count 6
25/08/30 14:58:47 WARN RangeJoinExec: [SedonaSQL] Try to use follower side partition number 16
25/08/30 14:58:47 WARN RangeJoinExec: [SedonaSQL] Join follower side partition number is also larger than 1/2 of the dominant side count 6
25/08/30 14:58:47 WARN RangeJoinExec: [SedonaSQL] Try to use 1/2 of the dominant side count 3 as the partition number of both sides
25/08/30 14:58:47 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.
25/08/30 14:58:47 WARN TaskSetManager: Stage 74 contains a task of very large size (7198 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

25/08/30 14:58:49 WARN RangeJoinExec: [SedonaSQL] Join dominant side partition number 16 is larger than 1/2 of the dominant side count 3
25/08/30 14:58:49 WARN RangeJoinExec: [SedonaSQL] Try to use follower side partition number 16
25/08/30 14:58:49 WARN RangeJoinExec: [SedonaSQL] Join follower side partition number is also larger than 1/2 of the dominant side count 3
25/08/30 14:58:49 WARN RangeJoinExec: [SedonaSQL] Try to use 1/2 of the dominant side count 1 as the partition number of both sides
25/08/30 14:58:49 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.
25/08/30 14:58:49 WARN TaskSetManager: Stage 78 contains a task of very large size (7198 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

Index(['Z_IIL_O', 'Z_IIL_D', 'Z_IIL_O_NOME', 'Z_IIL_D_NOME', 'N1', 'N2', 'N3',
       'demand', 'path', 'geometry', 'path_cost', 'population', 'distance'],
      dtype='object')
   Z_IIL_O  Z_IIL_D        Z_IIL_O_NOME                   Z_IIL_D_NOME     N1  \
0      219       10  MILANO 5-MILANO 16            BERGAMO 1-BERGAMO 7  21.20   
1      219       12  MILANO 5-MILANO 16  BERGAMO 2-BERGAMO 3-BERGAMO 4  20.15   
2      220       11   MILANO 6-MILANO 8            BERGAMO 5-BERGAMO 6  20.71   
3      220       12   MILANO 6-MILANO 8  BERGAMO 2-BERGAMO 3-BERGAMO 4   0.00   
4      221       10  MILANO 9-MILANO 10            BERGAMO 1-BERGAMO 7   0.00   

     N2    N3  demand                                               path  \
0  2.72  3.71   27.63  [60129574356, 120259116670, 42949705128, 34359...   
1  2.84  3.64   26.63  [60129574356, 120259116670, 42949705128, 34359...   
2  3.14  5.04   28.89  [8589936430, 42949674304, 42949676808, 3435974...   
3  3.18  4.44    7.62  [8589936