Skip to content

File format v4 and big graph lib re-write#49

Merged
sebastiankreutzer merged 30 commits intotudasc:develfrom
sebastiankreutzer:feat/graphlib_refactor
Jul 31, 2025
Merged

File format v4 and big graph lib re-write#49
sebastiankreutzer merged 30 commits intotudasc:develfrom
sebastiankreutzer:feat/graphlib_refactor

Conversation

@sebastiankreutzer
Copy link
Member

@sebastiankreutzer sebastiankreutzer commented Jul 7, 2025

File format v4 and big graph library re-write

This merge request introduces format v4, alongside a number of changes in the graph lib implementation.
Major parts of this PR are:

  • Introduction of format v4 to replace v3, addressing the remaining issues of v4 and providing a more compact JSON representation than all prior format versions.
  • Re-design of the graph library for improved performance and API consistency.
  • Introduction of merge policies (see RFC: Introduce merge policies #42) to handle different merge semantics.

I'm happy for any and all feedback, including:

  • Overall design and potential limitations
  • Bikeshedding of function/type names
  • How this may be split up into reviewable chunks

Note: This patch is complete (including unit and integration tests), with the following notable exceptions:

  • VersionThreeMCG{Reader,Writer}Test has not been adjusted to account for format changes after review. I will update this once we agree on the final format. I/O is still tested via the v2 tests (they depend on the v4 reader/writer).
  • CGConverterand CGMerge2 are functional. All other tools need to be updated. Required changes are fairly easy to apply.
  • There are some left-over debugging messages (mostly commented-out) that I will remove at a later point, please disregard these.

For performance results, skip to the very end.

Format v4

Format v4 combines ideas from both v2 and v3 to provide a JSON representation that is both compact and readable.
V4 completely replaces v3, as keeping both versions while maintaining a consistent internal representation is not feasible.
The following example shows the same call graph in v2, v4 and v3 (use horitzontal scroll bar to see everyghing):

Version 2 Version 4 (with names as identifiers) Version 3 (to be deprecated)
{
    "_CG": {
        "foo": {
            "callees": [],
            "callers": [
                "main"
            ],
            "doesOverride": false,
            "hasBody": true,
            "isVirtual": false,
            "meta": {
                "fileProperties": {
                    "origin": "foo.cpp",
                    "systemInclude": false
                },
                "numStatements": 0
            },
            "overriddenBy": [],
            "overrides": []
        },
        "main": {
            "callees": [
                "foo"
            ],
            "callers": [],
            "doesOverride": false,
            "hasBody": true,
            "isVirtual": false,
            "meta": {
                "fileProperties": {
                    "origin": "main.cpp",
                    "systemInclude": false
                },
                "numStatements": 3
            },
            "overriddenBy": [],
            "overrides": []
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "CGCollector",
            "sha": "9350298c64ad5a9aef50494f73caad08c38c1a51",
            "version": "0.7"
        },
        "version": "2.0"
    }
}
{
    "_CG": {
        "foo": {
            "functionName": "foo",
            "origin": "foo.cpp",
            "callees": {},
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 0,
                "overrideMD": {
                    "overriddenBy": [],
                    "overrides": []
                }
            }
        },
        "main": {
            "functionName" : "main",
            "origin": "main.cpp",
            "callees": {
                "foo": null
            },
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 3
            }
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "CGCollector",
            "sha": "a12c2adc9482ef0062fd27f104ab4dfd679cb91f",
            "version": "0.2"
        },
        "version": "4.0"
    }
}
{
    "_CG": {
        "edges": [
            [
                [
                    13570296075836900426,
                    6503480063870808269
                ],
                null
            ]
        ],
        "nodes": [
            [
                6503480063870808269,
                {
                    "functionName": "foo",
                    "origin": "foo.cpp",
                    "hasBody": true,
                    "meta": {
                        "fileProperties": {
                            "systemInclude": false
                        },
                        "numStatements": 0,
                        "overrideMD": {
                            "overriddenBy": [],
                            "overrides": []
                        }
                    }
                }
            ],
            [
                13570296075836900426,
                {
                    "functionName": "main",
                    "origin": "main.cpp",
                    "hasBody": true,
                    "meta": {
                        "fileProperties": {
                            "systemInclude": false
                        },
                        "numStatements": 3
                    }
                }
            ]
        ]
    },
    "_MetaCG": {
        "generator": {
            "name": "MetaCG",
            "sha": "2ce9326f676cfbc802b3924f66dfd97576058b30",
            "version": "0.7"
        },
        "version": "3.0"
    }
}

Here are the main changes compared to the previous formats:

  • The generator can choose arbitrary string IDs to refer to each node. These IDs have no relevance for the in-memory representation.
  • Call edges are stored in the callees field within each node, similar to v2. Each edge can carry metdata (by default null).
    • Compared to a global edges container like v4, this improves human readability and reduces storage requirements (the callee is implicit).
    • Callers are not explicitly stored like in v2 to reduce redundancy. This simplifies manual editing, as only one field needs to be modified, and consistency is always maintained.
  • There is no dedicated debug format. Instead, the user can choose to use (uniquified) function names as identifiers during export.

To conserve space, integer identifiers can be used instead of function names:

Version 4 (with names as identifiers) Version 4 (with integers as identifiers)
{
    "_CG": {
        "foo": {
            "functionName": "foo",
            "origin": "foo.cpp",
            "callees": {},
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 0,
                "overrideMD": {
                    "overriddenBy": [],
                    "overrides": []
                }
            }
        },
        "main": {
            "functionName" : "main",
            "origin": "main.cpp",
            "callees": {
                "foo": null
            },
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 3
            }
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "CGCollector",
            "sha": "a12c2adc9482ef0062fd27f104ab4dfd679cb91f",
            "version": "0.2"
        },
        "version": "4.0"
    }
}
{
    "_CG": {
        "0": {
            "functionName": "foo",
            "origin": "foo.cpp",
            "callees": null,
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 0,
                "overrideMD": {
                    "overriddenBy": [],
                    "overrides": []
                }
            }
        },
        "1": {
            "functionName": "main",
            "origin": "main.cpp",
            "callees": {
                "0": null
            },
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 3
            }
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "MetaCG",
            "sha": "9350298c64ad5a9aef50494f73caad08c38c1a51",
            "version": "0.7"
        },
        "version": "4.0"
    }
}
In this example, the size reduction is only 16%, but for larger call graphs with long functions names this can be very effective.

Potential additional features, not yet implemented:

  • Add a debug metadata field. This field can be used to export the list of callers for easier look up during manual inspection. This metadata would be ignored when reading, and re-generated when writing, if requested.

Graph library

The graph library was adjusted to handle the new format and to accommodate a wider variety of use-cases (e.g., @pearzt 's Flip project).

Node identification and lookup

Nodes are identified by an integer ID unique to the containing call graph.
This ID is generally not equal to the string identifier used in the JSON file, but is instead assigned by the call graph when creating the node.
It is explicitly allowed to create multiple nodes with the same function name and/or origin, enabling a correct representation of functions with internal linkage (e.g., functions declared static).

We maintain a hash map to simplify look-up by name. The function const NodeList& getNodes(const std::string&) can be used to get a list of functions matching the given name. CgNode* getFirstNode(const std::string&) is added as a convenience function to look up a single node, e.g. if we know that only one such node exists.

Node creation

If there are two or more in-memory call graphs, nodes in these call graphs may have identical IDs.
To avoid issues with unclear ownership and validity of a node, nodes must therefore always be tied to a specific call graph.
This was implemented by making the CgNode constructor private. Nodes can now only be created by calling Callgraph::insert or Callgraph::getOrInsertNode.
This also guarantees that there are no nodes with invalid or uninitialized IDs, as the IDs are set in the constructor and are marked const.

Edges

There are no major changes w.r.t. edge representation. Edges are stored centrally in the call graph, as before.
Instead of function names/hashes, edges are defined by integer node IDs. There is a convenience function bool addEdge(const std::string&, const std::string&) to simplify edge creation by name.

I/O

Unlinke in v3, there is no direct mapping between the JSON format and the internal representation.
This makes the implementation of the reader and writer more involved, but allows us to have full control over the import/export process. This also made it possible to provide more helpful error messages, when there are errors in the input JSON.

To convert between the string identifiers in the format and the internal node IDs, mapping classes have been introduced (StrToNodeMapping for the reader, NodeToStrMapping for the writer). These are used by the reader/writer to maintain a consistent mapping between the two identifiers and to translate node references (e.g. in the edge field).
This mapping data structure is only required during reading/writing and is discarded afterward.
The only use outside the reader/writer itself is in the creation and export of metadata (see below).

Merging

The implementation of the call graph merge functionality received a major overhaul.
In the current upstream version, functions are always merged by name and there is no customization of merge semantics.
This patch introduces so-called merge policies to enable greater control.

These are based on the following model:

Merge model
When merging graph A into graph B, the merge transfers all nodes from A into B.
For each node in A, there are two main cases, how this can happen:

  1. Node does not match any node in B: node can be cloned and inserted as a new node in B.
  2. Node does match a node in B: There are two possible actions:
    2.1 Keep existing node in B: nothing to be done
    2.2 Replace with node from A: keep the existing node in B but overwrite with properties from A

Merge policies are introduced to concretize the semantics of this model.
Relevant declarations:

struct MergeAction { // abbreviated definition
  NodeId targetNode;
  bool replace;
};

struct MergePolicy {
  virtual std::optional<MergeAction> findMatchingNode(const Callgraph& targetCG, const CgNode& sourceNode) const = 0;
};

For each node in the source graph A, the policy defines which of the outlined cases applies. The function returns:

  • An empty value, if there is no matching node (case 1)
  • An action with replace=false, if there is a match and the existing node should be kept (case 2.1).
  • An action with replace=true, if there is a match and the existing node should be overwritten (case 2.1).

The policy is passed to Callgraph::merge, which executes the merge accordingly. This returns a MergeRecorder object, used to look up the performed actions and to map from node IDs from graph A into the merged graph B.

Note that metadata are in charge of defining their own merge semantics for cases 2.1 and 2.2. More details below.

The current implementation comes with the MergeByName policy, which models the behavior of the previously used merge strategy.
As a follow-up, I suggest we add an ELFMerge policy that models the actual linking of object files more closely. To do this, we need a linkage metadata with values external, internal and weak, and define the policy based on this linkage type. This will finally allow us to accurately handle merge semantics of static and inline functions, regardless of using a source or IR level generator.

Metadata

There are no fundamental changes to how metadata is registered and instantiated.
However, the changes to the format and the merge semantics required some adjustments.

I/O
When creating from JSON, the metadata needs to know how to map from string identifiers to internal node IDs. Therefore, the constructor of each metadata takes an additional StrToNodeMapping& parameter. This is only relevant if the metadata references other nodes (like in OverrideMD) and can be ignored in the majority of cases.

Similarly, the conversion to JSON requires a NodeToStrMapping that is now added as a parameter to toJson.

Mapping
When cloning a node from graph A into graph B, any node IDs referenced in the metadata become potentially invalid.
We added the applyMapping(const GraphMapping&) method that lets the metadata correctly update its internal node references.
For the majority of metadata that do not reference other nodes, this method can be left empty.

Merging
Previously, what should happen during a merge of two metadata objects was often not clear, as there was no information about the performed action.
Now, the executed MergeAction is passed as a parameter to allow the metadata to handle the merge accordingly.
Additionally, we pass a GraphMapping to map node references of the metadata object that is merged in. Again, this is only relevant if the metadata itself references other nodes.

Please refer to OverrideMD.h as an example for a non-trivial metadata implementation using mappings.

Performance results

The following is a performance comparison with devel (both measured on my laptop).

Test Name devel (ms) This PR (ms) Speedup (×)
insert edges 148 109 1.36×
insert nodes 25 15 1.67×
read V2 format 28,547 457 62.45×
write V2 format 754 735 1.03×
read V3/4 format 211 (V3) 233 (V4) 0.91×
write V3/4 format 492 (V3) 417 (V4) 1.18×

Closes #42

@sebastiankreutzer
Copy link
Member Author

As a further note: I would like to completely get rid of the MCGManager, as it doesn't have any real purpose anymore beside a stringCallgraph lookup.
If there are no objections, I will do this after this PR has gone through.

@sebastiankreutzer
Copy link
Member Author

The last commit adds methods to delete nodes and edges

@sebastiankreutzer sebastiankreutzer force-pushed the feat/graphlib_refactor branch 2 times, most recently from 8f4d37c to 2fba539 Compare July 9, 2025 09:03
@TimHeldmann
Copy link
Member

ToDos from Todays meeting:

Decided to rename edges to callees

Discussed whether getFirstNode should print an error, if more than one node exists.
Decided to keep as is (no warning), but possibly implement getSingleNode() function later, which will warn if there where multiple hits.

@pearzt pearzt mentioned this pull request Jul 10, 2025
@sebastiankreutzer
Copy link
Member Author

ToDos from Todays meeting:

Decided to rename edges to callees

Discussed whether getFirstNode should print an error, if more than one node exists. Decided to keep as is (no warning), but possibly implement getSingleNode() function later, which will warn if there where multiple hits.

Addressed in the last two commits

@sebastiankreutzer
Copy link
Member Author

With the last commit, PGIS builds correctly and all unit and integration tests succeed 🥳

@sebastiankreutzer sebastiankreutzer marked this pull request as ready for review July 30, 2025 15:12
@sebastiankreutzer sebastiankreutzer changed the title [Draft] File format v4 and big graph lib re-write File format v4 and big graph lib re-write Jul 30, 2025
Copy link
Member

@TimHeldmann TimHeldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I will open an issue wrt. the V3 compatibility reader/writer facility and implement them if I have time to do so.

Copy link
Member

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial drive-by comments.

MCGLogger::instance().getErrConsole()->error("Source ID {} does not exist in graph: Unrecoverable graph error",
parentID);
abort();
bool Callgraph::addEdge(const std::string& callerName, const std::string& calleeName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that we cannot insert an add when there is more than 1 node with any of the two given names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for this utility function. You can always list the nodes that match a given function name and add an edge to these nodes directly.

@sebastiankreutzer sebastiankreutzer merged commit d05b53b into tudasc:devel Jul 31, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: Introduce merge policies

4 participants