File format v4 and big graph lib re-write by sebastiankreutzer · Pull Request #49 · tudasc/MetaCG

sebastiankreutzer · 2025-07-07T12:16:35Z

File format v4 and big graph library re-write

This merge request introduces format v4, alongside a number of changes in the graph lib implementation.
Major parts of this PR are:

Introduction of format v4 to replace v3, addressing the remaining issues of v4 and providing a more compact JSON representation than all prior format versions.
Re-design of the graph library for improved performance and API consistency.
Introduction of merge policies (see RFC: Introduce merge policies #42) to handle different merge semantics.

I'm happy for any and all feedback, including:

Overall design and potential limitations
Bikeshedding of function/type names
How this may be split up into reviewable chunks

Note: This patch is complete (including unit and integration tests), with the following notable exceptions:

VersionThreeMCG{Reader,Writer}Test has not been adjusted to account for format changes after review. I will update this once we agree on the final format. I/O is still tested via the v2 tests (they depend on the v4 reader/writer).
CGConverterand CGMerge2 are functional. All other tools need to be updated. Required changes are fairly easy to apply.
There are some left-over debugging messages (mostly commented-out) that I will remove at a later point, please disregard these.

For performance results, skip to the very end.

Format v4

Format v4 combines ideas from both v2 and v3 to provide a JSON representation that is both compact and readable.
V4 completely replaces v3, as keeping both versions while maintaining a consistent internal representation is not feasible.
The following example shows the same call graph in v2, v4 and v3 (use horitzontal scroll bar to see everyghing):

Version 2

Version 4 (with names as identifiers)

Version 3 (to be deprecated)

{
    "_CG": {
        "foo": {
            "callees": [],
            "callers": [
                "main"
            ],
            "doesOverride": false,
            "hasBody": true,
            "isVirtual": false,
            "meta": {
                "fileProperties": {
                    "origin": "foo.cpp",
                    "systemInclude": false
                },
                "numStatements": 0
            },
            "overriddenBy": [],
            "overrides": []
        },
        "main": {
            "callees": [
                "foo"
            ],
            "callers": [],
            "doesOverride": false,
            "hasBody": true,
            "isVirtual": false,
            "meta": {
                "fileProperties": {
                    "origin": "main.cpp",
                    "systemInclude": false
                },
                "numStatements": 3
            },
            "overriddenBy": [],
            "overrides": []
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "CGCollector",
            "sha": "9350298c64ad5a9aef50494f73caad08c38c1a51",
            "version": "0.7"
        },
        "version": "2.0"
    }
}

{
    "_CG": {
        "foo": {
            "functionName": "foo",
            "origin": "foo.cpp",
            "callees": {},
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 0,
                "overrideMD": {
                    "overriddenBy": [],
                    "overrides": []
                }
            }
        },
        "main": {
            "functionName" : "main",
            "origin": "main.cpp",
            "callees": {
                "foo": null
            },
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 3
            }
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "CGCollector",
            "sha": "a12c2adc9482ef0062fd27f104ab4dfd679cb91f",
            "version": "0.2"
        },
        "version": "4.0"
    }
}

{
    "_CG": {
        "edges": [
            [
                [
                    13570296075836900426,
                    6503480063870808269
                ],
                null
            ]
        ],
        "nodes": [
            [
                6503480063870808269,
                {
                    "functionName": "foo",
                    "origin": "foo.cpp",
                    "hasBody": true,
                    "meta": {
                        "fileProperties": {
                            "systemInclude": false
                        },
                        "numStatements": 0,
                        "overrideMD": {
                            "overriddenBy": [],
                            "overrides": []
                        }
                    }
                }
            ],
            [
                13570296075836900426,
                {
                    "functionName": "main",
                    "origin": "main.cpp",
                    "hasBody": true,
                    "meta": {
                        "fileProperties": {
                            "systemInclude": false
                        },
                        "numStatements": 3
                    }
                }
            ]
        ]
    },
    "_MetaCG": {
        "generator": {
            "name": "MetaCG",
            "sha": "2ce9326f676cfbc802b3924f66dfd97576058b30",
            "version": "0.7"
        },
        "version": "3.0"
    }
}

Here are the main changes compared to the previous formats:

The generator can choose arbitrary string IDs to refer to each node. These IDs have no relevance for the in-memory representation.
Call edges are stored in the callees field within each node, similar to v2. Each edge can carry metdata (by default null).
- Compared to a global edges container like v4, this improves human readability and reduces storage requirements (the callee is implicit).
- Callers are not explicitly stored like in v2 to reduce redundancy. This simplifies manual editing, as only one field needs to be modified, and consistency is always maintained.
There is no dedicated debug format. Instead, the user can choose to use (uniquified) function names as identifiers during export.

To conserve space, integer identifiers can be used instead of function names:

Version 4 (with names as identifiers)

Version 4 (with integers as identifiers)

{
    "_CG": {
        "foo": {
            "functionName": "foo",
            "origin": "foo.cpp",
            "callees": {},
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 0,
                "overrideMD": {
                    "overriddenBy": [],
                    "overrides": []
                }
            }
        },
        "main": {
            "functionName" : "main",
            "origin": "main.cpp",
            "callees": {
                "foo": null
            },
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 3
            }
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "CGCollector",
            "sha": "a12c2adc9482ef0062fd27f104ab4dfd679cb91f",
            "version": "0.2"
        },
        "version": "4.0"
    }
}

{
    "_CG": {
        "0": {
            "functionName": "foo",
            "origin": "foo.cpp",
            "callees": null,
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 0,
                "overrideMD": {
                    "overriddenBy": [],
                    "overrides": []
                }
            }
        },
        "1": {
            "functionName": "main",
            "origin": "main.cpp",
            "callees": {
                "0": null
            },
            "hasBody": true,
            "meta": {
                "fileProperties": {
                    "systemInclude": false
                },
                "numStatements": 3
            }
        }
    },
    "_MetaCG": {
        "generator": {
            "name": "MetaCG",
            "sha": "9350298c64ad5a9aef50494f73caad08c38c1a51",
            "version": "0.7"
        },
        "version": "4.0"
    }
}

In this example, the size reduction is only 16%, but for larger call graphs with long functions names this can be very effective.

Potential additional features, not yet implemented:

Add a debug metadata field. This field can be used to export the list of callers for easier look up during manual inspection. This metadata would be ignored when reading, and re-generated when writing, if requested.

Graph library

The graph library was adjusted to handle the new format and to accommodate a wider variety of use-cases (e.g., @pearzt 's Flip project).

Node identification and lookup

Nodes are identified by an integer ID unique to the containing call graph.
This ID is generally not equal to the string identifier used in the JSON file, but is instead assigned by the call graph when creating the node.
It is explicitly allowed to create multiple nodes with the same function name and/or origin, enabling a correct representation of functions with internal linkage (e.g., functions declared static).

We maintain a hash map to simplify look-up by name. The function const NodeList& getNodes(const std::string&) can be used to get a list of functions matching the given name. CgNode* getFirstNode(const std::string&) is added as a convenience function to look up a single node, e.g. if we know that only one such node exists.

Node creation

If there are two or more in-memory call graphs, nodes in these call graphs may have identical IDs.
To avoid issues with unclear ownership and validity of a node, nodes must therefore always be tied to a specific call graph.
This was implemented by making the CgNode constructor private. Nodes can now only be created by calling Callgraph::insert or Callgraph::getOrInsertNode.
This also guarantees that there are no nodes with invalid or uninitialized IDs, as the IDs are set in the constructor and are marked const.

Edges

There are no major changes w.r.t. edge representation. Edges are stored centrally in the call graph, as before.
Instead of function names/hashes, edges are defined by integer node IDs. There is a convenience function bool addEdge(const std::string&, const std::string&) to simplify edge creation by name.

I/O

Unlinke in v3, there is no direct mapping between the JSON format and the internal representation.
This makes the implementation of the reader and writer more involved, but allows us to have full control over the import/export process. This also made it possible to provide more helpful error messages, when there are errors in the input JSON.

To convert between the string identifiers in the format and the internal node IDs, mapping classes have been introduced (StrToNodeMapping for the reader, NodeToStrMapping for the writer). These are used by the reader/writer to maintain a consistent mapping between the two identifiers and to translate node references (e.g. in the edge field).
This mapping data structure is only required during reading/writing and is discarded afterward.
The only use outside the reader/writer itself is in the creation and export of metadata (see below).

Merging

The implementation of the call graph merge functionality received a major overhaul.
In the current upstream version, functions are always merged by name and there is no customization of merge semantics.
This patch introduces so-called merge policies to enable greater control.

These are based on the following model:

Merge model
When merging graph A into graph B, the merge transfers all nodes from A into B.
For each node in A, there are two main cases, how this can happen:

Node does not match any node in B: node can be cloned and inserted as a new node in B.
Node does match a node in B: There are two possible actions:
2.1 Keep existing node in B: nothing to be done
2.2 Replace with node from A: keep the existing node in B but overwrite with properties from A

Merge policies are introduced to concretize the semantics of this model.
Relevant declarations:

struct MergeAction { // abbreviated definition
  NodeId targetNode;
  bool replace;
};

struct MergePolicy {
  virtual std::optional<MergeAction> findMatchingNode(const Callgraph& targetCG, const CgNode& sourceNode) const = 0;
};

For each node in the source graph A, the policy defines which of the outlined cases applies. The function returns:

An empty value, if there is no matching node (case 1)
An action with replace=false, if there is a match and the existing node should be kept (case 2.1).
An action with replace=true, if there is a match and the existing node should be overwritten (case 2.1).

The policy is passed to Callgraph::merge, which executes the merge accordingly. This returns a MergeRecorder object, used to look up the performed actions and to map from node IDs from graph A into the merged graph B.

Note that metadata are in charge of defining their own merge semantics for cases 2.1 and 2.2. More details below.

The current implementation comes with the MergeByName policy, which models the behavior of the previously used merge strategy.
As a follow-up, I suggest we add an ELFMerge policy that models the actual linking of object files more closely. To do this, we need a linkage metadata with values external, internal and weak, and define the policy based on this linkage type. This will finally allow us to accurately handle merge semantics of static and inline functions, regardless of using a source or IR level generator.

Metadata

There are no fundamental changes to how metadata is registered and instantiated.
However, the changes to the format and the merge semantics required some adjustments.

I/O
When creating from JSON, the metadata needs to know how to map from string identifiers to internal node IDs. Therefore, the constructor of each metadata takes an additional StrToNodeMapping& parameter. This is only relevant if the metadata references other nodes (like in OverrideMD) and can be ignored in the majority of cases.

Similarly, the conversion to JSON requires a NodeToStrMapping that is now added as a parameter to toJson.

Mapping
When cloning a node from graph A into graph B, any node IDs referenced in the metadata become potentially invalid.
We added the applyMapping(const GraphMapping&) method that lets the metadata correctly update its internal node references.
For the majority of metadata that do not reference other nodes, this method can be left empty.

Merging
Previously, what should happen during a merge of two metadata objects was often not clear, as there was no information about the performed action.
Now, the executed MergeAction is passed as a parameter to allow the metadata to handle the merge accordingly.
Additionally, we pass a GraphMapping to map node references of the metadata object that is merged in. Again, this is only relevant if the metadata itself references other nodes.

Please refer to OverrideMD.h as an example for a non-trivial metadata implementation using mappings.

Performance results

The following is a performance comparison with devel (both measured on my laptop).

Test Name	`devel` (ms)	This PR (ms)	Speedup (×)
insert edges	148	109	1.36×
insert nodes	25	15	1.67×
read V2 format	28,547	457	62.45×
write V2 format	754	735	1.03×
read V3/4 format	211 (V3)	233 (V4)	0.91×
write V3/4 format	492 (V3)	417 (V4)	1.18×

Closes #42

sebastiankreutzer · 2025-07-08T15:27:06Z

As a further note: I would like to completely get rid of the MCGManager, as it doesn't have any real purpose anymore beside a string → Callgraph lookup.
If there are no objections, I will do this after this PR has gone through.

sebastiankreutzer · 2025-07-08T15:47:37Z

The last commit adds methods to delete nodes and edges

TimHeldmann · 2025-07-10T08:49:21Z

ToDos from Todays meeting:

Decided to rename edges to callees

Discussed whether getFirstNode should print an error, if more than one node exists.
Decided to keep as is (no warning), but possibly implement getSingleNode() function later, which will warn if there where multiple hits.

sebastiankreutzer · 2025-07-10T16:51:00Z

ToDos from Todays meeting:

Decided to rename edges to callees

Discussed whether getFirstNode should print an error, if more than one node exists. Decided to keep as is (no warning), but possibly implement getSingleNode() function later, which will warn if there where multiple hits.

Addressed in the last two commits

sebastiankreutzer · 2025-07-11T13:13:07Z

With the last commit, PGIS builds correctly and all unit and integration tests succeed 🥳

Formatting

TimHeldmann

Looks good to me.
I will open an issue wrt. the V3 compatibility reader/writer facility and implement them if I have time to do so.

jplehr

Some initial drive-by comments.

graph/include/Callgraph.h

graph/include/CgNode.h

graph/include/MergePolicy.h

graph/src/Callgraph.cpp

jplehr · 2025-07-30T16:31:49Z

graph/src/Callgraph.cpp

-    MCGLogger::instance().getErrConsole()->error("Source ID {} does not exist in graph: Unrecoverable graph error",
-                                                 parentID);
-    abort();
+bool Callgraph::addEdge(const std::string& callerName, const std::string& calleeName) {


Do I understand correctly that we cannot insert an add when there is more than 1 node with any of the two given names?

Only for this utility function. You can always list the nodes that match a given function name and add an edge to these nodes directly.

graph/src/Callgraph.cpp

graph/src/io/MCGReader.cpp

graph/include/io/VersionFourMCGWriter.h

graph/include/Callgraph.h

graph/include/CgNode.h

graph/include/MergePolicy.h

sebastiankreutzer requested review from TimHeldmann, jplehr and pearzt July 7, 2025 12:16

sebastiankreutzer force-pushed the feat/graphlib_refactor branch 2 times, most recently from 8f4d37c to 2fba539 Compare July 9, 2025 09:03

pearzt mentioned this pull request Jul 10, 2025

Remove PGIS #51

Closed

sebastiankreutzer force-pushed the feat/graphlib_refactor branch from 231cb1a to bd29e40 Compare July 11, 2025 12:56

sebastiankreutzer force-pushed the feat/graphlib_refactor branch from bd29e40 to a483abc Compare July 14, 2025 08:16

This was referenced Jul 17, 2025

Feat/origin aware callgraph #34

Closed

Linting for integration test files #38

Open

sebastiankreutzer added 15 commits July 29, 2025 16:32

Start graph lib refactor

65e44ac

[WIP] Continue graph lib refactor

84d1419

[WIP] Make it build

df65e93

[WIP] Fix V2 reader tests

705b513

[WIP] Fix V2 writer tests

d461332

[WIP} Fixing more stuff

7e0e288

[WIP] All unit tests succeeding

4d768cd

[WIP] Refactor merge, introduce merge policies

44668d4

Make integration tests work

42724d8

[WIP] Fix ground truth origin field

3b79060

[WIP] Continue fixing integration tests

101098d

[WIP] Adjust perf tests for V4

cc473bb

[WIP] Fix all tests

802b37d

[WIP] Remove old code

3922656

[WIP] Apply formatting

3b5cdfd

sebastiankreutzer and others added 11 commits July 29, 2025 16:32

[WIP] Fix accidentatlly deleted line

9e6560a

[WIP] Add methods to delete nodes and edges

551fbf6

Extend unit testing and fix some issues

36a6c39

[WIP] Formatting

a23ec19

[WIP] Change json entry from 'edges' to 'callees'

f655303

Add getSingleNode function

ea45193

[WIP] Start upgrading PGIS

3ea05ce

Update pymetacg for graphlib rewrite / v4

001c091

Add V4 reader/writer tests

03c3a6c

Formatting

Add tests for reading and writing of node-referencing metadata

d4b6f7b

Update CGPatch for v4 rewrite

87a9556

sebastiankreutzer force-pushed the feat/graphlib_refactor branch from 4cabca3 to 87a9556 Compare July 29, 2025 14:33

Clean up remaining debug output etc.

7e40959

sebastiankreutzer marked this pull request as ready for review July 30, 2025 15:12

sebastiankreutzer changed the title ~~[Draft] File format v4 and big graph lib re-write~~ File format v4 and big graph lib re-write Jul 30, 2025

TimHeldmann approved these changes Jul 30, 2025

View reviewed changes

jplehr reviewed Jul 30, 2025

View reviewed changes

jplehr reviewed Jul 31, 2025

View reviewed changes

graph/include/io/VersionFourMCGWriter.h Outdated Show resolved Hide resolved

pearzt approved these changes Jul 31, 2025

View reviewed changes

Address reviewer comments and add missing API documentation

ce27d33

sebastiankreutzer force-pushed the feat/graphlib_refactor branch from a319b4a to ce27d33 Compare July 31, 2025 12:53

sebastiankreutzer added 2 commits July 31, 2025 15:46

[NFC] Update READMEs

8dcc0b9

fixup! Address reviewer comments and add missing API documentation

5c20e6d

sebastiankreutzer merged commit d05b53b into tudasc:devel Jul 31, 2025
3 checks passed

Conversation

sebastiankreutzer commented Jul 7, 2025 • edited by pearzt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

File format v4 and big graph library re-write

Format v4

Graph library

Node identification and lookup

Node creation

Edges

I/O

Merging

Metadata

Performance results

Uh oh!

sebastiankreutzer commented Jul 8, 2025

Uh oh!

sebastiankreutzer commented Jul 8, 2025

Uh oh!

TimHeldmann commented Jul 10, 2025

Uh oh!

sebastiankreutzer commented Jul 10, 2025

Uh oh!

sebastiankreutzer commented Jul 11, 2025

Uh oh!

TimHeldmann left a comment

Choose a reason for hiding this comment

Uh oh!

jplehr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jplehr Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

sebastiankreutzer Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sebastiankreutzer commented Jul 7, 2025 •

edited by pearzt

Loading