File format v4 and big graph lib re-write#49
File format v4 and big graph lib re-write#49sebastiankreutzer merged 30 commits intotudasc:develfrom
Conversation
|
As a further note: I would like to completely get rid of the |
|
The last commit adds methods to delete nodes and edges |
8f4d37c to
2fba539
Compare
|
ToDos from Todays meeting: Decided to rename Discussed whether |
Addressed in the last two commits |
231cb1a to
bd29e40
Compare
|
With the last commit, PGIS builds correctly and all unit and integration tests succeed 🥳 |
bd29e40 to
a483abc
Compare
Formatting
4cabca3 to
87a9556
Compare
TimHeldmann
left a comment
There was a problem hiding this comment.
Looks good to me.
I will open an issue wrt. the V3 compatibility reader/writer facility and implement them if I have time to do so.
jplehr
left a comment
There was a problem hiding this comment.
Some initial drive-by comments.
| MCGLogger::instance().getErrConsole()->error("Source ID {} does not exist in graph: Unrecoverable graph error", | ||
| parentID); | ||
| abort(); | ||
| bool Callgraph::addEdge(const std::string& callerName, const std::string& calleeName) { |
There was a problem hiding this comment.
Do I understand correctly that we cannot insert an add when there is more than 1 node with any of the two given names?
There was a problem hiding this comment.
Only for this utility function. You can always list the nodes that match a given function name and add an edge to these nodes directly.
a319b4a to
ce27d33
Compare
File format v4 and big graph library re-write
This merge request introduces format v4, alongside a number of changes in the graph lib implementation.
Major parts of this PR are:
I'm happy for any and all feedback, including:
Note: This patch is complete (including unit and integration tests), with the following notable exceptions:
VersionThreeMCG{Reader,Writer}Testhas not been adjusted to account for format changes after review. I will update this once we agree on the final format. I/O is still tested via the v2 tests (they depend on the v4 reader/writer).CGConverterandCGMerge2are functional. All other tools need to be updated. Required changes are fairly easy to apply.For performance results, skip to the very end.
Format v4
Format v4 combines ideas from both v2 and v3 to provide a JSON representation that is both compact and readable.
V4 completely replaces v3, as keeping both versions while maintaining a consistent internal representation is not feasible.
The following example shows the same call graph in v2, v4 and v3 (use horitzontal scroll bar to see everyghing):
{ "_CG": { "foo": { "callees": [], "callers": [ "main" ], "doesOverride": false, "hasBody": true, "isVirtual": false, "meta": { "fileProperties": { "origin": "foo.cpp", "systemInclude": false }, "numStatements": 0 }, "overriddenBy": [], "overrides": [] }, "main": { "callees": [ "foo" ], "callers": [], "doesOverride": false, "hasBody": true, "isVirtual": false, "meta": { "fileProperties": { "origin": "main.cpp", "systemInclude": false }, "numStatements": 3 }, "overriddenBy": [], "overrides": [] } }, "_MetaCG": { "generator": { "name": "CGCollector", "sha": "9350298c64ad5a9aef50494f73caad08c38c1a51", "version": "0.7" }, "version": "2.0" } }{ "_CG": { "foo": { "functionName": "foo", "origin": "foo.cpp", "callees": {}, "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 0, "overrideMD": { "overriddenBy": [], "overrides": [] } } }, "main": { "functionName" : "main", "origin": "main.cpp", "callees": { "foo": null }, "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 3 } } }, "_MetaCG": { "generator": { "name": "CGCollector", "sha": "a12c2adc9482ef0062fd27f104ab4dfd679cb91f", "version": "0.2" }, "version": "4.0" } }{ "_CG": { "edges": [ [ [ 13570296075836900426, 6503480063870808269 ], null ] ], "nodes": [ [ 6503480063870808269, { "functionName": "foo", "origin": "foo.cpp", "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 0, "overrideMD": { "overriddenBy": [], "overrides": [] } } } ], [ 13570296075836900426, { "functionName": "main", "origin": "main.cpp", "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 3 } } ] ] }, "_MetaCG": { "generator": { "name": "MetaCG", "sha": "2ce9326f676cfbc802b3924f66dfd97576058b30", "version": "0.7" }, "version": "3.0" } }Here are the main changes compared to the previous formats:
calleesfield within each node, similar to v2. Each edge can carry metdata (by defaultnull).To conserve space, integer identifiers can be used instead of function names:
{ "_CG": { "foo": { "functionName": "foo", "origin": "foo.cpp", "callees": {}, "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 0, "overrideMD": { "overriddenBy": [], "overrides": [] } } }, "main": { "functionName" : "main", "origin": "main.cpp", "callees": { "foo": null }, "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 3 } } }, "_MetaCG": { "generator": { "name": "CGCollector", "sha": "a12c2adc9482ef0062fd27f104ab4dfd679cb91f", "version": "0.2" }, "version": "4.0" } }{ "_CG": { "0": { "functionName": "foo", "origin": "foo.cpp", "callees": null, "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 0, "overrideMD": { "overriddenBy": [], "overrides": [] } } }, "1": { "functionName": "main", "origin": "main.cpp", "callees": { "0": null }, "hasBody": true, "meta": { "fileProperties": { "systemInclude": false }, "numStatements": 3 } } }, "_MetaCG": { "generator": { "name": "MetaCG", "sha": "9350298c64ad5a9aef50494f73caad08c38c1a51", "version": "0.7" }, "version": "4.0" } }Potential additional features, not yet implemented:
debugmetadata field. This field can be used to export the list of callers for easier look up during manual inspection. This metadata would be ignored when reading, and re-generated when writing, if requested.Graph library
The graph library was adjusted to handle the new format and to accommodate a wider variety of use-cases (e.g., @pearzt 's Flip project).
Node identification and lookup
Nodes are identified by an integer ID unique to the containing call graph.
This ID is generally not equal to the string identifier used in the JSON file, but is instead assigned by the call graph when creating the node.
It is explicitly allowed to create multiple nodes with the same function name and/or origin, enabling a correct representation of functions with internal linkage (e.g., functions declared
static).We maintain a hash map to simplify look-up by name. The function
const NodeList& getNodes(const std::string&)can be used to get a list of functions matching the given name.CgNode* getFirstNode(const std::string&)is added as a convenience function to look up a single node, e.g. if we know that only one such node exists.Node creation
If there are two or more in-memory call graphs, nodes in these call graphs may have identical IDs.
To avoid issues with unclear ownership and validity of a node, nodes must therefore always be tied to a specific call graph.
This was implemented by making the
CgNodeconstructor private. Nodes can now only be created by callingCallgraph::insertorCallgraph::getOrInsertNode.This also guarantees that there are no nodes with invalid or uninitialized IDs, as the IDs are set in the constructor and are marked
const.Edges
There are no major changes w.r.t. edge representation. Edges are stored centrally in the call graph, as before.
Instead of function names/hashes, edges are defined by integer node IDs. There is a convenience function
bool addEdge(const std::string&, const std::string&)to simplify edge creation by name.I/O
Unlinke in v3, there is no direct mapping between the JSON format and the internal representation.
This makes the implementation of the reader and writer more involved, but allows us to have full control over the import/export process. This also made it possible to provide more helpful error messages, when there are errors in the input JSON.
To convert between the string identifiers in the format and the internal node IDs, mapping classes have been introduced (
StrToNodeMappingfor the reader,NodeToStrMappingfor the writer). These are used by the reader/writer to maintain a consistent mapping between the two identifiers and to translate node references (e.g. in theedgefield).This mapping data structure is only required during reading/writing and is discarded afterward.
The only use outside the reader/writer itself is in the creation and export of metadata (see below).
Merging
The implementation of the call graph merge functionality received a major overhaul.
In the current upstream version, functions are always merged by name and there is no customization of merge semantics.
This patch introduces so-called merge policies to enable greater control.
These are based on the following model:
Merge model
When merging graph
Ainto graphB, the merge transfers all nodes fromAintoB.For each node in
A, there are two main cases, how this can happen:B: node can be cloned and inserted as a new node inB.2.1 Keep existing node in
B: nothing to be done2.2 Replace with node from
A: keep the existing node inBbut overwrite with properties fromAMerge policies are introduced to concretize the semantics of this model.
Relevant declarations:
For each node in the source graph
A, the policy defines which of the outlined cases applies. The function returns:1)replace=false, if there is a match and the existing node should be kept (case2.1).replace=true, if there is a match and the existing node should be overwritten (case2.1).The policy is passed to
Callgraph::merge, which executes the merge accordingly. This returns aMergeRecorderobject, used to look up the performed actions and to map from node IDs from graphAinto the merged graphB.Note that metadata are in charge of defining their own merge semantics for cases
2.1and2.2. More details below.The current implementation comes with the
MergeByNamepolicy, which models the behavior of the previously used merge strategy.As a follow-up, I suggest we add an
ELFMergepolicy that models the actual linking of object files more closely. To do this, we need alinkagemetadata with valuesexternal,internalandweak, and define the policy based on this linkage type. This will finally allow us to accurately handle merge semantics ofstaticandinlinefunctions, regardless of using a source or IR level generator.Metadata
There are no fundamental changes to how metadata is registered and instantiated.
However, the changes to the format and the merge semantics required some adjustments.
I/O
When creating from JSON, the metadata needs to know how to map from string identifiers to internal node IDs. Therefore, the constructor of each metadata takes an additional
StrToNodeMapping¶meter. This is only relevant if the metadata references other nodes (like inOverrideMD) and can be ignored in the majority of cases.Similarly, the conversion to JSON requires a
NodeToStrMappingthat is now added as a parameter totoJson.Mapping
When cloning a node from graph
Ainto graphB, any node IDs referenced in the metadata become potentially invalid.We added the
applyMapping(const GraphMapping&)method that lets the metadata correctly update its internal node references.For the majority of metadata that do not reference other nodes, this method can be left empty.
Merging
Previously, what should happen during a merge of two metadata objects was often not clear, as there was no information about the performed action.
Now, the executed
MergeActionis passed as a parameter to allow the metadata to handle the merge accordingly.Additionally, we pass a
GraphMappingto map node references of the metadata object that is merged in. Again, this is only relevant if the metadata itself references other nodes.Please refer to
OverrideMD.has an example for a non-trivial metadata implementation using mappings.Performance results
The following is a performance comparison with
devel(both measured on my laptop).devel(ms)Closes #42