This folder contains a POC implementation of a module metrics tracker and enforcement. The following commands will run the scanner across the entire first-party codebase, and merge the results. All commands are assumed to run at the root of the checkout, inside of a correctly activated python virtual env.
Run modules_poc/mod_scanner.py --dump-modules
to produce a modules.yaml
file
in current directory. This file is a multi-level map from
module name to team name to directory path to list of file names.
For unassigned files it uses __NONE__
as the module name, and for unowned
files it uses __NO_OWNER__
as the team, both of which conveniently sort first.
For owned files it uses the part of the team-name after @10gen/
with -
replaced with *
to be friendlier to querying. In cases where multiple teams
own a file, the file is duplicated to each team's list.
This file can be viewed directly in VSCode. The yaml plugin's breadcrumbs and
folding are very helpful. yq
(jq
for yaml) is also a powerful tool. Here are a few
examples using it, some of which produce enough output to be worth opening in vscode:
# list of teams
yq '[.[] | keys] | add | sort | unique[]' -r modules.yaml
# unassigned files owned by server-programmability
yq '.__NONE__.server_programmability' modules.yaml
# files owned by server-programmability across all modules (or lack thereof)
yq '.[] |= (.server_programmability | values)' modules.yaml
# assigned files owned by server-programmability outside of the core module
yq '.[] |= (.server_programmability | values) | del(.core) | del(.__NONE__)' modules.yaml
# assigned files owned by server-programmability in modules that don't start with core
yq '.[] |= (.server_programmability | values) | with_entries(select(.key | startswith("core") | not)) | del(.__NONE__)' modules.yaml
# unowned files as a flat list
yq '.[].__NO_OWNER__ | values | to_entries | map("\(.key)/\(.value[])") | .[] ' modules.yaml -r | sort
# unowned files grouped by directory
yq '[.[].__NO_OWNER__ | to_entries? | .[]] | group_by(.key) | map({key: .[0].key, value: ([.[].value] | add | sort)}) | from_entries' modules.yaml
This will build the merged_decls.json
files in the current directory:
buildscripts/poetry_sync.sh # make sure the python env has the right packages installed
find bazel-out/ -name '*.mod_scanner_decls.json*' -delete # get rid of old data files
bazel build --config=mod-scanner "//src/mongo/..."
python modules_poc/merge_decls.py
merge_decls.py
takes an optional flag --intra-module
if you want to include intra module accesses
and declarations that are only used from within their module. Typically, you don't so it defaults to
omitting them.
If you only wish to include the files linked in to a given executable, replace the bazel build
command with the following commands:
TARGET="//src/mongo/db:mongod"
bazel cquery --config=mod-scanner "filter(//src/mongo, kind(cc_*, deps($TARGET)))" | awk '{print $1}' > targets.file
bazel build --config=mod-scanner --target_pattern_file=targets.file
Once you have produced a merged_decls.json
file, you can browse it by running
modules_poc/browse.py
. It will show the available keybindings on the right, which can be toggled
by pressing ?. If you are running from a VSCode or neovim terminal, you can press
g to go to any location in your editor. You can also press p to toggle an
embedded preview of the location the current line is currently on (you probably want to hide the
help when doing this). You can press Tab ↹ to switch between the tree and preview.
The browser is primarily intended to assist in labeling public APIs, so the files are sorted with the most number of unlabeled declarations ("unknowns") first. Only declarations that are used outside of their module are counted and shown. You can search for a file by pressing f or press m to filter the files by module.
As an advanced feature, you can pass a custom file to browse.py
and it will
use it rather than the default merged_decls.json
. It does need still to have
the same shape as the original. This works best with [jq] filtering to do
advanced filtering. For example, here is a command that will only show
declarations where some TUs will only see a forward declaration from another
module, and will assume that that module is the owner (we need to fix this):
./modules_poc/browse.py <(jq '[.[] | select(.other_mods)]' merged_decls.json)
In general, your jq
query should be of the form [.[] | select( SOME QUERY )]
to avoid breaking the format expectations. For more advanced analysis, using
jq
directly is a good idea.
Run the following command to upload
python modules_poc/upload.py $MONGO_URI # fill this in
If the upload fails with an error connecting and you need to update the IP
whitelist for your virtual workstation, curl -4 wtfismyip.com/text
is a good
way to see your public IP address
You can also scan a single file which is useful when iterating on this. You can
either pass it the same flags used to compile, or pass it just a cpp file and it
will figure out the flags from your compile_commands.json. It will create a file
called decls.yaml
to the current directory when run this way.
python modules_poc/mod_scanner.py src/mongo/bson/bsonobj.cpp
-
Once we no longer have errant forward declarations in the wrong module, we can make the processing a lot faster by having the scanner only write out things that are used across modules (or in the
__NONE__
module if we still want to track that). -
We should explore if using the indexing API (eg
clang_indexSourceFile
) will yield better results. In particular, there is a flag to opt-in to visiting all implicit instantiations which I think is currently a blind spot. Unfortunately it isn't exposed in the python API yet, so we would need to add it there first. -
Other interesting options would be a clang-tidy plugin or a clang plugin. We already have a lot of infrastructure to support clang-tidy plugins, but they will ignore any lines with
// NOLINT
comments. A clang plugin is particularly interesting if it will be able to run inside of clangd so we can show warnings when accessing an unfortunately-public API across modules (we may want to mark it as deprecated in that case) and errors when accessing private. -
Parallelize the merge script. Right now it is single threaded. We can use
multiprocessing
or similar to parallelize it. It should use a queue of input files and have workers merge them into a localall_decls
map and then the main thread should merge the results from each worker. If we just pass the path to the workers, this will also parallelize reading the files and parsing the json. -
We should try to report
loc
as the header declaring the entity, but right now it will report the cpp file where it is defined. This is currently important since we use the definition to decide the canonical location and module when merging. This may cause issues for free functions if the namespace is marked public in the header. The latter issue can be worked around when merging by using the visibility from other files if the current visibility is unknown. But we should pick the right location forloc
regardless. -
Try to collapse template instantiations that clang's
specialized_template
helper fails to. I don't know why it fails to, but it seems to on many of our templates. Maybe we should just merge all declarations with the sameloc
? If we do that, we should try to prefer decls wherekind
is a template. -
Split "unfortunately public" into 3 categories:
NEEDS_REPLACEMENT
: The current API isn't ideal for a public API, but consumers need its functionality. It is on the module owner to provide a better API.USE_INSTEAD("replacement")
: A replacement for this API has been provided that external consumers should switch to. It is on the module consumer to update their code.CURRENTLY_USED
: This is a marker we can put on code as we improve the scanner if it finds new usages of private APIs that were hidden in older versions. The module owner should examine these and decide if they should be public or marked for replacement.
-
Browser enhancements:
- Add some search functionality to the browser, at least for files.
- Add a way to show a flat list of all decls in a file, rather than the nested view.
- Include
spelling
in the output from the scanner, and use that tol highlight that part of the decl in the list - Have scanner include the source ranges (extents) of each usage so that
we can highlight correctly in the viewer. Should also include the name
of the entity performing the usage. Probably best to use something like
pretty_location()
but withspelling
rather thandisplay_name
- Make it easier to filter and sort the decls for exploration. One way would be
to use jq expressions. Need to be careful that
the data is still in the original shape. Could take two expressions, one that
goes into a
select()
and another that goes into asort_by()
. For now you can dobrowse.py <(jq '[.[] | select( ... )]' merged_decls.json)
, but we should come up with something better.