You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.
This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.
The text was updated successfully, but these errors were encountered:
collect repository_id, commit_hash, commit_author_name, commit_author_email using the gitbase and a python client on 22 open source stacks.
Iterate through 2019-05-01 GHTorrent dump, and map every commit hash with the GHTorrent id of its author.
For each org, create a CSV file with repository_id, author_email, author_name, author_id.
For each org, create 10 different identity matching table in Parquet format running match-identities with the --cache option pointing to the previous CSV file on which we dropped the author_id column, with 10 different values for the MaxIdentities parameter: [1, 5, 10, 20, 30, 40, 50, 100, 200, 500].
For each org, and each identity table generated (so 22x10), build 2 identity graph (1) from GHTorrent identity mapping (2) from our own identity matching.
Compute precision and recall using the following definitions for false positive and false negative.
One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.
This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.
The text was updated successfully, but these errors were encountered: