Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study how the quality depends on the hard identity size limit #30

Closed
vmarkovtsev opened this issue Jul 25, 2019 · 1 comment
Closed
Assignees

Comments

@vmarkovtsev
Copy link
Collaborator

One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.

This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.

@warenlg
Copy link
Contributor

warenlg commented Aug 27, 2019

I achieved this task following these steps:

  1. collect repository_id, commit_hash, commit_author_name, commit_author_email using the gitbase and a python client on 22 open source stacks.
  2. Iterate through 2019-05-01 GHTorrent dump, and map every commit hash with the GHTorrent id of its author.
  3. For each org, create a CSV file with repository_id, author_email, author_name, author_id.
  4. For each org, create 10 different identity matching table in Parquet format running match-identities with the --cache option pointing to the previous CSV file on which we dropped the author_id column, with 10 different values for the MaxIdentities parameter: [1, 5, 10, 20, 30, 40, 50, 100, 200, 500].
  5. For each org, and each identity table generated (so 22x10), build 2 identity graph (1) from GHTorrent identity mapping (2) from our own identity matching.
  6. Compute precision and recall using the following definitions for false positive and false negative.
    • FP = set(pred_graph.edges) - set.intersection(set(pred_graph.edges), set(true_graph.edges))
    • FN = set(ght_graph.edges) - set.intersection(set(pred_graph.edges), set(ght_graph.edges))
  7. Plot the precision and recall curves depending on MaxIdentities for each org.
  8. Conclude that MaxIdentities=20 stands for a good trade-off.

idmatching_pr_curves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants