Study how the quality depends on the hard identity size limit #30

vmarkovtsev · 2019-07-25T15:43:02Z

One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.

This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.

warenlg · 2019-08-27T10:19:47Z

I achieved this task following these steps:

collect repository_id, commit_hash, commit_author_name, commit_author_email using the gitbase and a python client on 22 open source stacks.
Iterate through 2019-05-01 GHTorrent dump, and map every commit hash with the GHTorrent id of its author.
For each org, create a CSV file with repository_id, author_email, author_name, author_id.
For each org, create 10 different identity matching table in Parquet format running match-identities with the --cache option pointing to the previous CSV file on which we dropped the author_id column, with 10 different values for the MaxIdentities parameter: [1, 5, 10, 20, 30, 40, 50, 100, 200, 500].
For each org, and each identity table generated (so 22x10), build 2 identity graph (1) from GHTorrent identity mapping (2) from our own identity matching.
Compute precision and recall using the following definitions for false positive and false negative.
- FP = set(pred_graph.edges) - set.intersection(set(pred_graph.edges), set(true_graph.edges))
- FN = set(ght_graph.edges) - set.intersection(set(pred_graph.edges), set(ght_graph.edges))
Plot the precision and recall curves depending on MaxIdentities for each org.
Conclude that MaxIdentities=20 stands for a good trade-off.

vmarkovtsev assigned EgorBu Jul 25, 2019

vmarkovtsev mentioned this issue Jul 25, 2019

Find a way to distinguish regular users from bots #9

Open

zurk mentioned this issue Aug 13, 2019

Add MaxIdentities parameter #44

Merged

vmarkovtsev assigned warenlg and unassigned EgorBu Aug 15, 2019

warenlg mentioned this issue Aug 27, 2019

Measure quality on several organizations #17

Closed

3 tasks

vmarkovtsev closed this as completed Aug 27, 2019

warenlg mentioned this issue Aug 31, 2019

Bad precision and recall (~60%) on IBM and intel open source stacks #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study how the quality depends on the hard identity size limit #30

Study how the quality depends on the hard identity size limit #30

vmarkovtsev commented Jul 25, 2019

warenlg commented Aug 27, 2019 •

edited

Loading

Study how the quality depends on the hard identity size limit #30

Study how the quality depends on the hard identity size limit #30

Comments

vmarkovtsev commented Jul 25, 2019

warenlg commented Aug 27, 2019 • edited Loading

warenlg commented Aug 27, 2019 •

edited

Loading