-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find a way to distinguish regular users from bots #9
Comments
@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300): Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,
I removed 2.5k rows over 15k in total excluding name identities matching @EgorBu as was discussed I assign this issue to you. Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated. |
Thanks K for filling the issue |
Related to #30 |
Current pattern: Problems with regexp:
Some French and Chinese names/surnames may look like bots for regexp |
@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned. |
Ideas:
Updates:
Next steps:
|
There are at least several problems that may affect the quality:
Hypothesis to check
Papers: |
We can take some rule-based approach as a benchmark: email contains
bot
word orno-reply
. However, there are emails liketensorflow-gardener@tensorflow.org
that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.The text was updated successfully, but these errors were encountered: