Find a way to distinguish regular users from bots #9

zurk · 2019-07-08T14:32:42Z

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like tensorflow-gardener@tensorflow.org that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

The text was updated successfully, but these errors were encountered:

zurk · 2019-07-22T11:37:19Z

@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):

Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,

["log cache ci", "metric store ci", "loggregator ci",
                  "pivotal publication toolsmiths", "cf-infra-bot",
                  "cloud foundry buildpacks team robot",
                 "garden windows", "final release builder",
                 "pipeline", "flintstone ci", "capi ci",
                  "container networking bot", "cf mega bot",
                 "routing-ci", "cf bpm", "uaa identity bot",
                 "pcf backup & restore ci",
                 "ci bot", "cfcr ci bot", "cfcr"]

I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing

@EgorBu as was discussed I assign this issue to you.

Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.

warenlg · 2019-07-22T13:54:48Z

Thanks K for filling the issue

vmarkovtsev · 2019-07-25T15:45:36Z

Related to #30

EgorBu · 2019-07-31T08:15:50Z

Problems with regexp:

('cici jiayi shen', 'jiayis.18@intl.zju.edu.cn'), 
('daniel adrian bohbot', 'daniel.bohbot@gmail.com'),
('horaci macias', 'hmacias@avaya.com'),
('melvindebot', '44030121+melvindebot@users.noreply.github.com'),
("daniel obot", "danobot@hotmail.com")

Some French and Chinese names/surnames may look like bots for regexp

vmarkovtsev · 2019-07-31T08:20:45Z

@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.

EgorBu · 2019-08-01T08:35:17Z

Ideas:

use regexp to find highly probable bots (19k found from 1300M rows author.date, author.email, author.name, committer.date, committer.email, committer.name)
calculate authors/committer fraction - it may show that distributions for normal users and bots are different
contribution activity - time & counts & repositories - it may show that distributions for normal users and bots are different
entropy of commit messages - idea that bots use heavily some patterns
intersection of name & repository contributed most
pretrained (or train on dataset) NN model to extract message embeddings + clustering for messages - if user messages are always from 1-2-3 clusters it could be a signal of bot
pretrained (or train on dataset) NN model to extract email/name embeddings + classification/clustering - it could be a good approach because we have quite a lot of bot names
use statistical features, messages, email/names as input for NN to make embeddings (triplet loss to make embeddings of bots closer to each other) + K nearest-neighbors search / classification

Updates:

launched pipeline for extraction statistics for bots - and it's slow (should be ~50 hours).
downloaded message dataset, reading about entropy measurements and other possible approaches
reading and thinking about ideas, coding

Next steps:

I will rewrite pipeline to use Spark - the task matches the map-reduce paradigm
Resave datasets as parquet/csv
launch pipeline for statistics
launch pipeline for entropy
intersection of name & repository contributed most

EgorBu · 2019-10-11T08:52:55Z

There are at least several problems that may affect the quality:

Noisy labels -
- false positives from regexp - like: abbot, julia jenkins and so on
- false negatives - not detected bots (gardener@tensorflow for example)
Model input doesn't contain required info to make a correct prediction
- false negatives - email doesn't contain bot related info, and the name contains. Ex: egor@bla.ru / Egor's bot for deployment
The name doesn't contain the required information to label it as bot
- false positives - email contains bot related info, and the name doesn't contain. Ex: egors-bot-deploy@bla.ru / Egorka -> so it will be labeled as not bot and email tells that it's a bot
Metrics. Deduplication:
- deduplication is done by several fields - and if repository name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0
- it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
Metrics. Usage
- we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
Dataset
- Another possible reason that quality was higher here is some dataset issues

Hypothesis to check

metrics - clarify how to measure quality
Dataset
- select a row in dataset
- split dataset into 2 parts before some row and after
- assign labels (0 - before, 1 - after)
- train classifier - if the quality is better than random - something is fishy with dataset
false positives - email contains bot related info, and the name doesn't contain and false negatives - email doesn't contain bot related info, and the name contains
- labels & predictions should be computed
- extract features separately from names & emails
- find nearest neighbors by name
- find nearest neighbors by email
- several situations are possible:
  - labels & predictions are the same among nearest neighbors for name & emails - perfect
  - labels among nearest neighbors for name are not the same - possible regexp mistakes?
  - predictions are not the same among nearest neighbors for emails - check it
  - labels & predictions are not the same among nearest neighbors for name & emails - possible regexp mistake?
model overfits to mistakes on regexp
- hypothesis - number of mistakes is not so big
- train several models on different chunks of data - it will reduce number of mistakes in each chunk
- make voting among models when making prediction
- focus on samples with different predictions and labels
- focus on samples with different predictions
features are not good enough
- BPE could extract features from abot as [a, bot] - and it will make almost impossible for model to differentiate one class from another
  - use token splitter to split victor@abot.fr into [victor, abot, fr]
  - add a feature that will highlight if something is in the exception list
  - don't extract BPE features from exceptions

Papers:

zurk added the enhancement New feature or request label Jul 8, 2019

zurk assigned EgorBu Jul 22, 2019

vmarkovtsev mentioned this issue Jul 30, 2019

Include the external identifiers into the result #31

Closed

warenlg mentioned this issue Oct 9, 2019

Add Script to save bot detection model in asdf format #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find a way to distinguish regular users from bots #9

Find a way to distinguish regular users from bots #9

zurk commented Jul 8, 2019

zurk commented Jul 22, 2019

warenlg commented Jul 22, 2019

vmarkovtsev commented Jul 25, 2019 •

edited

Loading

EgorBu commented Jul 31, 2019 •

edited by vmarkovtsev

Loading

vmarkovtsev commented Jul 31, 2019

EgorBu commented Aug 1, 2019 •

edited

Loading

EgorBu commented Oct 11, 2019 •

edited

Loading

Find a way to distinguish regular users from bots #9

Find a way to distinguish regular users from bots #9

Comments

zurk commented Jul 8, 2019

zurk commented Jul 22, 2019

warenlg commented Jul 22, 2019

vmarkovtsev commented Jul 25, 2019 • edited Loading

EgorBu commented Jul 31, 2019 • edited by vmarkovtsev Loading

vmarkovtsev commented Jul 31, 2019

EgorBu commented Aug 1, 2019 • edited Loading

EgorBu commented Oct 11, 2019 • edited Loading

There are at least several problems that may affect the quality:

Hypothesis to check

Papers:

vmarkovtsev commented Jul 25, 2019 •

edited

Loading

EgorBu commented Jul 31, 2019 •

edited by vmarkovtsev

Loading

EgorBu commented Aug 1, 2019 •

edited

Loading

EgorBu commented Oct 11, 2019 •

edited

Loading