Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to distinguish regular users from bots #9

Open
zurk opened this issue Jul 8, 2019 · 7 comments
Open

Find a way to distinguish regular users from bots #9

zurk opened this issue Jul 8, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@zurk
Copy link
Contributor

zurk commented Jul 8, 2019

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like tensorflow-gardener@tensorflow.org that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

@zurk zurk added the enhancement New feature or request label Jul 8, 2019
@zurk
Copy link
Contributor Author

zurk commented Jul 22, 2019

@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):

Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,

["log cache ci", "metric store ci", "loggregator ci",
                  "pivotal publication toolsmiths", "cf-infra-bot",
                  "cloud foundry buildpacks team robot",
                 "garden windows", "final release builder",
                 "pipeline", "flintstone ci", "capi ci",
                  "container networking bot", "cf mega bot",
                 "routing-ci", "cf bpm", "uaa identity bot",
                 "pcf backup & restore ci",
                 "ci bot", "cfcr ci bot", "cfcr"]

I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing

@EgorBu as was discussed I assign this issue to you.

Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.

@warenlg
Copy link
Contributor

warenlg commented Jul 22, 2019

Thanks K for filling the issue

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Jul 25, 2019

Related to #30

@EgorBu
Copy link

EgorBu commented Jul 31, 2019

Current pattern: r"[^a-zA-Z|]ci\W|[\s-]ci\W|ci[\s-]|[\s-]ci[\s-]|bot$|pipeline|release|routing"

Problems with regexp:

('cici jiayi shen', 'jiayis.18@intl.zju.edu.cn'), 
('daniel adrian bohbot', 'daniel.bohbot@gmail.com'),
('horaci macias', 'hmacias@avaya.com'),
('melvindebot', '44030121+melvindebot@users.noreply.github.com'),
("daniel obot", "danobot@hotmail.com")

Some French and Chinese names/surnames may look like bots for regexp

@vmarkovtsev
Copy link
Collaborator

@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.

@EgorBu
Copy link

EgorBu commented Aug 1, 2019

Ideas:

  • use regexp to find highly probable bots (19k found from 1300M rows author.date, author.email, author.name, committer.date, committer.email, committer.name)
  • calculate authors/committer fraction - it may show that distributions for normal users and bots are different
  • contribution activity - time & counts & repositories - it may show that distributions for normal users and bots are different
  • entropy of commit messages - idea that bots use heavily some patterns
  • intersection of name & repository contributed most
  • pretrained (or train on dataset) NN model to extract message embeddings + clustering for messages - if user messages are always from 1-2-3 clusters it could be a signal of bot
  • pretrained (or train on dataset) NN model to extract email/name embeddings + classification/clustering - it could be a good approach because we have quite a lot of bot names
  • use statistical features, messages, email/names as input for NN to make embeddings (triplet loss to make embeddings of bots closer to each other) + K nearest-neighbors search / classification

Updates:

  • launched pipeline for extraction statistics for bots - and it's slow (should be ~50 hours).
  • downloaded message dataset, reading about entropy measurements and other possible approaches
  • reading and thinking about ideas, coding

Next steps:

  • I will rewrite pipeline to use Spark - the task matches the map-reduce paradigm
  • Resave datasets as parquet/csv
  • launch pipeline for statistics
  • launch pipeline for entropy
  • intersection of name & repository contributed most

@EgorBu
Copy link

EgorBu commented Oct 11, 2019

There are at least several problems that may affect the quality:

  1. Noisy labels -
    • false positives from regexp - like: abbot, julia jenkins and so on
    • false negatives - not detected bots (gardener@tensorflow for example)
  2. Model input doesn't contain required info to make a correct prediction
    • false negatives - email doesn't contain bot related info, and the name contains. Ex: egor@bla.ru / Egor's bot for deployment
  3. The name doesn't contain the required information to label it as bot
    • false positives - email contains bot related info, and the name doesn't contain. Ex: egors-bot-deploy@bla.ru / Egorka -> so it will be labeled as not bot and email tells that it's a bot
  4. Metrics. Deduplication:
    • deduplication is done by several fields - and if repository name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0
    • it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
  5. Metrics. Usage
    • we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
  6. Dataset
    • Another possible reason that quality was higher here is some dataset issues

Hypothesis to check

  • metrics - clarify how to measure quality
  • Dataset
    • select a row in dataset
    • split dataset into 2 parts before some row and after
    • assign labels (0 - before, 1 - after)
    • train classifier - if the quality is better than random - something is fishy with dataset
  • false positives - email contains bot related info, and the name doesn't contain and false negatives - email doesn't contain bot related info, and the name contains
    • labels & predictions should be computed
    • extract features separately from names & emails
    • find nearest neighbors by name
    • find nearest neighbors by email
    • several situations are possible:
      • labels & predictions are the same among nearest neighbors for name & emails - perfect
      • labels among nearest neighbors for name are not the same - possible regexp mistakes?
      • predictions are not the same among nearest neighbors for emails - check it
      • labels & predictions are not the same among nearest neighbors for name & emails - possible regexp mistake?
  • model overfits to mistakes on regexp
    • hypothesis - number of mistakes is not so big
    • train several models on different chunks of data - it will reduce number of mistakes in each chunk
    • make voting among models when making prediction
    • focus on samples with different predictions and labels
    • focus on samples with different predictions
  • features are not good enough
    • BPE could extract features from abot as [a, bot] - and it will make almost impossible for model to differentiate one class from another
      • use token splitter to split victor@abot.fr into [victor, abot, fr]
      • add a feature that will highlight if something is in the exception list
      • don't extract BPE features from exceptions

Papers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants