Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider relaxing levenshtein distance rules #2058

Open
NuckChorris opened this issue Jul 14, 2019 · 21 comments
Open

Consider relaxing levenshtein distance rules #2058

NuckChorris opened this issue Jul 14, 2019 · 21 comments

Comments

@NuckChorris
Copy link

@NuckChorris NuckChorris commented Jul 14, 2019

I just tried to push a gem named strait (named for a narrow, defensible waterway!), but was told that it is too close to typo-protected gem named train.

This is pretty far away from typo territory. For any short, memorable name (in English especially) it's going to be rather difficult to not be within a close levenshtein distance of an existing gem.

Are typo-squatters an actual real issue? Does anyone accidentally typo gem 'r4ils'? The download counts for all the typo-squatters are... pretty low? I don't have access to the Rubygems database but I could imagine those 200 downloads are somebody downloading all gems in existence or the typo-squatter testing their own typo-squat. If we're absolutely certain this is necessary, then Damerau-Levenshtein might better capture "typos" with a lower threshold, since it supports transpositions.

As it is, it's pretty clear from glancing over the list in #2037 that the current rule is overzealous: it has a minimum 31% false-positive rate! The "legitimate" ones which would be caught by this filter are almost 1% of all total gems. There's clearly-legitimate things like monet vs mongo or samlr vs haml.

But seriously, as somebody who does a lot of similarity-based text processing over larger, messier databases than this: please don't try to automate this. I always come to regret it, every time. Without fail. And adding "popularity checks" has only ever made things worse, not better.

Alternative solutions:

  1. Human reporting! Let them pick a reason from a dropdown, make it two clicks, don't bury it in a link pile like it is right now. I'm down to help on this, it's pretty clear that we need something better than "Post to this feedback forum which is filled with spam and gets 4 posts a month". I know it's hard to do support on a volunteer basis. Let's make it easier!
  2. Publish a feed of recently-created gems with easy reporting links — wikipedia does this and it works great!
@indirect

This comment has been minimized.

Copy link
Member

@indirect indirect commented Jul 14, 2019

@NuckChorris sorry you ran into this! It is, in fact, a serious problem, and we regularly receive emails from developers either saying things like "oh no! I just noticed that rials will root your machine on installation" or things like "oh no! I typoed while installing rails and later discovered my machine has been rooted for a while".

The current rule is overzealous, and we plan to work to tune it. Waiting for people to notice that they have already been rooted by typo-based malware doesn't feel like a viable answer at this point.

@indirect

This comment has been minimized.

Copy link
Member

@indirect indirect commented Jul 14, 2019

(And to be clear, it's less support work for us to whitelist your gem, and everyone else's gem, than it is to keep allowing typo gems.)

@NuckChorris

This comment has been minimized.

Copy link
Author

@NuckChorris NuckChorris commented Jul 14, 2019

Definitely agree that transposition is a reasonable case, but I think perhaps Levenshtein is just a poor fit. It's just a pure mathematical concept of language, not a real understanding of the linguistics at play here

Could we modify Levenshtein-Damerau to weight additions, deletions, transpositions, and substitutions differently? I feel like when you account for English, additions just make new words while transpositions/deletions are the main source of typos, and substitutions are the secondary source

@indirect

This comment has been minimized.

Copy link
Member

@indirect indirect commented Jul 14, 2019

Yes, I think something with higher weight on transpositions, and something with higher weight on similar beginnings would be a better fit!

The paper A Comparison of String Distance Metrics for Name-Matching Tasks seems likely to have useful suggestions on how we could improve our matching; I was planning to try to apply it at some point in the future. If you're a subject-matter expert and interested in helping, you're probably a better person for the job than me. :)

@NuckChorris

This comment has been minimized.

Copy link
Author

@NuckChorris NuckChorris commented Jul 14, 2019

I'm afraid my experience is more in throwing these algorithms at the wall and see what works best for cleaning up data, not really in the academic side of it...

As for the common prefix, Jaro-Winkler might be a good fit since it puts heavier weights on the beginnings and can prefilter in the database

@rubyFeedback

This comment has been minimized.

Copy link

@rubyFeedback rubyFeedback commented Jul 15, 2019

I sort of agree with NuckChorris in the sense of usability issue versus security/protection issues.

I think typo-squatters is an issue but a somewhat minor one overall; the larger situation is how to deal with malicious folks in general. JavaScript has become quite famous in this regard (as a negative example).

While the above example is indeed problematic (strait to be assumed to be too close to train), I think the general aim to make the ruby ecosystem better is a laudable approach - just the overall usability should also be considered.

To the topic of levensthein distance - I am not really good with algorithms but the levensthein distance is used a lot all over everywhere; e. g. sequence alignment in bioinformatics. Third top downloaded gem is diff-lcs (as can be seen from https://rubygems.org/stats leading to https://rubygems.org/gems/diff-lcs); while this does not directly have a lot to do with levensthein as such, it is ultimately quite a similar task to do: to find subpatterns within two strings (e. g. equality or measuring differences). I assume this is quite common in informatics/programming in general, due to real problems - or causing real problems, too, as can be seen here. :D

@yb66

This comment has been minimized.

Copy link

@yb66 yb66 commented Aug 8, 2019

Why not push responsibility onto the installer? (I mean human and Rubygems/Bundler etc) If a check is run on install that says:

Did you know that these gems have names that are very close to others?

strait - did you mean "train"?
rials - did you mean "rails"?

If this is what you meant then please run install again with the --ignore-typoes switch

No need for centralised whitelisting, everyone can upload their gems, everyone gets to make sure they're downloading the correct gems. The checker can even be over zealous, or zealosity (heh) could be a setting, even the whitelists could be handled that end.

/2pence

@indirect

This comment has been minimized.

Copy link
Member

@indirect indirect commented Aug 8, 2019

@yb66 that's too late. by the time a malware gem is installed, it's already run its payload and the computer is already compromised. telling someone after the fact that they made a typo and now their computer needs its hard drive wiped is not something we're okay with as a solution, sorry.

@yb66

This comment has been minimized.

Copy link

@yb66 yb66 commented Aug 8, 2019

@indirect

Fetching gem metadata from https://rubygems.org/.........

Very first line of every install phase, so the payload doesn't even need to be downloaded, let alone run, it just needs for the metadata to be checked for typos.

I'd appreciate it if you dropped the supercilious attitude from your replies to me in future, especially when you've come up with a horrendously poor rendering of what I proposed. Here's the contributor convenant, if you need a reminder.

https://github.com/rubygems/rubygems.org/blob/master/CODE_OF_CONDUCT.md

@yb66

This comment has been minimized.

Copy link

@yb66 yb66 commented Aug 8, 2019

In my annoyance I ran bundler install instead of gem install so that quote doesn't apply but the idea is the same.

I hope you get the consideration your issue deserves and isn't getting, @NuckChorris.

I'm out.

@indirect

This comment has been minimized.

Copy link
Member

@indirect indirect commented Aug 9, 2019

Sorry for making you feel condescended to, that definitely wasn’t my intention.

The phrase “why don’t you just”, combined with a proposal that sounds like it would require a new UX, lots of work, and only help the tiny fraction of users who update their clients quickly, is not a viable solution to this ecosystem problem. If you’d like to offer a more fleshed out proposal, we have an RFC repo for that.

As a side comment telling the maintainers that you know better than us what we should be doing, your suggestion and response to feedback comes across as not actually interested in helping the situation for everyone involved.

@yb66

This comment has been minimized.

Copy link

@yb66 yb66 commented Aug 9, 2019

@indirect Firstly, I didn't use the phrase "why don’t you just". Even if I did, it's normal speech, not an invitation for condescension. It's a discussion.

Secondly, I don't accept your apology because your intentions were quite clear, from your first response and with this "As a side comment":

your suggestion and response to feedback comes across as not actually interested in helping the situation for everyone involved.

This is downright disrespectful, again. I presented an idea of what I believe to be a viable solution. You may disagree but I've yet to see anything from you to suggest otherwise that is valid.

@NuckChorris

This comment has been minimized.

Copy link
Author

@NuckChorris NuckChorris commented Aug 10, 2019

Woah, could we all take a step back? I think nobody intended to come across rudely: @yb66 just suggested that maybe a better solution would be clientside, @indirect misunderstood and thought they meant a warning (not an error!), and things escalated from there...

I do think the clientside solution would seem to solve this quite nicely, but the issue of old clients is definitely a problem.

I was thinking "well maybe we just do server-side blocking for old clients" but that presents a whole new set of problems with "disappearing" gems. This isn't an easy problem to solve, and we should all calm down and acknowledge that none of us have the perfect solution and work together to figure out a better long term solution!

(in the nearer term @indirect, I'd love if you'd whitelist strait for me, I never got a response on the rubygems.org support system! Do you need more help there?)

@yb66

This comment has been minimized.

Copy link

@yb66 yb66 commented Aug 10, 2019

@NuckChorris That's very diplomatic but, as a professional, do you allow other professionals to defame you in public? There's a code of conduct and this project receives money, it's not somebody's weekend hobby.

I wish you all the best with this but if client-side checking is too radical an idea then I don't hold out much hope.

@indirect

This comment has been minimized.

Copy link
Member

@indirect indirect commented Aug 13, 2019

@NuckChorris thanks for following up. I've opened a PR to create an exception list and add strait to it at #2092.

@yb66 I am sincerely apologetic, both for misunderstanding your suggestion and for making you feel condescended to. If you'd like to follow up on your client-side filtering idea, please open an RFC for the team to consider. This ticket isn't the right place to evaluate that kind of proposal. Thanks.

@tcd

This comment has been minimized.

Copy link

@tcd tcd commented Nov 17, 2019

What is currently the correct way to go about requesting that a name be whitelisted?

@NuckChorris

This comment has been minimized.

Copy link
Author

@NuckChorris NuckChorris commented Nov 20, 2019

@tcd you might open a PR adding your project name to the list ;)

@tcd

This comment has been minimized.

Copy link

@tcd tcd commented Nov 21, 2019

@NuckChorris I don't think there is an actual list in the repo. Looks like as of pull #2116, names are added directly to the database using the gemcutter:typo:exception rake task.

@dwradcliffe

This comment has been minimized.

Copy link
Member

@dwradcliffe dwradcliffe commented Nov 21, 2019

@tcd please open a help ticket on https://help.rubygems.org

@tcd

This comment has been minimized.

Copy link

@tcd tcd commented Nov 21, 2019

@dwradcliffe Opened one four days six days a week 10 days ago.

@tcd tcd mentioned this issue Nov 26, 2019
@tcd

This comment has been minimized.

Copy link

@tcd tcd commented Dec 2, 2019

Resolved my problem in a separate issue, sorry for cluttering up this issue with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants
You can’t perform that action at this time.