Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalidate gem name using levenshtein distance for gems with ten million downloads #2037

Merged
merged 3 commits into from Jul 8, 2019

Conversation

sonalkr132
Copy link
Member

Using the number of downloads as a threshold (as opposed to top 1000 gems) would keep blocked list consistent. 4859 (3% of total) of existing gems have an invalid name, 1505 of which seem legit gems. We can use postgres to store the exclusion list and use a script to update it. I can create a follow-up PR if we should blocking release to existing invalid gems.
The current protected list would be of 683 gems. A larger protected list (say 1000 gems, 5 mil downloads) would mean a more significant portion of gem names would be blocked (4.5% invalid names of total).

@simi
Copy link
Member

simi commented Jun 24, 2019

There was already some discussion before where I have pointed to some actually confusing limitations, but I can't find it now. Anyone any idea where it was discussed?

From attached list I still see the same problem.

For example:

  • mini_magick and mini_magick2, compass-rails and compass-rails31 (there's more examples of similar pattern in list) -> Is this really wrong/typo?
  • guard-concat and guard-compat -> also looks valid for me
  • gistgen and `listen' -> actually totally different for humans (or at least for me)

Is the original purpose for this to make rubygems.org safer? I think there can be a better strategy to achieve that. I can try to elaborate as well if welcomed.

@sonalkr132
Copy link
Member Author

Anyone any idea where it was discussed?

#1809 (comment)

there is 3 letter gem and 4 letter gems now

This PR only blocks new gems where gem name is four or more characters. levenshtein distance threshold for four-letter gems is one and that of five or more is two.

I have prepared the list of existing gem with invalid names so that we can decide if we want to block new releases on them. IMHO, we should block new releases too with an exception of these 1505 gems.

Is this really wrong/typo?

I understand deciding what constitutes as a wrong type is not easy and rules used in this PR won't protect against all typos. However, it seems good enough start and shouldn't have much impact on users.

I can try to elaborate as well if welcomed.

Sure. Any suggestion on better ways of dealing with problems like #2028 is welcome.

end

def protected_gems
Rubygem.by_downloads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to use something similar to:

    Rubygem.joins(:gem_download)
      .where("gem_downloads.count > ?", GemTypo::DOWNLOADS_THRESHOLD)
      .where.not(name: @rubygem_name)
      .pluck(:name)
  1. it doesn't do unnecessary order done by by_downloads scope
  2. it filters out @rubygem_name on database level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍 Updated. Technically, this PR doesn't need this check but it will be needed when we remove :create on validation.

private

def distance_threshold
@rubygem_name.size == GemTypo::SIZE_THRESHOLD ? 1 : 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to "memoize" this to not execute same call on every iteration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the initializer.

Copy link
Member

@indirect indirect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really good place to start and adjust from there 👍

@sonalkr132 sonalkr132 force-pushed the gem_typo_prototype branch 2 times, most recently from 47dc772 to e0699a4 Compare June 25, 2019 19:30
Use downloads count as threshold instead of static list.
Limit validation to new gems.
Update distance threshold as mentioned in segiddins PR.

Existing gem we may consider blocking:
https://gist.github.com/sonalkr132/af05b030af793ce17a69245152d5aa5f
total: 4859 (2.98%)
@sonalkr132
Copy link
Member Author

@simi any issue with merging this?

Copy link
Member

@simi simi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks ok to me. 👍

@sonalkr132 sonalkr132 merged commit 87f414c into rubygems:master Jul 8, 2019
@sonalkr132 sonalkr132 deleted the gem_typo_prototype branch July 8, 2019 12:52
@rubygems-deployer rubygems-deployer temporarily deployed to staging July 8, 2019 13:01 Inactive
@sonalkr132 sonalkr132 temporarily deployed to production July 12, 2019 17:29 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants