Replies: 23 comments
-
Currently no. There is some work started to send stats from the bundler client back to RubyGems.org but I don't think it's made it very far.
We have a few 3rd party services we use for stats, but nothing that would work well to expose on the website. At the moment we don't have the resources to store and process this data. I'm totally open to ideas and suggestions, although managing another data store seems unlikely right now. |
Beta Was this translation helpful? Give feedback.
-
Also, I think the idea is great and I agree it would be valuable to provide that information. |
Beta Was this translation helpful? Give feedback.
-
Great! Thanks for the quick response, David.
Why the bundler client? That information isn't available server-side?
I can read up on the infrastructure, assuming the easy to find resources like https://github.com/rubygems/rubygems-infrastructure/wiki/Architecture-Overview are up to date. My naïve thought was that the final destination for the calculated statistics would be whatever data store holds gem profile data, like author names, links to repo, etc. Am I on the right track? |
Beta Was this translation helpful? Give feedback.
-
With the CDN/caching layers we have in place (Fastly), our servers only see a fraction of the requests. So we can't use that data. We can send logs from Fastly for every request to something to process and store them, but that's a lot of logs. As for storage, it's simply too much data to store in the postgresql database. We used redis to store similar data before, but it got to be too much data for redis to handle. |
Beta Was this translation helpful? Give feedback.
-
So, if I understand correctly, we'll have to parse the logs from Fastly, aggregate that data, write the aggregated statistics to the postgres db, and display those stats on the gem profile. Is that correct?
Yeah, I'm sure. Any chance you can quantify that? Having a rough idea of the size of the raw logs, say, daily, will help with our planning. Just to get the discussion going, let me throw out a few obvious architectures:
Regarding the final aggregated statistics, let's see if we're on the same page; I'm picturing something like the following for each gem. {
ruby: {
:collection_began_at => '2017-01-01',
:total_downloads_since_collection_began => 3_000,
:2.3.1 => 1_000,
:2.2.5 => 1_500,
:2.1.10 => 500,
:jruby-1.7.25 => 0,
:jruby-9.0.5.0 => 0,
:rbx-3.28 => 0
},
rails: {
:collection_began_at => '2017-03-01',
:total_downloads_since_collection_began => 1_000,
:5.0 => 500,
:4.2 => 300,
:4.1 => 200
}
# possibly other gems in the future
} This would only add a few kilobytes at most to each gem record in the postgres database. Am I on the right track here? Were you picturing something similar? |
Beta Was this translation helpful? Give feedback.
-
I like the idea behind it so +1 - interestingly enough, at the least to me, I never thought about this. I mostly just use the latest ruby and sorta assume that everyone else uses it too. :D Btw in the event that it may add too much data, how about an approximation? Something like, you know, "About 10% use ruby 1.8.x for this gem." or just for some major version data. I mean the data structure is more accurate like: But actually, would not 4.x and 5.x be sorta enough? Better than what we currently have. So: Could be even simpler, to keep track in iterations of 1000 :D 5000 and 2000! Minimal storing for maximum output! |
Beta Was this translation helpful? Give feedback.
-
The final calculated statistics would only be a few kilobytes in each gem record in the postgres database, which is not a concern. When David says "it's simply too much data to store" I believe he means that tracking every download event (about 1 billion events per year?) in postgres or redis would be too much.
Unfortunately, rails does not follow semver, and makes breaking changes in minor versions, so gem authors will want to know the minor version, 4.2, 5.0, etc. |
Beta Was this translation helpful? Give feedback.
-
@jaredbeck I would also dearly love to see stats like downloads per day/month/year. I think this would ultimately be better solved by a specialized event store that we can query for the actual data, rather than calculating a storing a tiny subset of it in postgres along with the actual critical gem data. Two examples of existing systems for this include npm statistics via map reduce and cocoapods statistics via Redshift. Since we already have a complete map reduce pipeline set up using Amazon SQS and Shoryuken, we may be able to make everything work by simply adding a little bit to that existing pipeline (although I don't know if Fastly supports logging arbitrary data sent in request headers, which is how Bundler currently reports Ruby version). |
Beta Was this translation helpful? Give feedback.
-
Yay, with David and André both on board, I'm sure we can do this! #dreamteam
Oh, that's great! I'm reading |
Beta Was this translation helpful? Give feedback.
-
Hey @jaredbeck are you still working on this? |
Beta Was this translation helpful? Give feedback.
-
Yes. We are still discussing the architecture. There are (at least) two outstanding questions.
|
Beta Was this translation helpful? Give feedback.
-
I was working on a small part of this issue #1335 during the summer as part of the Rails Girls Summer of Code Fellowship. Long story short I was not able to finish my portion during the duration of the fellowship and I'm hoping to finish it up.
Bundler might possibly report the rails version since rails is a gem. The code that I wrote is basically a small program that captures the version and system information during the bundler install process (before the request goes to Fastly) and sends that data to the ruby gems server to be represented with Librato. I'm excited to hear more about your thoughts on this.
Librato may or may not be addressing some of those requirements however, this is a solution I would love to learn how to develop and implement. I am still an early stage programmer. If you're interested, I hope we can work together and see this project through to completion. The project files I was working on will be here, however in an effort to follow bundler PR conventions I must reset my development environment and make sure all tests pass. I have my code on my local device and will push it to my repo in a few days. |
Beta Was this translation helpful? Give feedback.
-
@jaredbeck hey I created a class called gem_analytics_reporter |
Beta Was this translation helpful? Give feedback.
-
Hi Ore, Thanks for the contribution but I think we're still waiting to hear from David and André about the architecture and I don't think we're ready to write code yet. |
Beta Was this translation helpful? Give feedback.
-
@jaredbeck hey there, sorry it took so long to get back to you about this. David and I talked a bit at RubyConf, and I think we have decided at least enough to move forward. There are a few separate things going on in this ticket, and I'd like to call them out and address them one at a time: 1. Metrics reported by existing Bundler versions: Bundler adds a 2. New Bundler metrics: At some point, we would like to start sending metrics data directly from Bundler to the server, skipping the hacky 3. Gem version download counts: As mentioned in the original post for this ticket, it would be great to have download numbers for individual gem versions. It's possible to capture these numbers from the existing log parsing pipeline. We don’t have a data store nailed down yet, but Redshift and Librato are the best candidates. However, version download counts are not the same as usage numbers (see next point). 4. Gem usage numbers: In addition to usage numbers, we would like to have stats on how many separate projects use a gem, counted by version. IIRC our last idea about how to do this is to take a SHA256 hash of the git remote URL, and use that to index gem usage for that single project. IMO, knowing that an older version of your gem is in use by a lot of projects is much more useful than knowing that it was downloaded a lot of times. These are (I think) the numbers that that would be most useful to the community at large, even more than raw download counts. Now let me try to answer your questions about processing and storing the numbers:
The current log pipeline counts downloads. I think we can extend it to count downloads per version pretty easily, as long as we have somewhere to send the data. We will also need to extend the existing log processor to get information from older versions of Bundler.
No, Bundler can't also report the Rails version (or any other gem version) in an HTTP request header. That’s not a good place to put data.
I haven't used Redshift that much, but that seems like a reasonable option. The other option that David and I were looking at is Librato, which has already provided us with a donated account (and since they are a metric service/store, we can easily query for aggregates from them as well).
We already pay for our AWS account, and I think we can continue to do that even if we start using Redshift as well.
Yes, that seems like a good idea, and it's probably required if we're going to display these numbers anywhere on RubyGems.org. @Speculate7 This ticket is probably not a good place to get feedback for your changes, since they are on Bundler—this ticket is about work on RubyGems.org. Please start by getting code review from your RGSoC coaches. After that, you can open a PR against Bundler and get feedback from the Bundler team. 👍 |
Beta Was this translation helpful? Give feedback.
-
That's great news. I wish I'd been there. Some day.
Absolutely, that would be ideal, and more than I had hoped for!
That's a clever way to anonymously identify projects. 97% of ruby projects use git, according to the Rails Hosting Survey 2016, so I think it's reasonable. |
Beta Was this translation helpful? Give feedback.
-
@indirect as always, thank you for the continued advice and support. |
Beta Was this translation helpful? Give feedback.
-
Rereading my comment, I realized I left out the (maybe) most important part of my point about calculating gem usage: There's no way for us to capture usage numbers by processing the Fastly logs. However, it should be pretty straightforward to extend the new Bundler metrics system (once it exists) to also send a hash of the git url and gem names and versions. That's the core reason that all of these projects are intertwined. :) |
Beta Was this translation helpful? Give feedback.
-
Closing due to inactivity. |
Beta Was this translation helpful? Give feedback.
-
I was looking for stats by gem version downloads over time today to make a more informed decision for backward compatibility on a gem I'm working on. I read about the issues faced and understand it. I think this thread should remain open as a feature request reminder for some day when there's resources available to support it. @indirect I agree having gem active usage data is ideal but if we can't get that, then wouldn't having the download count by versions over time be the next best thing? At a high level, what matters in development, is roughly knowing the relative impact to making breaking changes. If a gem version downloads are decreasing over time (think slow long tail) then I can assume the apps have either updated to newer gem versions or are running but not being updated with newer gem versions or are no longer running. In each reason, I can conclude it's ok to make a breaking change. The question is what is the size of the impact. One might use the recent daily/weekly/monthly download count for a gem version or take a windowed slice and compare it with the total downloads of all versions or of newer versions. It's not perfect nor accurate, but in relative terms, the ratio can be helpful i.e. is it 5% or 40% of apps that will have to make a change someday when they decide to update the gems. For me that is enough. Is there still interest in having this feature and what are the gating items? |
Beta Was this translation helpful? Give feedback.
-
We don’t currently have the devops capacity to administrate a system that tracks downloads over time. That’s what we had before, and we gave up after it took the site down repeatedly.
I think the new Ruby Toolbox efforts to ingest data dumps will give this information at a weekly or monthly level? Either way I would suggest investigating that first.
If the Ruby Toolbox numbers aren’t what you’re looking for, we need a proposal that would add data warehouse type capacity for us to store downloads over time without needing additional ops overhead to run it, and with graceful degradation that does not hurt the main site if it has issues.
…On Mar 1, 2019, 4:15 PM +0900, Aditya Prakash ***@***.***>, wrote:
Reopened #1439.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Beta Was this translation helpful? Give feedback.
-
@indirect I see so it's the same concerns - I completely understand. I did look at Ruby Toolbox and bestgems.org charts but it only shows total downloads of all versions over time and total downloads by version over time which doesn't provide the necessary insight. I'm sure you know what I'm looking for but I'll add an example of Android version installs to showcase what the OP and I are looking in gem version stats, for future reference. From this chart I can see versions <=2.2 are pretty safe to break compatibility but versions >=2.3 is not. Ruby Toolbox chart breakdowns doesn't provide this level of insight. Perhaps in future, either we find a way to do it in a low overhead way or something changes that allows us to support this... let's keep this issue open. |
Beta Was this translation helpful? Give feedback.
-
This has been my GSoC project which is practically complete (aside from some test issues). Mind you, some data (including gem versions) is not instrumented at the moment due to limitations with Datadog that's currently used in rubygems.org. |
Beta Was this translation helpful? Give feedback.
-
As a gem author,
I want to see statistics about which ruby versions people are using,
So I can make informed decisions about which rubies to support
As the author of a gem that depends on rails,
I want to see statistics about which rails versions people use,
So I can make informed decisions about which rails versions to support
This would be huge for the ruby community. Compatibility is an important goal of many gems.
I can think of a number of possible UIs for this, but first:
I'd be happy to do the legwork on this if someone with the commit bit is willing to sponsor it.
Beta Was this translation helpful? Give feedback.
All reactions