Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching [$500] #43

Closed
stefansundin opened this issue Jul 29, 2020 · 6 comments
Closed

Implement caching [$500] #43

stefansundin opened this issue Jul 29, 2020 · 6 comments

Comments

@stefansundin
Copy link
Owner

stefansundin commented Jul 29, 2020

Bounty: https://www.bountysource.com/issues/92164470-implement-caching

The publicly hosted RSS Box has become hard to use recently as its popularity has increased and more people have started using it. As discussed in other issues, Twitter and Instagram are unusable most of the time.

I have complained that some people are abusing the service (which is true), but there is a way to resolve the issue without requiring everyone to self-host their own instance, and that is to implement a caching system. Currently, every request for a feed will cause a request to go upstream to the service (e.g. Twitter). A caching system would allow the application to reuse previously fetched data. I have been thinking of the best way to add it for a while, but I have not had enough incentive to actually code it. I am hoping that a $500 bounty will provide enough incentive for me to add it quickly.

The way this works is that anyone who is interested to contribute can chip in, and if it is successfully funded, then I can add the feature and get the bounty. I have spent a lot of time working on this project on my own, and made it available to everyone for free. This is an opportunity for you to show your appreciation while also helping to add a very important feature that is desperately needed.

Previously, I have tried to find a good CDN solution to use, but a CDN has a couple of drawbacks, especially cost. By instead coding a caching system directly into the application and making it backed by a regular file-system, it should provide a solution that is free and robust.

The initial features will be:

  • caching system backed by a file-system (configurable path)
  • ability to cache arbitrary data, not only HTTP requests
  • resolve current Twitter and Instagram issues

Potential improvements after initial release (not required for bounty):

  • cache size management (e.g. max cache size, LRU cache eviction, etc)
  • add file locking (this will prevent simultaneous requests from causing extra outbound requests to e.g. Twitter and Instagram)
  • throttle abusive/spammy users (the more requests you make, the longer the cache is held, which hopefully disincentives spammy users)

This is the first time that I attempt to use bountysource.com. Please let me know what you think about this attempt, and please feel free to discuss it below.

@stefansundin
Copy link
Owner Author

It is unfortunate that I don't have a good way to get the word out to people. I have put a notice on the website, but I don't think many will see it. I could do more invasive stuff such as publish an item to everyone's feeds, but I think that would make people angry, so that's probably not a good idea.

If you have an idea on how to get the word out, let me know. And tell your friends. Thanks!

@Kikobeats
Copy link

Hey @stefansundin, thanks for the service!

Can you check https://github.com/Kikobeats/cacheable-response?

I think it could be fit well with the project. You can use any keyv database (I tend to use Redis, but if you use MongoDB you will have so much cheap space) and using CloudFlare on the top will give you save money (just the free plan is enough)

@stefansundin
Copy link
Owner Author

Hi @Kikobeats.

Both cacheable-response and keyv are node.js packages so I can't use them in RSS Box.

I'm not able to use CloudFlare on rssbox.herokuapp.com since I do not control the DNS. I have experimented with CloudFlare, but I think built-in caching will be a much better solution.

Thanks for sharing your thoughts.

@onli
Copy link

onli commented Aug 26, 2020

Hi Stefan
Are you sure you need a caching system primarily here? I think you need a throttle system instead (I had a look at the code and did not see that, hope I did not just miss it). Not to throttle abusers primarily, but to throttle your outgoing requests per API/site. That way your central instance might slow down, but you will avoid getting blocked or serving no content at all. When your instance gets slow, that also slows down the clients requesting data.

Only when this is in place caching data will be a helpful way to effectively increase your request limit per API and to limit load.

I'm doing something comparable with pipes with a downloader class that creates throttle objects per site and combines that with a database cache, https://github.com/pipes-digital/pipes/blob/master/downloader.rb. Or the twitterclient, https://github.com/pipes-digital/pipes/blob/master/twitterclient.rb, we share the same API limit issues.


Implementing some kind of cache should be straightforward for you, either by leveraging something like LRURedux or some other ruby project, or by hooking into something like Redis. Only limitation should be your server environment, the limitation of heroku especially ruling out sqlite (but not file system access). Not to disparage the monetization attempt, this is a cool project and you would deserve it, just speaking as a developer here.

If it's about getting some monetary reward for the project, I'd suggest a paid plan for the central instance. I host my own for Pipes, but I'd subscribe to support your effort here.

@stefansundin
Copy link
Owner Author

In the past few days I have done a lot of cleanup of the code, and today I have pushed the new caching framework that I designed. It is the single biggest major change that RSS Box has ever seen. It was quite tricky to get it working in a way that satisfied all of my criteria without making it too complicated or convoluted.

It is now deployed to the Heroku instance and the cache is still filling up, but I can already see definitive improvement. Since Heroku restarts all dynos every day, there will still be a period of time every day when the website is slow. I may be able to tweak it to adjust for that, but it will require some experimentation.

The data is currently cached for at least an hour for most services. Instagram caches its data for four hours at the moment. This will probably be tweaked going forward. But now it is completely useless to even attempt to fetch the same feed more frequently than an hour.

I think that the URL resolution feature is now the bottleneck, so I will probably change how that works in the future.

Code changes: ca97124...c7cf1f9

I am a bit sad that not a single person signed up to back this feature. If people had funded this then I would have finished it way earlier. When I created this bounty my hope was that it could be the initiator for a positive feedback cycle for RSS Box. But it sadly did not turn out that way.

@stefansundin
Copy link
Owner Author

stefansundin commented Jun 13, 2021

I just found a synthetic that I had created a long time ago in AWS CloudWatch Synthetics, and I found this data interesting. It clearly shows that things dramatically improved when I added the caching. Here's the graph for the last 15 months.

Screen Shot 2021-06-13 at 13 02 37

The synthetic fetched data from a Twitter user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants