Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media content caching strategy #1847

Closed
Gargron opened this issue Apr 15, 2017 · 121 comments
Closed

Media content caching strategy #1847

Gargron opened this issue Apr 15, 2017 · 121 comments
Labels
expertise wanted Extra expertise is needed for implementation legal Features related to law-compliance

Comments

@Gargron
Copy link
Member

Gargron commented Apr 15, 2017

Right now, Mastodon downloads local copies of:

  • avatar
  • header
  • status media attachments

On these local copies, Mastodon can perform operations like resizing, optimizing, creating thumbnails that fit Mastodon's UI - because the origin of the content can provide media in super large sizes that would severely impact end-user's bandwidth and browser performance if just displayed verbatim.

Moreover, bandwidth is not always cheap - it is capped to something like 1TB/mo on DigitalOcean, and is super expensive on Amazon S3, so hotlinking images and videos would severily impact owners of small instances, when lots of users of large instances would view their content from their public timelines (or even just home timelines through boosts). It does feel fair that an instance's admin is responsible for serving content to their own users, rather than also to users of other instances, which should be their admins' responsibilities.

However, this has storage and legal implications. I would like to hear your thoughts on how this can be improved.

@Gargron Gargron added expertise wanted Extra expertise is needed for implementation legal Features related to law-compliance priority - high labels Apr 15, 2017
@BjarniRunar
Copy link

Potentially low hanging fruit: You could reduce (but not eliminate) exposure if caching was disabled (or TTLs set very low, or caching limited to RAM-only on a swapless machine) for content that was tagged #nsfw.

More complicated, is to have feedback from the blocking/reporting into the cache layer, to quickly purge data users have flagged as objectionable. This is a rabbit-hole of complexity, but probably worth doing if you intend to keep the cache.
...

Another benefit to an instance loading media on behalf of its users, is it improves slightly the privacy if the instance users. Browsing the federated timeline or boosted tweets won't automatically leak your IP to an instance you have no pre-existing relationship with.

Yet another benefit: resizing images may disable/thwart exploits based on corrupt data (the instance itself is at higher risk of this though, and browsers are arguably better hardened/tested than the image conversion libraries used server-side).

I see a lot of benefits to what you are currently doing and I think it is the right thing for both users and the health of the network. However, the risk to admins is real and serious. Just my 2c, hope this is helpful. :-)

@tyrosinase
Copy link

I'm seeing people hotlink to images from Twitter, so that's also a consideration: if you freely allow hotlinking you run into the potential for people to use your instance as free image caching. Kind of a separate-but-related issue.

@spikewilliams
Copy link

spikewilliams commented Apr 15, 2017

I like the principal that an instance should be responsible for serving content to their own users, but I wonder if there should be a distinction between short-term and long-term storage.

Most of the traffic for any given piece of media will occur within 24 hours. After a certain period - 7 days? 30 days? - its mostly just being kept around for archival purposes. At that point, it may make sense to revert to the hosting by the original instance (problematic if that instance goes offline) or to some third party host or federation of hosts - and the instance can negotiate retrieval if the image gets requested again.

I am intrigued by Swarm as a peer-to-peer means of long-term data caching.

@ghost
Copy link

ghost commented Apr 15, 2017

Maybe you can add another class of trusted instances. Now you can silence or suspend instances, but if you add the option to trust certain instances, you can take actions on them. Like this:

Trusted instances: Cache for a longer time and locally keep all media .
Normal instances: Cache for a shorter time and locally keep all media for a limited time (e.g. 1 month and after that hotlink)
Silenced instances: Cache for a shorter time and hotlink all media. Silence on federated timeline.
Suspended instances: No caching, no hotlinking. Block all communication.

@vielmetti
Copy link

If people are storing on an expensive network they might look at self-hosting with Minio as their server back end, and then they can manage that storage queue themselves.

Minio will also federate across servers, so conceptually n servers could set up 2n+1 spindles, connect them all together, and have a shared cached file system.

@Miaourt
Copy link

Miaourt commented Apr 15, 2017

I would more like "no cache at all" for "Normal" ones, since legal don't take into account the "time" we host it...

@maethor
Copy link

maethor commented Apr 15, 2017

@Gargron Today, what is the TTL of the cache? It seems to be 14 days, but we are not sure about it.

I believe the current system is good because it protects users, and this should be the absolute priority (technically, but also privacy speaking). I really think hotlinking medias would be a bad idea, because of privacy. It would be ever worse if you hotlink « bad instances ».

Maybe you could allow the admins to configure the TTL? If I don't want any risk, I put a TTL of 1 minute, but I'm advise that it will be a little CPU intensive. Maybe I will do 1 day, or 1 week if I am careful. Maybe I will do 1 month if I don't care (personal instance, for example).

Another idea could be to separate the instance medias from the local copies. The admins could then have a sense of what is consuming storage.

@pfigel
Copy link
Contributor

pfigel commented Apr 15, 2017

Extending the existing domain blocking feature to allow admins to choose not to cache media content from certain instances (without having to suspend them) could be a viable (and relatively easy-to-implement) approach.

// Edit: Turns out there already is a hidden reject_media domain block type, so that's great news.

@ldidry
Copy link
Contributor

ldidry commented Apr 15, 2017

We could use a Camo instance: the URL of the image is still yourinstance.tld… but Camo proxies the request to the actual server that have the image. And for caching, we could put a Varnish between Nginx and Camo. It's working for images, but I don't know if it will work for mp4.

This way, your instance would never download content from other instances.

@Miaourt
Copy link

Miaourt commented Apr 15, 2017

But it's still distributing it...

@maethor
Copy link

maethor commented Apr 15, 2017

Which is what we want, I think.

@Miaourt
Copy link

Miaourt commented Apr 15, 2017

@maethor I'm worried of "illegal" content coming "from" my instance tbh, proxying stuff make it "on" my instance in a legal way :/

@Gargron
Copy link
Member Author

Gargron commented Apr 15, 2017

@maethor Current TTL is "forever". I think this is part of the things that need to be adjusted

@ldidry With Camo it seems like you would be doing the same as we're doing now, but with more effort since you'd still need to implement the actual caching oni top of it. Perhaps just adding TTLs on the current system would be better?

@gellenburg
Copy link

gellenburg commented Apr 15, 2017

I'm unorigmoniker@mastodon.social. Moving my arguments over here because I feel they're worth considering.

The solution to address @Technowix's concern is to not cache remote images. That's the only way you're going to address that concern, and the next time an instance admin posts loli or shota or outright CP.

The solution to address @BjarniRunar's concern about user privacy is to place the onus for user privacy on the user.

Here me out.

It is my opinion that instance admins don't have a responsibility for protecting users' privacy, and that should squarely fall on the responsibility of the user, and here's why.

Each user has a different threat model that they're concerned about. Some might be located in repressive regimes, others might be sharing their family's computer in the living room. Is it appropriate for an instance admin to try to provide protection for users posting from China or that gay teen posting from Iran or Saudi Arabia? What about that home-school kid with strict parents who only want their kid to be exposed to ideas that they approve of? What about that kid who's using a school laptop with spyware installed that they can't shut off? What about the political dissident under surveillance by their Government for views that are contrary to accepted norms?

Each of those threat vectors can be addressed through separate means.

As I replied to Bjarni on .social, I feel that it is up to me and every other user to take privacy into our own hands. Tor is free, and if you can afford it VPNs aren't that expensive.

I mentioned CloudFlare as a viable option for instance admins as it's the most well-known and accessible CDN. It also has the benefit of protecting instances from the "slashdot effect" or if something should go viral, or from DDoS attacks if somebody posts a toot that pisses some group or person off.

In my personal opinion those should definitely be considered as viable solutions and alternatives.

(Edit: a word)

@ldidry
Copy link
Contributor

ldidry commented Apr 15, 2017

@Gargron Nope, it's not the same. You said:

Right now, Mastodon downloads local copies of:

Camo downloads the images, but don't store them anywhere. It's just an image proxy. The Varnish I suggested is here to cache the images, but only in memory (well, you can make it cache them on disk, but it's not the default behavior (at least on Debian)). If you restart Varnish, you wipe all the cache. And the cache has a limited size, so new images replaces old ones in the cache system.

@gellenburg
Copy link

I should point out that what I mean by instance admins not having an onus to protect user privacy, what I mean is it is the users' responsibility for protecting the privacy of their web surfing habits.

Instance admins definitely have a responsibility for ensuring that SSL is enabled, is properly configured, that their servers are regularly patched and updated, and that any security vulnerabilities they discover or that are brought to their attention are promptly taken care of.

Instance admins also have a responsibility for ensuring that the software they're running is properly configured and that they take steps to prevent any data leakage incidents (lock down their servers, don't expose configuration files and passwords to the internet, etc.)

Above that, I do not feel it is an instance admin's (or Eugen's) responsibility for trying to protect every user from every real or perceived threat that may be out there. If you attempt to apply protection to the lowest common denominator or user on your system you are not going to be able to provide effective protection for most of your users.

@Gargron
Copy link
Member Author

Gargron commented Apr 15, 2017

@ldidry But ideally you'd still crop/downsize images for the end-user. I just meant that the cache wiping could be part of the current system, rather than replacing the current system with Camo

@Tryum
Copy link

Tryum commented Apr 15, 2017

hi, this is tryum on apoil.org instance !

This morning I threw some ideas, I don't know if it's viable or feasible :

-If content is crypted on the storage and the keys are distributed via another channel (to decrypt the content client side), does it still expose the admin to legal threats ? (must be state dependent).

-hotlinking the medias to the source instance, but distribute them also via p2p (ie webtorrent or any webtrc datachannel tech...) : The more viral the toot goes, the more distributed the media is, thus preventing small instance from slashdot effect.

If those ideas are silly, please be gentle, I'm not a web techy ;)

@norio
Copy link

norio commented Apr 15, 2017

@Gargron Hi, I'm pawoo.net founder (pixiv inc.).

We understand that mature images uploaded by our users are potentially problematic from the legal view, since it could be hosted in servers in other countries.

However, we would like to protect our users' works as possible unless they are illegal in Japan. We are caught in a dilemma.

We, as pawoo.net admin would like to obligate users to flag mature contents as NSFW. And as for images in server, we propose Mastodon...

  • not to store in cache if it's NSFW.
  • to block NSFW images of other instances, just showing NSFW label with a link of the original content.

We are in compliance with the law of our country, and deal with our content.

We spare no technical effort to resolve this problem.

@DanielGilbert
Copy link

The Varnish I suggested is here to cache the images, but only in memory (well, you can make it cache them on disk, but it's not the default behavior (at least on Debian)).

@MrGilbert@social.gilbert.world here.

It would be illegal to have CP or CP-like images in the non-volatile cache (aka RAM) here in my jurisdiction. Yes, it's hard to prove for law enforcement, but anyways - the laws are there. Although there are somehow some EU laws that prohibit this, they haven't been adopted into local laws.

Furthermore, I don't know if Cloudflare or any other big CDN is doing some kind of "matching" or "scanning" on the media they are delivering, so that might not be an option either, as the "mastodon train" is picking up speed (sorry for that - imagine a little mastodon sitting in a train. It's super-cute).

@Tryum Encryption might be an option. But distributed encryption is somehow complex, I guess.

@norio Cool that you are here! I guess it's not the NSFW per se, It's more the lolicon content, which is somehow problematic in western countries.

@eeeple
Copy link

eeeple commented Apr 15, 2017

Just my 2 cents :

  • legal considerations are on a country by country basis, making a universal solution almost impossible.
  • there should probably be a legal warning when installing a mastodon instance, warning admins about the potential legal problems which may arise.
  • Mastodon's documentation will probably need to include a legal section where admins can consult their local laws and act accordingly. (maybe even create some kind of TL;DR like https://tldrlegal.com/)
  • caching as it is implemented right now is in its infancy, and could really use more customization. It should be up to the instance's admin to choose the content caching policy.
  • cost is to be taken into consideration, because bandwidth is expensive, and resources are limited, both financially, and technologically.

I hope this may help.

@marcan
Copy link

marcan commented Apr 15, 2017

This is a fundamental disconnect between the law and technology, and I doubt there is a technical solution. CP laws are so broken around the world that even trying to police CP content can actually cause you legal grief (a team I'm part of was in the past told by a lawyer to stop filtering out known CP content hashes from a system, because that was a legal liability). Mind you, that's for real CP (real children), not loli (drawn content), but the latter is considered equivalent in some jurisdictions...

Good luck with the attempt at a technical fix, but I will be very impressed if you manage to find one. I would suggest getting a lawyer if you want to accurately evaluate the legal implications.

@spikewilliams
Copy link

@gellenberg

If would be great if we could rely on users to manage their own privacy, but many - probably most - users simply don't have the depth of technical knowledge that would equip them to make good decisions in that regard, much less the skills and time to effectively implement those decisions. They will tend to default to what the platform provides. If the platform wants to protect its users, it should be proactive in providing sensible privacy features.

@DanielGilbert
Copy link

Maybe, in the short term, as @norio mentioned, it might be an option to not cache NSFW flagged media. Although this would mean that we need to deal with higher traffic as an instance admin - so, at some point, people would start requesting a "block NSFW from my instance" option, to save bandwidth. Which would, in turn, mean some kind of censorship, which we don't want at all.

Gosh, maybe the technical problem is even our smallest one...

@delroth
Copy link
Contributor

delroth commented Apr 15, 2017

There are images that are not NSFW and will still cause admins in certain jurisdictions legal trouble. Looking at it only from the loli angle is very American centric.

@EzoeRyou
Copy link

As I am reading this discussion, Suddenly I feel like this is 15 years ago in P2P technology all over again. Why nobody learned anything from the history.

15 years ago, we are so hyped at building P2P distributed mesh network on the Internet.
We implemented File sharings, Chats, Forums, Blogs, Web pages and everything on top of that distributed P2P mesh network.

The result, we face the exactly same issues we face today.

  1. Copyright infrignement, Child porn, and other illegal datas(Nazi symbols in Germany for example) are spreading all over the place.

  2. Distributed cache burden us as the network grow so the cost of joining the network in terms of computational power, storage, and bandwidth become too expensive for newcomers.

If we seriously tried to solve those problem, we requires thousands of full-time employees, money, politics and hardware comparable to Twitter, Facebook, Google or Microsoft. We'll just become one of them.

At that time, there will be another people who think mastodon is too oppressive to the user so they start developing alternatives which promise "A decentralized alternative to existing platforms, it avoids the risks of a single mastodon community monopolizing your communication."

@dabura667
Copy link

I agree with @Tryum

If hotlinking media from instance A is unfair when A is using a cheap server and instance B sends 5mil users fetching media from A... so the solution is simple encryption. For each piece of media, store a 4 byte nonce on instance A, then encrypt the media using SHA256(nonce || content-identifier) symmetrically on the client side.

So the user would fetch encrypted blob from their instance cache (or from instance A if it isn't cached). Then fetch the 4 byte nonce from the hotlinked instance. Decrypt the content locally with AES. A single SHA256 hash and AES decrypt would not be too slow, unless they were viewing the content on a potato.

It would at least be slightly better than having an unencrypted CP image on your computer.

@marcan
Copy link

marcan commented Apr 16, 2017

@dabura667 You really want to talk to a lawyer before implementing something like that. The law does not work like technology does. Encrypting content could plausibly be taken as deliberate action to hinder the enforcement of the law in a situation like this. You might be better off with everything in the clear and cooperating with law enforcement if and when they ask (presuming they're interested in the source of the material, not servers it may have incidentally crossed). This isn't a legal opinion, I'm just saying that might be the case and you should really talk to a lawyer to figure that out.

@DanielGilbert
Copy link

Totally agree with the encryption idea. Although judges could argue that I can easily access the keys and therefore unencrypt the data. Are there countries out there where encryption is illegal?

Unfortunately, it's Easter here, and family duty calls. I will have another look on the issue tomorrow.

@marcan
Copy link

marcan commented Apr 16, 2017

Are there countries out there where encryption is illegal?

@DanielGilbert Yes (for various gradations of "illegal").

@DanielGilbert
Copy link

That's quite a lot. o.O

Did some small research:

From what I've found so far, I don't need to block pro-actively, but upon request. So a way to block troots from foreign instances might be sufficient for now - plus a solution for the cache, I guess.

@danielcherubini
Copy link

danielcherubini commented Apr 16, 2017

After some time to think I've realised that a good solution to this problem is more visibility of reporting functionality.

Reporting should work like this. If a user on my instance has something reported, then I as the admin can take responsibility for it. If that post is on a remote instance then it's reported to the instance admin.

I'm not sure if this is what's happening now. I'm sure it is.

But I think that if a user on my instance has reported another user on another instance then I also want to know so I can review it and potentially apply a domain block or a user block.

In this the functionality could be expanded so that other types of blocks are available.
Such as:

  • media blocks
  • URL blocks
  • Cache blocks

I really think this problem could be handled better if there was better tools available to admins, or even having another class of users, who act as moderators. Which would help large instances.

Why I think this is a good solution? Because then the users will have power to control what they find offensive and whole instances can become niche in what they allow.

@andy-twosticks
Copy link

@nightpool I don't have any links re: prosecuting over cache data, but (a) the police of any country don't tend to pay much attention to any conventions that they do not have to pay attention to and (b) EU conventions will soon not apply in the UK anyway?

Re: encryption: Not sure where people are going with this. UK law allows the police to prosecute anyone who refuses to give up the password, AND allows them to prosecute you anyway if you genuinely don't have it. (Maybe even if they only think your 160k block of random numbers is hiding something illegal, at least in theory.)
Even if you enabled genuine e2e encryption of private toots between users, that would not help the two users -- and the admin would have to trust that no user had ever forgotten to set a toot private?

@furoshiki
Copy link

furoshiki commented Apr 17, 2017

#1865 seems the mastodon at latest version has resolved this problem. Is domain based media blocker the best solution? pawoo.net team think this is a good idea.

@Miaourt
Copy link

Miaourt commented Apr 17, 2017

Well, it's a bit rough right now, but at least it permit peoples to communicates :3
Being able to "prevent caching" while "not muted" might be cool too :o

@danielcherubini
Copy link

I think this topic has steered away from the original point of this issue. In terms of media caching strategies, I don't really have any solid advice other than some degree of control over where we cache from.

This topic is now closer towards enhancements to the domain block systems. Naturally I think that this is a worthy and important discussion to happen. Domain blocks need finer grained control, so as I suggested before a good place to start is to add a couple more types of blocks.

  • media block
  • URL block
  • cache block

Cache blocking being the focus of this issue.

@westernotaku
Copy link

Could be this handeld in a way where you don't have to censor lolicon artists? We suffered enough for no reason at all, just by expressing ourself in form of drawings.

Maybe forbid certain ip ranges from viewing certain tags, as example #lolicon ? if I'm not mistaken as much illegal possesion of lolicon and distribution can be in places like australia, united kindom, canada, new zeland and france (this is not the mayority of western countryes btw) it's not illegal to be viewed online.

@jack1243star
Copy link

@westernotaku Please don't do IP or region based restrictions. It won't work with VPN and we suffered enough for no reason at all, just by living in a country without choice.

@Miaourt
Copy link

Miaourt commented Apr 18, 2017

So, well, how do we do ? Right now there is a "bandage" with "block caching but must mute"
But the wound is still wide open, we still can't communicate with instances that haven't the same legal juridiction than our...

@PeterCxy
Copy link

I do think the problem is more about storage / bandwidth rather than legal issues...

For me I just would like an archiving mechanism with which old content (e.g. older than a year) can be moved to some other storage, for example, a remote FTP server with abundant storage space, or an rclone encrypted Google Drive Unlimited remote, thus these less-viewed files can be safely removed from the main server to make space for more newly generated content (while they still can be viewed on demand). Before we get to an agreement on legal issues, implementing such mechanism, to me, will solve more urgent problems.

@Artoria2e5
Copy link
Contributor

@DanielGilbert
It would be illegal to have CP or CP-like images in the non-volatile cache (aka RAM) here in my jurisdiction.

Disputing this "aka RAM" part here. "Non-volatile" memory refer to storage devices that retain the information after power loss, and is just the opposite of what people have in these SDRAM slots of motherboards. Yes, there are non-volatile types of RAM devices, but people would normally call them flash storage.

(This isn't to say that caching such images in some tmpfs ramdisk is always safe -- you may eventually bump into some swap space and accidentally write it onto the disk, for example.)

@DanielGilbert
Copy link

@Artoria2e5

My fault. I meant "volatile".

https://www.heise.de/newsticker/meldung/Urteil-Kinderpornos-anklicken-ist-strafbar-931446.html

Translation of the relevant part:

"Already looking at child porn on the Internet is punishable. This follows from the existing legal situation and was now confirmed for the first time by an "Oberlandesgericht" (Higher Regional Court). Also the short-term download into the working memory, without a manual storage, brings users into the possession of the files, is stated in the reasoning of the OLG Hamburg from today's Monday."

Now, one might argue that a server cannot look at files - but I don't want to discuss that with any court here in Germany. ;)

@kensoh
Copy link

kensoh commented Apr 23, 2017

Has the core team considered automatically deleting posts/images/videos after a certain time. Eg 30 days, 2 weeks, 1 week, etc. Or a setting for instance owner to decide. I know that this may be a digression from the discussion here, and contrary to current Mastodon functionality.

Am raising this because I can imagine the load that instance owners are bearing. Even if an instance owner decides to stop accepting new users, the existing users' new connections with more and more users outside of that instance may already create new exponential storage / bandwidth load on that instance.

As a large majority of instances are basically self-funded, the growth in storage / bandwidth might force some instance owners to pull the plug. That kinda will start a consolidation phase where only the instances with the deepest pockets survive, and reduce the diversity of people/ideas/content that Mastodon is so good at.

Also, the appearance of Snapchat, Instagram/Facebook stories etc, may suggest that somehow people are ok with the idea that their created digital content/data do not have to persist and exist permanently. And with the fast moving info-developments now, old posts might not be relevant to someone's followers anyway.

I'm a 2-week old user who believes in the mission of Mastodon.

PS: btw I will just use the chance to say thank you very much to Mastodon maintainers and contributors =) It is just amazing a project of this scale and rapid growth is supported through an open-source community of contributors.

@Gargron
Copy link
Member Author

Gargron commented Jun 29, 2017

We've implemented some new features since this issue was opened, to help deal with the problem. There are also a couple open issues for more technically specific approaches, so I believe this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expertise wanted Extra expertise is needed for implementation legal Features related to law-compliance
Projects
None yet
Development

No branches or pull requests