Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50MB GitHub max size on tweet.db file #4

Open
zachleat opened this issue Nov 21, 2022 · 12 comments
Open

50MB GitHub max size on tweet.db file #4

zachleat opened this issue Nov 21, 2022 · 12 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@zachleat
Copy link
Contributor

GitHub has a 50MB max, which my personal archive tweet.db has hit.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github

We might want to shard this (yearly?) for larger archives

@zachleat zachleat added bug Something isn't working enhancement New feature or request labels Nov 21, 2022
@nhoizey
Copy link
Contributor

nhoizey commented Nov 21, 2022

My own tweet.db already weights more than 91MB, so I agree!

@nhoizey
Copy link
Contributor

nhoizey commented Nov 21, 2022

@zachleat did you consider using Markdown or JSON files for each tweet, instead of an sqlite DB?

It would remove this issue with a single large file.

But I guess it might be more difficult to manage, or slower for the build.

@zachleat
Copy link
Contributor Author

I think historically I moved to sqlite for performance reasons yeah, specially around memory in pagination.

BUT I also made a bunch of performance/memory improvements to Eleventy pagination in 2.0 that apply very directly here so… I’m not sure 😅

It wouldn’t be a small change to move away from sqlite though.

@zachleat
Copy link
Contributor Author

zachleat commented Nov 21, 2022

I do want to mention a short-term workaround here: run builds locally and commit your _site folder output to GitHub for deployment.

This has the side benefit of not requiring your entire twitter history in source control as a nice database for people to use 😅

@nhoizey
Copy link
Contributor

nhoizey commented Nov 21, 2022

Nice trick indeed! 👍

rknightuk added a commit to rknightuk/hellsite that referenced this issue Nov 21, 2022
@zachleat
Copy link
Contributor Author

I did want to note one other path forward here that would be a smaller lift than moving away from sqlite. The tweets.db does contain a full copy of the tweet JSON in the database. It may be easier to trim this down? Not completely sure, just throwing another idea out there.

Feels like a yearly sharding might be the least amount of work, tbh.

@nhoizey
Copy link
Contributor

nhoizey commented Nov 22, 2022

Removing useless parts of the JSON would be nice, but I'm not sure it would be enough for people with a lot of tweets.

Sharding is a good idea. 👍

Would it make some features more difficult, like assembling threads?

@tbroyer
Copy link

tbroyer commented Nov 24, 2022

I'd argue that maybe the tweet.db might not need to be committed into the repository as it can be rebuilt entirely from the tweets.js with any external dependency. You'd want to cache it in GitHub Actions, but possibly not commit it.

Regarding tweets.js, which can grow big too, then it could easily be split into several smaller files if needed.

@AramZS
Copy link

AramZS commented Dec 23, 2022

Yeah, I've put tweets.js and tweet.db into my gitignore. Alternatively, if you really want to commit those type of large files, I have some in my static sites (videos and such) and I just use https://git-lfs.com.

For reference, my tweet.db for >16k tweets is 243.1MB, which I think is pretty reasonable?

@aarongustafson
Copy link
Contributor

I’d also vote in favor of annual shards.

@zachleat
Copy link
Contributor Author

I do want to note for folks that are checking their _site folders in. I would highly recommend not checking in the _site/_pagefind folder and run the npx pagefind command on your CI server instead. Here’s what I used for my @zachleat archive: zachleat/tweetback-zachleat@324e965

@cooljeanius
Copy link

mine is 123MB for >100k tweets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants