Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old db files #76

Closed
gabrielasd opened this issue Apr 25, 2024 · 5 comments
Closed

Remove old db files #76

gabrielasd opened this issue Apr 25, 2024 · 5 comments
Assignees

Comments

@gabrielasd
Copy link
Collaborator

We need to remove the old compiled dataset versions here to avoid confusion/errors (e.g. #70).
Updated DB versions are being stored in a separate repository, AtomDBdata, and there is no reason to have duplicated information.

@gabrielasd
Copy link
Collaborator Author

gabrielasd commented Apr 25, 2024

Also in relation to this change (separating package from DB) I think we should consider removing the db folders in this repository (or at least the content in them) from being tracked by git.
@msricher , @marco-2023 what do you think?

@gabrielasd gabrielasd self-assigned this Apr 25, 2024
@marco-2023
Copy link
Collaborator

We could (and probably should) remove then also from the history using git filter-repo or bfg https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository

@gabrielasd
Copy link
Collaborator Author

The commit d75f912 removed the Slater DB files from datasets/slate/db

@msricher
Copy link
Collaborator

msricher commented May 7, 2024

I uploaded a copy of the AtomDB repo, with old db files removed from the git history: https://github.com/msricher/AtomDB_clean

I'll document what I did here. This makes use of the git-filter-repo package (at least that's the name on Arch).

  1. git log --all --name-only --pretty=format:%H -- ^'*.msg' | grep -E '*.msg$' > msg_files lists all of the files matching ^*.msg, by commit. There may be some duplicates due to multiple commits modifying the same files but that doesn't matter. I filtered out the lines containing the commit hashes with grep, and wrote everything to the file msg_files.
  2. for i in $(cat msg_files | xargs); do git filter-repo --invert-paths --path "$i"; done removes each of the files matched above from the git history.

Before I force-push this over the main repo, would either of you verify that the result in my repo looks correct? @gabrielasd @marco-2023

@gabrielasd
Copy link
Collaborator Author

This issue has been resolved as indicated by Michelle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants