# Managing Large Data Files in Bioinformatics Projects


## The Challenge: Large Data Files in Version Control

GitHub has strict file size limitations:
- Individual files must be under 100MB
- Repositories have a strict size limit of 100GB
- Recommended repository size is under 5GB for optimal performance

In bioinformatics, we frequently work with large data files such as:
- FASTQ sequencing files (often multiple GB each)
- BAM/SAM alignment files
- Reference genomes

Attempting to commit these files directly to Git will likely result in errors.<br> 
<br>
Even commiting large files to you local repository before pushing to Git can create a large headache. I personally found this out by making multiple commits with the GRCh38 in one of my files.

## Your options

1. Delete large files before you commit
    - example: keeping all large files in a `tmp` folder and then clearing it before you commit
<br> or <br>
2. Completely ignore certain types of files from being commited
    - example: using a config file like `.gitignore` to prevent certain files or folders from being commited

## My Solution: gitignore

I create a `.gitignore` file to simply ignore the large files (like fastq files) form being commited and pushed to Git. <br>
<br>
I find writing in a segment of code to clear my `tmp` folder or manually deleting files makes me prone to error. Creating a `.gitignore` when setting up my project allows me to make adjustments as I need but I also don't have to go back and clean old commits if I forgot to run a Jupyter cell or delete a file that accidentally got placed in a forgotten folder.


## Setting Up a .gitignore

The `.gitignore` file tells Git which files to ignore. 

Here's an example `.gitignore` file for a typical bioinformatics project:


# But I already have large files commited and have an error
If you are someone whom accidentally commited a large file or too many large files over multiple commits, like me, and now you're getting an error, check out [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/)