# Create TAR file of FASTQs and upload to SRA
This Python Jupyter notebook creates a `*.tar` file of the FASTQs and uploads them to the SRA.

First, import Python modules:

In [1]:
import datetime
import ftplib
import os
import tarfile

import pandas as pd

import yaml

Read in the FASTQ files to upload:

In [2]:
with open('upload_config.yaml') as f:
    config = yaml.safe_load(f)
    
fastqs = pd.read_csv('FASTQs_to_upload.csv')

Now we need to make a `*.tar` file with all of the files.
Note that this step **will take a long time to run**.

In [3]:
tar_filename = 'SRA_submission.tar'

try:
    with tarfile.open(tar_filename, mode='w') as f:
        for i, tup in enumerate(fastqs.itertuples()):
            print(f"Adding file {i + 1} of {len(fastqs)} to {tar_filename}")
            f.add(tup.filename_fullpath, arcname=tup.filename)
        print(f"Added all files to {tar_filename}")
except:
    if os.path.isfile(tar_filename):
        os.remove(tar_filename)
    raise

Adding file 1 of 100 to SRA_submission.tar
Adding file 2 of 100 to SRA_submission.tar
Adding file 3 of 100 to SRA_submission.tar
Adding file 4 of 100 to SRA_submission.tar
Adding file 5 of 100 to SRA_submission.tar
Adding file 6 of 100 to SRA_submission.tar
Adding file 7 of 100 to SRA_submission.tar
Adding file 8 of 100 to SRA_submission.tar
Adding file 9 of 100 to SRA_submission.tar
Adding file 10 of 100 to SRA_submission.tar
Adding file 11 of 100 to SRA_submission.tar
Adding file 12 of 100 to SRA_submission.tar
Adding file 13 of 100 to SRA_submission.tar
Adding file 14 of 100 to SRA_submission.tar
Adding file 15 of 100 to SRA_submission.tar
Adding file 16 of 100 to SRA_submission.tar
Adding file 17 of 100 to SRA_submission.tar
Adding file 18 of 100 to SRA_submission.tar
Adding file 19 of 100 to SRA_submission.tar
Adding file 20 of 100 to SRA_submission.tar
Adding file 21 of 100 to SRA_submission.tar
Adding file 22 of 100 to SRA_submission.tar
Adding file 23 of 100 to SRA_submission.t

See the size of the `*.tar` file to upload and make sure it has the expected files:

In [4]:
print(f"The size of {tar_filename} is {os.path.getsize(tar_filename) / 1e9:.1f} GB")

with tarfile.open(tar_filename) as f:
    files_in_tar = set(f.getnames())
if files_in_tar == set(fastqs['filename']):
    print(f"{tar_filename} contains all {len(files_in_tar)} expected files.")
else:
    raise ValueError(f"{tar_filename} does not have all the expected files.")

The size of SRA_submission.tar is 17.2 GB
SRA_submission.tar contains all 100 expected files.


We now read in the details about doing the FTP upload to the SRA.
Note that these are set in separate files already:

In [5]:
# the following are provided by SRA wizard insturctions
ftp_address = 'ftp-private.ncbi.nlm.nih.gov'
ftp_username = config['ftp_username'].strip()
ftp_account_folder = config['ftp_account_folder'].strip()
ftp_subfolder = config['ftp_subfolder'].strip()
with open('ftp_password.txt') as f:
    ftp_password = f.read().strip()

Now create FTP connection and upload the TAR file.
Note that this takes a while.
If you are worried that it will timeout given the size of your file, you can run this notebook via `slurm` so there is no timing out:

In [6]:
print(f"Starting upload at {datetime.datetime.now()}")

with ftplib.FTP(ftp_address) as ftp:
    ftp.login(user=ftp_username,
              passwd=ftp_password,
              )
    ftp.cwd(ftp_account_folder)
    ftp.mkd(ftp_subfolder)
    ftp.cwd(ftp_subfolder)
    with open(tar_filename, 'rb') as f:
        ftp.storbinary(f"STOR {tar_filename}", f)
        
print(f"Finished upload at {datetime.datetime.now()}")

Starting upload at 2023-07-26 12:40:37.927060
Finished upload at 2023-07-26 16:01:32.436859
