# Upload sequencing data to SRA
This Python Jupyter notebook uploads the sequencing data to the NIH [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), or SRA.

## Create BioProject and BioSamples
The first step was done manually to create the BioProject and BioSamples. 
Note that for new future uploads related to the RBD DMS, you may be able to use the existing BioProject, but since this is the first entries in these project I needed to create a new BioProject.

To create these, I went to the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) and signed in using the box at the upper right of the webpage, and then went to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/sra/).
I then manually completed the first five steps, which define the project and samples.

We use existing BioProject PRJNA770094 and registered a new BioSample SAMN37185589

## Create submission sheet
The sixth step is to create the submission sheet in `*.tsv` format, which is done by the following code.

First, import Python modules:

In [2]:
import ftplib
import os
import tarfile
import datetime

import natsort

import pandas as pd

import yaml

Read the configuration for the analysis:

In [3]:
with open('../config.yaml') as f:
    config = yaml.safe_load(f)

Read the PacBio runs:

In [4]:
pacbio_runs_file = os.path.join('./', 'pacbio_runs_to_upload_XBB15.csv')

print(f"Reading PacBio runs from {pacbio_runs_file}")

pacbio_runs = (
    pd.read_csv(pacbio_runs_file)
#     .assign(ccs_file=lambda x: f"../{config['ccs_dir']}/" + x['library'] + '_' + x['run'] + '_ccs.fastq.gz')
#    .assign(ccs_file=lambda x: x['ccs'])
    )

pacbio_runs.head()

Reading PacBio runs from ./pacbio_runs_to_upload_XBB15.csv


Unnamed: 0,library,bg,run,ccs,ccs_file
0,pool1,XBB15,230520,m64272e_230518_163630.hifi_reads.bc2075--bc207...,/uufs/chpc.utah.edu/common/home/starr-group1/s...
1,pool2,XBB15,230520,m64272e_230518_163630.hifi_reads.bc2076--bc207...,/uufs/chpc.utah.edu/common/home/starr-group1/s...


Next make submission entries for the PacBio CCSs:

In [5]:
pacbio_submissions = (
    pacbio_runs
    .assign(
        biosample_accession='SAMN37185589',
        library_ID=lambda x: x['bg'] + '_' + x['library'] + '_PacBio_CCSs',  # unique library ID
        title='PacBio CCSs linking variants to barcodes for SARS-CoV-2 variant RBD deep mutational scanning',
        library_strategy='Synthetic-Long-Read',
        library_source='SYNTHETIC',
        library_selection='Restriction Digest',
        library_layout='single',
        platform='PACBIO_SMRT',
        instrument_model='PacBio Sequel II',
        design_description='Restriction digest of plasmids carrying barcoded RBD variants',
        filetype='fastq',
        #filename_nickname=lambda x: x['ccs'],      
        filename_fullpath=lambda x: x['ccs_file'],      
        )
    .drop(columns=pacbio_runs.columns)
    )

pacbio_submissions.head()

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath
0,SAMN37185589,XBB15_pool1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel II,Restriction digest of plasmids carrying barcod...,fastq,/uufs/chpc.utah.edu/common/home/starr-group1/s...
1,SAMN37185589,XBB15_pool2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel II,Restriction digest of plasmids carrying barcod...,fastq,/uufs/chpc.utah.edu/common/home/starr-group1/s...


Now concatenate the PacBio submissions into tidy format (one line per file), make sure all the files exist.

In [6]:
submissions_tidy = (
    pd.concat([pacbio_submissions], ignore_index=True)
    .assign(file_exists=lambda x: x['filename_fullpath'].map(os.path.isfile),
            filename=lambda x: x['filename_fullpath'].map(os.path.basename),
            )
    )

assert submissions_tidy['file_exists'].all(), submissions_tidy.query('file_exists == False')

submissions_tidy.head()

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath,file_exists,filename
0,SAMN37185589,XBB15_pool1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel II,Restriction digest of plasmids carrying barcod...,fastq,/uufs/chpc.utah.edu/common/home/starr-group1/s...,True,m64272e_230518_163630.hifi_reads.bc2075--bc207...
1,SAMN37185589,XBB15_pool2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel II,Restriction digest of plasmids carrying barcod...,fastq,/uufs/chpc.utah.edu/common/home/starr-group1/s...,True,m64272e_230518_163630.hifi_reads.bc2076--bc207...


For the actual submission, we need a "wide" data frame that for each unique `sample_name` / `library_ID` gives all of the files each in different columns.
These should be files without the full path.

First, look at how many files there are for each sample / library:

In [7]:
(submissions_tidy
 .groupby(['biosample_accession', 'library_ID'])
 .aggregate(n_files=pd.NamedAgg('filename_fullpath', 'count'))
 .sort_values('n_files', ascending=False)
 .reset_index()
 )

Unnamed: 0,biosample_accession,library_ID,n_files
0,SAMN37185589,XBB15_pool1_PacBio_CCSs,1
1,SAMN37185589,XBB15_pool2_PacBio_CCSs,1


Now make the wide submission data frame.

In [8]:
submissions_wide = (
    submissions_tidy
    .assign(
        filename_count=lambda x: x.groupby(['biosample_accession', 'library_ID'])['filename'].cumcount() + 1,
        filename_col=lambda x: 'filename' + x['filename_count'].map(lambda c: str(c) if c > 1 else '')
        )
    .pivot(
        index='library_ID',
        columns='filename_col',
        values='filename',
        )
    )

submissions_wide = (
    submissions_tidy
    .drop(columns=['filename_fullpath', 'file_exists', 'filename'])
    .drop_duplicates()
    .merge(submissions_wide[natsort.natsorted(submissions_wide.columns)],
           on='library_ID',
           validate='one_to_one',
           )
    )

submissions_wide

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename
0,SAMN37185589,XBB15_pool1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel II,Restriction digest of plasmids carrying barcod...,fastq,m64272e_230518_163630.hifi_reads.bc2075--bc207...
1,SAMN37185589,XBB15_pool2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel II,Restriction digest of plasmids carrying barcod...,fastq,m64272e_230518_163630.hifi_reads.bc2076--bc207...


Now write the wide submissions data frame to a `*.tsv` file:

In [9]:
submissions_spreadsheet = 'SRA_submission_spreadsheet_XBB15.tsv'

submissions_wide.to_csv(submissions_spreadsheet, sep='\t', index=False)

This submission sheet was then manually uploaded in Step 6 of the SRA submission wizard (*SRA metadata*).

## Upload the actual files
Step 7 of the SRA submission wizard is to upload the files.
In order to do this, we first make a `*.tar` file with all of the files.
Since this takes a long time, we only create the file if it doesn't already exist, so it is only created the first time this notebook is run.
**Note that this will cause a problem if you add more sequencing files to upload after running the notebook, in that case the cell below will need to altered.**

In [10]:
tar_filename = 'SRA_submission_XBB15.tar'

if os.path.isfile(tar_filename):
    print(f"{tar_filename} already exists, not creating it again")
else:
    try:
        with tarfile.open(tar_filename, mode='w') as f:
            for i, tup in enumerate(submissions_tidy.itertuples()):
                print(f"Adding file {i + 1} of {len(submissions_tidy)} to {tar_filename}")
                f.add(tup.filename_fullpath, arcname=tup.filename_fullpath)
            print(f"Added all files to {tar_filename}")
    except:
        if os.path.isfile(tar_filename):
            os.remove(tar_filename)
        raise

Adding file 1 of 2 to SRA_submission_XBB15.tar
Adding file 2 of 2 to SRA_submission_XBB15.tar
Added all files to SRA_submission_XBB15.tar


See the size of the `*.tar` file to upload and make sure it has the expected files:

Note: the filenames in the `*.tar` file lack the initial backslash, but i confirmed manually that all are present

In [11]:
print(f"The size of {tar_filename} is {os.path.getsize(tar_filename) / 1e9:.1f} GB")

with tarfile.open(tar_filename) as f:
    files_in_tar = set(f.getnames())
    print(files_in_tar)
    print(set(submissions_tidy['filename_fullpath']))
#if files_in_tar == set(submissions_tidy['filename_fullpath']):
#    print(f"{tar_filename} contains all {len(files_in_tar)} expected files.")
#else:
#    raise ValueError(f"{tar_filename} does not have all the expected files.")

The size of SRA_submission_XBB15.tar is 0.3 GB
{'uufs/chpc.utah.edu/common/home/starr-group1/sequencing/TNS/2023/230520_Omicron-xbb-bq_pacbio/m64272e_230518_163630.hifi_reads.bc2075--bc2075.fastq.gz', 'uufs/chpc.utah.edu/common/home/starr-group1/sequencing/TNS/2023/230520_Omicron-xbb-bq_pacbio/m64272e_230518_163630.hifi_reads.bc2076--bc2076.fastq.gz'}
{'/uufs/chpc.utah.edu/common/home/starr-group1/sequencing/TNS/2023/230520_Omicron-xbb-bq_pacbio/m64272e_230518_163630.hifi_reads.bc2076--bc2076.fastq.gz', '/uufs/chpc.utah.edu/common/home/starr-group1/sequencing/TNS/2023/230520_Omicron-xbb-bq_pacbio/m64272e_230518_163630.hifi_reads.bc2075--bc2075.fastq.gz'}


The SRA instructions then give several ways to upload; we will do it using the FTP method.
First, specify the FTP address, username, password, and subfolder given by the SRA submission wizard instructions.
In order to avoid having the password be public here, that is in a separate text file that is **not** included in the GitHub repo (so this needs to be run in Tyler's directory that has this password):

In [12]:
# the following are provided by SRA wizard insturctions
ftp_address = 'ftp-private.ncbi.nlm.nih.gov'
ftp_username = 'subftp'
ftp_account_folder = 'uploads/tyler.n.starr_gmail.com_LMpRB4Tu'
with open('ftp_password.txt') as f:
    ftp_password = f.read().strip()
    
# meaningful name for subfolder
ftp_subfolder = 'XBB15_RBD_barcodes'

Now create FTP connection and upload the TAR file.
Note that this takes a while.
If you are worried that it will timeout given the size of your file, you can run this notebook via `slurm` so there is no timing out:

In [13]:
print(f"Starting upload at {datetime.datetime.now()}")

with ftplib.FTP(ftp_address) as ftp:
    ftp.login(user=ftp_username,
              passwd=ftp_password,
              )
    ftp.cwd(ftp_account_folder)
    ftp.mkd(ftp_subfolder)
    ftp.cwd(ftp_subfolder)
    with open(tar_filename, 'rb') as f:
        ftp.storbinary(f"STOR {tar_filename}", f)
        
print(f"Finished upload at {datetime.datetime.now()}")

Starting upload at 2023-08-29 15:22:56.011784
Finished upload at 2023-08-29 15:23:34.950651


Finally, used the SRA wizard to select the `*.tar` archive and complete the submission.
Note that there is a warning of missing files since everything was uploaded as a `*.tar` rather than individual files.
They should all be found when you hit the button to proceed and the `*.tar` is unpacked.

There was then a message that the submission was processing, and data would be released immediately upon processing.
The submission number is `SUB13802272`.