# **PART 2:** Preparing data for epitope annotation with TCRex

In [1]:
import os
# Set the working directory to the repository directory
os.chdir("/home/sebastiaan/PhD/Repositories/book_chapter/")

Once again, we will need the `pandas` library for handling the data.

In [2]:
import pandas as pd

TCRex has sets an upper limit of 50,000 to the amount of sequences that the user may upload. Therefore, files containing more than 50K sequences should be split into smaller chunks that satisfy the upload limit of the TCRex software.

We will start out by writing a function that does just that.

In [16]:
# Function to split data in smaller files
def split_data(
    data: pd.DataFrame,
    fn: str,
    folder: str,
    tcrex_limit: int = 50000
    ):  
    """
    Split a file containing TCR sequences into smaller files defined by a user given maximum number of sequences.
    
    Args:
        - fn: name of the file that needs to be split
        - folder: path to the directory were the splitted files need to be stored
        - tcrex_limit: maximum number of sequences in the splitted files (default = 50K)
    """
    for i, chunk_start in enumerate(range(0, data.shape[0], tcrex_limit), start = 1):
        # Select chunk
        data_chunk = data[chunk_start:chunk_start + tcrex_limit]
        # Save chunk to target directory
        data_chunk.to_csv(
            os.path.join(folder, '_'.join([fn, str(i)]) +'.tsv'),
            sep = '\t',
            index = False
            )

Next, we will check every file to see whether it exceeds the TCRex data limit. If so, we will split the file into smaller chunks (of size 50,000) using the function we defined previously. In addition, we should rename the columns containing CDR3, V and J gene information so that they satisfy the TCRex input format. The following function takes 6 arguments: the path to the source files, the path to the output folder (where results should be stored), the name of the CDR3, V and J gene columns and the size of the chunk (50,000 by default = TCRex limit).

In [17]:
def file_splitter(
    source: str,
    destination: str,
    cdr3_colname: str,
    vgene_colname: str,
    jgene_colname: str,
    chunk_size: int = 50000
    ):
    
    # Dictionary containing the names of the columns that should be renamed
    remap_cols = {
        cdr3_colname : "CDR3_beta",
        vgene_colname : "TRBV_gene",
        jgene_colname : "TRBJ_gene"
        }
    
    # If the parsed_data folder does not exist, create it
    if not os.path.exists(destination):
        os.mkdir(destination)

    # List files in source folder
    files = os.listdir(source)
    # Loop over files and split them if they contain more than 50K TCRs
    for fn in files:
        # Read in every file with name fn and rename its columns
        data = pd.read_csv(os.path.join(source, fn), sep = '\t')
        data = data.rename(columns = remap_cols)
        # Calculate the number of TCRs in the file
        nr_rows = data.shape[0]
        # If the number of TCRs exceeds the TCRex limit, split the file fn in smaller files
        if nr_rows > chunk_size:
            print(f"Splitting {fn}")
            new_folder = os.path.join(destination, fn.split('.')[0])
            if not os.path.exists(new_folder):
                os.mkdir(new_folder)
            # Use the split_data function we defined previously to split the file
            split_data(
                data = data,
                fn = fn,
                tcrex_limit = chunk_size, 
                folder = new_folder
                )
        # If the number of TCRs does not exceed the TCRex limit, move the file to the new folder
        else:
            os.replace(
                source = os.path.join(source, fn),
                destination = os.path.join(destination, fn)
                )

The only thing left to do now is to run the function with the necessary parameters.

In [18]:
# Define the directory to collect all parsed (i.e. splitted) files
indir = "./data/examples"
outdir = "./data/tcrex_in"

file_splitter(
    source = indir,
    destination = outdir,
    cdr3_colname = "junction_aa",
    vgene_colname = "v_call",
    jgene_colname = "j_call"
    )

Splitting P1_0_clones.txt
Splitting P1_15_clones.txt
