### This file contains functions to read bed files containing cell-free DNA fragment coordinates and store them as H5PY files. During the process, the coordinates are split into training, validation and test sets based on their chromosome number
**NOTE**:
- Each bed file corresponds to the donor or recipient cfDNA coordinates for a single patient. Donor cfDNA originate from the transplanted lung and recipient cfDNA coordinates originate from the recipient's blood cells.
- H5PY output files are created for each coordinate bed file. The coordinates belonging to train, validation and test sets for a single bed file are stored as three separate datasets within the corresponding H5PY output file.
- In addition to splitting the dataset into three sets, fragments that are too close to the ends of the chromosome are also removed (because we run into errors while extending coordinates to reach Enformer input size) refer to the comments in the relevant function for more information

**Reason behind splitting based on chromosomes**
The reason is to avoid data leakage between the training, test and validation sets. cfDNA fragment sequences could arise from the same region and hence have a lot of overlap (depending on the sequencing depth). Thus, even if there is no sample overlap between the three sets, there could be sequence overlap arising from two different fragments coming from the same genomic region. One way of mitigating this is to ensure that no chromosomes are shared between the three sets.

*Details on how chromosome-based splitting is done*
We need to split the 23 chromosomes between the training, validation and test sets such that the percentage of total samples that belongs to all the chromosomes in a particular set match the sample percentage.
1. For chromosome 1 - 23, count the number of samples with that chromosome and calculate what percentage of the total samples this is.
2. Start from the chromosome 1 and iterate over all the chromosomes and their percentage samples until training percent is reached. Repeat the process for validation set and assign the rest of the chromosomes to the test set. This should result in 3 lists of chromosomes - for training, validation and test set.
3. For each bed file, assign the samples to one of these three sets based on which of the three chromosome lists it is from.

In [8]:
import math
import numpy as np
import pandas as pd

import sys
import importlib

import h5py
import pysam

#If we don't do this step, then local files like config will not be detected. 
sys.path.insert(0,'/hpc/compgen/projects/fragclass/analysis/mvivekanandan/script/madhu_scripts')

import config
import os

In [9]:
importlib.reload(config)

#Set arguments from config file. 
arguments = {}
arguments["donorFile"] = config.filePaths.get("donorFile")
arguments["inputBedFileFolder"] = config.filePaths.get("inputBedFileFolder")
arguments['coordStoreDirectory'] = config.filePaths.get("coordStoreDirectory")
arguments['snpFilePath'] = config.filePaths.get("snpFile")
arguments['coordsStoreFilePath'] = config.filePaths.get("coordStoreFile")
arguments["refGenomePath"] = config.filePaths.get("refGenomePath")

arguments['testPercent'] = config.dataCreationConfig.get("percentTest")
arguments['validationPercent'] = config.dataCreationConfig.get("percentValidation")
arguments['numColsToExtract'] = config.dataCreationConfig.get("numColsToExtract")
arguments["balanceClassesBeforeCreatingCoordiantes"] = config.dataCreationConfig.get("balanceClassesBeforeCreatingCoordiantes")

#Datasetnames
arguments["trainingCoordsDatasetName"] = config.datasetNames.get("trainingCoords")
arguments["validationCoordsDatasetName"] = config.datasetNames.get("validationCoords")
arguments["testCoordsDatasetName"] = config.datasetNames.get("testCoords")
arguments["trainingLabelsDatasetName"] = config.datasetNames.get("trainingLabels")
arguments["validationLabelsDatasetName"] = config.datasetNames.get("validationLabels")
arguments["testLabelsDatasetName"] = config.datasetNames.get("testLabels")

#For ease of testing. If you only want to test, set this flag to False. Then H5PY coordinate files will not be created. 
WRITE_TO_FILES = True

In [10]:
'''
Input -
    1. dataNumpy is a numpy array with chrom number, start and end coordinates as columns for cfDNA fragment samples (rows)
    2. label - 0/1. The label is 1 if the samples are from a donor file and 0 if they are from a recipient file

Output - A 1D numpy array of 0's or 1's corresponding to the label, each value represanting the label for one sample. Consequently the length of the output array is the same as the number of rows in dataNumpy.
'''
def getLabelsForData(dataNumpy, label):
    nrows, ncols = dataNumpy.shape
    if label == 0:
        return np.zeros(nrows).reshape(nrows, 1)
    if label == 1:
        return np.ones(nrows).reshape(nrows, 1)
    else:
        print(f"Invalid label for data : {label}")
        raise SystemExit(1)

'''
The config file only contains testPercent (what percent of the non-training data should be test) and validationPercent(what percentage of the total data should be validation).
This function calculates the training percentage from these values and returns the absolute training, validation and test percentage (interms of total number of samples).

Output - 3 percentage values (training, validation and test)
'''
def getSampleDistributionPercents():
    testPercent = arguments["testPercent"]
    validationPercent = arguments["validationPercent"]
    
    nonTestPercent = (100 - testPercent)
    absValidationPercent = nonTestPercent * validationPercent/100
    absTrainingPercent = nonTestPercent * (100 - validationPercent)/100
    print(f"Abolsute training validation and test percentages are {absTrainingPercent}, {absValidationPercent} and {testPercent}")
    return (absTrainingPercent, absValidationPercent, testPercent)

'''
Inputs -
1. testPercent - percentage of total samples that should belong to the test set.
2. inputBedFilesDirectory - path to the directory containing input bed files.
Output - File names that should belong to the test set.
Description: Function iterates over files, counting number of samples until the required test percentage is attained. All the patient file names that were counted to reach the test percentage are returned.
'''
def getTestPatientsList(testPercent, inputBedFilesDirectory):
    total_samples = 0
    test_patient_filenames = []

    #Get the total number of samples in the bed files directory
    for filename in os.listdir(inputBedFilesDirectory):
        filepath = os.path.join(inputBedFilesDirectory, filename)
        columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
        cfdna_frag_df = pd.read_csv(filepath, sep = "\t", names = columnNames, skiprows=11)
        total_samples += len(cfdna_frag_df)
    
    req_test_samples = math.floor(testPercent/100 * total_samples)
    print(f"Number of test samples required : {req_test_samples} and total samples is {total_samples}")    
    
    #Get the filename until which the sample count reaches the test percentage level
    num_samples = 0
    for filename in os.listdir(inputBedFilesDirectory):
        if "recipient" in filename: continue

        donor_file_path = os.path.join(inputBedFilesDirectory, filename)
        recipient_file_path = donor_file_path.replace("donor", "recipient")
        recipient_file_name = filename.replace("donor", "recipient")

        columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
        donor_cf_dna_df = pd.read_csv(donor_file_path, sep = "\t", names = columnNames, skiprows=11)
        recipient_cf_dna_df = pd.read_csv(recipient_file_path, sep = "\t", names = columnNames, skiprows=11)

        num_samples += len(donor_cf_dna_df) + len(recipient_cf_dna_df)
        test_patient_filenames.append(filename)
        test_patient_filenames.append(recipient_file_name)

        if(num_samples > req_test_samples):
            print(f"Reached the required test samples, at filename {filename} and the current num_samples is {num_samples}")
            break
    
    return test_patient_filenames

The functions in the following block are all involved in splitting the samples into training, validation and test sets based on their chromosome.



In [11]:
'''
Input -
1. inputBedFilesDirectoryPath - directory path to where all the bed files are
2. columnNames - column names in the bed file (to later convert to dataframe)
3. trainPercent - percentage of total samples in training set

Output - 2 lists of chromosomes - for the training and validation sets.
'''
def getChromosomeListsForTrainingAndValidation(inputBedFilesDirectoryPath, columnNames, trainPercent, testPatients):
    average_percentage_df = getChromosomePercentagesAverage(inputBedFilesDirectoryPath, testPatients, columnNames)
    print(f"Total of percentage of the df is {average_percentage_df['percentage of samples'].sum()}")
    (training_end_index, training_chromosomes) = getChromosomesCoveringPercentSamples(average_percentage_df, trainPercent)
    print(f"Training end index is {training_end_index}")
    validation_chromosomes = average_percentage_df.loc[training_end_index + 1:]["#chrom"].values.tolist()
    return (training_chromosomes, validation_chromosomes)

'''
Input
1. TestPatients - list of filenames that belong to the test set (from previous function)
2. inputBedFilesDirectoryPath - directory path to where all the bed files are
3. columnNames - column names in the bed file (to later convert to dataframe)

Output - a dataframe with 2 columns - #chrom and percentage of samples.

Iterates over patient files and calculates the percentage (of total samples in that file) that belong to a chromosome.
For each chromosome, the percentages are added for all the files and later divided by the total number of files to get the percentage of total samples that belong to a chromosome (averaged over all patients)
'''
def getChromosomePercentagesAverage(inputBedFilesDirectoryPath, testPatients, columnNames):
    all_samples_df = pd.DataFrame(columns=['#chrom', "percentage of samples"])

    #Insert chromosome numbers.
    chroms = range(1, 23)
    list_chroms = list(map(lambda chrom: str(chrom), chroms)) + ["X"] + ["Y"]
    all_samples_df["#chrom"] = list_chroms
    all_samples_df["percentage of samples"] = [0] * 24
    
    num_files = 0
    for filename in os.listdir(inputBedFilesDirectoryPath):
        if(filename not in testPatients):
            filepath = os.path.join(inputBedFilesDirectoryPath, filename)

            num_files += 1
            cfdna_frag_df = pd.read_csv(filepath, sep = "\t", names = columnNames, skiprows=11)
            
            #If this string conversion is not done, for some files, #chrom till 14 are not strings. This creates problems while 
            #matching to the string chromosomes from the all_samples_df
            cfdna_frag_df["#chrom"]= cfdna_frag_df["#chrom"].map(str)
            cfdna_chrom_sample_count = cfdna_frag_df.groupby("#chrom").size().reset_index()
            cfdna_chrom_sample_count.columns = ["#chrom", "percentage of samples"]

            #Transform from count to percentage
            total_samples = len(cfdna_frag_df)
            cfdna_chrom_sample_count["percentage of samples"] = cfdna_chrom_sample_count["percentage of samples"].transform(lambda x: x/total_samples * 100)
            # print(f"Printing percentage of samples in file {filename} ...  \n {cfdna_chrom_sample_count.head(25)}", flush=True)

            # Pick the value from the cfdna_chrom_sample_count where #chrom in the chrom_sample_count df matches the #chrom
            # of the row being updated in the all_samples_df. The cdfna_chrom_sample_count.loc returns a pandas series. 
            # values[0][1] is used to fetch the single int/float value of the percentage. 
            # All samples should contain the sum of percentages from all files for each chromosome.
            all_samples_df["percentage of samples"] = all_samples_df.apply(lambda x:  addPercentagesFunction(x["#chrom"], x["percentage of samples"], cfdna_chrom_sample_count), axis = 1)


    #Take the average of the percentages sum over all files. 
    all_samples_df["percentage of samples"] =  all_samples_df["percentage of samples"].transform(lambda x: x/num_files)
    
    #Check to see if all the percentages in the final all_samples_df add upto 100. 
    all_samples_avg = all_samples_df["percentage of samples"]
    all_samples_avg_sum = all_samples_avg.sum()                         
    if(round(all_samples_avg_sum) != 100):
        raise Exception(f"********* Something is wrong !! The sum of percentages of all files combined(${all_samples_avg_sum}) is not adding up to 100. \n After averaging, the all samples df is {all_samples_df.head(25)}") 
    
    print(f"Chromosomes percentage df is {all_samples_df}")
    return all_samples_df

'''
Inputs
1. chrom - chromosome number
2. Percentage - percentage of samples arising from #chrom for a patient
3. file_level_chrom_percent_df - df of chromosomes and their sample percentages.

Add a single patient's sample percentage to the chromosomes to sample percentages df.
'''
def addPercentagesFunction(chrom, percentage, file_level_chrom_percent_df):
    percent_to_add = file_level_chrom_percent_df.loc[file_level_chrom_percent_df["#chrom"] == chrom]["percentage of samples"]
    if(percent_to_add.values.size == 0):
        return percentage
    else:
        return percentage + percent_to_add.values[0]

'''
Input -
1. chrom_avg_percent_coverage_df - df of chromosome and average percentage of samples that originate from this chromosome
2. maxPercent - trainingPercent or validationPercent in this case

Given a maxPercentage of samples, it returns a list of chromosomes whose samples when taken together reach that maxPercent

Output -
1. endIndex - the last chromosome in the list (so for the next set you can count from the next chrom onwards) (need not be returned, refactor method to not return it in later)
2. chrom_list - the list of chromosomes whose samples when taken together reach the max percent.
'''
def getChromosomesCoveringPercentSamples(chrom_avg_percent_coverage_df, maxPercent):
    chromosomes_list = []
    percent_covered = 0
    end_index = -1
    for i, row in chrom_avg_percent_coverage_df.iterrows():
        chrom = row["#chrom"]
        avg_percentage = row["percentage of samples"]
        percent_covered = percent_covered + avg_percentage
        chromosomes_list.append(chrom)
        if(percent_covered > maxPercent):
            end_index = i
            break
    if(end_index) == -1:
        raise Exception("Something is wrong, the inidividual chromosome percentages is not sufficient to reach the percentage requested")
    
    return (end_index, chromosomes_list)

'''
cfdna_frag_df - df of start, end coordinates and chrom number of cfDNA samples
train_chroms - list of chromosomes whose fragments should make the training set
validation_chroms - list of chromosomes whose fragments should make the validation set

The function splits the cfdna_frag_df into training and validation dfs according to their chromosomes.
'''
def getTrainingAndValidationData(cfdna_frag_df, train_chroms, validation_chroms):
    numColumnsToExtract = arguments["numColsToExtract"]
    training_df = cfdna_frag_df.loc[cfdna_frag_df["#chrom"].isin(train_chroms)].iloc[:, 0:numColumnsToExtract]
    validation_df = cfdna_frag_df.loc[cfdna_frag_df["#chrom"].isin(validation_chroms)].iloc[:, 0:numColumnsToExtract] 
    return (training_df, validation_df)

In [None]:
"""
Enformer expects input size of 196607 bps. cfDNA fragments are much smaller, so we align it on reference genome and take sequences on either side, such that the cfDNA fragment is in the middle
If the input sequence is too close to the start or end of the the chromosome, we can't extend it. Shifting to the right, accounting for the changed bins etc is too complex
So we simple discard fragments which are too close to the left or the right of the chromosome
"""
def removeCloseToEndFrags(df):
    #Get the length of each chromosome
    refGenomePath = arguments["refGenomePath"]
    refGenome = pysam.FastaFile(refGenomePath)

    chrom_len_map = {}
    chroms = df['#chrom'].unique()
    for i in chroms:
        chrom = str(i)
        length = refGenome.get_reference_length("chr" + chrom)
        chrom_len_map[chrom] = length

    allowed_distance_from_edges = 100000 #length to be extended on either of fragment for enformer input is 196607/2 = 98303.

    #Drop fragments too close to beginning of the chromosome
    df = df.drop(df[df['start'] <= allowed_distance_from_edges].index)
    df.reset_index(inplace=True, drop=True)

    #Drop fragments too close to the end of the chromosome
    for i in range(len(df)):
        chrom = str(df["#chrom"][i])
        if(df["end"][i] >= chrom_len_map[chrom] - allowed_distance_from_edges):
            df.drop(i)

    df.reset_index(inplace=True, drop=True)

    return df

Call the function that splits chromosomes for training, validation and test sets. Store the chromosome lists for each of the three sets in variables for later use.

In [None]:
columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
inputBedFilesDirectory = arguments["inputBedFileFolder"]
trainingPercent, validationPercent, testPercent = getSampleDistributionPercents()
testPatients = getTestPatientsList(testPercent, inputBedFilesDirectory)
print(f"Test patients list is {testPatients}")

#Preparatory step - decide which chromosomes should be part of training, validation and test set
#while maintaining the train, validation and test ratio somewhat, and making sure chromosomes are not shared between these 3 sets.
#Create the chromosome number vs percentage covered df and use it to get the list of chromosomes for training, validation and test data.
#Uncomment it, once testing is done (This has been moved to a previous cell for ease of splitting and running)
training_chromosomes, validation_chromosomes = getChromosomeListsForTrainingAndValidation(inputBedFilesDirectory, columnNames, trainingPercent, testPatients)
print(f"Training, validation and test chromosomes are {training_chromosomes}, {validation_chromosomes}")

In [12]:
'''
The function that ties all the previous small functions together. It calls functions/has code to do the following

1. Iterate over donor and recipient files to get dfs out of bed files. Make the length of donor and recipient df equal by truncating the bigger file
2. Call function to split the dfs into training and validation df based on their chromosome.
3. Call functions to remove fragments that are too close to the end of the chromosome. Also get the test data from the test patient files
4. Write all the data into H5PY files within separate training validation and test dataset.
'''
def fetchCoordinatesAndStore():
    rows_to_skip = 11
    
    columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
    inputBedFilesDirectory = arguments["inputBedFileFolder"]
    count_test = 0
    count_train_valid = 0
    count_files = 0

    #Iterate over each file, get the training, validation and test dataset, store to H5PY file.
    for filename in os.listdir(inputBedFilesDirectory):
        count_files = count_files + 1
        print(f"Processing filename {filename}")
        filepath = os.path.join(inputBedFilesDirectory, filename)

        #For test patients, don't split based on chromosomes. Test patients data is directly stored into H5PY coordinate files. 
        if filename in testPatients:
            testCoordStoreFilePath = os.path.join(arguments["coordStoreDirectory"], filename.replace('.frag.bed.gz', '') + ".hdf5")
            test_label = 1 if "donor" in filename else 0
            count_test += 1
            cfdna_frag_df = pd.read_csv(filepath,
                        sep = "\t", names = columnNames, skiprows=rows_to_skip)
            numColumnsToExtract = arguments["numColsToExtract"]
            test_data = cfdna_frag_df.iloc[:, 0:numColumnsToExtract]
            test_data = removeCloseToEndFrags(test_data)
            testLabels = getLabelsForData(test_data, test_label)
            if(WRITE_TO_FILES == True):
                with h5py.File(testCoordStoreFilePath, 'w') as h5_file:
                    print(f"Storing into testFile : {testCoordStoreFilePath}, test_data : {len(test_data)} and testLabels: {len(testLabels)}")
                    h5_file.create_dataset(arguments["testCoordsDatasetName"], data=test_data.astype(str).to_numpy(), compression = "gzip", compression_opts=9)
                    h5_file.create_dataset(arguments["testLabelsDatasetName"], data=testLabels, compression = "gzip", compression_opts=9)
            continue
        
        if 'recipient' in filename: continue #Just read donor files
        #For testing, when you have to check only the test files
        if 'donor' in filename: continue 

        cfdna_frag_dfs = []
        file_paths = []

        ##Getting number of samples in donor. Append the cfDNA fragment df into a list (Its a 2 element list with only the donor and the recipient for that patient)
        file_paths.append(filepath)
        cfdna_frag_dfs.append(pd.read_csv(file_paths[0],
                        sep = "\t", names = columnNames, skiprows=rows_to_skip))
        donor_sample_length = len(cfdna_frag_dfs[0])

        #Get the number of samples in the recipient
        file_paths.append(file_paths[0].replace('donor', 'recipient'))
        cfdna_frag_dfs.append(pd.read_csv(file_paths[1],
                        sep = "\t", names = columnNames, skiprows=rows_to_skip))
        recipient_sample_length = len(cfdna_frag_dfs[1])

        #get the minimum of the samples between donor and recipient
        min_sample_length = min(donor_sample_length, recipient_sample_length)
        
        #Do the same steps for the donor and the recipient file 
        for i in range(0,2):
            filename = os.path.basename(file_paths[i])

            if arguments["balanceClassesBeforeCreatingCoordiantes"]:
                cfdna_frag_dfs[i] = cfdna_frag_dfs[i].sample(n=min_sample_length, random_state=42, replace=False)

            #Some chromosome values are not proper strings. Convert all #chrom values to strings.
            cfdna_frag_dfs[i]["#chrom"]= cfdna_frag_dfs[i]["#chrom"].map(str)
            train_data, validation_data = getTrainingAndValidationData(cfdna_frag_dfs[i], training_chromosomes, validation_chromosomes)

            #Remove fragments that are too close to the start and end of the chromosome
            train_data = removeCloseToEndFrags(train_data)
            validation_data = removeCloseToEndFrags(validation_data)

            #Get labels for the data
            label = 1 if "donor" in filename else 0
            trainingLabels = getLabelsForData(train_data, label)
            validationLabels = getLabelsForData(validation_data, label)

            # print(f”Printing train data characteristics. Length: {len(train_data)}, Head : {train_data.head()}“)
            # print(f”Printing validation data characteristics. Length: {len(validation_data)}, Head : {validation_data.head()}“)
            # print(f”Size and head of train labels {len(trainingLabels)}, {trainingLabels[0:10]}“)
            # print(f”Size and head of validation labels {len(validationLabels)}, {validationLabels[0:10]}“)

            #Store the data into H5PY files as separate datasets.
            if(WRITE_TO_FILES == True):
                coordStoreFilePath = os.path.join(arguments["coordStoreDirectory"], filename.replace('.frag.bed.gz', '') + ".hdf5")
                with h5py.File(coordStoreFilePath, 'w') as h5_file:
                    h5_file.create_dataset(arguments["trainingCoordsDatasetName"], data=train_data.astype(str).to_numpy(), compression = "gzip", compression_opts=9)
                    h5_file.create_dataset(arguments["trainingLabelsDatasetName"], data=trainingLabels, compression = "gzip", compression_opts=9)
                    h5_file.create_dataset(arguments["validationCoordsDatasetName"], data=validation_data.astype(str).to_numpy(), compression = "gzip", compression_opts=9)
                    h5_file.create_dataset(arguments["validationLabelsDatasetName"], data=validationLabels, compression = "gzip", compression_opts=9)
                
            count_train_valid += 1

    print(f"Count test: {count_test} and count train valid : {count_train_valid}")
       

In [None]:
"""
This function reads the newly created H5PY files and verifes if everything is in order. The following checks are performed.
    1. The number of donor are recipient samples are the same between the bed files and the coordinate files.
    2. The number of data samples and the labels is the same for training, validation.
    3. The donors have label 1 and the recipients have label 0
    4. Number of training, validation and test add up to the minimal sample number used from bed files for each coord file.
    5. Number of samples in donor and recipient files (all 3 sets put together are equal) are the same in coordinate file

NOTE: This function is not yet modified to reflect the updated split policy of - set aside some patient files as test patients. Only the samples from the remaining patients are split into training and validation sets.
"""
def verifyNewCoordinateFiles():

    inputBedFilesDirectory = arguments["inputBedFileFolder"]
    coordStoreDirectory = arguments["coordStoreDirectory"]
    
    bed_files = os.listdir(inputBedFilesDirectory)
    coord_files = os.listdir(coordStoreDirectory)

    #Assertion 1: Assert that the total number of bed files and coordinate files are the same
    assert len(bed_files) == len(coord_files), (f"There are {len(bed_files)} bed files in " + 
        f"{os.path.basename(inputBedFilesDirectory)} and {len(coord_files)} " + 
            f"H5PY files in {os.path.basename(coordStoreDirectory)}. The numbers should match, something is wrong !!")

    #Assertion 2: that the total number of donor and recipient bed and coorinate files are the same
    num_donors_bed = 0
    num_recips_bed = 0
    num_donors_coord = 0
    num_recips_coord = 0

    for i in range(0, len(bed_files)):
        if "donor" in bed_files[i]:
            num_donors_bed += 1
        if "donor" in coord_files[i]:
            num_donors_coord += 1
        if "recipient" in bed_files[i]:
            num_recips_bed += 1
        if "recipient" in coord_files[i]:
            num_recips_coord += 1

    assert num_donors_bed == num_donors_coord, f"Number of donor bed files ({num_donors_bed}) does not match the number of coordinate bed files({num_donors_coord})"
    assert num_recips_bed == num_recips_coord, f"Number of donor bed files ({num_recips_bed}) does not match the number of coordinate bed files({num_recips_coord})"

    for i in range(0, len(coord_files)):
        columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
        bed_file = bed_files[i]

        #The kind of assertions would differ because we are not balancing the test class. Assert for the test patients separately.
        if filename in testPatients:
            test_data_length = len(f["testCoords"][:])
            test_labels = f["testLabels"][:]

            assert test_data_length == len(test_labels), (f"For file {filename}," +
                f"the length of testdata ({test_data_length}) and length of testlabels ({len(test_labels)}) does not match")

            #Assertion 3 for test data : assert that donors have label 1 and recipients have label 0
            test_zeros = np.zeros(test_data_length)
            test_ones = np.ones(test_data_length)
            test_compare_array = test_ones if "donor" in filename else test_zeros
            assert np.all(test_labels == test_compare_array), f"For file {filename}, some test samples have incorrect labels"

            #Assertion 4 equivalient for Test data
            #Assert that the number of donor and recipient coords in the coordinate H5PY file is the same as the 
            #number of donor and recipient coordinates in the bed file 

        #Process only the donor files
        if "recipient" in bed_file: continue 

        print(f"Processing file {bed_file}")
        donor_bed_file = bed_file
        recip_bed_file = donor_bed_file.replace("donor", "recipient")
        donor_coord_file = donor_bed_file.replace('.frag.bed.gz', '') + ".hdf5"
        recip_coord_file = recip_bed_file.replace('.frag.bed.gz', '') + ".hdf5"

        donor_bed_path = os.path.join(inputBedFilesDirectory, donor_bed_file)
        recip_bed_path = os.path.join(inputBedFilesDirectory, recip_bed_file)

        donor_cfdna_df = pd.read_csv(donor_bed_path,
                        sep = "\t", names = columnNames, skiprows=11)
        recip_cfdna_df = pd.read_csv(recip_bed_path,
                        sep = "\t", names = columnNames, skiprows=11)
        
        og_donor_length = len(donor_cfdna_df)
        og_recip_length = len(recip_cfdna_df)

        coord_lengths_donor_recip = []

        for filename in [donor_coord_file, recip_coord_file]:
            coord_path = os.path.join(coordStoreDirectory, filename)

            with h5py.File(coord_path, 'r') as f:
                train_data_length = len(f["trainingCoords"][:])
                validation_data_length = len(f["validationCoords"][:])
                train_labels = f["trainingLabels"][:]
                validation_labels = f["validationLabels"][:]

                #Assertion 2: Assert that the number of data samples and labels are the same for training, validation and test set. 
                assert train_data_length == len(train_labels), (f"For file {filename}," + 
                    f"the length of trainingdata ({train_data_length}) and length of traininglabels ({len(train_labels)}) does not match")
                assert validation_data_length == len(validation_labels), (f"For file {filename}," +
                    f"the length of validation data ({validation_data_length}) and length of validationlabels ({len(validation_labels)}) does not match")

                #Assertion 3: assert that donors have label 1 and recipients have label 0
                train_zeros = np.zeros(train_data_length)
                validation_zeros = np.zeros(validation_data_length)
 
                train_ones = np.ones(train_data_length)
                validation_ones = np.ones(validation_data_length)

                if "donor" in filename: 
                    assert np.all(train_labels == train_ones), f"For file {filename}, some training donor samples have incorrect labels"
                    np.all(validation_labels == validation_ones), f"For file {filename}, some validation donor samples have incorrect labels"
                
                if "recipient" in filename:
                    np.all(train_labels == train_zeros), f"For file {filename}, some training recipient samples have incorrect labels"
                    np.all(validation_labels == validation_zeros), f"For file {filename}, some validation recipient samples have incorrect labels"

                total_coord_length = train_data_length + validation_data_length + test_data_length
                coord_lengths_donor_recip.append(total_coord_length)
        
        #Assertion 4 & 5: Assert the sum of training, validation and test set in the donor coordinate H5PY file is
        #the same as the minimum of the donor/recipient file samples in the corresponding bed files. 
        #Do the same for the recipient coordinate H5PY file. 
        og_min_length = min(og_donor_length, og_recip_length)
        
        assert coord_lengths_donor_recip[0] == og_min_length, (f"minimum of donors and recipient lengths {og_min_length} " +
            f"does not match the total donor coordinates in the file {donor_coord_file}({coord_lengths_donor_recip[0]})")

        assert coord_lengths_donor_recip[1] == og_min_length, (f"minimum of donors and recipient lengths {og_min_length}" +
            f"does not match the total coordinates in the file {recip_coord_file}({coord_lengths_donor_recip[1]})")

In [None]:
if __name__ == "__main__":
    print(f"Arguments is {arguments}")
    fetchCoordinatesAndStore()
    print(f"================================================================================================================")
    print(f"Finished storing coordinate files, starting verifications... ")
    print(f"================================================================================================================")
    # verifyNewCoordinateFiles()


## NOTE: All functions from this point onwards are for testing if certain parts of the code work well and are not part of the main functionality.

**TESTING:: Get lengths of training and validation data for one file using the methods for splitting based on chromosome**

In [None]:
#Get the training and validation chromosome lists 
columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]

#Skip this if it's already done
inputBedFilesDirectory = arguments["inputBedFileFolder"]
trainingPercent, validationPercent, testPercent = getSampleDistributionPercents()
training_chromosomes, validation_chromosomes = getChromosomeListsForTrainingAndValidation(inputBedFilesDirectory, columnNames, trainingPercent, testPatients)
print(f"Training, validation and test chromosomes are {training_chromosomes}, {validation_chromosomes}")

columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
inputBedFilesDirectory = arguments["inputBedFileFolder"]
filename = "L29-M29-5.donor.frag.bed.gz"
filepath = os.path.join(inputBedFilesDirectory, filename)
cfdna_frag_df = pd.read_csv(filepath, sep = "\t", names = columnNames, skiprows=11)
train_chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11']
validation_chroms = ['12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']
train_data, valid_data = getTrainingAndValidationData(cfdna_frag_df, train_chroms, validation_chroms)
print(len(train_data))
print(len(valid_data))

**Verifying if the number of positives, negatives, samples in the coord file is correct wrt the bed file**

In [None]:
coordStoreDir = arguments["coordStoreDirectory"]
inputBedFilesDirectory = arguments["inputBedFileFolder"]
trainingPercent, validationPercent, testPercent = getSampleDistributionPercents()
testPatients = getTestPatientsList(testPercent, inputBedFilesDirectory)

for filename in os.listdir(coordStoreDir):
    filepath = os.path.join(coordStoreDir, filename)
    inputBedFilesPath = os.path.join(inputBedFilesDirectory, filename.replace('.hdf5', '') + ".frag.bed.gz")
    columnNames  = ["#chrom", "start", "end", "read_id", "mapq", "cigar1", "cigar2"]
    if filename.replace(".hdf5", ".frag.bed.gz") not in testPatients:
        testSamples = 0
        df = pd.read_csv(inputBedFilesPath, sep = "\t", names = columnNames, skiprows=11)
        with h5py.File(filepath, 'r') as f:
            # testSamples += len(f["testCoords"][:])
            # print(f"For filename : {filename}, testSamples: {testSamples}")
            # trainingSamples = len(f["trainingCoords"][:])
            # validationSamples = len(f["validationCoords"][:])
            trainingLabels = f["trainingLabels"][:]
            validationLabels = f["validationLabels"][:]
            pos_train = (trainingLabels == 1).sum()
            neg_train = (trainingLabels == 0).sum()
            pos_valid = (validationLabels == 1).sum()
            neg_valid = (validationLabels == 0).sum()
            print(f"For filename: {filename}, Pos Train: {pos_train}, Neg Train: {neg_train}, Pos valid: {pos_valid} and neg Valid: {neg_valid}")
            #print(f"For filename {filename}, Total samples coord : {trainingSamples + validationSamples}. Total samples bed files : {len(df)}")
            # print(f"Training samples : {trainingSamples} and validationSamples: {validationSamples}")

**Verifying test coords for a patient file**

In [47]:
filepath = os.path.join(coordStoreDir, "L21-M23.donor.hdf5")
print(filepath)
with h5py.File(filepath, 'r') as f:
    print(len(f["testCoords"][:]))

/hpc/compgen/projects/fragclass/analysis/mvivekanandan/output/properSplitCoordinateFiles/L21-M23.donor.hdf5
3055


**TESTING:: Test contents of the Coordinate H5PY files**

In [None]:
#Testing the contents of h5py file.
with h5py.File("/hpc/compgen/projects/fragclass/analysis/mvivekanandan/output/coordinateFiles/L4-M36.donor.hdf5", 'r') as f:
    trainingData = f['trainingCoords'][:]
    validationData = f["validationCoords"][:]
    testData = f["testCoords"][:]
    print(len(trainingData), len(validationData), len(testData))
    print(trainingData[0:10, :])
    print(validationData[0:10, :])
    print(testData[0:10, :])
    
    print("Switching to labels")
    trainingLabel = f["trainingLabels"][:]
    validationLabel = f["validationLabels"][:]
    testLabel = f["testLabels"][:]
    print(len(trainingLabel), len(validationLabel), len(testLabel))
    print(trainingLabel[0:5], validationLabel[0:5], testLabel[0:5])

**Move files from test patients to a separate directory**

In [60]:
import shutil
coordStoreDir = arguments["coordStoreDirectory"]
testPatientsDir = "/hpc/compgen/projects/fragclass/analysis/mvivekanandan/output/testCoordFilesEndFragRemoved"
testPatients
for filename in os.listdir(coordStoreDir):
    filepath = os.path.join(coordStoreDir, filename)
    if(filename.replace("hdf5", "frag.bed.gz") in testPatients):
        # shutil.copy(filepath, testPatientsDir)
        os.remove(filepath)