## Table of Contents
- [Import libraries](#1)
- [processing tables](#2)
- [Download tables](#3)

<a name='1'></a>
## Import libraries

The script is focused on setting up an environment for data analysis and visualization. It imports a suite of libraries and modules that are essential for statistical computing, data manipulation, progress tracking, file system operations, and generating visualizations such as plots and Venn diagrams. The specific libraries imported include pandas for data structures, numpy for numerical operations, tqdm for progress bars, glob for file path retrieval, os for operating system interaction, matplotlib and seaborn for plotting and graphical representations, and matplotlib_venn for creating Venn diagrams.

Additionally, the script modifies the system path to include a custom directory, which suggests that the script will use additional custom modules and configuration settings located in this directory. These custom modules, imported with wildcard imports (from config import * and from functions import *)

In [1]:
# %load /cluster/home/myurchikova/github/projects2020_ohsu/eth/learning_Master_thesis/TASKS/func/base_imports.py
import pandas as pd
import numpy as np
import tqdm 
import glob
import os
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import tarfile
import re
from matplotlib_venn import venn2, venn2_circles, venn2_unweighted
from matplotlib_venn import venn3, venn3_circles
import sys
sys.path.append(r"/cluster/home/myurchikova/github/projects2020_ohsu/eth/learning_Master_thesis/TASKS/func")
from config import *
from functions import *

<a name='2'></a>
## Processing table

Two empty lists, ETH and OSHU, are initialized. These may be intended to store data or results related to the subsequent data processing steps.
The pd.read_csv function from the pandas library is called three times to read different CSV files, each potentially associated with different datasets or aspects of the data:
eth_df is created by reading a CSV file with a path defined by ETH_PATH_BRCA, using specific columns defined by COLS_ETH and a tab ('\t') as the delimiter. The low_memory=False parameter is set, which is typically used to prevent low memory issues when processing large files.
batch_to_gene is created by reading another CSV file with a path defined by BATCH_PATH_BRCA. It names the columns explicitly as 'gene_id' and 'batch', which suggests that this DataFrame will associate gene identifiers with specific batch numbers or identifiers.
ohsu_df is created by reading a CSV file with a path defined by OHSU_PATH_BRCA, also using a tab as the delimiter and the low_memory=False parameter.

In [3]:
ETH = []
OSHU = []

eth_df=pd.read_csv(ETH_PATH_BRCA, usecols = COLS_ETH, sep = '\t', low_memory=False)
batch_to_gene = pd.read_csv(BATCH_PATH_BRCA, names = ['gene_id', 'batch'], low_memory=False)
ohsu_df=pd.read_csv(OHSU_PATH_BRCA, sep = '\t', low_memory=False)

Unnamed: 0,kmer,coord,batch,TCGAC8A12P01A11RA11507all,TCGAAOA0JM01A21RA05607all,TCGABHA18V01A11RA12D07all,TCGAA2A0D201A21RA03407all,TCGAA2A0SX01A12RA08407all
0,WYITRSGIA,92347505:92347506:92349915:92349941:None:None,56543,0.000000,0.000000,0.000000,4.102634,0.000000
1,WYITRSGIA,92347505:92347506:92349915:92349941:None:None,56543,0.000000,0.000000,0.000000,4.102634,0.000000
2,ISSQSRVEK,92379851:92379859:92493866:92493885:None:None,56543,0.000000,0.000000,2.474321,0.000000,0.000000
3,RSGDEEKYP,92600493:92600508:92611313:92611325:None:None,56543,2.922641,2.102386,1.237161,0.000000,0.000000
4,HLKMKMFQI,92379850:92379859:92496416:92496434:None:None,56543,2.922641,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...
22348810,DSAVDFTGR,121183457:121183462:121181326:121181348:None:None,2858,0.000000,0.000000,3.711482,2.735089,1.140192
22348811,LKWLLLLSL,121181270:121181278:121180861:121180880:None:None,2858,0.000000,0.000000,1.237161,0.000000,0.000000
22348812,CLKWCKHPT,121181270:121181281:121168819:121168835:None:None,2858,0.000000,0.000000,1.237161,0.000000,4.560769
22348813,KWLLLLSLF,121181270:121181275:121180858:121180880:None:None,2858,0.000000,0.000000,1.237161,0.000000,0.000000


eth_df is modified by calling a function to get junction coordinates, which suggests that the DataFrame contains genomic or biological data where junctions are relevant.
The pd.merge function is used to merge batch_to_gene DataFrame with eth_df on the 'batch' column, which implies combining two datasets based on a common identifier related to batches of samples or experiments.
The drop method is used on ETH_df to remove the 'batch' column after the merge operation, indicating that this identifier is no longer needed in the merged DataFrame.
The rename method is called twice to rename columns in ohsu_df and ETH_df using a dictionary of new column names provided by ETH_COLUMNS. This step standardizes the column names across DataFrames for consistency or clarity in further analysis.
Finally, ETH_df is printed or returned, which would display the current state of this DataFrame after the operations.

In [4]:
eth_df=table_processing.get_junction_coordinates(eth_df,'coord')
ETH_df = pd.merge(batch_to_gene,eth_df, on=['batch'])
ETH_df.drop(columns=['batch'],inplace=True)
ohsu_df.rename(columns = ETH_COLMNS, inplace = True)
ETH_df.rename(columns = ETH_COLMNS, inplace = True)
ETH_df

22348815it [1:18:07, 4767.27it/s]


Unnamed: 0,gene_id,kmer,coord,TCGAC8A12P01A11RA11507,TCGAAOA0JM01A21RA05607,TCGABHA18V01A11RA12D07,TCGAA2A0D201A21RA03407,TCGAA2A0SX01A12RA08407,strand,junction_coordinate
0,ENSG00000169962.5,TGKTQTTSP,1333364:1333379:1333505:1333517:None:None,11.690565,0.0,2.474321,0.0,9.121539,+,1333379:1333505
1,ENSG00000169962.5,TGKTQTTSP,1333364:1333379:1333505:1333517:None:None,11.690565,0.0,2.474321,0.0,9.121539,+,1333379:1333505
2,ENSG00000169962.5,TGKTQTTSP,1333364:1333379:1333505:1333517:None:None,11.690565,0.0,2.474321,0.0,9.121539,+,1333379:1333505
3,ENSG00000169962.5,DNQVRSPCP,1333115:1333129:1333258:1333271:None:None,0.000000,0.0,1.237161,0.0,0.000000,+,1333129:1333258
4,ENSG00000169962.5,NQTTSPAPF,1333118:1333124:1333505:1333526:None:None,5.845283,0.0,0.000000,0.0,0.000000,+,1333124:1333505
...,...,...,...,...,...,...,...,...,...,...
22348810,ENSG00000166391.15,VELLPEERT,75728985:75728989:75730193:75730216:None:None,0.000000,0.0,0.000000,0.0,5.700962,+,75728989:75730193
22348811,ENSG00000166391.15,NADSILGTP,75728226:75728244:75728789:75728798:None:None,0.000000,0.0,1.237161,0.0,0.000000,+,75728244:75728789
22348812,ENSG00000166391.15,LGTPGANLL,75728241:75728244:75728789:75728813:None:None,0.000000,0.0,1.237161,0.0,0.000000,+,75728244:75728789
22348813,ENSG00000166391.15,LGTPGANLL,75728241:75728244:75728789:75728813:None:None,0.000000,0.0,1.237161,0.0,0.000000,+,75728244:75728789


The rename method is used to change the column names in ohsu_df, replacing the name 'in-frame_neopeptides' with 'kmer', which suggests a focus on specific peptide sequences or genetic markers.
Several functions from a module called table_processing are applied to ohsu_df, indicating that this DataFrame is undergoing a series of predefined processing steps:
ohsu_to_eth_coord function might be transforming or aligning the coordinates from OHSU (Oregon Health & Science University) data to ETH (presumably a different institution or dataset format).
change_column_names function is likely standardizing the column names for consistency across different datasets or analyses.
preprocess_ohsu function suggests a general preprocessing of the data, which could involve cleaning, normalization, or preparation for further analysis.
A new column, 'junction_coordinate', is created in ohsu_df by applying a lambda function to the 'x_shifted' column. This lambda function splits the 'x_shifted' values by a colon, takes the second to fourth elements, and joins them with a colon. This operation is typical for parsing and reformatting complex or compound data entries.
The content of the 'junction_coordinate' column is then printed, which may be for the purpose of verification or inspection of the processed data.

In [5]:
ohsu_df.rename(columns = {'in-frame_neoepitopes': 'kmer'}, inplace = True)
ohsu_df = table_processing.ohsu_to_eth_coord(ohsu_df)
ohsu_df
ohsu_df = table_processing.change_column_names(ohsu_df)
ohsu_df = table_processing.preprocess_ohsu(ohsu_df)
ohsu_df['junction_coordinate'] = ohsu_df['jx_shifted'].apply(lambda x: ':'.join(x.split(';')[1:3]))
print(ohsu_df['junction_coordinate'])


ValueError: invalid literal for int() with base 10: 'TRPAVAGQC'

ohsu_df is being reassigned to a filtered version of itself, selecting only a subset of columns. This typically happens when only specific data columns are needed for further analysis.
ETH_df undergoes two drop operations to remove columns named 'strand' and 'coord'. The inplace=True parameter indicates that the DataFrame should be modified in place, without creating a new DataFrame.

In [None]:
oshu_df=ohsu_df[['gene_id','kmer','junction_coordinate','TCGAC8A12P01A11RA11507', 'TCGAAOA0JM01A21RA05607', 'TCGABHA18V01A11RA12D07', 'TCGAA2A0D201A21RA03407','TCGAA2A0SX01A12RA08407']]
ETH_df.drop(columns=['strand'],inplace=True)
ETH_df.drop(columns=['coord'],inplace=True)

path_eth_df and path_ohsu_df are created using a create_path function, indicating the generation of file paths for saving the data.
ETH_df is saved to a CSV file at the location specified by path_eth_df. The to_csv function is used with header=True, which includes the column names in the output file, and sep='\t', which means the data is tab-separated.
Similarly, ohsu_df is saved to a CSV file at the location specified by path_ohsu_df with the same parameters for headers and separator.

In [None]:
path_eth_df=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_BRCA,NAME_TABLES,NAME_ETH_BRCA])
path_ohsu_df=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_BRCA,NAME_TABLES,NAME_OHSU_BRCA])
ETH_df.to_csv(path_eth_df, header = True, sep='\t')
ohsu_df.to_csv(path_ohsu_df, header = True, sep='\t')

A dictionary named out_table is initialized with various keys, each associated with an empty list. These keys are suggestive of data attributes such as sample identifiers, k-mer counts, sizes, and coordinates which are often used in genomic data analysis.
A for loop begins, iterating over columns of ETH_df DataFrame using the tqdm library, which provides a progress bar for loops.
Within the loop, multiple conditional and data manipulation steps are performed:
Checking if the current column's k-mers are zero and performing set operations.
Appending data to the lists in out_table, such as 'junction_coordinate' and various size and coordinate related data.
Calculating differences between sets of k-mers and appending the results to the appropriate lists in out_table.
The snippet contains several operations that transform and aggregate data, possibly to prepare for statistical analysis or visualization. For example, it deals with intersection and union of genomic coordinates, suggesting a comparison between different datasets or conditions within the study.
The code ends with appending array data converted to lists, likely for uniformity in data structure within the out_table.

In [None]:
out_table = {
    'sample':[],
    'eth_without_ohsu':[],
    'kmer_eth':[],
    'ohsu_without_eth':[],
    'kmer_ohsu':[],
    'inter':[],
    'kmer_inter':[],
    'full_size_ohsu': [],
    'full_size_eth': [],

    'coord_OHSU':[],
    'coord_ETH':[],
    'size_ohsu_coor' : [], 
    'size_eth_coor' : [], 
    'size_intersection_coor' : [], 
    'size_ohsu\eth_coor' : [], 
    'size_eth\ohsu_coor' : [],
    'eth_coor\inter_coor':[],
    'ohsu_coor\inter_coor':[],
    'inter_coor':[],
    'ohsu_coor\eth_coor':[],
    'eth_coor\ohsu_coor':[]

}

for exp_col in tqdm.tqdm(EXPRESSION):
    df1=ETH_df[ETH_df[exp_col] > 0]
    df2=oshu_df[oshu_df[exp_col] > 0]
    eth_kmers=set(df1['kmer'])
    ohsu_kmers=set(df2['kmer'])
    out_table['coord_ETH'].append(df1['junction_coordinate'] if not df1.empty else 'None' )
    out_table['coord_OHSU'].append(df2['junction_coordinate'] if not df2.empty else 'None')
    ser1 = set(df1['junction_coordinate'])
    ser2 = set(df2['junction_coordinate'])
    out_table['size_ohsu_coor'].append(len(ser2))
    out_table['size_eth_coor'].append(len(ser1))
    out_table['size_ohsu\eth_coor'].append(len(df_ohsu_filter_coor:=ser2.difference(ser1)))
    out_table['size_eth\ohsu_coor'].append(len(df_eth_filter_coor:=ser1.difference(ser2)))
    
    out_table['ohsu_coor\eth_coor'].append(df_ohsu_filter_coor)
    out_table['eth_coor\ohsu_coor'].append(df_eth_filter_coor)
    
    out_table['size_intersection_coor'].append(len(df_inter_filter_coor:=ser2 & ser1))
    out_table['eth_coor\inter_coor'].append(df_eth_witout_inter_coor:=list(df_eth_filter_coor.difference(df_inter_filter_coor)))
    out_table['ohsu_coor\inter_coor'].append(df_ohsu_witout_inter_coor:=list(df_ohsu_filter_coor.difference(df_inter_filter_coor)))
    out_table['inter_coor'].append(list(df_inter_filter_coor))
    
    out_table['kmer_eth'].append(np.array(eth_kmers))
    out_table['kmer_ohsu'].append(np.array(ohsu_kmers))
    #append() takes a single item as an input and adds it to the and of the list
    out_table['sample'].append(exp_col)
    out_table['full_size_ohsu'].append(n_ohsu := len(ohsu_kmers))
    out_table['full_size_eth'].append(n_eth:=len(eth_kmers))
    
    out_table['eth_without_ohsu'].append(len(eth_without_ohsu_kmers:=eth_kmers.difference(ohsu_kmers)))
    out_table['ohsu_without_eth'].append(len(oshu_without_ohsu_kmers:=ohsu_kmers.difference(eth_kmers)))
    out_table['inter'].append(intersection:=len(intersection_kmers:= (eth_kmers & ohsu_kmers)))
    out_table['kmer_inter'].append(list(intersection_kmers))


In [None]:
out_df_original=pd.DataFrame(out_table)

<a name='3'></a>
## Download tables

path_non_filtering is generated using a create_path function, which likely constructs a file path by concatenating directory and file names specified in the parameters. The directory and file names suggest that the data may be related to a specific type of breast cancer research data (as indicated by BRCA).
out_df_original is a DataFrame that is written to a CSV file at the location specified by path_non_filtering. The to_csv method uses header=True, indicating that the CSV file should include column headers, and sep='\t', specifying that the data fields in the CSV are separated by tabs rather than commas, which is typical for TSV (tab-separated values) files.

In [None]:
path_non_filtering=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_BRCA,NAME_TABLES,NAME_NON_FILTERING_BRCA])
out_df_original.to_csv(path_non_filtering, header = True, sep='\t')