## Table of Contents
- [Import libraries](#1)
- [Download tables](#2)
- [Process tables](#3)
- [Download plots](#4)
- [Download plots with log10](#5)

<a name='1'></a>
## Import libraries

The script is focused on setting up an environment for data analysis and visualization. It imports a suite of libraries and modules that are essential for statistical computing, data manipulation, progress tracking, file system operations, and generating visualizations such as plots and Venn diagrams. The specific libraries imported include pandas for data structures, numpy for numerical operations, tqdm for progress bars, glob for file path retrieval, os for operating system interaction, matplotlib and seaborn for plotting and graphical representations, and matplotlib_venn for creating Venn diagrams.

Additionally, the script modifies the system path to include a custom directory, which suggests that the script will use additional custom modules and configuration settings located in this directory. These custom modules, imported with wildcard imports (from config import * and from functions import *)

In [5]:
# %load /cluster/home/myurchikova/github/projects2020_ohsu/eth/learning_Master_thesis/TASKS/func/base_imports.py
import pandas as pd
import numpy as np
import tqdm 
import glob
import os
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import tarfile
import re
from matplotlib_venn import venn2, venn2_circles, venn2_unweighted
from matplotlib_venn import venn3, venn3_circles
import sys
sys.path.append(r"/cluster/home/prelotla/github/projects2020_ohsu/eth/MY_Master_thesis_rerun_LP/TASKS/func/")
%load_ext autoreload
%autoreload 2
from config import *
from functions import *




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a name='2'></a>
## Download tables

Definition constants

In [6]:

samples = ['TCGA-25-1319-01A-01R-1565-13',
               'TCGA-25-1313-01A-01R-1565-13',
               'TCGA-61-2008-01A-02R-1568-13',
               'TCGA-24-1431-01A-01R-1566-13',
               'TCGA-24-2298-01A-01R-1569-13']
MAIN = [
        'kmer',
        'batch',
]
                                 
COLS_ETH = [
            'coord',
            'kmer',
            'batch',
            'TCGA25131901A01R156513all',
            'TCGA25131301A01R156513all',
            'TCGA61200801A02R156813all',
            'TCGA24143101A01R156613all',
            'TCGA24229801A01R156913all',
           ]
ETH_COLMNS= {
             'TCGA25131901A01R156513all':'TCGA25131901A01R156513',
             'TCGA25131301A01R156513all':'TCGA25131301A01R156513',
             'TCGA61200801A02R156813all':'TCGA61200801A02R156813',
             'TCGA24143101A01R156613all':'TCGA24143101A01R156613',
             'TCGA24229801A01R156913all':'TCGA24229801A01R156913',
            }
EXPRESSION=['TCGA25131901A01R156513',
            'TCGA25131301A01R156513',
            'TCGA61200801A02R156813',
            'TCGA24143101A01R156613',
            'TCGA24229801A01R156913',]

eth_df=pd.read_csv(OVARIAN_ETH, usecols = COLS_ETH, sep = '\t', low_memory=False):
This line reads a tab-separated values (TSV) file (likely due to sep='\t'), which is specified by the OVARIAN_ETH path. It only reads certain columns (COLS_ETH) to conserve memory, and it turns off optimizations for low memory usage.
batch_to_gene = pd.read_csv(BATCH_PATH, names = ['gene_id', 'batch'], low_memory=False):
Another CSV file is read from BATCH_PATH where the columns are explicitly named 'gene_id' and 'batch'. The low_memory=False parameter is also used here, which is helpful when processing large files.
eth_df.head():
Displays the first few rows of the DataFrame eth_df to get a preview of the data structure. The DataFrame seems to contain columns like 'kmer', 'coord', 'batch', and several others that look like identifiers, possibly for gene expression data.
ohsu_df=pd.read_csv(OVARIAN_OHSU, sep = '\t', low_memory=False):
Reads another TSV file specified by OVARIAN_OHSU. Similar to the first read operation, it disables memory optimization.
ohsu_df.head():
Similar to the eth_df.head(), this previews the first few rows of the ohsu_df DataFrame.
The data shown in the head of the eth_df DataFrame contains:

'kmer' column: This typically refers to substrings of length 'k' found in DNA or RNA sequences.
'coord' column: Likely refers to genomic coordinates.
'batch' column: Could be related to the batch processing of samples.
Other columns with TCGA (The Cancer Genome Atlas) in their headers: These columns likely contain gene expression values or other related data for different samples.

In [7]:
eth_df=pd.read_csv(OVARIAN_ETH, usecols = COLS_ETH, sep = '\t', low_memory=False)
batch_to_gene = pd.read_csv(BATCH_PATH, names = ['gene_id', 'batch'], low_memory=False)
eth_df.head()

Unnamed: 0,kmer,coord,batch,TCGA25131901A01R156513all,TCGA25131301A01R156513all,TCGA61200801A02R156813all,TCGA24143101A01R156613all,TCGA24229801A01R156913all
0,RKSTQMPCT,92379857:92379859:92611313:92611338:None:None,56543,20.982616,8.422932,17.304198,16.069005,19.585264
1,RKSTQMPCT,92379857:92379859:92611313:92611338:None:None,56543,20.982616,8.422932,17.304198,16.069005,19.585264
2,RKSTQMPCT,92379857:92379859:92611313:92611338:None:None,56543,20.982616,8.422932,17.304198,16.069005,19.585264
3,RKSTQMPCT,92379857:92379859:92611313:92611338:None:None,56543,20.982616,8.422932,17.304198,16.069005,19.585264
4,RKSTQMPCT,92379857:92379859:92611313:92611338:None:None,56543,20.982616,8.422932,17.304198,16.069005,19.585264


In [8]:
ohsu_df=pd.read_csv(OVARIAN_OHSU, sep = '\t', low_memory=False)
ohsu_df.head()

Unnamed: 0,jx,TCGA-24-1431-01A-01R-1566-13,TCGA-24-2298-01A-01R-1569-13,TCGA-25-1313-01A-01R-1565-13,TCGA-25-1319-01A-01R-1565-13,TCGA-61-2008-01A-02R-1568-13,TCGA-A2-A0D2-01A-21R-A034-07,TCGA-A2-A0SX-01A-12R-A084-07,TCGA-AO-A0JM-01A-21R-A056-07,TCGA-BH-A18V-01A-11R-A12D-07,...,modified_upstream_txs,in-frame_all-transcript_biexons,in-frame_nonhanging-tx_biexons,in-frame_peptide_sequence,hanging_txs_included_inframe_pepseqs,prefiltered_in-frame_epitopes,prefiltered_inframe_epitope_count,in-frame_neoepitopes,in-frame_neoepitope_count,frame-agnostic_all-transcript_biexons
0,chr10;48726;48803;-,0.0,2.17614,0.842293,2.517914,0.961344,0.0,5.700962,2.102386,1.237161,...,['ENST00000561967.1.MOD.CHR10.48726.48803.MINU...,LTQIGQCGNQIGAKFWEVISDEHAIDSAGTYHGDSHLQLERINVYY...,LTQIGQCGNQIGAKFWEVISDEHAIDSAGTYHGDSHLQLERINVYY...,NVYYNEASGGRYVPRAV;STCTTTRPAVAGQCGA,NVYYNEASGGRYVPRAV;STCTTTRPAVAGQCGA,NVYYNEASG;VYYNEASGG;YYNEASGGR;YNEASGGRY;NEASGG...,17.0,TTRPAVAGQ;TRPAVAGQC;RPAVAGQCG;PAVAGQCGA,4.0,LTQIGQCGNQIGAKFWEVISDEHAIDSAGTYHGDSHLQLERINVYY...
1,chr10;277578;281199;-,2.295572,30.465966,16.003571,10.071656,26.917642,51.9667,53.58904,31.535793,50.723584,...,['ENST00000280886.12.MOD.CHR10.277578.281199.M...,RTELTDANGERHDALYVVGALDEAMELRGMRYHPIDIETSVIRAHK...,RTELTDANGERHDALYVVGALDEAMELRGMRYHPIDIETSVIRAHK...,RAHKSVTECAVFTWTNL,RAHKSVTECAVFTWTNL,RAHKSVTEC;AHKSVTECA;HKSVTECAV;KSVTECAVF;SVTECA...,9.0,,0.0,RTELTDANGERHDALYVVGALDEAMELRGMRYHPIDIETSVIRAHK...
2,chr10;280261;281199;-,0.0,0.0,0.0,0.0,0.961344,0.0,0.0,6.307159,2.474321,...,['ENST00000280886.12.MOD.CHR10.280261.281199.M...,RTELTDANGERHDALYVVGALDEAMELRGMRYHPIDIETSVIRAHK...,RTELTDANGERHDALYVVGALDEAMELRGMRYHPIDIETSVIRAHK...,RAHKSVTEC,RAHKSVTEC,RAHKSVTEC,1.0,,0.0,RTELTDANGERHDALYVVGALDEAMELRGMRYHPIDIETSVIRAHK...
3,chr10;281324;283271;-,3.443358,23.212165,18.530451,15.107483,20.188231,32.821074,33.065578,44.15011,40.8263,...,['ENST00000280886.12.MOD.CHR10.281324.283271.M...,IWVHSAHNASGYFTIYGDESLQSDHFNSRLSFGDTQTIWARTGYLG...,IWVHSAHNASGYFTIYGDESLQSDHFNSRLSFGDTQTIWARTGYLG...,TELTDANGERHDALYVV,TELTDANGERHDALYVV,TELTDANGE;ELTDANGER;LTDANGERH;TDANGERHD;DANGER...,9.0,,0.0,IWVHSAHNASGYFTIYGDESLQSDHFNSRLSFGDTQTIWARTGYLG...
4,chr10;283447;286272;-,1.147786,41.346668,19.372744,6.714437,28.84033,61.539513,62.710579,37.842952,92.787044,...,['ENST00000280886.12.MOD.CHR10.283447.286272.M...,ALRHDRVRLVERGSPHSLPLMESGKILPGVRIIIANPETKGPLGDS...,ALRHDRVRLVERGSPHSLPLMESGKILPGVRIIIANPETKGPLGDS...,PLGDSHLGEIWVHSAH,PLGDSHLGEIWVHSAH,PLGDSHLGE;LGDSHLGEI;GDSHLGEIW;DSHLGEIWV;SHLGEI...,8.0,,0.0,ALRHDRVRLVERGSPHSLPLMESGKILPGVRIIIANPETKGPLGDS...


<a name='3'></a>
## Process tables

eth_df=table_processing.get_junction_coordinates(eth_df, 'coord'):
This line suggests that the eth_df DataFrame is being passed to a function to get junction coordinates based on the 'coord' column. The function is likely part of a custom module named table_processing.
ETH_df = pd.merge(batch_to_gene, eth_df, on=['batch']):
The pandas merge function is used to combine two DataFrames, batch_to_gene and eth_df, on the 'batch' column. This is commonly done to join related data that share a common identifier.
ETH_df.drop(columns=['batch'], inplace=True):
The drop method is used to remove the 'batch' column from ETH_df after the merge operation, indicating that the column is no longer needed for further analysis.
ohsu_df.rename(columns = ETH_COLUMNS, inplace = True):
The rename method is used to rename the columns of ohsu_df DataFrame according to a dictionary or list of column names provided by ETH_COLUMNS. The inplace=True parameter indicates that the changes are to be applied directly to the DataFrame without creating a copy.
ETH_df.rename(columns = ETH_COLUMNS, inplace = True):
Similarly, the rename method is used to rename the columns of ETH_df DataFrame according to ETH_COLUMNS.

In [9]:
# LP: Changed ETH table processing table_processing.get_junction_coordinates_updated
eth_df=table_processing.get_junction_coordinates_updated(eth_df,'coord')
ETH_df = pd.merge(batch_to_gene,eth_df, on=['batch'])
ETH_df.drop(columns=['batch'],inplace=True)
ohsu_df.rename(columns = ETH_COLMNS, inplace = True)
ETH_df.rename(columns = ETH_COLMNS, inplace = True)
ETH_df

Unnamed: 0,gene_id,kmer,coord,TCGA25131901A01R156513,TCGA25131301A01R156513,TCGA61200801A02R156813,TCGA24143101A01R156613,TCGA24229801A01R156913,strand,junction_coordinate
0,ENSG00000169962.5,NQTTSPAPF,1333118:1333124:1333505:1333526:None:None,0.000000,0.0,0.0,4.591144,0.72538,+,1333124:1333505
1,ENSG00000169962.5,NQTTSPAPF,1333118:1333124:1333505:1333526:None:None,0.000000,0.0,0.0,4.591144,0.72538,+,1333124:1333505
2,ENSG00000169962.5,DNQTTSPAP,1333115:1333124:1333505:1333523:None:None,0.000000,0.0,0.0,4.591144,0.72538,+,1333124:1333505
3,ENSG00000169962.5,DNQTTSPAP,1333115:1333124:1333505:1333523:None:None,0.000000,0.0,0.0,4.591144,0.72538,+,1333124:1333505
4,ENSG00000169962.5,QTTSPAPFV,1333121:1333124:1333505:1333529:None:None,0.000000,0.0,0.0,4.591144,0.72538,+,1333124:1333505
...,...,...,...,...,...,...,...,...,...,...
46118846,ENSG00000171533.11,KSCTSVEFP,75608108:75608111:75607238:75607262:None:None,4.196523,0.0,0.0,0.000000,0.00000,-,75607262:75608108
46118847,ENSG00000171533.11,KSCTSVEFP,75608108:75608111:75607238:75607262:None:None,4.196523,0.0,0.0,0.000000,0.00000,-,75607262:75608108
46118848,ENSG00000171533.11,KSCTSVEFP,75608108:75608111:75607238:75607262:None:None,4.196523,0.0,0.0,0.000000,0.00000,-,75607262:75608108
46118849,ENSG00000171533.11,KSCTSVEFP,75608108:75608111:75607238:75607262:None:None,4.196523,0.0,0.0,0.000000,0.00000,-,75607262:75608108


ohsu_df.rename(columns = {'in-frame_neopeptides': 'kmer'}, inplace = True):
This line renames the column 'in-frame_neopeptides' to 'kmer' in the DataFrame ohsu_df. The inplace=True argument makes the change in the existing DataFrame without the need to assign the result to a new variable.
ohsu_df = table_processing.ohsu_to_eth_coord(ohsu_df):
The ohsu_df DataFrame is passed to a function ohsu_to_eth_coord from a custom module table_processing. This suggests that the function is transforming or mapping OHSU coordinates to ETH coordinates.
ohsu_df = table_processing.change_column_names(ohsu_df):
Another function from table_processing is used to change the column names in ohsu_df. The specific changes are not detailed, but this is likely part of data standardization or cleanup.
ohsu_df = table_processing.preprocess_ohsu(ohsu_df):
The DataFrame ohsu_df is further processed by a preprocess_ohsu function from the table_processing module, which likely performs a series of preprocessing steps on the data.
ohsu_df['junction_coordinate'] = ohsu_df['x_shifted'].apply(lambda x: '::'.join(x.split('::')[1:3])):
This line creates or modifies the 'junction_coordinate' column in ohsu_df by applying a lambda function to each value in the 'x_shifted' column. The lambda function splits each string on '::', slices to get the second and third elements (if available), and then joins these elements with '::'. This could be formatting a genomic coordinate or similar structured data.
print(ohsu_df['junction_coordinate']):
Finally, the 'junction_coordinate' column is printed to the output, likely for verification or inspection purposes.

In [10]:
# LP: Changed table_processing.ohsu_to_eth_coord (See function)
ohsu_df.rename(columns = {'in-frame_neoepitopes': 'kmer'}, inplace = True)
ohsu_df = table_processing.ohsu_to_eth_coord(ohsu_df)
ohsu_df = table_processing.change_column_names(ohsu_df)
ohsu_df = table_processing.preprocess_ohsu(ohsu_df)
ohsu_df['junction_coordinate'] = ohsu_df['jx_shifted'].apply(lambda x: ':'.join(x.split(';')[1:3]))
print(ohsu_df['junction_coordinate'])

0                  48725:48803
0                  48725:48803
0                  48725:48803
0                  48725:48803
29               390095:390274
                  ...         
1344358    109916656:109919488
1344358    109916656:109919488
1344358    109916656:109919488
1344358    109916656:109919488
1344358    109916656:109919488
Name: junction_coordinate, Length: 3182186, dtype: object


In [11]:
oshu_df=ohsu_df[['gene_id','kmer','junction_coordinate']+EXPRESSION]

In [12]:
ETH_df.drop(columns=['strand'],inplace=True)
ETH_df.drop(columns=['coord'],inplace=True)
ETH_df.columns

Index(['gene_id', 'kmer', 'TCGA25131901A01R156513', 'TCGA25131301A01R156513',
       'TCGA61200801A02R156813', 'TCGA24143101A01R156613',
       'TCGA24229801A01R156913', 'junction_coordinate'],
      dtype='object')

<a name='4'></a>
## Plotting data

A dictionary named out_table is initialized with various keys, each associated with an empty list. These keys are suggestive of data attributes such as sample identifiers, k-mer counts, sizes, and coordinates which are often used in genomic data analysis.
A for loop begins, iterating over columns of ETH_df DataFrame using the tqdm library, which provides a progress bar for loops.
Within the loop, multiple conditional and data manipulation steps are performed:
Checking if the current column's k-mers are zero and performing set operations.
Appending data to the lists in out_table, such as 'junction_coordinate' and various size and coordinate related data.
Calculating differences between sets of k-mers and appending the results to the appropriate lists in out_table.
The snippet contains several operations that transform and aggregate data, possibly to prepare for statistical analysis or visualization. For example, it deals with intersection and union of genomic coordinates, suggesting a comparison between different datasets or conditions within the study.
The code ends with appending array data converted to lists, likely for uniformity in data structure within the out_table.

In [13]:
path_eth_df=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_OVARIAN,NAME_TABLES,NAME_ETH_OVARIAN])
path_ohsu_df=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_OVARIAN,NAME_TABLES,NAME_OHSU_OVARIAN])
ETH_df.to_csv(path_eth_df, header = True, sep='\t')
ohsu_df.to_csv(path_ohsu_df, header = True, sep='\t')

In [14]:
# LP Updated separated the coordinates
sample=[]
inter = []
kmer_inter=[]
eth_without_ohsu =  []
ohsu_withou_eth =  []
fs_ohsu =  []
kmer_ohsu=[]
fs_eth =  []
kmer_eth=[]
batch =  []
gene_id =  []
LANG='ENG'
OHSU_COLOR ='red' #'red'
ETH_COLOR = 'green'#'green'
OHSU_ETH_COLOR ='yellow' #'yellow'

if LANG == 'ENG':
    title_venn = 'Comparison of unique kmers ({exp_col})'
    title_venn_perc = 'Comparison of unique kmers in percentages ({exp_col})'
    title_barplot = 'Comparison of unique kmers ({exp_col})'
else:
    title_venn = 'Порівняння унікальних kmers ({exp_col})'
    title_venn_perc = 'Порівняння унікальних kmers у відсотках ({exp_col})'
    title_barplot = 'Порівняння унікальних kmers ({exp_col})'

out_table = {
    'sample':sample,
    'eth_without_ohsu':eth_without_ohsu,
    'kmer_eth':kmer_eth,
    'ohsu_without_eth':ohsu_withou_eth,
    'kmer_ohsu':kmer_ohsu,
    'inter':inter,
    'kmer_inter':kmer_inter,
    'full_size_ohsu': fs_ohsu,
    'full_size_eth': fs_eth,

    'coord_OHSU':[],
    'coord_ETH':[],
    'size_ohsu_coor' : [], 
    'size_eth_coor' : [], 
    'size_intersection_coor' : [], 
    'size_ohsu\eth_coor' : [], 
    'size_eth\ohsu_coor' : [],
    'eth_coor\inter_coor':[],
    'ohsu_coor\inter_coor':[],
    'eth_coor\ohsu_coor':[],
    'ohsu_coor\eth_coor':[],
    'inter_coor':[],
}
for exp_col in tqdm.tqdm(EXPRESSION):
    df1=ETH_df[ETH_df[exp_col] > 0]
    df2=oshu_df[oshu_df[exp_col] > 0]
    eth_kmers=set(df1['kmer'])
    ohsu_kmers=set(df2['kmer'])
    
    # df1, df2 = filter_df_common_kmers(eth_df[['gene_id', 'kmer', 'junction_coordinate']], ohsu_df[['gene_id', 'kmer', 'junction_coordinate']])
    out_table['coord_ETH'].append(df1['junction_coordinate'] if not df1.empty else 'None' )
    out_table['coord_OHSU'].append(df2['junction_coordinate'] if not df2.empty else 'None')
    ser1 = set(table_processing.separate_ETH_3exons(df1['junction_coordinate'])) # LP
    ser2 = set(df2['junction_coordinate'])
    out_table['size_ohsu_coor'].append(len(ser2))
    out_table['size_eth_coor'].append(len(ser1))
    out_table['size_ohsu\eth_coor'].append(len(df_ohsu_filter_coor:=ser2.difference(ser1)))
    out_table['size_eth\ohsu_coor'].append(len(df_eth_filter_coor:=ser1.difference(ser2)))
    out_table['size_intersection_coor'].append(len(df_inter_filter_coor:=ser2 & ser1))
    out_table['eth_coor\inter_coor'].append(df_eth_witout_inter_coor:=list(df_eth_filter_coor.difference(df_inter_filter_coor)))
    out_table['ohsu_coor\inter_coor'].append(df_ohsu_witout_inter_coor:=list(df_ohsu_filter_coor.difference(df_inter_filter_coor)))
    out_table['eth_coor\ohsu_coor'].append(df_eth_filter_coor)
    out_table['ohsu_coor\eth_coor'].append(df_ohsu_filter_coor)
    out_table['inter_coor'].append(list(df_inter_filter_coor))
    
    sample.append(exp_col)
    fs_ohsu.append(n_ohsu := len(ohsu_kmers))
    kmer_ohsu.append(list(ohsu_kmers))
    fs_eth.append(n_eth:=len(eth_kmers))
    kmer_eth.append(list(eth_kmers))
    eth_without_ohsu.append(len(eth_without_ohsu_kmers:=eth_kmers.difference(ohsu_kmers)))
    ohsu_withou_eth.append(len(oshu_without_ohsu_kmers:=ohsu_kmers.difference(eth_kmers)))
    inter.append(intersection:=len(intersection_kmers:= (eth_kmers & ohsu_kmers)))
    kmer_inter.append(list(intersection_kmers))
out_df_original=pd.DataFrame(data=out_table)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:02<00:00, 84.59s/it]


In [15]:
path_non_filtering=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_OVARIAN,NAME_TABLES,NAME_NON_FILTERING_OVARIAN])
out_df_original.to_csv(path_non_filtering, header = True, sep='\t')

In [16]:
print(path_non_filtering)

/cluster/work/grlab/projects/projects2020_OHSU/peptides_generation/202301_myurchikova_MT_rerun_LP/DATA/OVARIAN/TABLES/out_df_non_filtering_OVARIAN.csv
