## Table of Contents
- [Import libraries](#1)
- [Process tables](#2)

<a name='1'></a>
## Import libraries

The script is focused on setting up an environment for data analysis and visualization. It imports a suite of libraries and modules that are essential for statistical computing, data manipulation, progress tracking, file system operations, and generating visualizations such as plots and Venn diagrams. The specific libraries imported include pandas for data structures, numpy for numerical operations, tqdm for progress bars, glob for file path retrieval, os for operating system interaction, matplotlib and seaborn for plotting and graphical representations, and matplotlib_venn for creating Venn diagrams.

Additionally, the script modifies the system path to include a custom directory, which suggests that the script will use additional custom modules and configuration settings located in this directory. These custom modules, imported with wildcard imports (from config import * and from functions import *)


In [1]:
# %load /cluster/home/myurchikova/github/projects2020_ohsu/eth/learning_Master_thesis/TASKS/func/base_imports.py
import pandas as pd
import numpy as np
import tqdm 
import glob
import os
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import tarfile
import re
from matplotlib_venn import venn2, venn2_circles, venn2_unweighted
from matplotlib_venn import venn3, venn3_circles
import sys
sys.path.append(r"/cluster/home/prelotla/github/projects2020_ohsu/eth/MY_Master_thesis_rerun_LP/TASKS/func/")
%load_ext autoreload
%autoreload 2
from config import *
from functions import *




Matplotlib created a temporary config/cache directory at /scratch/slurm-job.5039281/matplotlib-w9kj1ufu because the default path (/cluster/customapps/biomed/grlab/users/prelotla/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib is building the font cache; this may take a moment.


<a name='2'></a>
## Process tables

Configuration and Setup:
The code sets up various file paths and configurations for processing data. It includes file directories for input data and specifies output paths.
Colors for different data types or sources are defined (e.g., OHSU_COLOR, ETH_COLOR), which suggests these colors will be used in visualizations.
Data Loading and Preprocessing:
Data files are loaded, potentially compressed files (as suggested by the use of tarfile), and file names are retrieved.
The code performs preprocessing steps to organize and pair related data files, indicating a comparison between different datasets or conditions.
Data Filtering and Compilation:
The script uses regular expressions to filter data based on specific patterns or criteria.
DataFrames are created and manipulated, possibly merging or aligning data from different sources (e.g., OHSU and ETH).
Data Analysis:
The code includes conditional logic to manipulate and analyze the data based on certain criteria, such as filtering and prioritizing specific data points.
Various metrics are calculated, such as sizes of intersections between datasets, and differences, indicating comparative analysis.
Visualization Preparation:
The code appears to prepare data for visualization, possibly in the form of Venn diagrams or other comparative visual tools, as suggested by the variables related to colors and coordinates.
Error Handling and Logging:
The script includes try-except blocks to handle potential errors during the file reading and data processing steps.
Logging conditions are present, indicating that the script may provide output logs for debugging or record-keeping.
Output Generation:
The script aggregates results into a final DataFrame, which might be used for subsequent analysis or for generating a visual report.
Conditions for printing or saving the results are included, with checks on whether the processed data is empty or not.

In [2]:
# LP always for this notebook
OHSU_OV_NEW=True
if OHSU_OV_NEW:
    TAR_OHSU = TAR_OSHU_OV_NEW
else:
    TAR_OHSU = TAR_OSHU_OV

In [3]:
TEXT_SIZE = 65

In [4]:
# LP get_junction_coordinates_updated
# LP separate ETH coord
# LP rename index column

filter_dir ='/cluster/work/grlab/projects/projects2020_OHSU/peptides_generation/CANCER_eth/commit_c4dd02c_conf2_Frame_cap0_runs/TCGA_Ovarian_374/filtering_samples/filters_19May_order_5ge_wAnnot_GPstar'
             #/cluster/work/grlab/projects/projects2020_OHSU/peptides_generation/CANCER_eth/commit_c4dd02c_conf2_Frame_cap0_runs/TCGA_Breast_1102/filtering_samples/filters_19May_order_5ge_wAnnot_GPstar

#pwd = '/cluster/home/myurchikova/github/projects2020_ohsu/eth/learning_Master_thesis/'

OHSU_COLOR ='red' #'red'
ETH_COLOR = 'green'#'green'
OHSU_ETH_COLOR ='yellow' #'yellow'
LANG = 'ENG'
LOG = False
out_df_filtered=pd.DataFrame()
sample_target=['TCGA25131901A01R156513',
            'TCGA25131301A01R156513',
            'TCGA61200801A02R156813',
            'TCGA24143101A01R156613',
            'TCGA24229801A01R156913',]
# sample_target=['TCGA-C8-A12P-01A-11R-A115-07']
                
if LANG == 'ENG':
    title_venn = ' {sample}'
else:
    title_venn = '{sample}'

# ETH Names
eth_all = glob.glob(os.path.join(filter_dir, 'G*'))

# OHSU Names
with tarfile.open(TAR_OHSU, "r:*") as tar:
    ohsu_all = tar.getnames()

# Get file pairs
file_pair = {}
for idx_eth, eth in enumerate(eth_all):
    pattern = os.path.basename(eth).replace('G_', '').replace('.gz', '') 
    for idx_ohsu, ohsu in enumerate(ohsu_all):
        if pattern in ohsu:
            file_pair[eth] = ohsu

restricts = sample_target
for restrict in restricts:

    df = {'sample' : [], 
      'filter_foreground' : [], 
      'filter_background' : [], 
      'filter': [],
      'size_ohsu' : [], 
      'size_eth' : [], 
      'size_intersection' : [], 
      'size_ohsu\eth' : [], 
      'size_eth\ohsu' : [],
      'eth_kmers\inter':[],
      'ohsu_kmers\inter':[],

      'coord_OHSU':[],
      'coord_ETH':[],
      'size_ohsu_coor' : [], 
      'size_eth_coor' : [], 
      'size_intersection_coor' : [], 
      'size_ohsu\eth_coor' : [], 
      'size_eth\ohsu_coor' : [],
      'eth_coor\inter_coor':[],
      'ohsu_coor\inter_coor':[],
      'inter_coor':[],
      'eth_coor\ohsu_coor':[],
      'ohsu_coor\eth_coor':[],
      'priority':[],
          
         }
    with tarfile.open(TAR_OHSU, "r:*") as tar: #OHSU
        for eth, ohsu in file_pair.items(): # ETH
            if (not restrict) or restrict == re.findall('G_([\s\S]+?)_',eth)[0].replace('-',''): #Restrict to category of interest
                # try:
                    df_ohsu = pd.read_csv(tar.extractfile(ohsu), sep="\t")
                    df_ohsu.reset_index(inplace=True)
                    df_ohsu.rename({'index':'kmer', 'kmer': 'jx'}, axis = 1, inplace=True) # LP rename
                    if not df_ohsu.empty: df_ohsu = table_processing.ohsu_to_eth_coord(df_ohsu)
                    if not df_ohsu.empty: df_ohsu['junction_coordinate'] = df_ohsu['jx_shifted'].apply(lambda x: ':'.join(x.split(';')[1:3]))
                   
                    df_eth = pd.read_csv(eth, sep="\t")
                    df_eth = pd.read_csv(eth, sep="\t")
                    if not df_eth.empty: df_eth=table_processing.get_junction_coordinates_updated(df_eth,'coord') #LP
                    df1=df_eth
                    df2=df_ohsu
                    df_eth_coor = set(table_processing.separate_ETH_3exons(df1['junction_coordinate'])) if not df1.empty else set([]) # LP
                    df_ohsu_coor = set(df2['junction_coordinate']) if not df1.empty else set([])
                    df['coord_ETH'].append(df1['junction_coordinate'] if not df1.empty else 'None' )
                    df['coord_OHSU'].append(df2['junction_coordinate'] if not df2.empty else 'None')
                    df_eth = set(df_eth['kmer'])
                    df_ohsu = set(df_ohsu['kmer']) # LP rename
                    name = os.path.basename(ohsu).replace('.tsv', '').split('_')
                    print(restrict,name)
                    df['sample'].append(name[1].replace('-',''))
                    if not OHSU_OV_NEW:
                        df['filter_foreground'].append(name[2])
                        df['filter_background'].append(name[3])
                        if name[3][1] == 'Any':
                            priority=0
                        elif name[3][1] == 10:
                            priority=1
                        elif name[3][1] == 2:
                            priority=2
                        else:
                            priotity = None
                        df['priority'].append(priority)
                        df['filter'].append(name[2]+' '+name[3])
                    else:
                        df['filter'].append(name[2])
                        a = []
                        for i in range(5):
                            if name[2][i] == 'A':
                                a.append('Any')
                            elif name[2][i] == 'X':
                                a.append('10')
                            elif name[2][i] == 'N':
                                a.append('None')
                            else:
                                a.append(name[2][i])
                        df['filter_foreground'].append(f'({a[0]}, {a[1]}, {a[2]})')
                        df['filter_background'].append(f'({a[3]}, {a[4]})')
                        if a[4] == 'Any':
                            priority=0
                        elif a[4] == '10':
                            priority=1
                        elif a[4] == '2':
                            priority=2
                        else:
                            priority = None
                        df['priority'].append(priority)
                    print(a)
                    print(priority)
                    df['size_ohsu'].append(len(df_ohsu))
                    df['size_eth'].append(len(df_eth))
                    df['size_ohsu\eth'].append(len(df_ohsu_filter:=df_ohsu.difference(df_eth)))
                    df['size_eth\ohsu'].append(len(df_eth_filter:=df_eth.difference(df_ohsu)))
                    df['size_intersection'].append(len(df_inter_filter:=df_ohsu & df_eth))
                    df['eth_kmers\inter'].append(df_eth_witout_inter:=list(df_eth_filter.difference(df_inter_filter)))
                    df['ohsu_kmers\inter'].append(df_ohsu_witout_inter:=list(df_ohsu_filter.difference(df_inter_filter)))

                    
                    df['size_ohsu_coor'].append(len(df_ohsu_coor))
                    df['size_eth_coor'].append(len(df_eth_coor))
                    df['size_ohsu\eth_coor'].append(len(df_ohsu_filter_coor:=df_ohsu_coor.difference(df_eth_coor)))
                    df['size_eth\ohsu_coor'].append(len(df_eth_filter_coor:=df_eth_coor.difference(df_ohsu_coor)))
                    df['size_intersection_coor'].append(len(df_inter_filter_coor:=df_ohsu_coor & df_eth_coor))
                    df['eth_coor\inter_coor'].append(df_eth_witout_inter_coor:=list(df_eth_filter_coor.difference(df_inter_filter_coor)))
                    df['ohsu_coor\inter_coor'].append(df_ohsu_witout_inter_coor:=list(df_ohsu_filter_coor.difference(df_inter_filter_coor)))
                    df['eth_coor\ohsu_coor'].append(df_eth_witout_inter_coor:=list(df_eth_filter_coor.difference(df_ohsu_filter_coor)))
                    df['ohsu_coor\eth_coor'].append(df_ohsu_witout_inter_coor:=list(df_ohsu_filter_coor.difference(df_eth_filter_coor)))
                    
                    df['inter_coor'].append(list(df_inter_filter_coor))
                    if LOG == True:
                        print("\n\nOSHU\n\n")
                        print(df_ohsu)
                        print("\n\nETH\n\n")
                        print(df_eth)

                # except Exception as e:
                #     print("Error",e)
                #     continue
    df = pd.DataFrame(df)
    if not out_df_filtered.empty:
        out_df_filtered = pd.concat([out_df_filtered, df])
    else:
        out_df_filtered = df

TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0A501GA']
['0', 'Any', '5', '0', '1']
None
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0A13XGA']
['0', 'Any', '1', '3', '10']
1
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0AN1XGA']
['0', 'Any', 'None', '1', '10']
1
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0AN32GA']
['0', 'Any', 'None', '3', '2']
2
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0253AGA']
['0', '2', '5', '3', 'Any']
0
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0A51AGA']
['0', 'Any', '5', '1', 'Any']
0
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '02101GA']
['0', '2', '1', '0', '1']
None
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '02501GA']
['0', '2', '5', '0', '1']
None
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0251XGA']
['0', '2', '5', '1', '10']
1
TCGA25131901A01R156513 ['J', 'TCGA-25-1319-01A-01R-1565-13', '0AN12G

TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0A53XGA']
['0', 'Any', '5', '3', '10']
1
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0AN1AGA']
['0', 'Any', 'None', '1', 'Any']
0
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0253AGA']
['0', '2', '5', '3', 'Any']
0
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0A512GA']
['0', 'Any', '5', '1', '2']
2
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0A13AGA']
['0', 'Any', '1', '3', 'Any']
0
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0A13XGA']
['0', 'Any', '1', '3', '10']
1
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0AN01GA']
['0', 'Any', 'None', '0', '1']
None
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0213XGA']
['0', '2', '1', '3', '10']
1
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0A11XGA']
['0', 'Any', '1', '1', '10']
1
TCGA61200801A02R156813 ['J', 'TCGA-61-2008-01A-02R-1568-13', '0213A

TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0AN3XGA']
['0', 'Any', 'None', '3', '10']
1
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0AN1XGA']
['0', 'Any', 'None', '1', '10']
1
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0211XGA']
['0', '2', '1', '1', '10']
1
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '02132GA']
['0', '2', '1', '3', '2']
2
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0251AGA']
['0', '2', '5', '1', 'Any']
0
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0A13XGA']
['0', 'Any', '1', '3', '10']
1
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0AN01GA']
['0', 'Any', 'None', '0', '1']
None
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0A132GA']
['0', 'Any', '1', '3', '2']
2
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0AN12GA']
['0', 'Any', 'None', '1', '2']
2
TCGA24229801A01R156913 ['J', 'TCGA-24-2298-01A-01R-1569-13', '0A11A

In [5]:
OHSU_OV_NEW

True

In [7]:
path_filtering=create_path.create_path(SAVE_DIR,[DIR_CSV,DIR_OVARIAN,NAME_TABLES,NAME_FILTERING_OVARIAN])
out_df_filtered.to_csv(path_filtering,header=True,sep=';')
