# Code for preprocessing with bag

This is the final version of the code of our project that has managed to process the whole dataset of 1.7 TB in about 1h and 40 minutes. 

In this notebook we create a bag that contains lists of paths of files. 
Each list contains the paths of the output, evolved and logfile of the same thread.
The preprocessing function is written in a way that takes in input a list containing the paths of an output file, an evolved file and a logfile and creates a dataframe cotaining the useful informations. 
Applying this function to the bag through the `bag.map()` command we obtain a bag of dataframes. 
In the end, we extract a dataframe from the bag using the `bag.to_dataframe()` command and save the content in parquet files with `.to_parquet()`.

## Importing the libraries

In [None]:
import pandas as pd
import re
import time
import glob

import dask
import dask.dataframe as dd
import dask.bag as db
from dask.distributed import Client, SSHCluster

## Cluster up

In [None]:
cluster = SSHCluster(
            ["bhbh-1", 'bhbh-1', "bhbh-2", "bhbh-3", "bhbh-4", "bhbh-5"],
            connect_options={"client_keys": "/home/ubuntu/private/tbertola_key.pem"},
            worker_options={"n_workers": 2,   #best set-up from benchmark
                            "nthreads": 2},
            scheduler_options={"port": 8786, "dashboard_address": ":8787"}
            )

2023-06-17 19:55:35,024 - distributed.deploy.ssh - INFO - 2023-06-17 19:55:35,022 - distributed.scheduler - INFO - State start
2023-06-17 19:55:35,032 - distributed.deploy.ssh - INFO - 2023-06-17 19:55:35,031 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.67.22.140:8786
2023-06-17 19:55:35,964 - distributed.deploy.ssh - INFO - 2023-06-17 19:55:35,962 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.140:44745'
2023-06-17 19:55:35,977 - distributed.deploy.ssh - INFO - 2023-06-17 19:55:35,976 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.140:36305'
2023-06-17 19:55:36,741 - distributed.deploy.ssh - INFO - 2023-06-17 19:55:36,739 - distributed.worker - INFO -       Start worker at:   tcp://10.67.22.140:37095
2023-06-17 19:55:36,742 - distributed.deploy.ssh - INFO - 2023-06-17 19:55:36,740 - distributed.worker - INFO -          Listening to:   tcp://10.67.22.140:37095
2023-06-17 19:55:36,750 - distributed.deploy.ssh - INFO - 2023-06-17

## Client

In [None]:
client = Client(cluster)


+---------+--------+-----------+------------------+
| Package | Client | Scheduler | Workers          |
+---------+--------+-----------+------------------+
| tornado | 6.3.2  | 6.3.2     | {'6.2', '6.3.2'} |
+---------+--------+-----------+------------------+


In [None]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.SpecCluster
Dashboard: http://10.67.22.140:8787/status,

0,1
Dashboard: http://10.67.22.140:8787/status,Workers: 6
Total threads: 12,Total memory: 23.28 GiB

0,1
Comm: tcp://10.67.22.140:8786,Workers: 6
Dashboard: http://10.67.22.140:8787/status,Total threads: 12
Started: Just now,Total memory: 23.28 GiB

0,1
Comm: tcp://10.67.22.140:37095,Total threads: 2
Dashboard: http://10.67.22.140:41909/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.140:44745,
Local directory: /tmp/dask-scratch-space/worker-ja4lhh8l,Local directory: /tmp/dask-scratch-space/worker-ja4lhh8l

0,1
Comm: tcp://10.67.22.140:43837,Total threads: 2
Dashboard: http://10.67.22.140:36123/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.140:36305,
Local directory: /tmp/dask-scratch-space/worker-7ccp4fa7,Local directory: /tmp/dask-scratch-space/worker-7ccp4fa7

0,1
Comm: tcp://10.67.22.21:34523,Total threads: 2
Dashboard: http://10.67.22.21:41611/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.21:46339,
Local directory: /tmp/dask-scratch-space/worker-97hrmcje,Local directory: /tmp/dask-scratch-space/worker-97hrmcje

0,1
Comm: tcp://10.67.22.21:42441,Total threads: 2
Dashboard: http://10.67.22.21:38519/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.21:34979,
Local directory: /tmp/dask-scratch-space/worker-_ai0f43m,Local directory: /tmp/dask-scratch-space/worker-_ai0f43m

0,1
Comm: tcp://10.67.22.31:38195,Total threads: 2
Dashboard: http://10.67.22.31:37033/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.31:35159,
Local directory: /tmp/dask-scratch-space/worker-k73vlgxe,Local directory: /tmp/dask-scratch-space/worker-k73vlgxe

0,1
Comm: tcp://10.67.22.31:44087,Total threads: 2
Dashboard: http://10.67.22.31:46731/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.31:39763,
Local directory: /tmp/dask-scratch-space/worker-ri0wdot6,Local directory: /tmp/dask-scratch-space/worker-ri0wdot6


## Creating list of directories

In [None]:
dir_path = '/mnt/bhbh/fiducial_Hrad_5M/sevn_output_*'
dir_list = glob.glob(dir_path)

## Creating the bag list of paths

In [None]:
bag = db.from_sequence([[dir_ + f'/0/output_{thread}.csv', 
                         dir_ + f'/0/evolved_{thread}.dat',
                         dir_ + f'/0/logfile_{thread}.dat'] 
                        for dir_ in dir_list for thread in range(30)], npartitions=30*60)  #selecting   #partitions=#threads

## Pre-processing function for the bag_of_thread

In [None]:
# this function does the preprocessing
# on three files of the same thread:
# output_{thread}, evolved_{thread}, logfile_{thread}

def preprocessing_bag_of_thread(paths):
    '''
       paths = python list of the paths of the three files considered [output, evoleved, logfile]
    '''
    
    # list of column names and types to read
    
    # output_{}.csv
    output_column_to_read = ['name', 'Mass_0', 'RemnantType_0',
                             'Mass_1', 'RemnantType_1',
                             'Semimajor','Eccentricity',
                             'GWtime','BWorldtime']
    output_column_type = ['string', 'float64', 'int64',
                          'float64', 'int64',
                          'float64', 'float64',
                          'float64', 'float64']

    # evolved_{}.dat
    evolved_column_to_read = ['name', 'Mass_0',
                              'Z_0', 'SN_0',
                              'Mass_1', 'SN_1',
                              'a', 'e']
    evolved_column_type = ['string', 'float64',
                           'float64', 'string',
                           'float64', 'string',
                           'float64', 'float64']
    
    # further columns to remove at the end 
    drop_list = ['RemnantType_0',  'RemnantType_1']
    
   
    #OUTPUT files processing
    
    output = pd.read_csv(paths[0],                              # read the file
                         usecols=output_column_to_read,         # read only some cols
                         dtype=dict(zip(output_column_to_read,  # specify the types
                                        output_column_type))).\
                rename(columns={'Mass_0':'Mass_0_out',          # rename columns
                                'Mass_1':'Mass_1_out'})         #

    # mask to select only the black holes binaries, defined by RemnantType
    idxBHBH=(output.RemnantType_0==6) & (output.RemnantType_1==6) & (output.Semimajor.notnull())
    
    # apply the mask
    output=output[idxBHBH]
        
    
    #EVOLVED files processing
    
    #extracting the alpha parameter from the path of the file 
    alpha = float(re.findall(r".+(?<=A)(.*)(?=L)",
                             paths[1])[0])
    
    #read the columns we are interested in from the evolved file
    evolved = pd.read_table(paths[1],                               # read file
                            sep='\s+',                              # separate by spaces
                            usecols=evolved_column_to_read,         # read only some columns
                            dtype=dict(zip(evolved_column_to_read,  # specify the types
                                           evolved_column_type)))   #
    #NB: sep='\s+' is need because there are different number of spaces separareting the columns
    
    #adding the column with the alpha parameter
    evolved['alpha'] = alpha
    
    
    #LOGFILE files processing
    
    logfile = pd.read_csv(paths[2],    # read the file
                          header=None) # there is no header

    
    #Running Regex on the line of the logfile to extrac useful informations
    df_RLO = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);RLO_BEGIN;").\
                dropna().\
                rename(columns={0:'name', 1:'ID'}).\
                groupby(['name']).\
                size().\
                to_frame(name='RLO').\
                reset_index()                                                 
    
    
    df_CE = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);CE;").\
                dropna().\
                rename(columns={0:'name', 1:'ID'}).\
                groupby(['name']).\
                size().\
                to_frame(name='CE').\
                reset_index()                                         
    

    df_BSN = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);BSN;").\
                dropna().\
                rename(columns={0:'name', 1:'ID'}).\
                groupby(['name']).\
                size().\
                to_frame(name='BSN').\
                reset_index()

    df_No_Kick = logfile[0].str.extract(r"S;((?:\d*\_)?\d+);(\d+);SN;.+:(0):.+:.+:.+.").\ 
                dropna().\                                                                
                rename(columns={0:'name', 1:'ID', 2: 'No_Kick'}).\                        
                groupby(['name']).\                                                      
                size().\                                                                  
                to_frame(name='No_Kick').\                                                
                reset_index()  

    
    #MERGE
    bhbh = evolved.merge(output, on=['name'], how='inner').\
                    merge(df_RLO, on=['name'], how='left').\
                    merge(df_CE,  on=['name'], how='left').\
                    merge(df_BSN, on=['name'], how='left').\
                    merge(df_No_Kick, on=['name'], how='left').\
                    fillna(value=0).\
                    drop(columns=drop_list)                   
    
    
    #Adding some columns with physical meaning
    bhbh['tdelay'] = bhbh['GWtime'] + bhbh['BWorldtime'] #time delay
    
    #defining the max mass of output
    bhbh['Mass_max_out'] = bhbh['Mass_1_out']
    bhbh['Mass_max_out'] = bhbh['Mass_max_out'].\
                            where(cond=(bhbh['Mass_max_out'] > bhbh['Mass_0_out']),
                                  other=bhbh['Mass_0_out'])

    #defining q=m1/m2 with m2>,m1
    bhbh['q'] = bhbh['Mass_1_out']/bhbh['Mass_0_out']
    bhbh['q'] = bhbh['q'].\
                      where(cond=(bhbh['Mass_1_out'] < bhbh['Mass_0_out']),
                      other=bhbh['Mass_0_out']/bhbh['Mass_1_out'])
    
    #defining the Chirp mass
    bhbh['Mass_chirp'] = ((bhbh['Mass_0_out'] * bhbh['Mass_1_out'])**(3/5))/((bhbh['Mass_0_out'] + bhbh['Mass_1_out'])**(1/5))
    
    return bhbh # return the pandas DataFrame

## Map the preprocessing function to the bag

In [None]:
%%time
bag_of_df = bag.map(preprocessing_bag_of_thread)

CPU times: user 5.3 ms, sys: 415 µs, total: 5.72 ms
Wall time: 5.72 ms


## Concat the dataframes

In [None]:
%%time
bag_of_dicts = bag_of_df.map(lambda df: df.to_dict(orient='records')).flatten() #trasforming to a bag of dictionaries to use .to_dataframe()

CPU times: user 78.3 ms, sys: 374 µs, total: 78.7 ms
Wall time: 82.6 ms


In [None]:
%%time
bhbh = bag_of_dicts.to_dataframe()  #extracting a final dataframe

CPU times: user 468 ms, sys: 27.8 ms, total: 495 ms
Wall time: 52.1 s


## Save to parquet

In [None]:
%%time
bhbh.to_parquet('/mnt/bhbh/bag_all_dataset_2/')

CPU times: user 17 s, sys: 1.91 s, total: 18.9 s
Wall time: 2h 5min 29s


## Compute (if we do not want to save the results)

## Final dataframe

In [None]:
bhbh = dd.read_parquet('/mnt/bhbh/bag_all_dataset_2/part.*.parquet')

In [None]:
bhbh.head()

Unnamed: 0,name,Mass_0,Z_0,SN_0,Mass_1,SN_1,a,e,alpha,Mass_0_out,...,Eccentricity,GWtime,BWorldtime,RLO,CE,BSN,tdelay,Mass_max_out,q,Mass_chirp
0,0_186500805616303,24.025,0.0004,rapid_gauNS,15.586,rapid_gauNS,56.1,0.00704,1.0,9.050055,...,0.028653,7108.607,10.62461,2.0,0.0,2,7119.232,9.050055,0.928217,7.58944
1,0_502130275753280,39.24,0.0004,rapid_gauNS,37.312,rapid_gauNS,1180.0,0.0782,1.0,38.33631,...,0.006686,1274892000.0,5.217448,2.0,0.0,2,1274892000.0,38.33631,0.950738,32.539204
2,0_201673565337120,61.947,0.0004,rapid_gauNS,30.179,rapid_gauNS,5040.0,0.221,1.0,59.19078,...,0.277687,1390318000000.0,6.160173,0.0,0.0,2,1390318000000.0,59.19078,0.357987,30.060437
3,0_929528790266714,135.386,0.0004,rapid_gauNS,78.408,rapid_gauNS,20600.0,0.754,1.0,47.06763,...,0.777802,446474200000000.0,3.399872,0.0,0.0,2,446474200000000.0,47.06763,0.742702,35.234364
4,0_583722007414750,51.01,0.0004,rapid_gauNS,50.206,rapid_gauNS,2400.0,0.299,1.0,49.44577,...,0.005092,3798065000.0,4.275696,2.0,0.0,2,3798065000.0,49.44577,0.985839,42.73895


## Close the cluster

In [None]:
cluster.close()