# Code for preprocessing with bag

This is the final version of the code of our project that has managed to process the hole dataset of 1.7 TB. 

In this notebook we create a bag that contains lists of paths of files. 
Each list contains the paths of the output, evolved and logfile of the same thread.
The preprocessing function is written in a way that takes in input a list containing the paths of an output file, an evolved file and a logfile and creates a dataframe cotaining the useful informations. 
Applying this function to the bag through the `bag.map()` command we obtain a bag of dataframes. 
In the end, we extract a dataframe from the bag using the `bag.to_dataframe()` and save the content in parquet files with `.to_parquet()`

## Importing the libraries

In [8]:
import pandas as pd
import re
import time
import glob

import dask
import dask.dataframe as dd
import dask.bag as db
from dask.distributed import Client, SSHCluster

## Cluster up

In [5]:
cluster = SSHCluster(
            ["bhbh-1", 'bhbh-1', "bhbh-2", "bhbh-3", "bhbh-4", "bhbh-5"],
            connect_options={"client_keys": "/path_to_my_key"},
            worker_options={"n_workers": 2,   #best set-up from benchmark
                            "nthreads": 2},
            scheduler_options={"port": 8786, "dashboard_address": ":8787"}
            )

2023-06-16 14:27:19,942 - distributed.deploy.ssh - INFO - 2023-06-16 14:27:19,940 - distributed.scheduler - INFO - State start
2023-06-16 14:27:19,946 - distributed.deploy.ssh - INFO - 2023-06-16 14:27:19,943 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-2h_ywjgb', purging
2023-06-16 14:27:19,948 - distributed.deploy.ssh - INFO - 2023-06-16 14:27:19,946 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-limx3khf', purging
2023-06-16 14:27:19,955 - distributed.deploy.ssh - INFO - 2023-06-16 14:27:19,955 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.67.22.140:8786
2023-06-16 14:27:20,949 - distributed.deploy.ssh - INFO - 2023-06-16 14:27:20,948 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.140:39625'
2023-06-16 14:27:20,959 - distributed.deploy.ssh - INFO - 2023-06-16 14:27:20,958 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22

## Client

In [None]:
client = Client(cluster)


+---------+--------+-----------+------------------+
| Package | Client | Scheduler | Workers          |
+---------+--------+-----------+------------------+
| tornado | 6.3.2  | 6.3.2     | {'6.3.2', '6.2'} |
+---------+--------+-----------+------------------+


In [9]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.SpecCluster
Dashboard: http://10.67.22.140:8787/status,

0,1
Dashboard: http://10.67.22.140:8787/status,Workers: 10
Total threads: 20,Total memory: 38.82 GiB

0,1
Comm: tcp://10.67.22.140:8786,Workers: 10
Dashboard: http://10.67.22.140:8787/status,Total threads: 20
Started: 2 minutes ago,Total memory: 38.82 GiB

0,1
Comm: tcp://10.67.22.140:35551,Total threads: 2
Dashboard: http://10.67.22.140:40171/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.140:39625,
Local directory: /tmp/dask-scratch-space/worker-1cafvr1g,Local directory: /tmp/dask-scratch-space/worker-1cafvr1g

0,1
Comm: tcp://10.67.22.140:37909,Total threads: 2
Dashboard: http://10.67.22.140:42351/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.140:36241,
Local directory: /tmp/dask-scratch-space/worker-6j_ogum9,Local directory: /tmp/dask-scratch-space/worker-6j_ogum9

0,1
Comm: tcp://10.67.22.21:39687,Total threads: 2
Dashboard: http://10.67.22.21:46267/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.21:43363,
Local directory: /tmp/dask-scratch-space/worker-qhzyuxnd,Local directory: /tmp/dask-scratch-space/worker-qhzyuxnd

0,1
Comm: tcp://10.67.22.21:46271,Total threads: 2
Dashboard: http://10.67.22.21:46661/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.21:36587,
Local directory: /tmp/dask-scratch-space/worker-8w2c2my9,Local directory: /tmp/dask-scratch-space/worker-8w2c2my9

0,1
Comm: tcp://10.67.22.220:33153,Total threads: 2
Dashboard: http://10.67.22.220:39329/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.220:43119,
Local directory: /tmp/dask-scratch-space/worker-0ebfrb6p,Local directory: /tmp/dask-scratch-space/worker-0ebfrb6p

0,1
Comm: tcp://10.67.22.220:41029,Total threads: 2
Dashboard: http://10.67.22.220:34101/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.220:40307,
Local directory: /tmp/dask-scratch-space/worker-8wnk7309,Local directory: /tmp/dask-scratch-space/worker-8wnk7309

0,1
Comm: tcp://10.67.22.31:33045,Total threads: 2
Dashboard: http://10.67.22.31:39889/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.31:38253,
Local directory: /tmp/dask-scratch-space/worker-bcuu3j7i,Local directory: /tmp/dask-scratch-space/worker-bcuu3j7i

0,1
Comm: tcp://10.67.22.31:37247,Total threads: 2
Dashboard: http://10.67.22.31:35117/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.31:36953,
Local directory: /tmp/dask-scratch-space/worker-2piqo84f,Local directory: /tmp/dask-scratch-space/worker-2piqo84f

0,1
Comm: tcp://10.67.22.81:45595,Total threads: 2
Dashboard: http://10.67.22.81:41455/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.81:33405,
Local directory: /tmp/dask-scratch-space/worker-k25_pzwe,Local directory: /tmp/dask-scratch-space/worker-k25_pzwe

0,1
Comm: tcp://10.67.22.81:46555,Total threads: 2
Dashboard: http://10.67.22.81:43989/status,Memory: 3.88 GiB
Nanny: tcp://10.67.22.81:43163,
Local directory: /tmp/dask-scratch-space/worker-v0k2vqb6,Local directory: /tmp/dask-scratch-space/worker-v0k2vqb6


## Creating list of directories

In [10]:
dir_path = '/mnt/bhbh/fiducial_Hrad_5M/sevn_output_*'
dir_list = glob.glob(dir_path)

## Creating the bag list of paths

In [6]:
bag = db.from_sequence([[dir_ + f'/0/output_{thread}.csv', 
                    dir_ + f'/0/evolved_{thread}.dat',
                    dir_ + f'/0/logfile_{thread}.dat'] for dir_ in dir_list for thread in range(30)], npartitions=30*60)  #selecting   #partitions= #threads

## Pre-processing function for the bag_of_thread

In [1]:
def preprocessing_bag_of_thread(paths):
    
    #lists of columns to read for each file and corresponding type
    output_column_to_read = ['name', 'Mass_0', 'RemnantType_0', 'Mass_1', 'RemnantType_1',
                         'Semimajor','Eccentricity','GWtime','BWorldtime']

    output_column_type = ['string', 'float64', 'int64', 'float64', 'int64',
                      'float64', 'float64', 'float64', 'float64']

    evolved_column_to_read = ['name', 'Mass_0', 'Z_0', 'SN_0', 'Mass_1', 'SN_1', 'a', 'e']


    evolved_column_type = ['string', 'float64', 'float64', 'string', 'float64', 
                      'string', 'float64', 'float64']

    drop_list = ['RemnantType_0',  'RemnantType_1']
    
   
    #Preprocessing OUTPUT
    
    #reading the file
    output = pd.read_csv(paths[0], usecols=output_column_to_read, dtype=dict(zip(output_column_to_read, output_column_type))).\
                rename(columns={'Mass_0':'Mass_0_out', 'Mass_1':'Mass_1_out'})
    
    #mask to select only the binaries we are interested in
    idxBHBH=(output.RemnantType_0==6) & (output.RemnantType_1==6) & (output.Semimajor.notnull())
    output=output[idxBHBH]    
    
    
    #preprocessing EVOLVED
      
    #reading the file
    evolved = pd.read_table(paths[1], sep='\s+', usecols=evolved_column_to_read, dtype=dict(zip(evolved_column_to_read, evolved_column_type)))                
    
    #extracting alpha with a regex
    alpha = float(re.findall(r".+(?<=A)(.*)(?=L)", paths[1])[0])
    evolved['alpha'] = alpha
    
    
    #preprocessing LOGFILE
    
    logfile = pd.read_csv(paths[2], header=None)
    
    
    #extracting informations with regex

    df_RLO = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);RLO_BEGIN;").\
                dropna().\
                rename(columns={0:'name', 1:'ID'}).\
                groupby(['name']).\
                size().to_frame(name='RLO').\
                reset_index()

    df_CE = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);CE;").\
                dropna().\
                rename(columns={0:'name', 1:'ID'}).\
                groupby(['name']).\
                size().to_frame(name='CE').\
                reset_index()

    df_BSN = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);BSN;").\
                dropna().\
                rename(columns={0:'name', 1:'ID'}).\
                groupby(['name']).\
                size().to_frame(name='BSN').\
                reset_index()

    
    #MERGE
    bhbh = evolved.merge(output, on=['name'], how='inner').\
                   merge(df_RLO, on=['name'], how='left').\
                   merge(df_CE,  on=['name'], how='left').\
                   merge(df_BSN, on=['name'], how='left').\
                   fillna(value=0).\
                   drop(columns=drop_list)
    
    
    #add some useful columns
    bhbh['tdelay'] = bhbh['GWtime'] + bhbh['BWorldtime']

    bhbh['Mass_max_out'] = bhbh['Mass_1_out']
    bhbh['Mass_max_out'] = bhbh['Mass_max_out'].\
                            where(cond=(bhbh['Mass_max_out'] > bhbh['Mass_0_out']), other=bhbh['Mass_0_out'])

    bhbh['q'] = bhbh['Mass_1_out']/bhbh['Mass_0_out']
    bhbh['q'] = bhbh['q'].\
                where(cond=(bhbh['Mass_1_out'] < bhbh['Mass_0_out']), other=bhbh['Mass_0_out']/bhbh['Mass_1_out'])

    bhbh['Mass_chirp'] = ((bhbh['Mass_0_out'] * bhbh['Mass_1_out'])**(3/5))/((bhbh['Mass_0_out'] + bhbh['Mass_1_out'])**(1/5))
    
    
    return bhbh #a pandas dataframe

## Map the preprocessing function to the bag

In [8]:
%%time
bag_of_df = bag.map(preprocessing_bag_of_thread)

CPU times: user 142 ms, sys: 1.46 ms, total: 144 ms
Wall time: 143 ms


## Concat the dataframes

In [9]:
%%time
bag_of_dicts = bag_of_df.map(lambda df: df.to_dict(orient='records')).flatten() #trasforming to a bag of dictionaries to use .to_dataframe()

CPU times: user 5.46 ms, sys: 3.38 ms, total: 8.85 ms
Wall time: 8.9 ms


In [12]:
%%time
bhbh = bag_of_dicts.to_dataframe()

CPU times: user 639 ms, sys: 9.42 ms, total: 648 ms
Wall time: 22.8 s


## Save to parquet

In [13]:
%%time
bhbh.to_parquet('/mnt/bhbh/bag_all_dataset/')

CPU times: user 14.3 s, sys: 1.45 s, total: 15.8 s
Wall time: 1h 37min 3s


## Compute (if we do not want to save the results)

## Final dataframe

In [18]:
bhbh = dd.read_parquet('/mnt/bhbh/bag_all_dataset/part.*.parquet')

In [19]:
bhbh.head()

Unnamed: 0,name,Mass_0,Z_0,SN_0,Mass_1,SN_1,a,e,alpha,Mass_0_out,...,Eccentricity,GWtime,BWorldtime,RLO,CE,BSN,tdelay,Mass_max_out,q,Mass_chirp
0,0_186500805616303,24.025,0.0004,rapid_gauNS,15.586,rapid_gauNS,56.1,0.00704,1.0,9.050055,...,0.028653,7108.607,10.62461,2.0,0.0,2,7119.232,9.050055,0.928217,7.58944
1,0_502130275753280,39.24,0.0004,rapid_gauNS,37.312,rapid_gauNS,1180.0,0.0782,1.0,38.33631,...,0.006686,1274892000.0,5.217448,2.0,0.0,2,1274892000.0,38.33631,0.950738,32.539204
2,0_201673565337120,61.947,0.0004,rapid_gauNS,30.179,rapid_gauNS,5040.0,0.221,1.0,59.19078,...,0.277687,1390318000000.0,6.160173,0.0,0.0,2,1390318000000.0,59.19078,0.357987,30.060437
3,0_929528790266714,135.386,0.0004,rapid_gauNS,78.408,rapid_gauNS,20600.0,0.754,1.0,47.06763,...,0.777802,446474200000000.0,3.399872,0.0,0.0,2,446474200000000.0,47.06763,0.742702,35.234364
4,0_583722007414750,51.01,0.0004,rapid_gauNS,50.206,rapid_gauNS,2400.0,0.299,1.0,49.44577,...,0.005092,3798065000.0,4.275696,2.0,0.0,2,3798065000.0,49.44577,0.985839,42.73895


## Close the cluster

In [21]:
cluster.close()