<div style='background-color:#f7f7f7; padding-top:30px; padding-left:20px; padding-right:20px; padding-bottom:30px'>
    <center>
        <div style='  display: block;
  font-size: 2em;
  font-weight: bold;  display: block;
  font-size: 2em;
  font-weight: bold;'>MAPD-B - Preprocessing of SEVN data for binary black holes mass distribution analysis
        </div>
    <center>
        </br>
    <i>Tommaso Bertola, Giacomo Di Prima, Giuseppe Viterbo, Marco Zenari</i></center></div>

# Introduction: the computational problem

Our aim for this project is to preprocess the data of multiple SEVN simulations of binary systems.

SEVN is Python program developed by the Astronomy Department at the University of Padua to simulate the evolution of binary systems. The evolution takes into account different physical phenomena which obey the patterns and laws obeserved in the Universe, especially those seen in the stellar tracks.

We will focus our attention specifically to those systems evolving into binary black holes. To study these systems we therefore need to extract from the whole dataset produced by SEVN only some information regarding the initial and final conditions of the binary system evolution.

The final goal is to obtain a simple and handy DataFrame listing only those features.

# Data structure

To better understading the problem, we will briefely describe the dataset we are given.

The dataset consists of a number of folders named after some "hyperparameters" given to SEVN while performing the simulations.
In our case, we are given 60 different folders, whose names are like `sevn_output_Z0.001A1L1`, `sevn_output_Z0.03A5L1`, ... More specifically the hyperparameters are the numbers following the letters `Z` and `L` which will be included in the final output DataFrame for each record.

Inside each folder there are three kinds of files:
* `output_{nthread}.csv`
* `logfile_{nthread}.dat`
* `evolved_{nthread}.dat`

where {nthread} is a number ranging from 0 to 29, corresponding to the thread responsible for the computation of those simulations. 

On average, each of the `output_{nthread}.csv` files occupy 750MB, `logfile_{nthread}.dat` 200MB, and `evolved_{nthread}.dat` 50MB. 

In total, each folder occupies between 26 to 31GB of data for a gross total of around 1.7TB.


## File schema
We briefely report the schema of the three different kind of files to better explain the following parsing process

### `evolved_{nthread}.dat` schema
These files contain the initial properties of the systems that has been successfully evolved by SEVN.
These are **fixed width** fields files and therefore we used the `pd.read_table` function to parse them.
The fields we are interested in reading are reported below.

|Column Name| Data type | Description|
|:----|:----|:-----|
|name|string `0_1234....`|Unique in each folder|
|Mass_0|float|Initial mass of star 0 (in Solar masses)|
|Mass_1|float|Initial mass of star 1 (in Solar masses)|
|Z_0|float|Metallicity (both stars have the same)|
|SN_0|string `rapid_gauNS`|Supernova model for star 0|
|SN_1|string `rapid_gauNS`|Supernova model for star 1|
|a|float| Initial semimajor of the binary |
|e|float|Initial eccentricity of the binary|
                      
All other fields in the files are discarded in the analysis.


### `output_{nthread}.csv` schema
The output files contains the final conditions of the simulations.
Each row is a different binary system, and each column indicates a different feature of the simulation.
The file is formatted in a tipical **csv format**.
We report in the following table the columns we are interested in. 

|Column Name| Data type | Description|
|:----|:----|:-----|
|name |string `0_1234...`| Unique in each folder|
|Mass_0|double| Mass of object 0 (in Solar masses)|
|Mass_1|double| Mass of object 1 (in Solar masses)|
|RemnantType_0|int|Type of object 0 after evolution|
|RemnantType_1|int|Type of object 1 after evolution|
|Semimajor|float|Semimajor of the binary (in Solar radius)|
|Eccentricity|float| Eccentricity of the binary|
|GWtime|float|Gravitational wave orbital decay time|
|BWorldtime|int|time elapsed in the simulations|

All other fields in the files are discarded in the analysis.
                      
### `logfile_{nthread}.dat` schema
Logfiles are **plain text** files, containing the description of a particular astrophysical event in each row.
Each event is univocally associated to a specific binary system evolution by its `name`. 
To recover the relevant information we have to run different regular expressions.
Each regex used specifically captures the name of the event, namely `RLO_BEGIN`, `CE`, or `BSN`. 

An example of the content of these files is given by the following rows

<div style='background-color:#f7f7f7; padding-top:30px; padding-left:20px; padding-right:20px; padding-bottom:30px'>B;0_474492234654248;0;COLLISION;23.667631;0:9.44621:18.0861:3:1:2.61393:1.21104:1:38.822:0.50739:19.1197:10.6772
B;0_474492234654248;0;MERGER;23.667631;0:9.446e+00:1.906e+00:0.000e+00:3:0:18.1224:1:2.614e+00:0.000e+00:0.000e+00:1:0:1.21104:12.0601:38.822:0.50739
S;0_474492234654248;0;NS;25.471686;5:1.46229:4.64623e+12:63.4538:0.98692
S;0_474492234654248;0;SN;25.471686;10.6025:3.06259:1.65091:1.46229:5:2:453.262:345.701:-197.664:-212.483:-187.853
S;0_641394535500269;0;WD;32.367569;3.2735:2.57868:1.29894:1.29894:3
</div>

# Tools
To parse the data we leverage on distributed computing techniques. 
Specifically we extensively use Python Dask library which enable us to manage a cluster of workers hosted on CloudVeneto virtual machines.

## Infrastracture configuration
![network_configuration.png](network_configuration.png)

We will be using 6 different virtual machines, `bhbh-{d,1,2,3,4,5}`. 
The first, `bhbh-d`, serves as a NFS server, allowing all the other VMs `bhbh-{1,2,3,4,5}` to read, write and crunch and work with the archived data.

### Setting up `bhbh-d`
After instantiating the VM `bhbh-d` of flavor `cloudveneto.medium`, we attach the volume `bhbh-v` to `/dev/vdb` which is then mounted on `/mnt/bhbh-v`.
To allow the other VMs to see the new volume, we use the software provided by `nfs-kernel-server`.

The NFS server is configured such that `/mnt/bhbh-v` can be seen by the other VMs and therefore mounted therein.
It is important to note that the configuration of the sharing of the volume is made in `sync` mode.
The reason being the cluster might potentially alter the consistency of the data if `async` mode were used. 

Finally, the actual data were copied from the `demoblack.fisica.unipd.it` server to the mounted volume. The operation took a few hours as the total size of the data is around 1.7TB. The operation was carried out via `rsync` utility in order to resume the copying was the network to fail for some unspecified errors.

### Setting up `bhbh-{1,2,3,4,5}`
The VMs for the proper data crunching are of the flavour `cloudveneto.large`, having 8GB of RAM each for a total amount of 32GB<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1).
The following operations are equally repeated on all the 
After installing the `nfs-utils` package, on

<a name="cite_note-1"></a>1. [^](#cite_ref-1) 32GB is comparable to the size of one folder.

#### Particular set up of bhbh-1


# Preprocessing on the file

In this section we are going to give a brief summary of the operations that we have to do on the file for our preprocessing. 
As anticipated, the final goal is to obtain a dataframe containing only the useful informations and save it for further analysis.

In order to obtain only one dataframe we need to read, clean and extract informations from each of the three type of files and finally merge all the informations for each binary in a dataframe.
All the preprocessing operations are written inside a function that is then executed in a delayd way with dask. 

For the preprocessing different attemps have been tried, and in the following section we are going to describe pro and cons of each of them. 
Now we take as example one preprocessing function in order to describe the basic operations that we have done. 
Differences in the preprocessing function of each attempt will be englighted in the corresponding section.

```python

#this function does the preprocessing on three files of the sam thread: output_{thread}, evolved_{thread}, logfile_{thread}

def preprocessing_bag_of_thread(paths):
    
    '''
       paths = list of the paths of the three file considered [output, evoleved, logfile]
    '''
    
    #Listing the columns we are interested in the output file
    output_column_to_read = ['name', 'Mass_0', 'RemnantType_0', 'Mass_1', 'RemnantType_1',
                         'Semimajor','Eccentricity','GWtime','BWorldtime']
    
    #Listing the type of the column we are interested in
    output_column_type = ['string', 'float64', 'int64', 'float64', 'int64',
                      'float64', 'float64', 'float64', 'float64', 'int64']

    #Listing the columns we are interested in the evolved file
    evolved_column_to_read = ['name', 'Mass_0', 'Z_0', 'SN_0', 'Mass_1', 'SN_1', 'a', 'e']

    #Listing the type of the column we are interested in
    evolved_column_type = ['string', 'float64', 'float64', 'string', 'float64', 
                      'string', 'float64', 'float64']

    #final columns to be dropped (they are used for filtering the dataset but will not be needed
    drop_list = ['RemnantType_0',  'RemnantType_1']
    
   
    #OUTPUT
    
    #reading the columns of the ouput we are interested in and renaming some of them
    output = pd.read_csv(paths[0], usecols=output_column_to_read, dtype=dict(zip(output_column_to_read, output_column_type))).\
                rename(columns={'Mass_0':'Mass_0_out', 'Mass_1':'Mass_1_out'})

    #mask to select only the black holes binaries
    idxBHBH=(output.RemnantType_0==6) & (output.RemnantType_1==6) & (output.Semimajor.notnull())
    output=output[idxBHBH]
        
    
    #EVOLVED
    
    #extracting the alpha parameter from the path of the folder
    alpha = float(re.findall(r".+(?<=A)(.*)(?=L)", paths[1])[0])
    
    #read the columns we are interested in from the evolved file
    evolved = pd.read_table(paths[1], sep='\s+', usecols=evolved_column_to_read, dtype=dict(zip(evolved_column_to_read, evolved_column_type)))                
    #NB: sep='\s+' is need because there are different number of spaces separareting the columns
    
    #adding the column with the alpha parameter
    evolved['alpha'] = alpha
    
    
    #LOGFILE
    
    #reading the logfile 
    logfile = pd.read_csv(paths[2], header=None)

    #Running Regex on the line of the logfile to extrac useful informations
    
    df_RLO = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);RLO_BEGIN;").\  #searching for string "RLO_BEGIN"
                dropna().\ # dropping nan
                rename(columns={0:'name', 1:'ID'}).\ #rename
                groupby(['name']).\ #grouping by name
                size().to_frame(name='RLO').\ #and counting the number of RLO
                reset_index() #to have a nice dataframe

    
    df_CE = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);CE;").\  #searching for string "CE"
                dropna().\  # dropping nan
                rename(columns={0:'name', 1:'ID'}).\ #rename
                groupby(['name']).\ #grouping by name
                size().to_frame(name='CE').\ #and counting the number of CE
                reset_index() #to have a nice dataframe
    

    df_BSN = logfile[0].str.extract(r"B;((?:\d*\_)?\d+);(\d+);BSN;").\  #searching for string "BSN"
                dropna().\  # dropping nan
                rename(columns={0:'name', 1:'ID'}).\ #rename
                groupby(['name']).\ #grouping by name
                size().to_frame(name='BSN').\ #and counting the number of BSN
                reset_index() #to have a nice dataframe

    
    #MERGE
    bhbh = evolved.merge(output, on=['name'], how='inner').\  #innerg join on the name between wvolved and output
                   merge(df_RLO, on=['name'], how='left').\   #left join on the name with df_RLO
                   merge(df_CE,  on=['name'], how='left').\   #left join on the name with df_CE
                   merge(df_BSN, on=['name'], how='left').\   #left join on the name with df_BSN
                   fillna(value=0).\   #setting nan to zero
                   drop(columns=drop_list)  #dropping no more useful columm
    
    
    #Adding some columns with physical meaning
    bhbh['tdelay'] = bhbh['GWtime'] + bhbh['BWorldtime'] #time delay
    
    #defining the max mass of output
    bhbh['Mass_max_out'] = bhbh['Mass_1_out']
    bhbh['Mass_max_out'] = bhbh['Mass_max_out'].\
                            where(cond=(bhbh['Mass_max_out'] > bhbh['Mass_0_out']), other=bhbh['Mass_0_out'])

    #defining q=m1/m2 with m2>,m1
    bhbh['q'] = bhbh['Mass_1_out']/bhbh['Mass_0_out']
    bhbh['q'] = bhbh['q'].\
                where(cond=(bhbh['Mass_1_out'] < bhbh['Mass_0_out']), other=bhbh['Mass_0_out']/bhbh['Mass_1_out'])
    
    #defining the Chirp mass
    bhbh['Mass_chirp'] = ((bhbh['Mass_0_out'] * bhbh['Mass_1_out'])**(3/5))/((bhbh['Mass_0_out'] + bhbh['Mass_1_out'])**(1/5))
    
    return bhbh
```

# Our attempts
Here we show the code to 

Load the whole thing at once

FG_new

FG_normal

FG_bag

Brute force approach

## Possible improvements and limitations