# Machine Learning for String Field Theory

H. Erbin, R. Finotello, M. Kudrna, M. Schnabl

---
---

## Abstract

In the framework of bosonic Open String Field Theory (OSFT), we consider several observables characterised by conformal weight and type, and the position of vacua in the potential for various values of truncated mass level.
We focus on the prediction of the extrapolated value for the level-$\infty$ truncation using Machine Learning (ML) techniques.

## Synopsis

In this notebook we tidy and convert the datasets from their original format to a CSV-like format for training and predictions.

## General Observations

Each entry in the datasets represents one observable in OSFT.
Together with the features labelling the observable, we also have the values of such observable at different truncation levels.
The purpose of the analysis is eventually to compute the extrapolated values at $\infty$ level truncation.
The data is therefore twofold: some variable are labelling the observable, while the values of the truncation levels should then be compared with the values at $\infty$.

In [None]:
%load_ext autoreload
%autoreload 2

## Lumps Solutions

We start from the lumps solutions: the dataset contains a list of JSON formatted entries.
We first need to flatten the entries and then to make sure there are no empty entries and no duplicates.

In [2]:
from mltools.tidy import TidySet

path = './data/mathematica_lumps.json'
data = TidySet(path, format='json')

In [3]:
data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   init    46 non-null     object
 1   exp     46 non-null     object
 2   weight  46 non-null     object
 3   type    46 non-null     object
 4   2.      46 non-null     object
 5   3.      46 non-null     object
 6   4.      46 non-null     object
 7   5.      46 non-null     object
 8   6.      46 non-null     object
 9   7.      46 non-null     object
 10  8.      46 non-null     object
 11  9.      46 non-null     object
 12  10.     46 non-null     object
 13  11.     46 non-null     object
 14  12.     46 non-null     object
 15  13.     46 non-null     object
 16  14.     46 non-null     object
 17  15.     46 non-null     object
 18  16.     46 non-null     object
 19  17.     46 non-null     object
 20  18.     46 non-null     object
dtypes: object(21)
memory usage: 7.7+ KB


We then rename the columns of the truncation levels to be easily recognisable:

In [4]:
import re

data.colrename(lambda s: re.sub('^([0-9]*)[.]$', r'level_\1', s))
data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   init      46 non-null     object
 1   exp       46 non-null     object
 2   weight    46 non-null     object
 3   type      46 non-null     object
 4   level_2   46 non-null     object
 5   level_3   46 non-null     object
 6   level_4   46 non-null     object
 7   level_5   46 non-null     object
 8   level_6   46 non-null     object
 9   level_7   46 non-null     object
 10  level_8   46 non-null     object
 11  level_9   46 non-null     object
 12  level_10  46 non-null     object
 13  level_11  46 non-null     object
 14  level_12  46 non-null     object
 15  level_13  46 non-null     object
 16  level_14  46 non-null     object
 17  level_15  46 non-null     object
 18  level_16  46 non-null     object
 19  level_17  46 non-null     object
 20  level_18  46 non-null     object
dtypes: object(21)
memo

We then label each vector entry by their position:

In [5]:
data.addlabel('solutions')

data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 22 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   solutions  46 non-null     object
 1   init       46 non-null     object
 2   exp        46 non-null     object
 3   weight     46 non-null     object
 4   type       46 non-null     object
 5   level_2    46 non-null     object
 6   level_3    46 non-null     object
 7   level_4    46 non-null     object
 8   level_5    46 non-null     object
 9   level_6    46 non-null     object
 10  level_7    46 non-null     object
 11  level_8    46 non-null     object
 12  level_9    46 non-null     object
 13  level_10   46 non-null     object
 14  level_11   46 non-null     object
 15  level_12   46 non-null     object
 16  level_13   46 non-null     object
 17  level_14   46 non-null     object
 18  level_15   46 non-null     object
 19  level_16   46 non-null     object
 20  level_17   46 non-null     object


We can finally stack each entry on top of each other:

In [6]:
data.rowexplode()

shape = data.get_dataframe().shape
data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 778 entries, 0 to 777
Data columns (total 22 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   solutions  778 non-null    int32  
 1   init       778 non-null    float64
 2   exp        778 non-null    float64
 3   weight     778 non-null    float64
 4   type       778 non-null    float64
 5   level_2    778 non-null    float64
 6   level_3    778 non-null    float64
 7   level_4    778 non-null    float64
 8   level_5    778 non-null    float64
 9   level_6    778 non-null    float64
 10  level_7    778 non-null    float64
 11  level_8    778 non-null    float64
 12  level_9    778 non-null    float64
 13  level_10   778 non-null    float64
 14  level_11   778 non-null    float64
 15  level_12   778 non-null    float64
 16  level_13   778 non-null    float64
 17  level_14   778 non-null    float64
 18  level_15   778 non-null    float64
 19  level_16   778 non-null    float64
 20  level_17  

We then remove the duplicates from the dataset:

In [7]:
duplicates = data.dupremove()

print(f'No. of duplicates: {duplicates.shape[0]:d}')
print(f'Fraction of duplicates: {100 * duplicates.shape[0] / shape[0]:.1f}%')

No. of duplicates: 46
Fraction of duplicates: 5.9%


Finally we can save the file:

In [8]:
data.save('./data/lumps.csv', format='csv', index=False)

# print 10 random entries
data.get_dataframe().sample(n=10)

Unnamed: 0,solutions,init,exp,weight,type,level_2,level_3,level_4,level_5,level_6,...,level_9,level_10,level_11,level_12,level_13,level_14,level_15,level_16,level_17,level_18
28,1,1.0001,0.0,2.25045,4.0,0.978332,0.978299,1.051501,1.0516,0.849937,...,1.123636,0.808957,0.808561,1.091685,1.091947,0.824492,0.824061,1.046204,1.046457,0.834898
500,31,0.0,1.0,0.047259,4.0,0.791955,0.894114,0.904821,0.932881,0.937952,...,0.96095,0.962756,0.967571,0.968827,0.972157,0.973095,0.975628,0.976353,0.978251,0.978833
340,22,0.0,0.0,0.0,2.0,0.031164,0.013293,0.007269,0.005884,0.003971,...,0.002369,0.001901,0.001748,0.001465,0.001362,0.001176,0.001103,0.000974,0.000919,0.000825
681,41,2.8,-1.0,1.0,4.0,-0.592813,-0.951379,-0.695346,-0.717707,-0.886287,...,-0.872551,-0.93101,-0.935391,-0.91619,-0.917908,-0.949086,-0.950766,-0.938999,-0.939855,-0.959152
89,5,1.05,0.0,4.41,4.0,-6.427392,-7.275274,39.550827,42.068051,-105.694303,...,202.872096,-436.041351,-451.293342,1106.455475,1144.40018,-2636.515159,-2720.325014,5644.998987,5811.544565,-11004.600467
335,21,0.0,1.0,3.780864,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.610646,19.796817,-71.792632
625,38,2.65,1.0,4.0,4.0,-6.276573,-7.422184,8.65822,9.521694,-27.320335,...,72.350953,-145.799192,-150.62625,274.262233,282.059102,-472.148752,-484.080578,774.594907,792.222284,-1213.705691
59,3,1.01,0.0,4.0804,4.0,-1.741196,-1.955402,16.041456,16.999501,-48.965051,...,94.857958,-160.678015,-166.628528,325.966235,339.295673,-689.203729,-716.841468,1374.645827,1426.356955,-2532.86207
164,10,1.25,0.0,1.5625,4.0,-2.573824,-2.939078,2.460954,2.64028,-2.614119,...,2.641532,-2.619495,-2.741867,2.417507,2.51944,-2.414282,-2.493915,2.186028,2.25349,-2.137891
60,4,1.025,1.0,0.0,2.0,1.016275,1.015782,1.012352,1.011909,1.010011,...,1.008238,1.007389,1.007205,1.006569,1.006423,1.005926,1.005807,1.005407,1.005308,1.004977


## WZW Model

We then consider the data of the WZW model.
These are stored in two different datasets containing the real and the imaginary parts of the values.
We first need to change the name of the labels to distinguish the complex numbers, then we separately proceed as in the previous case, and finally merge the datasets.

In [9]:
from mltools.tidy import TidySet

re_path = './data/mathematica_wzw_real.json'
im_path = './data/mathematica_wzw_imaginary.json'
re_data = TidySet(re_path, format='json')
im_data = TidySet(im_path, format='json')

In [10]:
re_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   k       46 non-null     object
 1   exp     46 non-null     object
 2   weight  46 non-null     object
 3   j       46 non-null     object
 4   m       46 non-null     object
 5   type    46 non-null     object
 6   2.      46 non-null     object
 7   3.      46 non-null     object
 8   4.      46 non-null     object
 9   5.      46 non-null     object
 10  6.      46 non-null     object
 11  7.      46 non-null     object
 12  8.      46 non-null     object
 13  9.      46 non-null     object
 14  10.     46 non-null     object
 15  11.     15 non-null     object
 16  12.     2 non-null      object
 17  13.     1 non-null      object
 18  14.     1 non-null      object
dtypes: object(19)
memory usage: 7.0+ KB


In [11]:
im_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   k       46 non-null     object
 1   exp     46 non-null     object
 2   weight  46 non-null     object
 3   j       46 non-null     object
 4   m       46 non-null     object
 5   type    46 non-null     object
 6   2.      46 non-null     object
 7   3.      46 non-null     object
 8   4.      46 non-null     object
 9   5.      46 non-null     object
 10  6.      46 non-null     object
 11  7.      46 non-null     object
 12  8.      46 non-null     object
 13  9.      46 non-null     object
 14  10.     46 non-null     object
 15  11.     15 non-null     object
 16  12.     2 non-null      object
 17  13.     1 non-null      object
 18  14.     1 non-null      object
dtypes: object(19)
memory usage: 7.0+ KB


We then rename the columns of the truncation levels to be easily recognisable:

In [12]:
import re

re_data.colrename(lambda s: re.sub('^([0-9]*)[.]$', r'level_\1', s))
re_data.colrename(lambda s: re.sub('^(.*)$', r'\1_re', s))
re_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   k_re         46 non-null     object
 1   exp_re       46 non-null     object
 2   weight_re    46 non-null     object
 3   j_re         46 non-null     object
 4   m_re         46 non-null     object
 5   type_re      46 non-null     object
 6   level_2_re   46 non-null     object
 7   level_3_re   46 non-null     object
 8   level_4_re   46 non-null     object
 9   level_5_re   46 non-null     object
 10  level_6_re   46 non-null     object
 11  level_7_re   46 non-null     object
 12  level_8_re   46 non-null     object
 13  level_9_re   46 non-null     object
 14  level_10_re  46 non-null     object
 15  level_11_re  15 non-null     object
 16  level_12_re  2 non-null      object
 17  level_13_re  1 non-null      object
 18  level_14_re  1 non-null      object
dtypes: object(19)
memory usage: 7.0

In [13]:
import re

im_data.colrename(lambda s: re.sub('^([0-9]*)[.]$', r'level_\1', s))
im_data.colrename(lambda s: re.sub('^(.*)$', r'\1_im', s))
im_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   k_im         46 non-null     object
 1   exp_im       46 non-null     object
 2   weight_im    46 non-null     object
 3   j_im         46 non-null     object
 4   m_im         46 non-null     object
 5   type_im      46 non-null     object
 6   level_2_im   46 non-null     object
 7   level_3_im   46 non-null     object
 8   level_4_im   46 non-null     object
 9   level_5_im   46 non-null     object
 10  level_6_im   46 non-null     object
 11  level_7_im   46 non-null     object
 12  level_8_im   46 non-null     object
 13  level_9_im   46 non-null     object
 14  level_10_im  46 non-null     object
 15  level_11_im  15 non-null     object
 16  level_12_im  2 non-null      object
 17  level_13_im  1 non-null      object
 18  level_14_im  1 non-null      object
dtypes: object(19)
memory usage: 7.0

Since most of the entries for level 11 to 14 are empty, we discard them and we will not consider them for the analysis.

In [14]:
re_data.coldrop(['level_11_re', 'level_12_re', 'level_13_re', 'level_14_re'])
re_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   k_re         46 non-null     object
 1   exp_re       46 non-null     object
 2   weight_re    46 non-null     object
 3   j_re         46 non-null     object
 4   m_re         46 non-null     object
 5   type_re      46 non-null     object
 6   level_2_re   46 non-null     object
 7   level_3_re   46 non-null     object
 8   level_4_re   46 non-null     object
 9   level_5_re   46 non-null     object
 10  level_6_re   46 non-null     object
 11  level_7_re   46 non-null     object
 12  level_8_re   46 non-null     object
 13  level_9_re   46 non-null     object
 14  level_10_re  46 non-null     object
dtypes: object(15)
memory usage: 5.5+ KB


In [15]:
im_data.coldrop(['level_11_im', 'level_12_im', 'level_13_im', 'level_14_im'])
im_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   k_im         46 non-null     object
 1   exp_im       46 non-null     object
 2   weight_im    46 non-null     object
 3   j_im         46 non-null     object
 4   m_im         46 non-null     object
 5   type_im      46 non-null     object
 6   level_2_im   46 non-null     object
 7   level_3_im   46 non-null     object
 8   level_4_im   46 non-null     object
 9   level_5_im   46 non-null     object
 10  level_6_im   46 non-null     object
 11  level_7_im   46 non-null     object
 12  level_8_im   46 non-null     object
 13  level_9_im   46 non-null     object
 14  level_10_im  46 non-null     object
dtypes: object(15)
memory usage: 5.5+ KB


We can finally stack each entry on top of each other:

In [16]:
re_data.rowexplode()

re_shape = re_data.get_dataframe().shape
re_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1680 entries, 0 to 1679
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   k_re         1680 non-null   float64
 1   exp_re       1680 non-null   float64
 2   weight_re    1680 non-null   float64
 3   j_re         1680 non-null   float64
 4   m_re         1680 non-null   float64
 5   type_re      1680 non-null   float64
 6   level_2_re   1680 non-null   float64
 7   level_3_re   1680 non-null   float64
 8   level_4_re   1680 non-null   float64
 9   level_5_re   1680 non-null   float64
 10  level_6_re   1680 non-null   float64
 11  level_7_re   1680 non-null   float64
 12  level_8_re   1680 non-null   float64
 13  level_9_re   1680 non-null   float64
 14  level_10_re  1680 non-null   float64
dtypes: float64(15)
memory usage: 197.0 KB


In [17]:
im_data.rowexplode()

im_shape = im_data.get_dataframe().shape
im_data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1680 entries, 0 to 1679
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   k_im         1680 non-null   int64  
 1   exp_im       1680 non-null   float64
 2   weight_im    1680 non-null   int64  
 3   j_im         1680 non-null   int64  
 4   m_im         1680 non-null   int64  
 5   type_im      1680 non-null   int64  
 6   level_2_im   1680 non-null   float64
 7   level_3_im   1680 non-null   float64
 8   level_4_im   1680 non-null   float64
 9   level_5_im   1680 non-null   float64
 10  level_6_im   1680 non-null   float64
 11  level_7_im   1680 non-null   float64
 12  level_8_im   1680 non-null   float64
 13  level_9_im   1680 non-null   float64
 14  level_10_im  1680 non-null   float64
dtypes: float64(10), int64(5)
memory usage: 197.0 KB


We then need to merge the two datasets into a new dataset containing both real and imaginary parts.

In [18]:
import pandas as pd

df    = pd.merge(re_data.get_dataframe(),
                 im_data.get_dataframe(),
                 how='outer',
                 left_index=True,
                 right_index=True
                )
shape = df.shape

We can then remove the identically vanishing columns:

In [19]:
import re

vanishing = (df.mean() == 0.0) & (df.std() == 0.0)
vanishing = [c for c in df.columns if vanishing[c] == True]

# drop the columns
rename = [re.sub('_im', '_re', c) for c in vanishing]
rename = {c: re.sub('_re', '', c) for c in rename}
df     = df.drop(columns=vanishing).rename(columns=rename)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 1679
Data columns (total 25 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   k            1680 non-null   float64
 1   exp_re       1680 non-null   float64
 2   weight       1680 non-null   float64
 3   j            1680 non-null   float64
 4   m            1680 non-null   float64
 5   type         1680 non-null   float64
 6   level_2_re   1680 non-null   float64
 7   level_3_re   1680 non-null   float64
 8   level_4_re   1680 non-null   float64
 9   level_5_re   1680 non-null   float64
 10  level_6_re   1680 non-null   float64
 11  level_7_re   1680 non-null   float64
 12  level_8_re   1680 non-null   float64
 13  level_9_re   1680 non-null   float64
 14  level_10_re  1680 non-null   float64
 15  exp_im       1680 non-null   float64
 16  level_2_im   1680 non-null   float64
 17  level_3_im   1680 non-null   float64
 18  level_4_im   1680 non-null   float64
 19  level_

We then remove the duplicates from the dataset:

In [20]:
duplicate_id = df.duplicated()

duplicates = df.loc[duplicate_id]
df         = df.loc[~duplicate_id]

print(f'No. of duplicates: {duplicates.shape[0]:d}')
print(f'Fraction of duplicates: {100 * duplicates.shape[0] / shape[0]:.1f}%')

No. of duplicates: 13
Fraction of duplicates: 0.8%


Finally we can save the file:

In [21]:
df.to_csv('./data/wzw.csv', index=False)

# print 10 random entries
df.sample(n=10)

Unnamed: 0,k,exp_re,weight,j,m,type,level_2_re,level_3_re,level_4_re,level_5_re,...,exp_im,level_2_im,level_3_im,level_4_im,level_5_im,level_6_im,level_7_im,level_8_im,level_9_im,level_10_im
104,5.0,-1.168038,0.535714,1.5,0.5,4.0,-11.11072,-11.11072,-2.905008,-2.905008,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1454,8.0,-1.205053,0.875,2.5,-1.5,4.0,-0.260347,-1.344743,-1.254738,-0.890034,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1273,7.0,0.398375,0.972222,2.5,-2.5,4.0,0.175493,0.263727,0.292839,0.311392,...,-0.499546,-0.392039,-0.420095,-0.441339,-0.446982,-0.455664,-0.455966,-0.455134,-0.456433,-0.459199
380,4.0,-0.5,0.625,1.5,-1.5,4.0,-0.375988,-0.404428,-0.429662,-0.432209,...,-0.5,-0.267059,-0.300773,-0.369993,-0.38517,-0.408725,-0.41502,-0.427362,-0.431962,-0.439799
477,5.0,-0.86778,1.25,2.5,-2.5,4.0,-0.91662,-0.879819,-0.87983,-0.87169,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193,7.0,-0.638943,0.222222,1.0,0.0,4.0,-0.668593,-0.673768,-0.653338,-0.650023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1474,8.0,0.525731,2.0,4.0,1.0,4.0,-2.086209,-2.558573,3.143636,3.596948,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1410,8.0,0.0,1.2,3.0,-1.0,4.0,0.663791,-0.078793,-1.234089,-0.165055,...,-0.707107,-0.464865,-0.814597,-0.993942,-1.32522,-0.935301,-0.444522,-0.960096,-1.523793,-0.948219
617,6.0,-0.594604,1.09375,2.5,1.5,4.0,1.518086,1.6477,-1.022441,-1.188542,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1569,8.0,0.473677,1.575,3.5,-2.5,4.0,0.118032,0.178388,0.452935,0.470849,...,0.196203,-0.268013,0.343933,0.384894,0.11129,0.109368,0.240342,0.245224,0.163229,0.16332


## Double Lumps Solutions

We finally consider the double lump solutions.
The dataset in this case is made of only one solution, thus we do not need to add a counter for that.

In [22]:
from mltools.tidy import TidySet

path = './data/mathematica_dlumps.json'
data = TidySet(path, format='json')

In [23]:
data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   init    20 non-null     int64  
 1   weight  20 non-null     float64
 2   type    20 non-null     int64  
 3   2.      20 non-null     float64
 4   3.      20 non-null     float64
 5   4.      20 non-null     float64
 6   5.      20 non-null     float64
 7   6.      20 non-null     float64
 8   7.      20 non-null     float64
 9   8.      20 non-null     float64
 10  9.      20 non-null     float64
 11  10.     20 non-null     float64
 12  11.     20 non-null     float64
 13  12.     20 non-null     float64
 14  13.     20 non-null     float64
 15  14.     20 non-null     float64
 16  15.     20 non-null     float64
 17  16.     20 non-null     float64
 18  17.     20 non-null     float64
 19  18.     20 non-null     float64
 20  exp     20 non-null     float64
dtypes: float64(19), int64(2)
memory usage: 3.

We then rename the columns of the truncation levels to be easily recognisable:

In [24]:
import re

data.colrename(lambda s: re.sub('^([0-9]*)[.]$', r'level_\1', s))
data.get_dataframe().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   init      20 non-null     int64  
 1   weight    20 non-null     float64
 2   type      20 non-null     int64  
 3   level_2   20 non-null     float64
 4   level_3   20 non-null     float64
 5   level_4   20 non-null     float64
 6   level_5   20 non-null     float64
 7   level_6   20 non-null     float64
 8   level_7   20 non-null     float64
 9   level_8   20 non-null     float64
 10  level_9   20 non-null     float64
 11  level_10  20 non-null     float64
 12  level_11  20 non-null     float64
 13  level_12  20 non-null     float64
 14  level_13  20 non-null     float64
 15  level_14  20 non-null     float64
 16  level_15  20 non-null     float64
 17  level_16  20 non-null     float64
 18  level_17  20 non-null     float64
 19  level_18  20 non-null     float64
 20  exp       20 non-null     float64


We then remove the duplicates from the dataset:

In [26]:
duplicates = data.dupremove()

print(f'No. of duplicates: {duplicates.shape[0]:d}')
print(f'Fraction of duplicates: {100 * duplicates.shape[0] / shape[0]:.1f}%')

No. of duplicates: 1
Fraction of duplicates: 0.1%


Finally we can save the file:

In [27]:
data.save('./data/dlumps.csv', format='csv', index=False)

# print 10 random entries
data.get_dataframe().sample(n=10)

Unnamed: 0,init,weight,type,level_2,level_3,level_4,level_5,level_6,level_7,level_8,...,level_10,level_11,level_12,level_13,level_14,level_15,level_16,level_17,level_18,exp
6,0,0.027778,4,0.623568,0.629606,0.679687,0.728737,0.740993,0.780145,0.788949,...,0.818299,0.833743,0.839221,0.849563,0.854017,0.862205,0.8659004,0.871932,0.8750546,0.9552932
16,0,3.361111,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,23.210856,23.159297,-67.06599,-67.0486,146.9126,117.2284
14,0,2.25,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-6.656851,9.229395,9.834966,-17.395405,-18.270006,20.195534,21.12974,-28.43968,-29.50866,-19.79894
0,3,0.0,2,2.171332,2.095383,2.061599,2.049103,2.036695,2.03243,2.026098,...,2.020293,2.0192,2.016635,2.015958,2.014117,2.013663,2.012277,2.011954,2.010872,2.000132
12,0,1.361111,4,0.0,0.0,0.0,0.0,-1.021448,-0.647487,0.020622,...,-0.080513,0.070895,-0.11409,-0.121588,0.328101,0.402608,-0.09584362,-0.09570714,0.5101253,0.6600898
11,0,1.0,4,0.0,0.0,2.245192,2.52795,0.584574,0.592287,2.249617,...,1.146863,1.153984,2.153222,2.1874,1.429024,1.433263,2.086919,2.103089,1.585226,2.015004
4,3,9.0,4,-33.620593,-40.373713,198.953725,218.118294,-997.122136,-1067.67844,5001.157827,...,-23157.212916,-24338.986279,105266.161459,110002.605965,-445320.847252,-462954.335124,1711459.0,1772124.0,-6009038.0,-14476750.0
8,0,0.25,4,-1.870796,-1.633202,-1.734322,-1.825242,-1.856547,-1.855997,-1.87273,...,-1.915585,-1.911995,-1.919985,-1.933232,-1.939385,-1.93652,-1.941151,-1.9485,-1.952275,-1.994476
17,0,4.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,37.84559,39.28174,-162.398,-28.42354
15,0,2.777778,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.281499,1.731179,-1.897068,-1.006636,-0.3858528,-1.778739,6.757286,3.616752
