# Machine Learning for String Field Theory

*H. Erbin, R. Finotello, M. Kudrna, M. Schnabl*

## Abstract

In the framework of bosonic **Open String Field Theory** (OSFT), we consider several observables characterised by conformal weight, periodicity of the oscillations and the position of vacua in the potential for various values of truncated mass level.
We focus on the prediction of the extrapolated value for the level-$\infty$ truncation using Machine Learning (ML) techniques.

## Synopsis

In this notebook we tidy and convert the datasets from their original format of the **WZW model** to a CSV-like format for training and predictions.

## General Observations

Each entry in the datasets represents one observable in OSFT.
Since these observables are represented by vector entries in the dataset, we build a new dataset flattened over the columns.

Together with the features labelling the observable, we also have the values of such observable at different truncation levels.
The purpose of the analysis is eventually to compute the extrapolated values at $\infty$ level truncation.
The data is therefore twofold: some variable are labelling the observable, while the values of the truncation levels should then be compared with the values at $\infty$.

Notice that the finite truncation levels are in general complex, while the observables can be made real by taking linear combinations.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import re
import os

In [2]:
# create shortcuts for paths
proot = lambda s: os.path.join('.', s)
pdata = lambda s: os.path.join(proot('data'), s)

## Load the Dataset

In [3]:
df_re = pd.read_json(pdata('mathematica_wzw_real.json'))
df_im = pd.read_json(pdata('mathematica_wzw_imaginary.json'))

In [4]:
df_re.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   k       46 non-null     object
 1   exp     46 non-null     object
 2   weight  46 non-null     object
 3   j       46 non-null     object
 4   m       46 non-null     object
 5   type    46 non-null     object
 6   2.      46 non-null     object
 7   3.      46 non-null     object
 8   4.      46 non-null     object
 9   5.      46 non-null     object
 10  6.      46 non-null     object
 11  7.      46 non-null     object
 12  8.      46 non-null     object
 13  9.      46 non-null     object
 14  10.     46 non-null     object
 15  11.     15 non-null     object
 16  12.     2 non-null      object
 17  13.     1 non-null      object
 18  14.     1 non-null      object
dtypes: object(19)
memory usage: 7.0+ KB


In [5]:
df_im.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   k       46 non-null     object
 1   exp     46 non-null     object
 2   weight  46 non-null     object
 3   j       46 non-null     object
 4   m       46 non-null     object
 5   type    46 non-null     object
 6   2.      46 non-null     object
 7   3.      46 non-null     object
 8   4.      46 non-null     object
 9   5.      46 non-null     object
 10  6.      46 non-null     object
 11  7.      46 non-null     object
 12  8.      46 non-null     object
 13  9.      46 non-null     object
 14  10.     46 non-null     object
 15  11.     15 non-null     object
 16  12.     2 non-null      object
 17  13.     1 non-null      object
 18  14.     1 non-null      object
dtypes: object(19)
memory usage: 7.0+ KB


The dataset is made of 46 non-null vector entries (the dataset is not complete).
We need to:

1. remove non complete entries or variables,
2. rename the variables of the truncation levels to be human manageable,
3. flatten the entries,
5. get the dummy variables for the type of oscillations,
6. create entries for the levels in order to have both the complex and separate formulation.

## Remove Incomplete Variables

In [6]:
df_re = df_re.drop(columns=['11.', '12.', '13.', '14.'])
df_im = df_im.drop(columns=['11.', '12.', '13.', '14.'])

## Rename the columns

In [7]:
columns = lambda c: re.sub(r'(.*)[.]', r'level_\1', c)
df_re = df_re.rename(columns=columns)
df_im = df_im.rename(columns=columns)

In [8]:
df_re.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   k         46 non-null     object
 1   exp       46 non-null     object
 2   weight    46 non-null     object
 3   j         46 non-null     object
 4   m         46 non-null     object
 5   type      46 non-null     object
 6   level_2   46 non-null     object
 7   level_3   46 non-null     object
 8   level_4   46 non-null     object
 9   level_5   46 non-null     object
 10  level_6   46 non-null     object
 11  level_7   46 non-null     object
 12  level_8   46 non-null     object
 13  level_9   46 non-null     object
 14  level_10  46 non-null     object
dtypes: object(15)
memory usage: 5.5+ KB


In [9]:
df_im.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   k         46 non-null     object
 1   exp       46 non-null     object
 2   weight    46 non-null     object
 3   j         46 non-null     object
 4   m         46 non-null     object
 5   type      46 non-null     object
 6   level_2   46 non-null     object
 7   level_3   46 non-null     object
 8   level_4   46 non-null     object
 9   level_5   46 non-null     object
 10  level_6   46 non-null     object
 11  level_7   46 non-null     object
 12  level_8   46 non-null     object
 13  level_9   46 non-null     object
 14  level_10  46 non-null     object
dtypes: object(15)
memory usage: 5.5+ KB


## Flatten the Entries

In [10]:
df_re = pd.concat([pd.DataFrame({f: df_re[f].iloc[n] for f in df_re}) for n in range(df_re.shape[0])], axis=0)
df_re.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   k         1680 non-null   float64
 1   exp       1680 non-null   float64
 2   weight    1680 non-null   float64
 3   j         1680 non-null   float64
 4   m         1680 non-null   float64
 5   type      1680 non-null   float64
 6   level_2   1680 non-null   float64
 7   level_3   1680 non-null   float64
 8   level_4   1680 non-null   float64
 9   level_5   1680 non-null   float64
 10  level_6   1680 non-null   float64
 11  level_7   1680 non-null   float64
 12  level_8   1680 non-null   float64
 13  level_9   1680 non-null   float64
 14  level_10  1680 non-null   float64
dtypes: float64(15)
memory usage: 210.0 KB


In [11]:
df_im = pd.concat([pd.DataFrame({f: df_im[f].iloc[n] for f in df_im}) for n in range(df_im.shape[0])], axis=0)
df_im.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   k         1680 non-null   int64  
 1   exp       1680 non-null   float64
 2   weight    1680 non-null   int64  
 3   j         1680 non-null   int64  
 4   m         1680 non-null   int64  
 5   type      1680 non-null   int64  
 6   level_2   1680 non-null   float64
 7   level_3   1680 non-null   float64
 8   level_4   1680 non-null   float64
 9   level_5   1680 non-null   float64
 10  level_6   1680 non-null   float64
 11  level_7   1680 non-null   float64
 12  level_8   1680 non-null   float64
 13  level_9   1680 non-null   float64
 14  level_10  1680 non-null   float64
dtypes: float64(10), int64(5)
memory usage: 210.0 KB


In [12]:
df_re.describe()

Unnamed: 0,k,exp,weight,j,m,type,level_2,level_3,level_4,level_5,level_6,level_7,level_8,level_9,level_10
count,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0
mean,6.777381,0.038664,0.904315,1.971429,0.0,3.890476,-0.077024,-0.079294,0.089836,0.078986,-0.011237,-0.003376,0.075769,0.06903,-0.008013
std,1.314858,0.594829,0.588936,1.263905,1.576801,0.455165,2.614227,2.648262,1.286826,1.317507,1.384471,1.436868,1.518934,1.578406,1.722916
min,2.0,-1.519671,0.0,0.0,-4.0,2.0,-26.284377,-26.284377,-8.757038,-9.252983,-10.978029,-11.445648,-13.721069,-14.249796,-24.994666
25%,6.0,-0.437426,0.416667,1.0,-1.0,4.0,-0.570279,-0.601505,-0.487621,-0.460953,-0.530362,-0.535451,-0.494487,-0.48239,-0.485126
50%,7.0,0.0,0.972222,2.0,0.0,4.0,0.0,0.0,0.018674,0.016188,0.0,0.0,0.008034,0.006125,0.0
75%,8.0,0.500705,1.333333,3.0,1.0,4.0,0.586592,0.619976,0.611741,0.585377,0.536852,0.556819,0.597575,0.589435,0.506648
max,8.0,1.414214,2.0,4.0,4.0,4.0,35.385221,35.385221,11.673646,11.673646,10.978029,11.445648,18.673134,20.051284,15.315423


In [13]:
df_im.describe()

Unnamed: 0,k,exp,weight,j,m,type,level_2,level_3,level_4,level_5,level_6,level_7,level_8,level_9,level_10
count,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0,1680.0
mean,0.0,-0.001853,0.0,0.0,0.0,0.0,-0.00567,-0.009452,-0.001553,0.003396,-0.001987,-0.014919,-0.01094,0.009722,0.006753
std,0.0,0.298647,0.0,0.0,0.0,0.0,0.234921,0.371899,0.352832,0.353773,0.3685,0.443906,0.471528,0.623249,0.657285
min,0.0,-0.930605,0.0,0.0,0.0,0.0,-1.867961,-1.969113,-1.930349,-2.005075,-2.287981,-3.646223,-4.023181,-6.650984,-7.174097
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.930605,0.0,0.0,0.0,0.0,1.867961,1.736677,1.930349,2.005075,2.287981,3.646223,4.023181,6.683011,7.187127


## Remove Identically Vanishing Columns

In [14]:
df_im = df_im.drop(columns=df_im.loc[:,(df_im.mean() == 0) & (df_im.std() == 0)].columns)

In [15]:
df_re.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   k         1680 non-null   float64
 1   exp       1680 non-null   float64
 2   weight    1680 non-null   float64
 3   j         1680 non-null   float64
 4   m         1680 non-null   float64
 5   type      1680 non-null   float64
 6   level_2   1680 non-null   float64
 7   level_3   1680 non-null   float64
 8   level_4   1680 non-null   float64
 9   level_5   1680 non-null   float64
 10  level_6   1680 non-null   float64
 11  level_7   1680 non-null   float64
 12  level_8   1680 non-null   float64
 13  level_9   1680 non-null   float64
 14  level_10  1680 non-null   float64
dtypes: float64(15)
memory usage: 210.0 KB


In [16]:
df_im.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   exp       1680 non-null   float64
 1   level_2   1680 non-null   float64
 2   level_3   1680 non-null   float64
 3   level_4   1680 non-null   float64
 4   level_5   1680 non-null   float64
 5   level_6   1680 non-null   float64
 6   level_7   1680 non-null   float64
 7   level_8   1680 non-null   float64
 8   level_9   1680 non-null   float64
 9   level_10  1680 non-null   float64
dtypes: float64(10)
memory usage: 144.4 KB


## Merge the Datasets

In [17]:
df_re = df_re.rename(columns=lambda c: re.sub(r'(exp|level_.*)', r'\1_re', c))
df_im = df_im.rename(columns=lambda c: re.sub(r'(exp|level_.*)', r'\1_im', c))

In [18]:
df = pd.concat([df_re, df_im], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 25 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   k            1680 non-null   float64
 1   exp_re       1680 non-null   float64
 2   weight       1680 non-null   float64
 3   j            1680 non-null   float64
 4   m            1680 non-null   float64
 5   type         1680 non-null   float64
 6   level_2_re   1680 non-null   float64
 7   level_3_re   1680 non-null   float64
 8   level_4_re   1680 non-null   float64
 9   level_5_re   1680 non-null   float64
 10  level_6_re   1680 non-null   float64
 11  level_7_re   1680 non-null   float64
 12  level_8_re   1680 non-null   float64
 13  level_9_re   1680 non-null   float64
 14  level_10_re  1680 non-null   float64
 15  exp_im       1680 non-null   float64
 16  level_2_im   1680 non-null   float64
 17  level_3_im   1680 non-null   float64
 18  level_4_im   1680 non-null   float64
 19  level_5_

## Get Dummy Variables for the Type of Oscillations

In [19]:
df = pd.get_dummies(df, columns=['type'])
df = df.rename(columns={'type_2.0': 'type_2', 'type_4.0': 'type_4'})
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 26 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   k            1680 non-null   float64
 1   exp_re       1680 non-null   float64
 2   weight       1680 non-null   float64
 3   j            1680 non-null   float64
 4   m            1680 non-null   float64
 5   level_2_re   1680 non-null   float64
 6   level_3_re   1680 non-null   float64
 7   level_4_re   1680 non-null   float64
 8   level_5_re   1680 non-null   float64
 9   level_6_re   1680 non-null   float64
 10  level_7_re   1680 non-null   float64
 11  level_8_re   1680 non-null   float64
 12  level_9_re   1680 non-null   float64
 13  level_10_re  1680 non-null   float64
 14  exp_im       1680 non-null   float64
 15  level_2_im   1680 non-null   float64
 16  level_3_im   1680 non-null   float64
 17  level_4_im   1680 non-null   float64
 18  level_5_im   1680 non-null   float64
 19  level_6_

## Compute Complex Numbers

In [20]:
levels = ['level_' + str(n) for n in range(2, 11)] + ['exp']

df_re_tmp = df[[l + '_re' for l in levels]].applymap(lambda n: complex(n, 0))
df_im_tmp = df[[l + '_im' for l in levels]].applymap(lambda n: complex(0, n))
for l in levels:
    df[l] = df_re_tmp[l + '_re'] + df_im_tmp[l + '_im']

## Reorder the Columns

In [21]:
columns = ['k', 'weight', 'j', 'm', 'type_2', 'type_4'] + ['level_' + str(n) for n in range(2, 11)]  + ['level_' + str(n) + '_re' for n in range(2, 11)] + ['level_' + str(n) + '_im' for n in range(2, 11)] + ['exp', 'exp_re', 'exp_im']
df = df[columns]

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 0 to 49
Data columns (total 36 columns):
 #   Column       Non-Null Count  Dtype     
---  ------       --------------  -----     
 0   k            1680 non-null   float64   
 1   weight       1680 non-null   float64   
 2   j            1680 non-null   float64   
 3   m            1680 non-null   float64   
 4   type_2       1680 non-null   uint8     
 5   type_4       1680 non-null   uint8     
 6   level_2      1680 non-null   complex128
 7   level_3      1680 non-null   complex128
 8   level_4      1680 non-null   complex128
 9   level_5      1680 non-null   complex128
 10  level_6      1680 non-null   complex128
 11  level_7      1680 non-null   complex128
 12  level_8      1680 non-null   complex128
 13  level_9      1680 non-null   complex128
 14  level_10     1680 non-null   complex128
 15  level_2_re   1680 non-null   float64   
 16  level_3_re   1680 non-null   float64   
 17  level_4_re   1680 non-null   float6

## Compute the Angle and Modulus of the Extrapolated Label

In [23]:
# angle is computed in (-pi, pi) -> (-1, 1)
df['exp_angle'] = df['exp'].apply(np.angle) / np.pi
df['exp_mod']   = df['exp'].apply(np.abs)

## Remove Duplicates

Duplicates can also be in the sense of complex conjugates: we first compute the absolute values and complex modulus of the observables and then mark as duplicates the entries coming from the same solution (same `k`), same weight, same $\mathrm{SU}(2)$ multiplet (same `j` and same |`m`|), and same extrapolated labels (or complex conjugates).

In [24]:
duplicates_id = df.abs().duplicated(subset=['k', 'weight', 'j', 'm', 'type_2', 'type_4', 'exp_mod', 'exp_angle'])
duplicates = df.loc[duplicates_id]
df = df.loc[~duplicates_id]

In [25]:
print(f'Number of duplicates:   {duplicates_id.sum():d}')
print(f'Fraction of duplicates: {duplicates_id.mean():.3f}')

Number of duplicates:   768
Fraction of duplicates: 0.457


## Save to File

In [26]:
duplicates.to_csv(pdata('wzw_dup.csv'), index=False)
df.to_csv(pdata('wzw.csv'), index=False)