# Machine Learning for String Field Theory

*H. Erbin, R. Finotello, M. Kudrna, M. Schnabl*

## Abstract

In the framework of bosonic **Open String Field Theory** (OSFT), we consider several observables characterised by conformal weight, periodicity of the oscillations and the position of vacua in the potential for various values of truncated mass level.
We focus on the prediction of the extrapolated value for the level-$\infty$ truncation using Machine Learning (ML) techniques.

## Synopsis

In this notebook we tidy and convert the datasets from their original format of the **lump solutions** to a CSV-like format for training and predictions.

## General Observations

Each entry in the datasets represents one observable in OSFT.
Since these observables are represented by vector entries in the dataset, we also introduce a new label which will identify the observable inside its original solution vector.

Together with the features labelling the observable, we also have the values of such observable at different truncation levels.
The purpose of the analysis is eventually to compute the extrapolated values at $\infty$ level truncation.
The data is therefore twofold: some variable are labelling the observable, while the values of the truncation levels should then be compared with the values at $\infty$.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import re
import os

In [2]:
# create shortcuts for paths
proot = lambda s: os.path.join('.', s)
pdata = lambda s: os.path.join(proot('data'), s)

## Load the Dataset

In [3]:
df = pd.read_json(pdata('mathematica_lumps.json'))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   init    46 non-null     object
 1   exp     46 non-null     object
 2   weight  46 non-null     object
 3   type    46 non-null     object
 4   2.      46 non-null     object
 5   3.      46 non-null     object
 6   4.      46 non-null     object
 7   5.      46 non-null     object
 8   6.      46 non-null     object
 9   7.      46 non-null     object
 10  8.      46 non-null     object
 11  9.      46 non-null     object
 12  10.     46 non-null     object
 13  11.     46 non-null     object
 14  12.     46 non-null     object
 15  13.     46 non-null     object
 16  14.     46 non-null     object
 17  15.     46 non-null     object
 18  16.     46 non-null     object
 19  17.     46 non-null     object
 20  18.     46 non-null     object
dtypes: object(21)
memory usage: 7.7+ KB


The dataset is made of 46 non-null vector entries (the dataset is complete).
We need to:

1. rename the variables of the truncation levels to be human manageable,
2. add a label for each solution,
3. flatten the entries,
4. remove the initial point,
5. get the dummy variables for the type of oscillations.

## Rename the columns

In [4]:
columns = lambda c: re.sub(r'(.*)[.]', r'level_\1', c)
df = df.rename(columns=columns)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   init      46 non-null     object
 1   exp       46 non-null     object
 2   weight    46 non-null     object
 3   type      46 non-null     object
 4   level_2   46 non-null     object
 5   level_3   46 non-null     object
 6   level_4   46 non-null     object
 7   level_5   46 non-null     object
 8   level_6   46 non-null     object
 9   level_7   46 non-null     object
 10  level_8   46 non-null     object
 11  level_9   46 non-null     object
 12  level_10  46 non-null     object
 13  level_11  46 non-null     object
 14  level_12  46 non-null     object
 15  level_13  46 non-null     object
 16  level_14  46 non-null     object
 17  level_15  46 non-null     object
 18  level_16  46 non-null     object
 19  level_17  46 non-null     object
 20  level_18  46 non-null     object
dtypes: object(21)
memo

## Drop Perfect Match

In [5]:
df = df.iloc[1:]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 1 to 45
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   init      45 non-null     object
 1   exp       45 non-null     object
 2   weight    45 non-null     object
 3   type      45 non-null     object
 4   level_2   45 non-null     object
 5   level_3   45 non-null     object
 6   level_4   45 non-null     object
 7   level_5   45 non-null     object
 8   level_6   45 non-null     object
 9   level_7   45 non-null     object
 10  level_8   45 non-null     object
 11  level_9   45 non-null     object
 12  level_10  45 non-null     object
 13  level_11  45 non-null     object
 14  level_12  45 non-null     object
 15  level_13  45 non-null     object
 16  level_14  45 non-null     object
 17  level_15  45 non-null     object
 18  level_16  45 non-null     object
 19  level_17  45 non-null     object
 20  level_18  45 non-null     object
dtypes: object(21)
memo

## Add label

In [6]:
shapes = list(df.applymap(len).max(axis=1))
labels = [[k+1] * n for k, n in list(zip(range(df.shape[0]), shapes))]
df['solution'] = labels
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 1 to 45
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   init      45 non-null     object
 1   exp       45 non-null     object
 2   weight    45 non-null     object
 3   type      45 non-null     object
 4   level_2   45 non-null     object
 5   level_3   45 non-null     object
 6   level_4   45 non-null     object
 7   level_5   45 non-null     object
 8   level_6   45 non-null     object
 9   level_7   45 non-null     object
 10  level_8   45 non-null     object
 11  level_9   45 non-null     object
 12  level_10  45 non-null     object
 13  level_11  45 non-null     object
 14  level_12  45 non-null     object
 15  level_13  45 non-null     object
 16  level_14  45 non-null     object
 17  level_15  45 non-null     object
 18  level_16  45 non-null     object
 19  level_17  45 non-null     object
 20  level_18  45 non-null     object
 21  solution  45 non-n

## Flatten the Entries

In [7]:
df = pd.concat([pd.DataFrame({f: df[f].iloc[n] for f in df}) for n in range(df.shape[0])], axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 763 entries, 0 to 19
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   init      763 non-null    float64
 1   exp       763 non-null    float64
 2   weight    763 non-null    float64
 3   type      763 non-null    float64
 4   level_2   763 non-null    float64
 5   level_3   763 non-null    float64
 6   level_4   763 non-null    float64
 7   level_5   763 non-null    float64
 8   level_6   763 non-null    float64
 9   level_7   763 non-null    float64
 10  level_8   763 non-null    float64
 11  level_9   763 non-null    float64
 12  level_10  763 non-null    float64
 13  level_11  763 non-null    float64
 14  level_12  763 non-null    float64
 15  level_13  763 non-null    float64
 16  level_14  763 non-null    float64
 17  level_15  763 non-null    float64
 18  level_16  763 non-null    float64
 19  level_17  763 non-null    float64
 20  level_18  763 non-null    float64

In [8]:
df.describe()

Unnamed: 0,init,exp,weight,type,level_2,level_3,level_4,level_5,level_6,level_7,...,level_10,level_11,level_12,level_13,level_14,level_15,level_16,level_17,level_18,solution
count,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,...,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0
mean,0.900145,0.568807,1.86481,3.764089,-1.486604,-1.646112,7.572923,8.009875,-32.442704,-33.974851,...,-707.859825,-730.665421,2880.727031,2962.131627,-10880.473304,-11156.002595,38233.218285,39115.222069,-125611.7,24.237221
std,1.018137,0.694124,2.31459,0.645534,4.45984,4.903356,20.975753,22.199837,108.036621,113.190933,...,2772.330578,2859.981462,11577.937424,11896.896973,44115.397542,45204.266729,155584.08265,159093.092977,512985.2,13.025317
min,0.0,-1.0,0.0,2.0,-19.74404,-21.893983,-0.754568,-0.782633,-514.984097,-538.627792,...,-13321.170445,-13781.246472,-8.850113,-12.265769,-211473.396816,-216475.644423,-44.356923,-66.596211,-2489024.0,1.0
25%,0.0,0.0,0.040825,4.0,-0.68581,-0.973798,0.0,0.0,-0.896869,-0.915398,...,-1.022243,-1.07355,0.001907,0.001975,-2.093628,-4.152302,0.139497,0.12457,-6.801183,13.0
50%,0.0,1.0,1.0,4.0,0.0,0.0,0.938337,0.944864,0.0,0.002741,...,0.002597,0.002644,0.99996,0.99791,0.473076,0.550104,1.005628,1.005427,0.8994815,25.0
75%,1.75,1.0,2.985594,4.0,0.912406,0.992236,1.347502,1.486801,0.991624,1.000029,...,1.000028,1.004544,3.385307,5.228127,1.00612,1.006436,7.344206,10.777103,1.005195,36.0
max,3.0,1.0,9.0,4.0,1.239384,1.358098,122.931347,131.67549,2.275741,2.712998,...,5.243298,6.283092,56115.100219,57592.69886,16.106978,23.077325,731718.33209,748286.961169,103.3588,45.0


## Get Dummy Variables for the Type of Oscillations

In [9]:
df = pd.get_dummies(df, columns=['type'])
df = df.rename(columns={'type_2.0': 'type_2', 'type_4.0': 'type_4'})

## Remove Columns and Prepare for Analysis

In [10]:
columns = ['solution', 'weight', 'type_2', 'type_4'] + ['level_' + str(n) for n in range(2, 19)] + ['exp']
df = df[columns]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 763 entries, 0 to 19
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   solution  763 non-null    int64  
 1   weight    763 non-null    float64
 2   type_2    763 non-null    uint8  
 3   type_4    763 non-null    uint8  
 4   level_2   763 non-null    float64
 5   level_3   763 non-null    float64
 6   level_4   763 non-null    float64
 7   level_5   763 non-null    float64
 8   level_6   763 non-null    float64
 9   level_7   763 non-null    float64
 10  level_8   763 non-null    float64
 11  level_9   763 non-null    float64
 12  level_10  763 non-null    float64
 13  level_11  763 non-null    float64
 14  level_12  763 non-null    float64
 15  level_13  763 non-null    float64
 16  level_14  763 non-null    float64
 17  level_15  763 non-null    float64
 18  level_16  763 non-null    float64
 19  level_17  763 non-null    float64
 20  level_18  763 non-null    float64

## Remove Duplicates

In [11]:
duplicates_id = df.duplicated()
duplicates = df.loc[duplicates_id]
df = df.loc[~duplicates_id]

In [12]:
print(f'Number of duplicates:   {duplicates_id.sum():d}')
print(f'Fraction of duplicates: {duplicates_id.mean():.3f}')

Number of duplicates:   45
Fraction of duplicates: 0.059


## Save to File

In [13]:
duplicates.to_csv(pdata('lumps_dup.csv'), index=False)
df.to_csv(pdata('lumps.csv'), index=False)