# Machine Learning for String Field Theory

*H. Erbin, R. Finotello, M. Kudrna, M. Schnabl*

## Abstract

In the framework of bosonic **Open String Field Theory** (OSFT), we consider several observables characterised by conformal weight, periodicity of the oscillations and the position of vacua in the potential for various values of truncated mass level.
We focus on the prediction of the extrapolated value for the level-$\infty$ truncation using Machine Learning (ML) techniques.

## Synopsis

In this notebook we tidy and convert the datasets from their original format of the **double lump solutions** to a CSV-like format for training and predictions.

## General Observations

Each entry in the datasets represents one observable in OSFT.
Since these observables are represented by vector entries in the dataset, we also introduce a new label which will identify the observable inside its original solution vector.

Together with the features labelling the observable, we also have the values of such observable at different truncation levels.
The purpose of the analysis is eventually to compute the extrapolated values at $\infty$ level truncation.
The data is therefore twofold: some variable are labelling the observable, while the values of the truncation levels should then be compared with the values at $\infty$.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import re
import os

In [2]:
# create shortcuts for paths
proot = lambda s: os.path.join('.', s)
pdata = lambda s: os.path.join(proot('data'), s)

## Load the Dataset

In [3]:
df = pd.read_json(pdata('mathematica_dlumps.json'))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   init    20 non-null     int64  
 1   weight  20 non-null     float64
 2   type    20 non-null     int64  
 3   2.      20 non-null     float64
 4   3.      20 non-null     float64
 5   4.      20 non-null     float64
 6   5.      20 non-null     float64
 7   6.      20 non-null     float64
 8   7.      20 non-null     float64
 9   8.      20 non-null     float64
 10  9.      20 non-null     float64
 11  10.     20 non-null     float64
 12  11.     20 non-null     float64
 13  12.     20 non-null     float64
 14  13.     20 non-null     float64
 15  14.     20 non-null     float64
 16  15.     20 non-null     float64
 17  16.     20 non-null     float64
 18  17.     20 non-null     float64
 19  18.     20 non-null     float64
 20  exp     20 non-null     float64
dtypes: float64(19), int64(2)
memory usage: 3.

The dataset is made of 46 non-null vector entries (the dataset is complete).
We need to:

1. rename the variables of the truncation levels to be human manageable,
2. remove the initial point,
3. get the dummy variables for the type of oscillations.

## Rename the columns

In [4]:
columns = lambda c: re.sub(r'(.*)[.]', r'level_\1', c)
df = df.rename(columns=columns)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   init      20 non-null     int64  
 1   weight    20 non-null     float64
 2   type      20 non-null     int64  
 3   level_2   20 non-null     float64
 4   level_3   20 non-null     float64
 5   level_4   20 non-null     float64
 6   level_5   20 non-null     float64
 7   level_6   20 non-null     float64
 8   level_7   20 non-null     float64
 9   level_8   20 non-null     float64
 10  level_9   20 non-null     float64
 11  level_10  20 non-null     float64
 12  level_11  20 non-null     float64
 13  level_12  20 non-null     float64
 14  level_13  20 non-null     float64
 15  level_14  20 non-null     float64
 16  level_15  20 non-null     float64
 17  level_16  20 non-null     float64
 18  level_17  20 non-null     float64
 19  level_18  20 non-null     float64
 20  exp       20 non-null     float64


In [5]:
df.describe()

Unnamed: 0,init,weight,type,level_2,level_3,level_4,level_5,level_6,level_7,level_8,...,level_10,level_11,level_12,level_13,level_14,level_15,level_16,level_17,level_18,exp
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,...,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,1.05,1.715278,3.8,-2.608983,-3.087235,12.360494,13.485368,-53.858729,-57.641226,258.016894,...,-1172.908731,-1232.022585,5292.641517,5529.478551,-22317.746351,-23199.704152,85662.85,88697.07,-300601.1,-723920.6
std,1.468081,2.189245,0.615587,8.118042,9.802525,44.393799,48.702023,222.478514,238.231213,1116.816835,...,5174.917644,5439.189243,23531.62981,24590.746496,99564.849566,103507.763796,382673.0,396237.8,1343627.0,3237079.0
min,0.0,0.0,2.0,-33.620593,-40.373713,-1.734322,-1.825242,-997.122136,-1067.67844,-5.550577,...,-23157.212916,-24338.986279,-7.42388,-17.395405,-445320.847252,-462954.335124,-67.06599,-67.0486,-6009038.0,-14476750.0
25%,0.0,0.090278,4.0,-1.304863,-1.425327,0.0,0.0,-1.437197,-1.412107,-0.06877,...,-1.958771,-1.390445,-0.195941,-0.652932,-1.970593,-1.375239,-0.6939012,-1.802527,-8.913318,-1.995897
50%,0.0,1.0,4.0,0.0,0.0,0.01128,0.001635,0.0,0.0,0.010817,...,0.0,0.000272,1.219497,0.424945,9.9e-05,0.201399,1.182381,0.4360167,0.2550822,0.3300225
75%,3.0,2.381944,4.0,0.006879,0.341444,2.107497,2.043995,0.623679,0.639251,2.023673,...,1.222807,1.71139,2.185291,2.014058,1.580704,2.010701,6.847626,2.034738,2.008174,1.998589
max,3.0,9.0,4.0,2.171332,2.499696,198.953725,218.118294,2.036695,2.03243,5001.157827,...,3.761426,9.229395,105266.161459,110002.605965,23.210856,23.159297,1711459.0,1772124.0,146.9126,117.2284


## Get Dummy Variables for the Type of Oscillations

In [6]:
df = pd.get_dummies(df, columns=['type'])
df = df.rename(columns={'type_2.0': 'type_2', 'type_4.0': 'type_4'})

## Remove Columns and Prepare for Analysis

In [7]:
columns = ['weight', 'type_2', 'type_4'] + ['level_' + str(n) for n in range(2, 19)] + ['exp']
df = df[columns]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    20 non-null     float64
 1   type_2    20 non-null     uint8  
 2   type_4    20 non-null     uint8  
 3   level_2   20 non-null     float64
 4   level_3   20 non-null     float64
 5   level_4   20 non-null     float64
 6   level_5   20 non-null     float64
 7   level_6   20 non-null     float64
 8   level_7   20 non-null     float64
 9   level_8   20 non-null     float64
 10  level_9   20 non-null     float64
 11  level_10  20 non-null     float64
 12  level_11  20 non-null     float64
 13  level_12  20 non-null     float64
 14  level_13  20 non-null     float64
 15  level_14  20 non-null     float64
 16  level_15  20 non-null     float64
 17  level_16  20 non-null     float64
 18  level_17  20 non-null     float64
 19  level_18  20 non-null     float64
 20  exp       20 non-null     float64


## Remove Duplicates

In [8]:
duplicates_id = df.duplicated()
duplicates = df.loc[duplicates_id]
df = df.loc[~duplicates_id]

In [9]:
print(f'Number of duplicates:   {duplicates_id.sum():d}')
print(f'Fraction of duplicates: {duplicates_id.mean():.3f}')

Number of duplicates:   1
Fraction of duplicates: 0.050


## Save to File

In [10]:
duplicates.to_csv(pdata('dlumps_dup.csv'), index=False)
df.to_csv(pdata('dlumps.csv'), index=False)