# Machine Learning for Level Truncation in Open String Field Theory

Harold Erbin, Riccardo Finotello, Matej Kudrna, Martin Schnabl

---
---

## Abstract

In the framework of bosonic Open String Field Theory (OSFT), we consider several observables characterised by conformal weight and type, and the position of vacua in the potential for various values of truncated mass level. We focus on the prediction of the extrapolated value for the level-$\infty$ truncation using Machine Learning (ML) techniques.

In this notebook we focus on the $\mathrm{SU}(2)$ WZW model and study some basic statistical inference from a linear model.

In [1]:
%load_ext autoreload
%autoreload 2

import os

os.makedirs('./models', exist_ok=True)

## Regression Analysis

We load the tidy dataset and prepare for the EDA by dropping duplicates and looking at its properties:

In [2]:
import pandas as pd

# load the dataset
df = pd.read_csv('./data/data_wzw.csv')

We split the dataset into training and test sets. Since the label `k` is going to be part of the training features we do not split, as in the previous case, according to it, but we use a traditional approach. The ultimate idea is to predict both real and imaginary parts of `exp`:

In [3]:
from sklearn.model_selection import train_test_split

RAND = 123

df_train, df_test = train_test_split(df, train_size=0.8, shuffle=True, random_state=RAND)

# clean columns and separate the labels
exp_train = df_train[['exp_re', 'exp_im']]
exp_test  = df_test[['exp_re', 'exp_im']]

df_train  = df_train.drop(columns=['exp_re', 'exp_im'])
df_test   = df_test.drop(columns=['exp_re', 'exp_im'])

# check the size
print('Training ratio: {:.2f}%'.format(100 * df_train.shape[0] / df.shape[0]))

Training ratio: 80.00%


In order to get the predictions we prepare the estimator (a simple linear regressor in this case) and save everything we need for later:

In [4]:
from sklearn.linear_model import LinearRegression
import joblib

# define the estimator and save it
estimator = LinearRegression(fit_intercept=False, normalize=False, n_jobs=-1)
joblib.dump(estimator, './models/lr_prelim.pkl')

# save training and test sets
df_train.to_csv('./data/data_train_80.csv', index=False)
df_test.to_csv('./data/data_test_20.csv', index=False)

exp_train.to_csv('./data/labels_train_80.csv', index=False)
exp_test.to_csv('./data/labels_test_20.csv', index=False)

We then train the model:

In [5]:
!python3 ./scripts/scikit-train.py --train './data/data_train_80.csv' \
                                   --labels './data/labels_train_80.csv' \
                                   --estimator './models/lr_prelim.pkl'

LinearRegression trained in 0.002 seconds.


In [6]:
!python3 ./scripts/scikit-predict.py --test './data/data_test_20.csv' \
                                     --labels './data/labels_test_20.csv' \
                                     --estimator './models/lr_prelim.pkl' \
                                     --output 'lr_prelim'

LinearRegression predicted in 0.001 seconds.


In [7]:
import json

with open('./metrics/{}.json'.format('lr_prelim', 'r')) as f:
    metrics = pd.DataFrame(json.load(f), index=['lr_prelim'])
    
metrics

Unnamed: 0,DOF,MSE,MSE 95% CI (lower),MSE 95% CI (upper),RMSE,MAE,R2
lr_prelim,313,0.057991,-0.014972,0.072721,0.240813,0.068925,0.828673


We then perform the analysis of the variance (ANOVA) and study the coefficients of the fit:

In [8]:
DATA   = './predictions/lr_prelim.csv'
ESTIM  = './models/lr_prelim.pkl'
OUTPUT = 'lr_prelim_anova'

!python3 ./scripts/scikit-anova.py --data './predictions/lr_prelim.csv' \
                                   --estimator './models/lr_prelim.pkl' \
                                   --output 'lr_prelim_anova'

In [9]:
import pandas as pd

pd.read_csv('./metrics/lr_prelim_anova.csv', index_col=0)

Unnamed: 0,coefficients [Re(exp)],coefficients [Im(exp)],standard error [Re(exp)],standard error [Im(exp)],t statistic [Re(exp)],t statistic [Im(exp)],p value (t_obs > |t|) [Re(exp)],p value (t_obs > |t|) [Im(exp)],95% CI (lower) [Re(exp)],95% CI (upper) [Re(exp)],95% CI (lower) [Im(exp)],95% CI (upper) [Im(exp)]
k,0.002671,-0.005132,0.015213,0.001723,0.176,-2.977,0.861,0.003,-0.031591,0.036933,-0.009013,-0.001251
weight,0.048004,-0.004869,0.031582,0.003577,1.52,-1.361,0.13,0.175,-0.023125,0.119134,-0.012925,0.003188
j,-0.033284,0.000351,0.014707,0.001666,-2.263,0.211,0.024,0.833,-0.066408,-0.000161,-0.0034,0.004103
m,0.000942,0.000328,0.012584,0.001425,0.075,0.23,0.94,0.818,-0.0274,0.029285,-0.002882,0.003538
type,0.002381,0.01048,0.041406,0.00469,0.058,2.234,0.954,0.026,-0.090874,0.095636,-8.2e-05,0.021043
level_2_re,0.328998,0.008762,0.006351,0.000719,51.791,12.163,0.0,0.0,0.314693,0.343303,0.007142,0.010382
level_2_im,0.024593,-0.034398,0.092842,0.010516,0.265,-3.271,0.791,0.001,-0.184508,0.233695,-0.058081,-0.010714
level_3_re,-0.349234,-0.008625,0.006274,0.000711,-55.653,-12.12,0.0,0.0,-0.363366,-0.335103,-0.010225,-0.007024
level_3_im,-0.065203,-0.064036,0.052515,0.005948,-1.242,-10.764,0.215,0.0,-0.183478,0.053072,-0.077433,-0.05064
level_4_re,-0.51624,0.014511,0.014576,0.001651,-35.415,8.785,0.0,0.0,-0.549068,-0.483412,0.010793,0.01823


In general it seems that we could avoid using `m` and `weight` (but at least one of them should be in the set to label the solutions). It also seems that in order to predict the real part of the label we can avoid using the imaginary parts of the truncation levels, but we definitely need it to predict the imaginary part of the label.