# Level Truncation in String Field Theory

H. Erbin, R. Finotello, M. Kudrna

## Minimal Models

We consider minimal models in String Field Theory and their truncation levels.
We use machine learning techniques to extrapolate the value of the truncations $L$ at $\infty$ given the results using fits of polynomials in $\frac{1}{L}$.

## Dataset Tidying

In what follows we tidy the dataset and prepare a dataset containing the physical observables as rows, and weight, type and truncation levels as columns.

In [1]:
import pandas as pd
import numpy as np
import re

## Import the Dataset

We first import the dataset using the `pandas` library.
The original dataset is a Mathematica file which has been convert to JSON in the format `[{column -> value} ... {column -> value}]` thus the correct *orientation* of the dataset is `records`.
There are two different datasets: the first is the real part of the data, the second represents the imaginary part.

In [2]:
df_re = pd.read_json('./data/data_re.json', orient='records')
df_im = pd.read_json('./data/data_im.json', orient='records')

Native datasets are composed by solutions at different radii and, for each one of them, by the values of weight, type and truncation levels of different observables.
There are several incomplete cases as it is impossible to compute the same number of truncation levels for all the observables.

In [3]:
# check consistency between the two datasets
assert df_re.isna().astype(np.int).sum().max() == df_re.isna().astype(np.int).sum().max()

n_incomplete = df_re.isna().astype(np.int).sum().max()
print(f'There are {n_incomplete:d} incomplete cases in each dataset.')

There are 63 incomplete cases in each dataset.


In [4]:
# check consistency between the two datasets
assert df_re.shape == df_im.shape

df_nrows, df_ncols = df_re.shape
print(f'The datasets are made of {df_nrows:d} rows and {df_ncols:d} columns.')
print(f'Equivalently, they are {df_nrows:d} solutions and {df_ncols:d} features and labels.')

The datasets are made of 65 rows and 26 columns.
Equivalently, they are 65 solutions and 26 features and labels.


Each feature has a different physical interpretation.
The column `exp` represents the extrapolated labels which will be used as output of the machine learning algorithms.
In fact the `type` feature is categorical and can be rescaled to $\{ 0,\, 1\}$.

In [5]:
df_re.columns

Index(['exp', 'weight', 'type', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22',
       '23', '24'],
      dtype='object')

In [6]:
df_im.columns

Index(['exp', 'weight', 'type', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22',
       '23', '24'],
      dtype='object')

The `type` column takes only values $\{ 2,\, 4 \}$ which can be rescaled by dividing by $2$ and subtracting $1$:

In [7]:
df_re['type'] = df_re['type'].apply(lambda x: [(i // 2) - 1 for i in x])

We finally rename the columns to be distinguishable by adding the prefix *lev_* to the truncation levels, and the suffixes *_re* and *_im* to set the real and imaginary parts apart.

In [8]:
df_re = df_re.rename(columns=lambda x: re.sub('$', '_re', re.sub('^([0-9]*)$', r'lev_\1', x)))
df_im = df_im.rename(columns=lambda x: re.sub('$', '_im', re.sub('^([0-9]*)$', r'lev_\1', x)))

## Incomplete Cases

We then need to handle the incomplete cases.
We fill every incomplete (`NaN`) case with a list of zeros as long as the number of observables in the solution.
We then pad the entire result to be as long as the highest number of observables per solution in the dataset.

In [9]:
length_re = df_re.applymap(lambda x: len(x) if isinstance(x, list) else 0).max(axis=1)
length_im = df_im.applymap(lambda x: len(x) if isinstance(x, list) else 0).max(axis=1)

# check if the number of observables is consistent
assert (length_re == length_im).all()
length = length_re # or length im since it is the same

In [10]:
# fill the incomplete cases with a list of zeros
df_re = df_re.applymap(lambda x: x if isinstance(x, list) else [0])
df_im = df_im.applymap(lambda x: x if isinstance(x, list) else [0])

# pad the lists to be the maximal length
df_re = df_re.applymap(lambda x: np.pad(x, (0,length.max() - len(x))))
df_im = df_im.applymap(lambda x: np.pad(x, (0,length.max() - len(x))))

## Stacking the Dataset

We then stack the dataset to obtain a dataset containing only one physical observable per line.
We first add a column to dicriminate the number of solutions and observables for ordering purposes (i.e. we assign a number in $[0,\, 64]$ for the solution and $[0,\, 16]$ for each observable.

In [11]:
solution   = []
observable = []
for n in df_re.index:
    solution.append([n] * length.max())
    observable.append([i for i in range(length.max())])

# add to dataset
df_re['solution']   = solution
df_re['observable'] = observable
df_im['solution']   = solution
df_im['observable'] = observable

We stack each solution on top of each other by creating a list of `pd.DataFrame` for each solution and then stack all of them:

In [12]:
def expand_series(df, n):
    '''
    Expand a row of a dataset containing lists to a standalone dataframes.
    
    Needed arguments:
        df: the dataframe (pd.DataFrame),
        n:  the id number of the row (int).
    '''
    
    # select the row
    row = df.iloc[n]
    
    # expand to a series
    row = row.apply(pd.Series)
    
    # transpose the dataframe
    row = row.transpose()
    
    return row

In [13]:
df_re_list = [expand_series(df_re, n) for n in range(df_nrows)]
df_im_list = [expand_series(df_im, n) for n in range(df_nrows)]

We then stack the dataframes using `pandas`:

In [14]:
df_re_tidy = pd.concat(df_re_list)
df_im_tidy = pd.concat(df_im_list)

## Cleaning the Output

We finally clean the output by setting the *dtypes*, reordering the columns, joining the datasets, and removing the duplicates.

In [15]:
# change dtypes
df_re_tidy = df_re_tidy.astype({'type_re': np.int, 'solution': np.int, 'observable': np.int})
df_im_tidy = df_im_tidy.astype({'type_im': np.int, 'solution': np.int, 'observable': np.int})

In [16]:
# join the datasets
df_tidy = df_re_tidy.merge(df_im_tidy, how='inner', on=['solution', 'observable'])

# assert that no incomplete cases are present
assert df_tidy.isna().sum().sum() == 0

# check if shape still matches
assert df_tidy.shape[0] == df_re_tidy.shape[0]
assert df_tidy.shape[0] == df_im_tidy.shape[0]
assert df_tidy.shape[1] == df_re_tidy.shape[1] + df_im_tidy.shape[1] - 2

In [17]:
# select vanishing columns
df_agg_cols = df_tidy.agg([np.sum, np.mean, np.std])
drop_cols   = [key for key in df_agg_cols.columns if (df_agg_cols[key] == 0).all()]
df_tidy     = df_tidy.drop(columns=drop_cols)

In [51]:
# remove vanishing rows
df_agg_rows = df_tidy.drop(columns=['solution', 'observable']).agg([np.sum, np.mean, np.std], axis=1)
drop_rows   = [index for index in df_agg_rows.index if (df_agg_rows.iloc[index] == 0).all()]
df_tidy     = df_tidy.drop(index=drop_rows)

In [52]:
# remove duplicates (do not use solution and observable)
col_list = [col for col in df_tidy.columns if col != 'solution' and col != 'observable']
df_tidy  = df_tidy.drop_duplicates(subset=col_list, ignore_index=True)

df_tidy_nrows, df_tidy_ncols = df_tidy.shape

print(f'New shape of the tidy dataset: {df_tidy_nrows:d} samples, {df_tidy_ncols:d} features.')

New shape of the tidy dataset: 802 samples, 34 features.


In [27]:
# rename the new columns
df_tidy = df_tidy.rename(columns={'exp_re': 'exp', 'weight_re': 'weight', 'type_re': 'type'})
df_tidy = df_tidy.rename(columns=lambda x: re.sub('_([0-9])_', r'_0\1_', x))

In [36]:
# sort the columns
sorted_cols = ['solution', 'observable', 'weight', 'type'] + sorted(df_tidy.filter(regex='lev_')) + ['exp']
df_tidy     = df_tidy[sorted_cols]

## Save the Dataset

We finally have the tidy dataset and we can save it to file (we choose JSON format for versatility).

In [54]:
df_tidy.to_json('./data/data_tidy.json.gz', orient='records')