# Machine Learning for String Field Theory

In the context of String Field Theory (SFT), we analyse data from different categories of models and study the truncation levels for various observables. The target is to use machine learning (ML) methods to predict the value of the observables without truncating at finite level.

In [1]:
%load_ext autoreload
%autoreload 2

import os
import re
import pandas as pd
import numpy as np
import json

## Dataset Creation

In this notebook we simply take the different files containing the data of the models and we create a single tidy dataset ready for processing.

In [2]:
data_dir  = './data_orig'
data_reg  = re.compile(r'.*json$')
data_list = [file for file in os.listdir(data_dir) if data_reg.match(file)]

Data in the list include:

- **minimal models**, both real and imaginary parts including *weight*, *type*, levels 2 through 24 and the extrapolated label (*exp*),
- **lumps** solutions, with the initial point (*init*), *weight*, *type*, levels 2 through 18 and the extrapolated label (*exp*),
- **WZW model**, both real and imaginary parts including the level *k*, *weight*, *type*, $\mathrm{SU}(2)$ quantum numbers *j* and *m* (such that the weight $h = \frac{j ( j + 1 )}{k + 2}$) and levels 2 through 14 and the extrapolated label (*exp*),
- **double lumps** solutions, with the initial point (*init*), *weight*, *type*, levels 2 through 18 and the extrapolated label (*exp*)

In general the *init* variable can be discarded as it should not enter the computation of the extrapolated label. Variables which are not present in a specific dataset could be safely replaced with zeros, but for the sake of generality we build a general dataset using `NaN` values (they can be quickly replaced when importing the dataset, for instance, in `pandas`).

For each dataset in the list we therefore build a tidy dataset before putting everything together.

In [3]:
os.makedirs('./data_upload', exist_ok=True)

model_dict = {'lumps': 0,
              'double-lumps': 1,
              'minimal_models': 2,
              'wzw': 3
             }

with open('./data_upload/model_dict.json', 'w') as f:
    json.dump(model_dict, f)
    
model_dict_inv = {value: key for key, value in model_dict.items()}

with open('./data_upload/model_dict_reverse.json', 'w') as f:
    json.dump(model_dict_inv, f)

In [4]:
def tidy_data(file, model, data_dir='./data', rename_levels=True, set_dtype=True, save_path=None, **kwargs):
    '''
    Create a tidy dataset from a JSON file.
    
    Needed arguments:
        file: the name of the JSON file,
        model: a string identifying the physical model.
        
    Optional arguments:
        data_dir:      the root dir containing the file,
        rename_levels: rename the columns containing the truncation levels by adding 'level_' as a prefix,
        set_dtype:     modify dtypes to be stored,
        save_path:     path to a file to save the tidy dataset,
        **kwargs:      additional arguments to pass to pd.to_json.
    
    Returns:
        a tidy Pandas dataset if not saved, NoneType otherwise.
    '''
    
    # get the file and drop the init variable
    df = pd.read_json(os.path.join(data_dir, file))
    
    if 'init' in df.columns:
        df = df.drop(columns='init')
        
    if rename_levels:
        df = df.rename(columns=lambda c: re.sub(r'^([0-9])$', r'level_0\1', c))
        df = df.rename(columns=lambda c: re.sub(r'^([1-9][0-9])$', r'level_\1', c))
    
    # check if real or imaginary parts
    is_im   = bool(re.match(r'.*_im[.]json$', file))
    
    if is_im:
        # if imaginary part
        df = df.rename(columns=lambda c: re.sub(r'^(.*)$', r'\1_im', c))    
    else:
        # if real part (or not specified)
        df = df.rename(columns=lambda c: re.sub(r'^(.*)$', r'\1_re', c))
        
    # get the name of the columns
    columns = list(df.columns)
        
    # go over the rows and expand them
    df_stack = []
    for n in range(df.shape[0]):

        # get the row (without NaN values)
        row = df.iloc[n].dropna()

        # check if at least one of the entries is a list
        is_list = np.all(row.apply(lambda entry: isinstance(entry, list)))

        # if one of the entries if a list, get 
        if is_list:
            row = dict(row)
            row = pd.DataFrame(row)

            # add column to distinguish the solutions
            row['solution'] = n

            # append to the list to be concatenated
            df_stack.append(row)
    
    # concatenate if needed
    if len(df_stack) > 0:
        df = pd.concat(df_stack, axis=0, ignore_index=True)
        
    # finally add a column with the model specification
    df['model'] = model_dict[model]
    
    # reorder the columns
    cols = []
    
    if 'solution' in df.columns:
        cols = ['solution']
        
    cols = cols + [col for col in columns if not bool(re.match(r'^exp.*', col))]
    
    if 'model' in df.columns:
        cols = cols + ['model']
        
    if 'exp_im' in columns:
        cols = cols + ['exp_im']
        
    if 'exp_re' in columns:
        cols = cols + ['exp_re']
        
    df = df[cols]
        
    # reset dtypes for storage
    if set_dtype:
        # solution is a short int (unsigned)
        if 'solution' in df.columns:
            df.loc[:, 'solution'] = pd.Series(df['solution'], dtype=np.uint8)
            
        # the level k is also a short unsigned int if present
        if 'k' in df.columns:
            df.loc[:, 'k'] = pd.Series(df['k'], dtype=np.uint8)
        
        # models are categories
        df.loc[:, 'model']    = pd.Categorical(pd.Series(df['model'], dtype=np.uint8))
        
        # every other column is a float
        for c in columns:
            df.loc[:, c] = pd.Series(df[c], dtype=np.float32)
            
        # the type column is a category as well
        if 'type_re' in df.columns:
            df.loc[:, 'type_re'] = pd.Categorical(pd.Series(df['type_re'], dtype=np.uint8))
        if 'type_im' in df.columns:
            df.loc[:, 'type_im'] = pd.Categorical(pd.Series(df['type_im'], dtype=np.uint8))
        
    # return the tidy dataset
    if save_path is not None:
        df.to_json(save_path, **kwargs)
        
    return df

We can finally create the separate datasets:

In [5]:
data_dir_tidy = './data_tidy'
os.makedirs(data_dir_tidy, exist_ok=True)

df_dict = {re.sub(r'[.]json', '', data): tidy_data(file=data,
                                                   model=re.sub(r'_re[.]json|_im[.]json|[.]json', '', data),
                                                   data_dir=data_dir,
                                                   save_path=os.path.join(data_dir_tidy, re.sub(r'[.]json', '', data) + '_tidy.json')
                                                  ) for data in data_list
          }

In the *lumps* dataset we also remove the first `solution` (i.e. solution $= 0$), as its values may spoil the analysis given the straight forward correlation with the labels.

In [6]:
# remove the first entry
df_dict['lumps'] = df_dict['lumps'].loc[df_dict['lumps']['solution'] != 0]

# rescale the solution number
df_dict['lumps']['solution'] = df_dict['lumps'].loc[:, 'solution'].apply(lambda x: x - 1)

## Dataset Unification

From the previous datasets, we need to first merge the data which has both real and imaginary parts, since they are not separate datasets:

In [7]:
df_list = [pd.merge(df_dict['minimal_models_re'], df_dict['minimal_models_im'], left_index=True, right_index=True),
           pd.merge(df_dict['wzw_re'], df_dict['wzw_im'], left_index=True, right_index=True)
          ]

# drop redundant columns
df_list = [df.drop(columns=['solution_y', 'model_y']).rename(columns={'solution_x': 'solution', 'model_x': 'model'}) for df in df_list]

# add the other datasets
df_list.append(df_dict['lumps'])
df_list.append(df_dict['double-lumps'])

We finally merge the newly created datasets in order to have a single file containing the whole information on the data.

In [8]:
df = pd.concat(df_list, axis=0, ignore_index=True)

# change temporarily the dtypes of "type" to float
df.loc[:, 'type_re'] = df['type_re'].astype(np.float32)
df.loc[:, 'type_im'] = df['type_im'].astype(np.float32)

From this dataset we need to remove identically vanishing columns such as `type_im`, `k_im`, etc. We also need to rename the surviving paired columns to avoid showing the suffix *_re* when it is not needed.

In [9]:
vanishing_columns = []
for column in df.columns:
    if column != 'model':
        if df[column].mean(skipna=True) == 0 and df[column].std(skipna=True) == 0:
            vanishing_columns.append(column)

# drop the columns
df = df.drop(columns=vanishing_columns)

# create a dictionary to rename the columns
rename_columns = {re.sub(r'_im$', r'_re', c): re.sub(r'_im$', '', c) for c in vanishing_columns}
df = df.rename(columns=rename_columns)

We can finally reorder the columns and modify the dtypes accordingly:

In [10]:
df = df[['weight', 'type', 'k', 'solution', 'j', 'm',
         'level_02_re', 'level_02_im',
         'level_03_re', 'level_03_im',
         'level_04_re', 'level_04_im',
         'level_05_re', 'level_05_im',
         'level_06_re', 'level_06_im',
         'level_07_re', 'level_07_im',
         'level_08_re', 'level_08_im',
         'level_09_re', 'level_09_im',
         'level_10_re', 'level_10_im',
         'level_11_re', 'level_11_im',
         'level_12_re', 'level_12_im',
         'level_13_re', 'level_13_im',
         'level_14_re', 'level_14_im',
         'level_15',
         'level_16',
         'level_17',
         'level_18',
         'level_19',
         'level_20',
         'level_21',
         'level_22',
         'level_23',
         'level_24',
         'model',
         'exp_re', 'exp_im'
        ]
       ]

# fill the NaN exp_im values to avoid mistaking an incomplete case from a purely real solution
df.loc[df['exp_im'].isna(), 'exp_im'] = 0.0

# fill the NaN values in the imaginary parts of the levels when the corresponding real parts are not NaN as well
for n in range(2, 15):
    level = 'level_' + f'{n:02}'
    df.loc[~df[level + '_re'].isna() & df[level + '_im'].isna(), level + '_im'] = 0.0

# modify the dtypes
df.loc[:, 'solution'] = pd.Series(df['solution'], dtype='Int8')
df.loc[:, 'type']     = pd.Categorical(pd.Series(df['type'], dtype=np.uint8))
df.loc[:, 'k']        = pd.Series(df['k'], dtype='Int8')
df.loc[:, 'model']    = pd.Categorical(pd.Series(df['model'], dtype=np.uint8))

for c in df.columns:
    if bool(re.match(r'^weight$|^j$|^m$|^level.*|^exp.*', c)):
        df.loc[:, c] = pd.Series(df[c], dtype=np.float32)

We finally save the dataset to file:

In [11]:
df.to_json('./data_upload/sft_data.json.gz', orient='index')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3300 entries, 0 to 3299
Data columns (total 45 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   weight       3300 non-null   float32 
 1   type         3300 non-null   category
 2   k            1680 non-null   Int8    
 3   solution     3280 non-null   Int8    
 4   j            1680 non-null   float32 
 5   m            1680 non-null   float32 
 6   level_02_re  3300 non-null   float32 
 7   level_02_im  3300 non-null   float32 
 8   level_03_re  3300 non-null   float32 
 9   level_03_im  3300 non-null   float32 
 10  level_04_re  3300 non-null   float32 
 11  level_04_im  3300 non-null   float32 
 12  level_05_re  3300 non-null   float32 
 13  level_05_im  3300 non-null   float32 
 14  level_06_re  3300 non-null   float32 
 15  level_06_im  3300 non-null   float32 
 16  level_07_re  3300 non-null   float32 
 17  level_07_im  3300 non-null   float32 
 18  level_08_re  3300 non-null  

As a last step we write a README file containing the explanation of what previously produced.

In [12]:
README = '''
description:

  the dataset contains the values of several variables related to observables in String Field Theory for different models, and at different finite and infinite levels of truncations

authors:

  M. Kudrna (Charles U., Prague) - original data
  R. Finotello (Torino U.) - dataset
  
dataset:

  columns description:
  
    weight:     conformal weight of the observable (float),
    type:       oscillation periodicity (int, categorical: either 2 or 4),
    k:          level of the WZW model (int, <NA> if not WZW model),
    solution:   identifier of the radius of the solution in the same physical model (int, <NA> for double lumps),
    j:          quantum number of the SU(2) representation (float, NaN if not WZW model),
    m:          quantum number of the SU(2) representation (float, NaN if not WZW model),
    level_*_re: real part of the finite truncation levels (float, Nan if not computed),
    level_*_im: imaginary part of the finite truncation levels (float, Nan if not computed),
    level_*:    real part of the higher finite truncation levels (float, Nan if not computed),
    model:      category of the physical model (int, categorical from 0 to 3, see dictionary below),
    exp_re:     real part of the extrapolated truncation at infinity (float),
    exp_re:     imaginary part of the extrapolated truncation at infinity (float)
    
  description:
      
    content: string field theory observables at different levels of truncation
    
    notes:
    
      - JSON dictionaries to translate categorical models to name of the physical model are provided (see model_dict.json and model_dict_reverse.json in this directory)
      - NaN or <NA> have been left where the values have not been computed for some reason to be distinguishable from a genuine zero (most of the times they can be safely replaced with zeros using Pandas, or similar tools)
    
  rows: 3300
  
  columns: 45
  
  size: ~300KB (gzipped), ~2.8MB (deflated)
  
file:

  mime: application/gzip
  name: sft_data.json.gz
  description: gzipped JSON file
'''

with open('./data_upload/sft_data.json.gz.txt', 'w') as f:
    f.write(README)