# Machine Learning for Level Truncation in Bosonic Open String Field Theory

We consider the position of lumps in bosonic open string field theory (OSFT) at mass level truncation of finite order. We then extrapolate predictions for level-$\infty$ truncation.

In this notebook we show the preanalysis: we open the dataset and extract the features. We then eliminate the duplicates and show some outlying samples.

## Setup

The following analysis is performed on a machine with the following specifications:

In [1]:
!echo "CPU: $(lscpu| awk '/^Model name/ {$1=""; $2=""; print}'| sed 's/^[[:space:]]*//g')"
!echo "GPU: $(lspci| awk '/3D controller/ {$1=""; $2=""; $3=""; print}'| sed 's/^[[:space:]]*//g')"
!echo "RAM: $(free --giga| awk '/^Mem/ {print $2}')GB (avail. now: $(free --giga| awk '/^Mem/ {print $7}')GB)"

CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
GPU: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2)
RAM: 16GB (avail. now: 10GB)


Computations will be executed using a restricted amount of CPU threads:

In [2]:
multi_thread = 4

# sanitise the input
if multi_thread > 8:
    multi_thread = 8

We will use several module in this Python notebook. We import them early to take a look at their version number and to keep track of changes in the intallation:

In [3]:
import sys

import numpy             as np
import pandas            as pd
import matplotlib        as mpl
import matplotlib.pyplot as plt
import sklearn           as skl

# Jupyter magics
%load_ext autoreload
%autoreload 2

%matplotlib inline
mpl.rc('axes', labelsize=12) #------- set size of the labels in Matplotlib
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# check for restrictions and print version number
try:
    assert np.__version__ >= '1.18.0', 'Numpy version should be at least 1.18.0 to avoid conflict with Pandas and PyTables'
    print('Numpy_version: {}'.format(np.__version__))
    
    assert pd.__version__ >= '1.0.0', 'Pandas version should be at least 1.0.0 to use PyTables correctly'
    print('Pandas version: {}'.format(pd.__version__))
    
    assert mpl.__version__ > '3.1.0', 'Matplotlib version should be at least 3.1.0'
    print('Matplotlib version: {}'.format(mpl.__version__))

    assert skl.__version__ >= '0.22.0', 'Scikit-learn version should be at least 0.22.0 to use newest implementations.'
    print('Scikit-learn version: {}'.format(skl.__version__))
    
except AssertionError as msg:
    print(msg)
    
# fix the random seed
RAND = 42
np.random.seed(RAND)

Numpy_version: 1.18.4
Pandas version: 1.0.3
Matplotlib version: 3.2.1
Scikit-learn version: 0.23.1


Now create the directory structure and path names to work within the notebook:

In [4]:
from os import path, makedirs

# define directory names
ROOT_DIR = '.' #-------------------------------------------------- root directory
IMG_DIR  = 'img' #------------------------------------------------ images
MOD_DIR  = 'models' #--------------------------------------------- saved models
LOG_DIR  = 'log' #------------------------------------------------ logs
OUT_DIR  = 'output' #--------------------------------------------- saved predictions, relevant output, etc.

DB_NAME = 'data_sft_dict' #--------------------------------------- name of the dataset
DB_FILE = DB_NAME + '.json' #------------------------------------- full name with extension
DB_PATH = path.join(ROOT_DIR, DB_FILE) #-------------------------- full path of the dataset

# define full paths
IMG_PATH = path.join(ROOT_DIR, IMG_DIR)
MOD_PATH = path.join(ROOT_DIR, MOD_DIR)
LOG_PATH = path.join(ROOT_DIR, LOG_DIR)
OUT_PATH = path.join(ROOT_DIR, OUT_DIR)

# create directories if non existent
if not path.isdir(IMG_PATH):
    makedirs(IMG_PATH, exist_ok=True)
if not path.isdir(MOD_PATH):
    makedirs(MOD_PATH, exist_ok=True)
if not path.isdir(LOG_PATH):
    makedirs(LOG_PATH, exist_ok=True)
if not path.isdir(OUT_PATH):
    makedirs(OUT_PATH, exist_ok=True)

Finally create a logging session to store debug info:

In [5]:
import logging
from mltools.liblog import create_logfile

path_to_log = path.join(LOG_PATH, DB_NAME + '_preanalysis.log') #--------------- path to the log
log = create_logfile(path_to_log, name=DB_NAME, level=logging.DEBUG) #---------- create log file and session

log.info('\n\n'
         '--------------------------------------------\n'
         '  MACHINE LEARNING FOR LEVEL TRUNCATION IN\n'
         '  BOSONIC OPEN STRING FIELD THEORY\n\n'
         '  (preanalysis)\n'
         '--------------------------------------------\n'
         '  Authors: Harold Erbin, Riccardo Finotello\n'
         '--------------------------------------------\n'
         '  Abstract:\n\n'
         '  We consider the position of the lumps of\n'
         '  the tachyon potential in bosonic open\n'
         '  string field theory at a finite mass level\n'
         '  truncation. We then extrapolate the\n'
         '  predictions for level-$\infty$ using\n'
         '  machine learning techniques.\n\n'
        )

Rotating existing logs...


## Importing the Database

We import the database containing the positions of the lumps of the tachyon potential:

In [6]:
if path.isfile(DB_PATH):
    df = pd.read_json(DB_PATH)
    
    if not df.empty:
        log.debug('Successfully imported {}'.format(DB_PATH))
    else:
        sys.stderr.write('Database is empty!')
        log.error('Database is empty!')
else:
    sys.stderr.write('Cannot find database!')

We then start to analyse the _dtypes_ of the columns to understand what the dataset is made of:

In [7]:
df.dtypes

init      object
exp       object
weight    object
type      object
2         object
3         object
4         object
5         object
6         object
7         object
8         object
9         object
10        object
11        object
12        object
13        object
14        object
15        object
16        object
17        object
18        object
dtype: object

And then we take a look at the first few entries to understand the composition of the dataset:

In [8]:
df.head(3)

Unnamed: 0,init,exp,weight,type,2,3,4,5,6,7,...,9,10,11,12,13,14,15,16,17,18
0,"[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, -1, 1, -1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]","[0, 0, 1, 4, 9, 0, 0.25, 1, 2.25, 4, 0, 0.25, ...","[2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]",...,"[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]","[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1]"
1,"[1.0001, 0, 1.0001, 1.0001, 1.0001, 1.0001, 0,...","[1, 0, -1, 1, -1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]","[0, 0, 1, 4, 9, 0, 0.249950007499, 0.999800029...","[2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]","[1.000099754465899, -4.382819109385611e-08, 0....","[1.000099754261711, -6.385189815988693e-08, 0....","[1.000099495309939, -1.9972775228453091e-07, 0...","[1.000099494726808, -1.724421881015622e-07, 0....","[1.000099223845491, -3.2173432889712715e-07, 0...","[1.000099222907449, -2.856173963606407e-07, 0....",...,"[1.000098951488667, -3.9768513795577186e-07, 0...","[1.000098684133785, -5.470716468222031e-07, 0....","[1.000098682609473, -5.081256574169557e-07, 0....","[1.000098418312006, -6.559751804689415e-07, 0....","[1.000098416531483, -6.169537248661669e-07, 0....","[1.000098155198292, -7.630807670831046e-07, 0....","[1.000098153176776, -7.242020485026188e-07, 0....","[1.000097894670832, -8.685182838696036e-07, 0....","[1.000097892420157, -8.29888184051414e-07, 0.9...","[1.000097636616163, -9.72346758033437e-07, 0.9..."
2,"[1.001, 0, 1.001, 1.001, 1.001, 1.001, 0, 0, 0...","[1, 0, -1, 1, -1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]","[0, 0, 1, 4, 9, 0, 0.24950074900124802, 0.9980...","[2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]","[1.000976232275641, -3.8820163895943e-06, 0.90...","[1.000976049237533, -5.9682788728782085e-06, 0...","[1.000952815386108, -1.689178344782202e-05, 0....","[1.000952352120989, -1.5047680126166993e-05, 0...","[1.000929855311967, -2.5515953552563565e-05, 0...","[1.000929190755388, -2.3264350581688953e-05, 0...",...,"[1.0009075350494, -3.03802946004187e-05, 0.664...","[1.000888236249875, -3.860013055708573e-05, 0....","[1.000887354078869, -3.657126395737419e-05, 0....","[1.000869431716091, -4.387998808033297e-05, 0....","[1.000868490272024, -4.199905525467674e-05, 0....","[1.000851773655717, -4.853491335211209e-05, 0....","[1.000850791489764, -4.6788451470318466e-05, 0...","[1.000835137520457, -5.266383019452565e-05, 0....","[1.000834127701389, -5.1036790056833293e-05, 0...","[1.000819417460843, -5.634321216588246e-05, 0...."


In total we are dealing with a dataset of shape:

In [9]:
df.shape

(46, 21)

In the dataset we have therefore different predictions for the position of the lumps of the bosonic potential: they correspond to different choices of the initial point and other properties and data for 18 levels of mass truncation are provided together with the extrapolation for the level-$\infty$.

## Features Extraction and Manipulation

Before moving to the analysis we need to extract the modify the features in the dataset.

First of all we exclude the first entry which looks too static and artificially "perfect" to help in the predictions.

In [10]:
df = df[1:].reset_index(drop=True) #------ drop first entry and reset the index counter (do not include former index as column)
print('New shape of the dataset: {}'.format(df.shape))

log.debug('Dropped first entry.')

New shape of the dataset: (45, 21)


We then check that row-wise the size of the elements is unique:

In [11]:
df_sizes = df.applymap(np.shape)\
             .apply(np.unique, axis=1) #-------------------------------- compute the shape of each element of the dataframe and
#----------------------------------------------------------------------- take the unique values in each row

df_sizes_if_unique = df_sizes.apply(lambda x: np.shape(x) == (1,))\
                             .sum(axis=0) #----------------------------- check if each row contains only one element and
#----------------------------------------------------------------------- sum True values which should equal the size of the dataframe

try:
    assert df_sizes_if_unique == df.shape[0], 'Sizes along rows are not unique!'
    log.info('Sizes along rows are unique.')
except AssertionError as msg:
    log.error(msg)
    print(msg)

We then add one column which represents the position of each datum inside its own system of solutions:

In [12]:
def pad_index(index, shape):
    '''
    Pad the index number with a certain shape.
    
    Required arguments:
        index: the index to pad
        shape: the shape of the padding
        
    Returns:
        the padded index
    '''
    
    return [index] * np.prod(shape)

def pad_index_list(index_list, shape_list):
    '''
    Pad the entire list of indices.
    
    Required arguments:
        index_list: list of indices to pad
        shape_list: list of shapes of the paddings
    '''
    
    try:
        assert df.shape[0] == df_sizes.shape[0]
    except AssertionError as msg:
        print(msg)
        
    full_list = []
    for n in range(df.shape[0]):
        full_list.append(pad_index(index_list[n], shape_list[n]))
        
    return full_list

# add the column with the list of indices
df['system'] = pad_index_list(df.index, df_sizes.apply(lambda x: x[0]))
log.debug('Add system feature as reference.')

# reorder the dataframe
df = df[['system', 'init', 'weight', 'type', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', 'exp']]

We then "flatten" the systems: for each system we build a new dataframe and then concatenate each of the new dataframes.

In [13]:
log.debug('New flattened dataframe has been built.')
df_flat = pd.concat([pd.DataFrame({f: df[f].iloc[n] for f in df}) for n in range(df.shape[0])], axis=0, ignore_index=True)

# describe the new dataset
df_flat.describe()

Unnamed: 0,system,init,weight,type,2,3,4,5,6,7,...,10,11,12,13,14,15,16,17,18,exp
count,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,...,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0,763.0
mean,23.237221,0.900145,1.86481,3.764089,-1.486604,-1.646112,7.572923,8.009875,-32.442704,-33.974851,...,-707.859825,-730.665421,2880.727031,2962.131627,-10880.473304,-11156.002595,38233.218285,39115.222069,-125611.7,0.568807
std,13.025317,1.018137,2.31459,0.645534,4.45984,4.903356,20.975753,22.199837,108.036621,113.190933,...,2772.330578,2859.981462,11577.937424,11896.896973,44115.397542,45204.266729,155584.08265,159093.092977,512985.2,0.694124
min,0.0,0.0,0.0,2.0,-19.74404,-21.893983,-0.754568,-0.782633,-514.984097,-538.627792,...,-13321.170445,-13781.246472,-8.850113,-12.265769,-211473.396816,-216475.644423,-44.356923,-66.596211,-2489024.0,-1.0
25%,12.0,0.0,0.040825,4.0,-0.68581,-0.973798,0.0,0.0,-0.896869,-0.915398,...,-1.022243,-1.07355,0.001907,0.001975,-2.093628,-4.152302,0.139497,0.12457,-6.801183,0.0
50%,24.0,0.0,1.0,4.0,0.0,0.0,0.938337,0.944864,0.0,0.002741,...,0.002597,0.002644,0.99996,0.99791,0.473076,0.550104,1.005628,1.005427,0.8994815,1.0
75%,35.0,1.75,2.985594,4.0,0.912406,0.992236,1.347502,1.486801,0.991624,1.000029,...,1.000028,1.004544,3.385307,5.228127,1.00612,1.006436,7.344206,10.777103,1.005195,1.0
max,44.0,3.0,9.0,4.0,1.239384,1.358098,122.931347,131.67549,2.275741,2.712998,...,5.243298,6.283092,56115.100219,57592.69886,16.106978,23.077325,731718.33209,748286.961169,103.3588,1.0


## Duplicates Search and Outliers Detection

We then analyse the new dataset in search of duplicates and outliers. We will drop the firsts, while we will keep the seconds even though we want to identify them.

In [14]:
df_nodup = df_flat.drop_duplicates(ignore_index=True)
df_nodup.describe()

Unnamed: 0,system,init,weight,type,2,3,4,5,6,7,...,10,11,12,13,14,15,16,17,18,exp
count,718.0,718.0,718.0,718.0,718.0,718.0,718.0,718.0,718.0,718.0,...,718.0,718.0,718.0,718.0,718.0,718.0,718.0,718.0,718.0,718.0
mean,23.314763,0.836629,1.981686,3.749304,-1.649589,-1.81717,7.981259,8.445939,-34.541055,-36.169106,...,-752.288433,-776.523313,3061.210057,3147.716636,-11562.460857,-11855.25868,40629.386999,41566.669584,-133484.3,0.541783
std,13.023788,1.004161,2.337013,0.662688,4.548304,5.005473,21.558408,22.815235,111.03917,116.338079,...,2852.135166,2942.302665,11912.541567,12240.685674,45391.801105,46512.063723,160087.941813,163698.216993,527842.2,0.706857
min,0.0,0.0,0.0,2.0,-19.74404,-21.893983,-0.754568,-0.782633,-514.984097,-538.627792,...,-13321.170445,-13781.246472,-8.850113,-12.265769,-211473.396816,-216475.644423,-44.356923,-66.596211,-2489024.0,-1.0
25%,12.0,0.0,0.15534,4.0,-0.819357,-1.048764,0.0,0.0,-0.924887,-0.94195,...,-1.122517,-2.123912,0.001518,0.001679,-3.214377,-6.185564,0.045728,0.042314,-25.69285,0.0
50%,24.0,0.0,1.0001,4.0,0.0,0.0,0.923914,0.935052,0.0,0.0,...,0.001407,0.001712,0.987326,0.987066,0.001634,0.004821,1.004445,1.001084,0.09732395,1.0
75%,35.0,1.65,3.213367,4.0,0.795133,0.913984,1.408763,1.552803,0.960644,0.984305,...,0.991467,0.996329,4.397037,6.81182,0.997875,1.001248,16.098058,16.13199,1.003658,1.0
max,44.0,3.0,9.0,4.0,1.239384,1.358098,122.931347,131.67549,2.275741,2.712998,...,5.243298,6.283092,56115.100219,57592.69886,16.106978,23.077325,731718.33209,748286.961169,103.3588,1.0


We then look for outliers. However we have no idea how many of them are in the dataset, thus [`skl.EllipticEnvelope`](https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html) is not a good choice. We use the _interquartile range_ (IQR) to detect outliers and decide (if needed) some boundaries for the outliers. We also study the _interdecile range_ (IDR) to select extreme outliers.

In [15]:
def iqr_detection(feature):
    '''
    Compute the interquartile range of a given feature and return the indices of the outliers.
    
    Required arguments:
        feature: the list of data to analyse.
        
    Returns:
        the list of indices of outliers.
    '''
    
    q1, q3 = np.percentile(feature, [25, 75]) #-------------------------------- get values of 1st and 3rd quartile
    iqr    = q3 - q1 #--------------------------------------------------------- compute the interquartile range (IQR)
    
    lower  = q1 - (iqr * 1.5) #------------------------------------------------ lower bound
    upper  = q3 + (iqr * 1.5) #------------------------------------------------ higher bound
    
    return np.where((feature > upper) | (feature < lower))[0].tolist() #------- return indices out of bounds

def idr_detection(feature):
    '''
    Compute the interdecile range and return indices of points outside the limit.
    
    Required arguments:
        feature: the list of data to analyse.
        
    Returns:
        the list of indices of outliers.
    '''
    
    d1, d9 = np.percentile(feature, [10, 90]) #-------------------------------- get values of 1st and 9th decile
    idr    = d9 - d1 #--------------------------------------------------------- compute the interdecile range (IDR)

    lower  = d1 - (idr * 1.5) #------------------------------------------------ lower bound
    upper  = d9 + (idr * 1.5) #------------------------------------------------ higher bound
    
    return np.where((feature > upper) | (feature < lower))[0].tolist() #------- return indice out of bounds

To give a few exempla:

In [16]:
feature = '18'

print('Number of outlying samples: {:d}'.format(np.shape(iqr_detection(df_nodup[feature]))[0])) #----------------- outliers
print('Number of extreme outlying samples: {:d}'.format(np.shape(idr_detection(df_nodup[feature]))[0])) #--------- extreme outliers

Number of outlying samples: 167
Number of extreme outlying samples: 66


## Saving the Dataset

We finally save the dataset for further analysis.

In [17]:
df_nodup.to_hdf(path.join(ROOT_DIR, 'data_sft_analysis.h5'), key='data_sft')

log.info('Saved dataset from preanalysis.')