# Understanding X-ray Absorption Spectra by Means of Descriptors and Machine Learning Algorithms
### A. A. Guda, S. A. Guda, A. Martini, A.N. Kravtsova, A. Algasov, A. Bugaev, S. P. Kubrin, L. V. Guda, P. Šot, J. A. van Bokhoven, C. Copéret, A. V. Soldatov

The notebook contains the code for using machine learning algorithms and establishing the relationship between intuitive descriptors of spectra, such as edge position, intensities, positions and curvatures of minima and maxima on the one side, and those of the local atomic and electronic structure which are the coordination numbers, bond distances and angles, and oxidation state on the other. This approach allows overcoming the problem of the systematic difference between theoretical and experimental spectra. Furthermore, the numerical relations can be expressed in analytical formulas providing a simple and fast tool to extract structural parameters based on spectral shape. The methodology was successfully applied to experimental data of the multicomponent Fe:SiO2 system and reference iron compounds, demonstrating the high prediction quality for both theoretical validation sets and experimental data.

### Current folder contains:
- exp subfolder with experimental spectra
- generated subfolder with data generated during calculations and used afterwards
- results subfolder
- samples subfolder with theoretical spectra database
- xyz subfolder with .xyz files used in molecule constructor and samples generation
- exp_true_values.csv file with known parameters for experimental spectra database
- instructions.docx file with install instructions
- paper_calculations.ipynb - this notebook
- Fe_project.py - project file with different settings

## Import [PyFitIt](http://hpc.nano.sfedu.ru/pyfitit)

In [1]:
!pip install jupytext
!pip install lmfit



In [2]:
!pip install "pandas<2.0.0"



In [3]:
# conda activate myenv
!git clone https://github.com/YannamYeswanth/ML-Approach-To-Xanes-Spectroscopy

fatal: destination path 'ML-Approach-To-Xanes-Spectroscopy' already exists and is not an empty directory.


In [4]:
%cd ML-Approach-To-Xanes-Spectroscopy

/content/ML-Approach-To-Xanes-Spectroscopy


In [5]:
!ls

exp		     Fe_project.py  paper_calculations.ipynb  pyfitit	 results  xyz
exp_true_values.csv  generated	    __pycache__		      README.md  samples


In [6]:
from pyfitit import *
import pandas as pd

resultFolder = 'results'

## Plot spectra of the calculated samples (Figure 4a)

In [7]:
for cn in range(2,7):
    sample = readSample(f'samples/sample_{cn}')
    proj = loadProject('Fe_project.py', CN=cn, valence=3)
    sample.spectra = smoothLib.smoothDataFrame(proj.FDMNES_smooth, sample.spectra, 'fdmnes',
                                               proj.spectrum, proj.intervals['fit_norm'],
                                               folder=sample.folder, norm=proj.FDMNES_smooth['norm'])
    sample.limit(energyRange=[7100, 7200], inplace=True)
    plotting.plotSample(sample.energy, sample.spectra, color_param=sample.params['r1'],
                        fileName=f'{resultFolder}/sample_adaptive_CN{cn}_wo_pre-edge.png',
                        title=f'Sample for cn={cn}', alpha=0.1)

In [8]:
!pip install jupytext ipykernel ipywidgets tqdm scipy numba cycler statsmodels lmfit matplotlib numpy pandas parsy notebook nbformat pulp scikit_learn seaborn




## Print information about structure parameters of the samples

### Relation between paper notation p1,p2,p3,... and sampled parameters:

- for cn=2: p1=r1, p2=r1+d12, p3=phi1, p4=psi
- for cn=3: p1=phi, p2=r1, p3=r2, p4=function(p1)
- for cn=4: p1=phi1, p2=phi2, p3=r1, p4=r2
- for cn=5: p1=r2, p2=r2+d2, p3=r1, p4=phi2, p5=phi1
- for cn=6: p1=r2, p2=r2+d2, p3=r1, p4=phi2, p5=phi1

In [9]:
cn = 2
sample = readSample(f'samples/sample_{cn}')
sample.params.describe()

Unnamed: 0,d12,phi1,psi,r1
count,170.0,170.0,170.0,170.0
mean,0.096616,149.827244,34.150476,1.948982
std,0.061295,19.039872,22.033443,0.17401
min,0.0,120.0,0.0,1.7
25%,0.045896,132.019581,14.317416,1.791991
50%,0.096079,149.277988,34.143792,1.943622
75%,0.149222,168.260434,55.229526,2.095672
max,0.2,180.0,70.0,2.3


In [10]:
sample.params

Unnamed: 0,d12,phi1,psi,r1
0,0.000000,120.000000,0.000000,1.700000
1,0.200000,180.000000,70.000000,2.300000
2,0.100000,126.000000,63.000000,2.240000
3,0.180000,162.000000,49.000000,1.760000
4,0.060000,174.000000,7.000000,1.880000
...,...,...,...,...
165,0.083073,121.544267,20.110502,1.894168
166,0.082792,129.964625,8.028477,1.742104
167,0.139504,125.141152,4.885180,1.735908
168,0.055254,129.347137,65.580398,1.948533


## Construct database from samples for different cn and calculate descriptors of structure

In [11]:
# calculate structure parameters: mean and standard deviation of distances to O atoms
def calcDist(params, mol, CN):
    dists = mol.getSortedDists('O')
    return [np.mean(dists[:CN]), np.std(dists[:CN])]

# read pre-calculated samples and replace individual structure parameters by common: cn, valence, mean and std of distances to O atoms
# These samples were calculated by adaptive sampling.
# To get uniformly distributed geometry parameters, the samples can be converted to IHS by interpolation when flag convertToIHS=True
def loadSpectra(energyRange=None, sampleFolder='samples', convertToIHS=False):
    all_data = None
    for CN in range(2,7):
        for valence in [2,3]:
            data = readSample(sampleFolder+os.sep+f'sample_{CN}')
            projFile = 'Fe_project.py'
            project = loadProject(projFile, CN=CN, valence=valence)
            if convertToIHS:
                geometryParamRanges = project.geometryParamRanges
                for r in ['r1', 'r2']:
                    if r in geometryParamRanges: geometryParamRanges[r][0] = 1.8
                data = sampling.convertSampleTo(data, 'IHS', 500, geometryParamRanges, seed=0)
                data.saveToFolder(f'generated/samples/sample_{CN}_IHS_generated')
            n = len(data.params)
            oldParams = data.paramNames
            data.addParam(paramGenerator=lambda params, mol: calcDist(params, mol, CN), paramName=['FeODist','stdDist'], project=project)
            data.addParam(paramName='CN', paramData=np.ones(n)*CN)
            data.addParam(paramName='valence', paramData=np.ones(n)*valence)
            data.delParam(oldParams)
            data.spectra = smoothLib.smoothDataFrame(project.FDMNES_smooth, data.spectra, 'fdmnes', project.spectrum, project.intervals['fit_norm'], folder=data.folder, norm=project.FDMNES_smooth['norm'])
            assert np.all(data.spectra.values<15)
            data.addParam(paramName='name', paramData=np.array([f'cn{CN}v{valence}_{i}' for i in range(n)], dtype=object))
            if all_data is None: all_data = data
            else: all_data.unionWith(data)

    # take all experimental spectra from folder exp
    exp_spectra_names = []
    exp_files = os.listdir('exp')
    trueParams = pd.read_csv(f'exp_true_values.csv', sep=';')
    for f in exp_files:
        name = os.path.splitext(os.path.split(f)[-1])[0]
        exp_spectra_names.append(name)
        sp = readSpectrum(f'exp{os.sep}{f}')
        i = np.where(trueParams['name'].to_numpy() == name)[0]
        if len(i) > 0:
            assert len(i) == 1
            i = i[0]
            true = {col:trueParams.loc[i,col] for col in set(trueParams.columns) if col == 'name' or not np.isnan(trueParams.loc[i,col])}
        else: true = {'name': name}
        all_data.addRow(sp, true)

    all_data.limit(energyRange)
    return all_data, exp_spectra_names

if os.path.exists(f'generated/data_initial.pkl'):
    singleComponentData, exp_spectra_names = utils.load_pkl(f'generated/data_initial.pkl')
    singleComponentDataIHSgen, _ = utils.load_pkl(f'generated/data_initial_IHS_generated.pkl')
else:
    energyRange = [7100, 7200]
    singleComponentData, exp_spectra_names = loadSpectra(energyRange, convertToIHS=False)
    singleComponentDataIHSgen, _ = loadSpectra(energyRange, convertToIHS=True)
    utils.save_pkl((singleComponentData, exp_spectra_names), f'generated/data_initial.pkl')
    utils.save_pkl((singleComponentDataIHSgen, exp_spectra_names), f'generated/data_initial_IHS_generated.pkl')


## Plot experimental spectra

In [12]:
singleComponentData.params


Unnamed: 0,FeODist,stdDist,CN,valence,name
0,1.70,1.570092e-16,2.0,2.00,cn2v2_0
1,2.40,1.000000e-01,2.0,2.00,cn2v2_1
2,2.29,5.000000e-02,2.0,2.00,cn2v2_2
3,1.85,9.000000e-02,2.0,2.00,cn2v2_3
4,1.91,3.000000e-02,2.0,2.00,cn2v2_4
...,...,...,...,...,...
3179,,,5.0,3.00,yoderite
3180,,,,2.33,Zhamanshinite-1
3181,,,,,Zhamanshinite-3
3182,,,,2.84,Zhamanshinite-4


In [13]:
print(exp_spectra_names)

['a-Fe2O3', 'akermanite', 'andradite', 'Aouelloul glass', 'Australite-1', 'Australite-2', 'Cambodianite', 'Darwin glass', 'Fe23O4', 'Fe2SiO4', 'Fe@SiO2_1', 'Fe@SiO2_2', 'Fe@SiO2_3', 'Fe@SiO2_4', 'Fe@SiO2_5', 'FeII fumarate', 'FeII siloxide', 'FeIII siloxide', 'g-Fe2O3', 'glass_0.01', 'glass_0.28', 'glass_0.54', 'glass_0.67', 'glass_0.87', 'grandidierite', 'h1', 'h10', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'h8', 'Indoshinite', 'Irghisite-1', 'Irghisite-2', 'Irghizite', 'kirschsteinite', 'Lybian desert glass', 'Moldavite-1', 'Moldavite-2', 'Muong-Nong-1', 'Muong-Nong-2', 'NaFeSi2O6', 'Ries suevite', 'siderite', 'staurolite', 'Wabar glass', 'yoderite', 'Zhamanshinite-1', 'Zhamanshinite-3', 'Zhamanshinite-4', 'Zhamanshinite-5']


In [14]:
toPlotNames = ['FeIII siloxide', 'Darwin glass', 'Zhamanshinite-3', 'akermanite', 'Fe@SiO2_1', 'kirschsteinite']
toPlot = tuple()
for name in toPlotNames:
    sp = readSpectrum(f'exp{os.sep}{name}.txt')
    toPlot += (sp.energy, sp.intensity, name)
plotting.plotToFile(*toPlot, fileName='results/exp_spectra.png', xlim=[7100, 7200])

## Calculate descriptors of spectra

In [15]:
def calc_descriptors(sample, usePcaPrebuildData, pcaPrebuildDataFile):
    efermi = {'type':'efermi', 'columnName': 'Edge'}
    stableMin = {'type': 'stableExtrema', 'extremaType': 'min', 'energyInterval': [7135,7190],
                 'plotFolderPrefix': 'generated/stable_extrema', 'columnName': 'Pit'}
    stableMax = copy.deepcopy(stableMin)
    stableMax['extremaType'] = 'max'
    stableMax['energyInterval'] = [7120,7150]
    stableMax['columnName'] = 'WL'
    relPcaPrebuildDataFile = os.path.split(pcaPrebuildDataFile)[0]+os.sep+'rel_'+os.path.split(pcaPrebuildDataFile)[-1]
    sample, goodSpectrumIndices = descriptor.addDescriptors(sample, [stableMin, stableMax, efermi,
         {'type':'pca', 'usePcaPrebuildData':usePcaPrebuildData, 'fileName':pcaPrebuildDataFile, 'columnName': 'PCA'},
         {'type':'rel_pca', 'usePcaPrebuildData':usePcaPrebuildData, 'fileName':relPcaPrebuildDataFile, 'columnName': 'rPCA'}])
    d = sample.params
    sample.addParam(paramName='Pit_e-WL_e', paramData=d['Pit_e'] - d['WL_e'])
    sample.addParam(paramName='WL-Pit_slope', paramData=(d['WL_i'] - d['Pit_i'])/(d['Pit_e'] - d['WL_e']))
    return sample, goodSpectrumIndices

# if descriptors were already calculated we just load it
if os.path.exists(f'generated/data.pkl'):
    singleComponentData, _ = utils.load_pkl(f'generated/data.pkl')
    singleComponentDataIHSgen, _ = utils.load_pkl(f'generated/data_IHS_gen.pkl')
else:
    singleComponentData, _ = calc_descriptors(singleComponentData, usePcaPrebuildData=False, pcaPrebuildDataFile='generated/pca_data.pkl')
    singleComponentDataIHSgen, _ = calc_descriptors(singleComponentDataIHSgen, usePcaPrebuildData=False, pcaPrebuildDataFile='generated/pca_data_IHS_gen.pkl')
    utils.save_pkl((singleComponentData, exp_spectra_names), f'generated/data.pkl')
    utils.save_pkl((singleComponentDataIHSgen, exp_spectra_names), f'generated/data_IHS_gen.pkl')

# divide sample into two parts: known (theory) and unknown (experiments)
singleComponentData, exp_sample = singleComponentData.splitUnknown()
singleComponentDataIHSgen, _ = singleComponentDataIHSgen.splitUnknown()

## Plot descriptors (Figure 5)

In [16]:
singleComponentData.params

Unnamed: 0,FeODist,stdDist,CN,valence,name,Pit_e,Pit_i,Pit_d2,WL_e,WL_i,...,Edge_e,Edge_slope,PCA1,PCA2,PCA3,rPCA1,rPCA2,rPCA3,Pit_e-WL_e,WL-Pit_slope
0,1.700000,1.570092e-16,2.0,2.0,cn2v2_0,7174.213,0.967816,0.000608,7151.813,1.062284,...,7118.668741,0.256780,-12.435555,-0.058096,-1.196916,-12.421397,-0.949698,-1.034476,22.40,0.004217
1,2.400000,1.000000e-01,2.0,2.0,cn2v2_1,7140.723,0.963677,0.000479,7127.823,0.994570,...,7116.935878,0.996525,-12.644639,-1.564105,-1.440974,-12.646377,0.416936,-0.802519,12.90,0.002395
2,2.290000,5.000000e-02,2.0,2.0,cn2v2_2,7143.323,0.944085,0.001392,7128.323,1.027793,...,7116.947962,1.046825,-12.691263,-1.699015,-1.244123,-12.704728,0.470324,-0.664177,15.00,0.005581
3,1.850000,9.000000e-02,2.0,2.0,cn2v2_3,7163.113,0.965425,0.000570,7137.263,1.070923,...,7117.594873,0.500304,-12.656435,-0.569995,-0.975222,-12.591461,-0.671262,-0.636916,25.85,0.004081
4,1.910000,3.000000e-02,2.0,2.0,cn2v2_4,7161.613,0.949821,0.000829,7133.833,1.054984,...,7117.424328,0.613579,-12.640514,-0.838924,-0.779109,-12.580626,-0.504029,-0.414721,27.78,0.003786
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3125,1.832873,1.103374e-01,6.0,3.0,cn6v3_329,7171.513,0.916502,0.001033,7141.583,1.201670,...,7126.460153,0.372814,-11.981932,2.182922,-0.853656,-12.912126,-1.003195,0.462919,29.93,0.009528
3126,1.901619,2.805156e-02,6.0,3.0,cn6v3_330,7167.003,0.856987,0.002131,7137.653,1.213377,...,7126.017947,0.552063,-12.140503,2.258461,-0.006777,-13.040728,0.039036,1.146553,29.35,0.012143
3127,2.097283,1.629862e-01,6.0,3.0,cn6v3_331,7157.013,0.931806,0.000745,7135.043,1.138063,...,7123.274360,0.687751,-12.451801,0.718879,0.732433,-12.890194,0.372966,0.230560,21.97,0.009388
3128,1.970500,7.075401e-02,6.0,3.0,cn6v3_332,7162.103,0.879912,0.001385,7137.733,1.210162,...,7124.217189,0.518175,-12.386361,1.552315,0.269205,-12.991053,-0.231615,0.789254,24.37,0.013551


In [17]:
###### IITI ########
for combination in [['WL_d2', 'Pit_e'], ['Edge_e', 'WL_d2'], ['Edge_e', 'Pit_e'], ['rPCA2', 'rPCA3'], ['WL_e', 'Pit_e-WL_e'], ['PCA2', 'rPCA3']]:
    descriptor.plot_descriptors_2d_SVM_rbf(singleComponentData.params, combination, ['CN'], folder_prefix=f'{resultFolder}/scatter_plots',
                                   unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=6)
for combination in [['WL_e', 'Pit_e-WL_e'], ['PCA2', 'rPCA3']]:
    descriptor.plot_descriptors_2d_SVM_rbf(singleComponentData.params, combination, ['valence'], folder_prefix=f'{resultFolder}/scatter_plots',
                                   unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=6)

Try predict by: ['WL_d2', 'Pit_e']
Best model params:  {'C': 10} Delta =0.006070287539936103
CN - classification score: 0.65-0.65



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['Edge_e', 'WL_d2']
Best model params:  {'C': 10} Delta =0.051757188498402606
CN - classification score: 0.67-0.67



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['Edge_e', 'Pit_e']
Best model params:  {'C': 10} Delta =0.15367412140575076
CN - classification score: 0.68-0.68



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['rPCA2', 'rPCA3']
Best model params:  {'C': 10} Delta =0.021086261980830634
CN - classification score: 0.61-0.61



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['WL_e', 'Pit_e-WL_e']
Best model params:  {'C': 1} Delta =0.003833865814696469
CN - classification score: 0.33-0.33



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['PCA2', 'rPCA3']
Best model params:  {'C': 10} Delta =0.03099041533546326
CN - classification score: 0.49-0.49



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['WL_e', 'Pit_e-WL_e']
Best model params:  {'C': 10} Delta =0.00383386581469658
valence - classification score: 0.88-0.88



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['PCA2', 'rPCA3']
Best model params:  {'C': 10} Delta =0.012140575079872207
valence - classification score: 0.83-0.83



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


In [18]:
###### IITI ########
for combination in [['WL_d2', 'Pit_e'], ['Edge_e', 'WL_d2'], ['Edge_e', 'Pit_e'], ['rPCA2', 'rPCA3'], ['WL_e', 'Pit_e-WL_e'], ['PCA2', 'rPCA3']]:
    descriptor.plot_descriptors_2d_SVMl(singleComponentData.params, combination, ['CN'], folder_prefix=f'{resultFolder}/scatter_plots',
                                   unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=6)
for combination in [['WL_e', 'Pit_e-WL_e'], ['PCA2', 'rPCA3']]:
    descriptor.plot_descriptors_2d_SVMl(singleComponentData.params, combination, ['valence'], folder_prefix=f'{resultFolder}/scatter_plots',
                                   unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=6)

Try predict by: ['WL_d2', 'Pit_e']
Best model params:  {'C': 10} Delta =0.001916932907348179
CN - classification score: 0.60-0.60



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['Edge_e', 'WL_d2']
Best model params:  {'C': 1} Delta =0.00191693290734829
Try predict by: ['Edge_e', 'Pit_e']
Best model params:  {'C': 10} Delta =0.0012779552715655451
CN - classification score: 0.33-0.33



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['rPCA2', 'rPCA3']
Best model params:  {'C': 10} Delta =0.00031948881789140016
CN - classification score: 0.42-0.42



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['WL_e', 'Pit_e-WL_e']
Best model params:  {'C': 10} Delta =0.003514376996805124
CN - classification score: 0.28-0.28



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['PCA2', 'rPCA3']
Best model params:  {'C': 1} Delta =0.003833865814696469
CN - classification score: 0.39-0.39



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['WL_e', 'Pit_e-WL_e']
Best model params:  {'C': 10} Delta =0.0012779552715654896
Try predict by: ['PCA2', 'rPCA3']
Best model params:  {'C': 10} Delta =0.0012779552715654896
valence - classification score: 0.76-0.76



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


In [19]:
for combination in [['WL_d2', 'Pit_e'], ['Edge_e', 'WL_d2'], ['Edge_e', 'Pit_e'], ['rPCA2', 'rPCA3'], ['WL_e', 'Pit_e-WL_e'], ['PCA2', 'rPCA3']]:
    descriptor.plot_descriptors_2d(singleComponentData.params, combination, ['CN'], folder_prefix=f'{resultFolder}/scatter_plots',
                                   unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=6)
for combination in [['WL_e', 'Pit_e-WL_e'], ['PCA2', 'rPCA3']]:
    descriptor.plot_descriptors_2d(singleComponentData.params, combination, ['valence'], folder_prefix=f'{resultFolder}/scatter_plots',
                                   unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=6)

Try predict by: ['WL_d2', 'Pit_e']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 4} Delta =0.024281150159744413
CN - classification score: 0.64-0.64



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['Edge_e', 'WL_d2']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 4} Delta =0.02332268370607027
CN - classification score: 0.68-0.68



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['Edge_e', 'Pit_e']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 1} Delta =0.017252396166134165
CN - classification score: 0.69-0.69



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['rPCA2', 'rPCA3']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 4} Delta =0.029392971246006483
CN - classification score: 0.61-0.61



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['WL_e', 'Pit_e-WL_e']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 1} Delta =0.0019169329073482344
CN - classification score: 0.35-0.35



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['PCA2', 'rPCA3']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 4} Delta =0.042811501597444124
CN - classification score: 0.50-0.50



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['WL_e', 'Pit_e-WL_e']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 4} Delta =0.013418530351437696
valence - classification score: 0.87-0.87



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


Try predict by: ['PCA2', 'rPCA3']
Best model params:  {'n_estimators': 40, 'min_samples_leaf': 4} Delta =0.02460063897763587
valence - classification score: 0.82-0.82



  ax.scatter(unknown[descriptor_names[0]], unknown[descriptor_names[1]], umarkersize, **c_params, vmin=0, vmax=1, edgecolor='black')


## Find best descriptor subsets of size 1, 2, 3, 4 (Table 3)

In [20]:
# the quality is measured using cv_count cross-validation technique, which is repeated m times
# quality = accuracy for classification, r2-score - for regression
###### IITI ########
def getQuality(label_name, combination):
    print('Try to predict', label_name, 'by', combination)
    qualityResult = descriptor.getQuality_ANN4(singleComponentDataIHSgen.params, combination, [label_name], cv_count=10, m=5, printDebug=True)
    print(f"quality = {qualityResult[label_name]['quality']:.3f}±{qualityResult[label_name]['quality_std']:.3f}")

getQuality(label_name='CN', combination=['Pit_i'])
getQuality(label_name='CN', combination=['WL_i'])
getQuality(label_name='CN', combination=['Edge_e'])
getQuality(label_name='CN', combination=['WL_d2'])
getQuality(label_name='CN', combination=['PCA3'])
getQuality(label_name='CN', combination=['WL-Pit_slope'])

# getQuality(label_name='CN', combination=['Edge_slope'])
# getQuality(label_name='CN', combination=['Pit_e'])
# getQuality(label_name='CN', combination=['WL-Pit_slope'])
# getQuality(label_name='CN', combination=['rPCA3'])
# getQuality(label_name='CN', combination=['WL_e'])


getQuality(label_name='CN', combination=['Edge_e', 'WL_d2'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope'])
getQuality(label_name='CN', combination=['WL_d2', 'Pit_e'])

getQuality(label_name='CN', combination=['Edge_e', 'WL_e', 'rPCA3'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_d2'])

getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_e', 'Pit_e'])

# ===================================================

# getQuality(label_name='FeODist', combination=['Pit_e'])
getQuality(label_name='FeODist', combination=['Pit_e-WL_e'])
getQuality(label_name='FeODist', combination=['Edge_slope'])
getQuality(label_name='FeODist', combination=['WL_e'])
getQuality(label_name='FeODist', combination=['rPCA2'])
getQuality(label_name='FeODist', combination=['PCA3'])

getQuality(label_name='FeODist', combination=['rPCA2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['WL_d2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['PCA2', 'PCA3', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_d2', 'Pit_e'])

# =====================================================================

getQuality(label_name='stdDist', combination=['Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_d2'])
getQuality(label_name='stdDist', combination=['Edge_e'])
getQuality(label_name='stdDist', combination=['rPCA3'])
getQuality(label_name='stdDist', combination=['rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3'])

getQuality(label_name='stdDist', combination=['WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_i', 'rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['PCA2', 'PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'WL_i', 'Pit_i'])

# =====================================================================

getQuality(label_name='valence', combination=['PCA1'])
getQuality(label_name='valence', combination=['Edge_e'])
getQuality(label_name='valence', combination=['WL_e'])
getQuality(label_name='valence', combination=['PCA2'])
getQuality(label_name='valence', combination=['rPCA1'])
getQuality(label_name='valence', combination=['PCA3'])

getQuality(label_name='valence', combination=['PCA1', 'rPCA3'])
getQuality(label_name='valence', combination=['Edge_e', 'WL_i'])
getQuality(label_name='valence', combination=['WL_e', 'Pit_e'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_d2', 'Pit_e-WL_e'])
getQuality(label_name='valence', combination=['PCA2', 'rPCA2', 'rPCA3'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_e', 'Pit_e-WL_e', 'Pit_i'])

Try to predict CN by ['Pit_i']


TypeError: Model.fit() got an unexpected keyword argument 'nb_epoch'

In [None]:
# the quality is measured using cv_count cross-validation technique, which is repeated m times
# quality = accuracy for classification, r2-score - for regression
###### IITI ########
def getQuality(label_name, combination):
    print('Try to predict', label_name, 'by', combination)
    qualityResult = descriptor.getQuality_ANN3(singleComponentDataIHSgen.params, combination, [label_name], cv_count=10, m=5, printDebug=True)
    print(f"quality = {qualityResult[label_name]['quality']:.3f}±{qualityResult[label_name]['quality_std']:.3f}")

getQuality(label_name='CN', combination=['Pit_i'])
getQuality(label_name='CN', combination=['WL_i'])
getQuality(label_name='CN', combination=['Edge_e'])
getQuality(label_name='CN', combination=['WL_d2'])
getQuality(label_name='CN', combination=['PCA3'])
getQuality(label_name='CN', combination=['WL-Pit_slope'])

# getQuality(label_name='CN', combination=['Edge_slope'])
# getQuality(label_name='CN', combination=['Pit_e'])
# getQuality(label_name='CN', combination=['WL-Pit_slope'])
# getQuality(label_name='CN', combination=['rPCA3'])
# getQuality(label_name='CN', combination=['WL_e'])


getQuality(label_name='CN', combination=['Edge_e', 'WL_d2'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope'])
getQuality(label_name='CN', combination=['WL_d2', 'Pit_e'])

getQuality(label_name='CN', combination=['Edge_e', 'WL_e', 'rPCA3'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_d2'])

getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_e', 'Pit_e'])

# ===================================================

# getQuality(label_name='FeODist', combination=['Pit_e'])
getQuality(label_name='FeODist', combination=['Pit_e-WL_e'])
getQuality(label_name='FeODist', combination=['Edge_slope'])
getQuality(label_name='FeODist', combination=['WL_e'])
getQuality(label_name='FeODist', combination=['rPCA2'])
getQuality(label_name='FeODist', combination=['PCA3'])

getQuality(label_name='FeODist', combination=['rPCA2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['WL_d2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['PCA2', 'PCA3', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_d2', 'Pit_e'])

# =====================================================================

getQuality(label_name='stdDist', combination=['Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_d2'])
getQuality(label_name='stdDist', combination=['Edge_e'])
getQuality(label_name='stdDist', combination=['rPCA3'])
getQuality(label_name='stdDist', combination=['rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3'])

getQuality(label_name='stdDist', combination=['WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_i', 'rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['PCA2', 'PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'WL_i', 'Pit_i'])

# =====================================================================

getQuality(label_name='valence', combination=['PCA1'])
getQuality(label_name='valence', combination=['Edge_e'])
getQuality(label_name='valence', combination=['WL_e'])
getQuality(label_name='valence', combination=['PCA2'])
getQuality(label_name='valence', combination=['rPCA1'])
getQuality(label_name='valence', combination=['PCA3'])

getQuality(label_name='valence', combination=['PCA1', 'rPCA3'])
getQuality(label_name='valence', combination=['Edge_e', 'WL_i'])
getQuality(label_name='valence', combination=['WL_e', 'Pit_e'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_d2', 'Pit_e-WL_e'])
getQuality(label_name='valence', combination=['PCA2', 'rPCA2', 'rPCA3'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_e', 'Pit_e-WL_e', 'Pit_i'])

In [None]:
# the quality is measured using cv_count cross-validation technique, which is repeated m times
# quality = accuracy for classification, r2-score - for regression
###### IITI ########
def getQuality(label_name, combination):
    print('Try to predict', label_name, 'by', combination)
    qualityResult = descriptor.getQuality_SVM_rbf(singleComponentDataIHSgen.params, combination, [label_name], cv_count=10, m=5, printDebug=True)
    print(f"quality = {qualityResult[label_name]['quality']:.3f}±{qualityResult[label_name]['quality_std']:.3f}")

getQuality(label_name='CN', combination=['Pit_i'])
getQuality(label_name='CN', combination=['WL_i'])
getQuality(label_name='CN', combination=['Edge_e'])
getQuality(label_name='CN', combination=['WL_d2'])
getQuality(label_name='CN', combination=['PCA3'])
getQuality(label_name='CN', combination=['WL-Pit_slope'])

# getQuality(label_name='CN', combination=['Edge_slope'])
# getQuality(label_name='CN', combination=['Pit_e'])
# getQuality(label_name='CN', combination=['WL-Pit_slope'])
# getQuality(label_name='CN', combination=['rPCA3'])
# getQuality(label_name='CN', combination=['WL_e'])


getQuality(label_name='CN', combination=['Edge_e', 'WL_d2'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope'])
getQuality(label_name='CN', combination=['WL_d2', 'Pit_e'])

getQuality(label_name='CN', combination=['Edge_e', 'WL_e', 'rPCA3'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_d2'])

getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_e', 'Pit_e'])

# ===================================================

# getQuality(label_name='FeODist', combination=['Pit_e'])
getQuality(label_name='FeODist', combination=['Pit_e-WL_e'])
getQuality(label_name='FeODist', combination=['Edge_slope'])
getQuality(label_name='FeODist', combination=['WL_e'])
getQuality(label_name='FeODist', combination=['rPCA2'])
getQuality(label_name='FeODist', combination=['PCA3'])

getQuality(label_name='FeODist', combination=['rPCA2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['WL_d2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['PCA2', 'PCA3', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_d2', 'Pit_e'])

# =====================================================================

getQuality(label_name='stdDist', combination=['Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_d2'])
getQuality(label_name='stdDist', combination=['Edge_e'])
getQuality(label_name='stdDist', combination=['rPCA3'])
getQuality(label_name='stdDist', combination=['rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3'])

getQuality(label_name='stdDist', combination=['WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_i', 'rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['PCA2', 'PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'WL_i', 'Pit_i'])

# =====================================================================

getQuality(label_name='valence', combination=['PCA1'])
getQuality(label_name='valence', combination=['Edge_e'])
getQuality(label_name='valence', combination=['WL_e'])
getQuality(label_name='valence', combination=['PCA2'])
getQuality(label_name='valence', combination=['rPCA1'])
getQuality(label_name='valence', combination=['PCA3'])

getQuality(label_name='valence', combination=['PCA1', 'rPCA3'])
getQuality(label_name='valence', combination=['Edge_e', 'WL_i'])
getQuality(label_name='valence', combination=['WL_e', 'Pit_e'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_d2', 'Pit_e-WL_e'])
getQuality(label_name='valence', combination=['PCA2', 'rPCA2', 'rPCA3'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_e', 'Pit_e-WL_e', 'Pit_i'])

In [None]:
# the quality is measured using cv_count cross-validation technique, which is repeated m times
# quality = accuracy for classification, r2-score - for regression
def getQuality(label_name, combination):
    print('Try to predict', label_name, 'by', combination)
    qualityResult = descriptor.getQuality(singleComponentDataIHSgen.params, combination, [label_name], cv_count=10, m=5, printDebug=True)
    print(f"quality = {qualityResult[label_name]['quality']:.3f}±{qualityResult[label_name]['quality_std']:.3f}")

getQuality(label_name='CN', combination=['Pit_i'])
getQuality(label_name='CN', combination=['WL_i'])
getQuality(label_name='CN', combination=['Edge_e'])
getQuality(label_name='CN', combination=['WL_d2'])
getQuality(label_name='CN', combination=['PCA3'])
getQuality(label_name='CN', combination=['WL-Pit_slope'])

# getQuality(label_name='CN', combination=['Edge_slope'])
# getQuality(label_name='CN', combination=['Pit_e'])
# getQuality(label_name='CN', combination=['WL-Pit_slope'])
# getQuality(label_name='CN', combination=['rPCA3'])
# getQuality(label_name='CN', combination=['WL_e'])


getQuality(label_name='CN', combination=['Edge_e', 'WL_d2'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope'])
getQuality(label_name='CN', combination=['WL_d2', 'Pit_e'])

getQuality(label_name='CN', combination=['Edge_e', 'WL_e', 'rPCA3'])
getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_d2'])

getQuality(label_name='CN', combination=['Edge_e', 'Edge_slope', 'WL_e', 'Pit_e'])

# ===================================================

# getQuality(label_name='FeODist', combination=['Pit_e'])
getQuality(label_name='FeODist', combination=['Pit_e-WL_e'])
getQuality(label_name='FeODist', combination=['Edge_slope'])
getQuality(label_name='FeODist', combination=['WL_e'])
getQuality(label_name='FeODist', combination=['rPCA2'])
getQuality(label_name='FeODist', combination=['PCA3'])

getQuality(label_name='FeODist', combination=['rPCA2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['WL_d2', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['PCA2', 'PCA3', 'rPCA3'])
getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_e'])

getQuality(label_name='FeODist', combination=['Edge_e', 'WL_e', 'Pit_d2', 'Pit_e'])

# =====================================================================

getQuality(label_name='stdDist', combination=['Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_d2'])
getQuality(label_name='stdDist', combination=['Edge_e'])
getQuality(label_name='stdDist', combination=['rPCA3'])
getQuality(label_name='stdDist', combination=['rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3'])

getQuality(label_name='stdDist', combination=['WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['Pit_i', 'rPCA2'])
getQuality(label_name='stdDist', combination=['PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'Pit_i'])
getQuality(label_name='stdDist', combination=['PCA2', 'PCA3', 'Pit_i'])

getQuality(label_name='stdDist', combination=['Edge_e', 'WL_d2', 'WL_i', 'Pit_i'])

# =====================================================================

getQuality(label_name='valence', combination=['PCA1'])
getQuality(label_name='valence', combination=['Edge_e'])
getQuality(label_name='valence', combination=['WL_e'])
getQuality(label_name='valence', combination=['PCA2'])
getQuality(label_name='valence', combination=['rPCA1'])
getQuality(label_name='valence', combination=['PCA3'])

getQuality(label_name='valence', combination=['PCA1', 'rPCA3'])
getQuality(label_name='valence', combination=['Edge_e', 'WL_i'])
getQuality(label_name='valence', combination=['WL_e', 'Pit_e'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_d2', 'Pit_e-WL_e'])
getQuality(label_name='valence', combination=['PCA2', 'rPCA2', 'rPCA3'])

getQuality(label_name='valence', combination=['Edge_e', 'WL_e', 'Pit_e-WL_e', 'Pit_i'])

## Analytical relations between descriptors of spectra and descriptors of structure (Table 4)

In [None]:
label_names = ['CN', 'FeODist', 'stdDist', 'valence']
features = sorted(list(set(singleComponentDataIHSgen.paramNames) - set(label_names) - {'PCA1','PCA2','PCA3','rPCA1','rPCA2','rPCA3','name'}))
descriptor.getAnalyticFormulasForGivenFeatures(singleComponentDataIHSgen.params, features, label_names, l1_ratio=1,
                                               output_file=f'{resultFolder}/analytic_labels_by_features.txt')
descriptor.getAnalyticFormulasForGivenFeatures(singleComponentDataIHSgen.params, sorted(list(set(label_names) - {'valence'})), features,
                                               output_file=f'{resultFolder}/analytic_features_by_labels.txt')

## Building sample of mixtures

In [None]:
# Do not rebuild sample, if it's saved copy exits
if os.path.exists(f'generated/mix_data.pkl'):
    mixtureData = utils.load_pkl(f'generated/mix_data.pkl')
    mixtureDataIHSgen = utils.load_pkl(f'generated/mix_data_IHS_generated.pkl')
else:
    #descriptor
    # generate mixtures of random spectra from singleComponentData with random concentrations
    # target features (labels) are calculated as weighted average
    label_names = ['CN', 'FeODist', 'stdDist', 'valence']
    mixtureData = mixture.generateMixtureOfSample(size=5000, componentCount=2, sample=singleComponentData,
        label_names=label_names, addDescrFunc=lambda sample: calc_descriptors(sample,True,'generated/pca_data.pkl'),
        randomSeed=1, componentNameColumn='name')
    mixtureDataIHSgen = mixture.generateMixtureOfSample(size=5000, componentCount=2, sample=singleComponentDataIHSgen,
        label_names=label_names, addDescrFunc=lambda sample: calc_descriptors(sample,True,'generated/pca_data_IHS_gen.pkl'),
        randomSeed=1, componentNameColumn='name')
    utils.save_pkl(mixtureData, f'generated/mix_data.pkl')
    utils.save_pkl(mixtureDataIHSgen, f'generated/mix_data_IHS_generated.pkl')
    mixtureData.params.to_csv(f'generated/mix_data.csv', index=False, sep=' ')
    mixtureDataIHSgen.params.to_csv(f'generated/mix_data_IHS_generated.csv', index=False, sep=' ')

## Scatter plots of descriptors calculated for mixtures (Figure 8)

In [None]:
###### IITI ########
descriptor.plot_descriptors_2d_SVM_rbf(mixtureData.params, ['WL_d2', 'Pit_e'], ['CN'], folder_prefix=f'{resultFolder}/scatter_mix_plots',
                               unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=0)
descriptor.plot_descriptors_2d_SVM_rbf(mixtureData.params, ['WL_d2', 'Pit_e'], ['FeODist'], folder_prefix=f'{resultFolder}/scatter_mix_plots',
                               unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=0)
descriptor.plot_descriptors_2d_SVM_rbf(mixtureData.params, ['WL_e', 'Pit_e-WL_e'], ['valence'], folder_prefix=f'{resultFolder}/scatter_mix_plots',
                               unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=0)

In [None]:
descriptor.plot_descriptors_2d(mixtureData.params, ['WL_d2', 'Pit_e'], ['CN'], folder_prefix=f'{resultFolder}/scatter_mix_plots',
                               unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=0)
descriptor.plot_descriptors_2d(mixtureData.params, ['WL_d2', 'Pit_e'], ['FeODist'], folder_prefix=f'{resultFolder}/scatter_mix_plots',
                               unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=0)
descriptor.plot_descriptors_2d(mixtureData.params, ['WL_e', 'Pit_e-WL_e'], ['valence'], folder_prefix=f'{resultFolder}/scatter_mix_plots',
                               unknown=exp_sample.params, markersize=50, textsize=0, alpha=0.3, plot_only=True, doNotPlotRemoteCount=0)

## Comparison between cross-validation quality calculated for selected combinations of descriptors (Tables 5,6 and Figure 9)

In [None]:
pip install openpyxl

In [None]:
# Table 5
table = 5
descriptor.descriptor_quality(mixtureData.params, ['CN'], ['Edge_e', 'Pit_e-WL_e', 'WL_d2'], feature_subset_size=3, cv_parts_count=10, cv_repeat=3,
                              unknown_data=exp_sample.params, textColumn='name', folder=f'{resultFolder}/table_{table}_CN')
descriptor.descriptor_quality(mixtureData.params, ['FeODist'], ['Edge_e', 'WL_e', 'Pit_e'], feature_subset_size=3, cv_parts_count=10, cv_repeat=3,
                              unknown_data=exp_sample.params, textColumn='name', folder=f'{resultFolder}/table_{table}_FeODist')
descriptor.descriptor_quality(mixtureData.params, ['valence'], ['Edge_e', 'Pit_e', 'WL_i'], feature_subset_size=3, cv_parts_count=10, cv_repeat=3,
                              unknown_data=exp_sample.params, textColumn='name', folder=f'{resultFolder}/table_{table}_valence')

# Table 6 and Figure 9
table = 6
descriptor.descriptor_quality(mixtureData.params, ['CN'], ['WL_e', 'Pit_e', 'rPCA2'], feature_subset_size=3, cv_parts_count=10, cv_repeat=3,
                              unknown_data=exp_sample.params, textColumn='name', folder=f'{resultFolder}/table_{table}_CN')
descriptor.descriptor_quality(mixtureData.params, ['FeODist'], ['WL_i', 'Pit_i', 'rPCA2'], feature_subset_size=3, cv_parts_count=10, cv_repeat=3,
                              unknown_data=exp_sample.params, textColumn='name', folder=f'{resultFolder}/table_{table}_FeODist')
descriptor.descriptor_quality(mixtureData.params, ['valence'], ['Edge_e', 'WL_e', 'PCA3'], feature_subset_size=3, cv_parts_count=10, cv_repeat=3,
                              unknown_data=exp_sample.params, textColumn='name', folder=f'{resultFolder}/table_{table}_valence')

## Relative to constant prediction error calculated for training set of 600 spectra for CN=6 (Table S1)

In [None]:
v = 3;    cn = 6
method = 'RBF'  # 'Ridge', 'Ridge Quadric', 'Extra Trees', 'RBF' 'LightGBM'
proj = loadProject('Fe_project.py', CN=cn, valence=v)
# this function builds ML model for spectra predition by structure parameters
estimator = constructInverseEstimator(method, proj, proj.FDMNES_smooth, CVcount=10,
                                      folderToSaveCVresult=f'{resultFolder}/{method}_v{v}_cn{cn}_CV')
sample = readSample('samples/sample_'+
str(cn))
estimator.fit(sample)