# Training models using data from Citrination.

In [1]:
from citrination_client import CitrinationClient
from saxskit.saxs_models import get_data_from_Citrination
from saxskit.saxs_models import train_classifiers, train_regressors
from time import time

import warnings
warnings.filterwarnings("ignore")

In [2]:
with open("../../citrination_api_key_ssrl.txt", "r") as g:
    api_key = g.readline()

a_key = api_key.strip()

cl = CitrinationClient(site='https://slac.citrination.com',api_key=a_key )

SAXSKIT provides two options for training:
* training from scratch
* updating existing models using additional data

"training from scratch" is useful for initial training or when we have a lot of new data (more than 30%). It is recommended to use "hyper_parameters_search = True." 

Updating existing models is recommended when we have some new data (less than 30%). Updating existing models takes significant less time than "training from scratch"



## Training from "scratch"

Let's assume that initially we have only two datasets: 1 and 15. We want to use them to create the models.

#### Step 1. Get data from Citrination

In [3]:
data = get_data_from_Citrination(client = cl, dataset_id_list= [1,15]) # [1,15] is a list of datasets ids

**data** is pandas data frame that contains:
* experiment_id - It will be used for grouping for creating crossvaligdation folders during the training. Often samples from the same experiment are very similar and we should avoid to to have the samples from the same experiment in training and validation sets
* 20 features: 'Imax_over_Imean', 'Imax_sharpness', 'I_fluctuation',
       'logI_fluctuation', 'logI_max_over_std', 'r_fftIcentroid', 'r_fftImax',
       'q_Icentroid', 'q_logIcentroid', 'pearson_q', 'pearson_q2',
       'pearson_expq', 'pearson_invexpq', 'I0_over_Imean', 'I0_curvature',
       'q_at_half_I0', 'q_at_Iq4_min1', 'pIq4_qwidth', 'pI_qvertex',
       'pI_qwidth'
* 4 True / False labels: 'unidentified', 'guinier_porod', 'spherical_normal',
       'diffraction_peaks'. If a sample have 'unidentified = True', it also have "False" for all other labels
* 10 continuouse labels: 'I0_floor', 'G_gp', 'rg_gp', 'D_gp', 'I0_sphere',
       'r0_sphere', 'sigma_sphere', 'I_pkcenter', 'q_pkcenter', 'pk_hwhm'. Some samples have "None" for some of these labels. For example, only samples with 'spherical_normal =  True' have some value for 'sigma_sphere'

In [4]:
data.head()

Unnamed: 0,experiment_id,Imax_over_Imean,Imax_sharpness,I_fluctuation,logI_fluctuation,logI_max_over_std,r_fftIcentroid,r_fftImax,q_Icentroid,q_logIcentroid,...,I0_floor,G_gp,rg_gp,D_gp,I0_sphere,r0_sphere,sigma_sphere,I_pkcenter,q_pkcenter,pk_hwhm
191,R1,18.6673,1.03119,0.00109872,3.59799,3.02239,0.107509,0.00185529,0.0653252,-0.239971,...,0.240559,,,,1454.74,33.6663,0.0521611,,,
125,R1,18.7091,1.03054,0.00109695,4.94637,3.04098,0.107607,0.00185529,0.0651664,-0.216147,...,0.235522,,,,1573.42,33.7184,0.0453875,,,
817,R5,18.6484,1.03019,0.00108724,4.96878,3.13777,0.107417,0.00185529,0.065241,-0.0914662,...,0.276159,,,,1969.04,33.916,0.0413327,,,
1185,R12,42.4029,2.15578,0.0022293,5.55918,3.34202,0.0663934,0.00171821,0.106922,1.26546,...,0.0,,,,,,,,,
647,R4,75.0283,2.95092,0.00235101,51.0107,3.98897,0.0885655,0.00185529,0.0924768,-1.70073,...,0.0,,,,,,,,,


#### Step 2. Train Classifiers

In [5]:
t0 = time()
train_classifiers(data,  hyper_parameters_search = True)
# scalers and models will be saved in 'saxskit/saxskit/modeling_data/scalers_and_models.yml'
# accuracy will be saved in 'saxskit/saxskit/modeling_data/accuracy.txt'
# We can use yaml_filename='file_name.yml' as additional parametrs to save scalers and models in it
print("Training took about", (time()-t0)/60, " minutes.")

Training took about 0.9615498185157776  minutes.


In [7]:
with open("../saxskit/modeling_data/accuracy.txt", "r") as g:
    accuracy = g.readline()
    
accuracy

"{'unidentified': 0.98467737860489402, 'spherical_normal': 0.97380536424454611, 'guinier_porod': 0.81333902893165577, 'diffraction_peaks': 0.97396326849628756}"

Since often the data form the same experiment is highly correlated, "Leave N Group Out" technique is used to calculate accuracy. Data from two experiments is excluded from training and used as testing set. For example, if we have experiments 1,2,3,5,and 5:
* train the model on 1,2 3; test on 4,5
* train the model on 1,2,5; test on 3,4
* try all combinations...
* calculate average accuracy

#### Step 3. Train Regression models

In [8]:
t0 = time()
train_regressors(data,  hyper_parameters_search = True)
# scalers and models will be saved in 'saxskit/saxskit/modeling_data/scalers_and_models_regression.yml'
# accuracy will be saved in 'saxskit/saxskit/modeling_data/accuracy_regression.txt'
# We can use yaml_filename='file_name.yml' as additional parametrs to save scalers and models in it
print("Training took about", (time()-t0)/60, " minutes.")

Training took about 16.191103851795198  minutes.


In [10]:
with open("../saxskit/modeling_data/accuracy_regression.txt", "r") as g:
    accuracy = g.readline()
    
accuracy

"{'r0_sphere': 0.29207195697128335, 'sigma_sphere': 0.61194005642492144, 'rg_gp': 0.27678231141481924}"

For the regression models, "Leave N Group Out" technique is also used. The accuracy is calculated as absolute mean error divided by standard derivation. 