# Training models using data from Citrination.

To train the models using data from https://slac.citrination.com , **the user must have**:
* access to SLAC Citrination account 
* apy_key

In [1]:
from citrination_client import CitrinationClient
from saxskit.saxskit.saxs_models import get_data_from_Citrination
from saxskit.saxskit.saxs_models import train_classifiers, train_regressors
from time import time

import warnings
warnings.filterwarnings("ignore")

In [2]:
with open("citrination_api_key_ssrl.txt", "r") as g:
    api_key = g.readline()

a_key = api_key.strip()

cl = CitrinationClient(site='https://slac.citrination.com',api_key=a_key )

SAXSKIT provides two options for training:
* training from scratch
* updating existing models using additional data

"training from scratch" is useful for initial training or when we have a lot of new data (more than 30%). It is recommended to use "hyper_parameters_search = True." 

Updating existing models is recommended when we have some new data (less than 30%). Updating existing models takes significant less time than "training from scratch"



## Training from "scratch"

Let's assume that initially we have only two datasets: 1 and 15. We want to use them to create the models.

#### Step 1. Get data from Citrination

In [3]:
t0 = time()
data = get_data_from_Citrination(client = cl, dataset_id_list= [1,15]) # [1,15] is a list of datasets ids
print("Getting data took about", (time()-t0)/60, " minutes.")

Getting data took about 1.8684906045595804  minutes.


**data** is pandas data frame that contains:
* experiment_id - It will be used for grouping for creating crossvaligdation folders during the training. Often samples from the same experiment are very similar and we should avoid to to have the samples from the same experiment in training and validation sets
* 20 features: 'Imax_over_Imean', 'Imax_sharpness', 'I_fluctuation',
       'logI_fluctuation', 'logI_max_over_std', 'r_fftIcentroid', 'r_fftImax',
       'q_Icentroid', 'q_logIcentroid', 'pearson_q', 'pearson_q2',
       'pearson_expq', 'pearson_invexpq', 'I0_over_Imean', 'I0_curvature',
       'q_at_half_I0', 'q_at_Iq4_min1', 'pIq4_qwidth', 'pI_qvertex',
       'pI_qwidth'
* 4 True / False labels: 'unidentified', 'guinier_porod', 'spherical_normal',
       'diffraction_peaks'. If a sample have 'unidentified = True', it also have "False" for all other labels
* 10 continuouse labels: 'I0_floor', 'G_gp', 'rg_gp', 'D_gp', 'I0_sphere',
       'r0_sphere', 'sigma_sphere', 'I_pkcenter', 'q_pkcenter', 'pk_hwhm'. Some samples have "None" for some of these labels. For example, only samples with 'spherical_normal =  True' have some value for 'sigma_sphere'

In [4]:
data.head()

Unnamed: 0,experiment_id,Imax_over_Imean,Imax_sharpness,I_fluctuation,logI_fluctuation,logI_max_over_std,r_fftIcentroid,r_fftImax,q_Icentroid,q_logIcentroid,...,I0_floor,G_gp,rg_gp,D_gp,I0_sphere,r0_sphere,sigma_sphere,I_pkcenter,q_pkcenter,pk_hwhm
810,R5,18.7303,1.03298,0.00109113,5.27887,3.30742,0.10774,0.00185529,0.0658953,0.0119629,...,0.434391,,,,1978.14,33.8903,0.0402607,,,
1454,R13,17.4753,1.0282,0.00102252,3.09034,2.75816,0.10391,0.00171821,0.0664384,-1.86571,...,0.128828,,,,1042.52,32.8963,0.0431634,,,
361,R2,43.8719,1.49365,0.00216948,6.51334,3.55511,0.0503532,0.00185529,0.071455,-0.404089,...,0.0,,,,,,,,,
1052,R7,53.3159,1.91001,0.0022556,7.60686,3.8161,0.0572842,0.00185529,0.0818391,-0.285929,...,0.0,,,,,,,,,
1204,R12,39.2003,2.03524,0.00225863,11.0873,3.54075,0.0636123,0.00171821,0.110507,2.23699,...,0.0,,,,,,,,,


#### Step 2. Train Classifiers

In [5]:
t0 = time()
train_classifiers(data,  hyper_parameters_search = True)
# scalers and models will be saved in 'saxskit/saxskit/modeling_data/scalers_and_models.yml'
# accuracy will be saved in 'saxskit/saxskit/modeling_data/accuracy.txt'
# We can use yaml_filename='file_name.yml' as additional parametrs to save scalers and models in it
print("Training took about", (time()-t0)/60, " minutes.")

Training took about 0.8094613353411356  minutes.


In [6]:
with open("saxskit/saxskit/modeling_data/accuracy.txt", "r") as g:
    accuracy = g.readline()
    
accuracy

"{'unidentified': 0.98475775909557894, 'spherical_normal': 0.97246868669787134, 'guinier_porod': 0.8033718515852486, 'diffraction_peaks': 0.97284273694727297}"

"Leave N Group Out" technique is used to calculate accuracy. Data from two experiments is excluded from training and used as testing set. For example, if we have experiments 1,2,3,5,and 5:
* train the model on 1,2 3; test on 4,5
* train the model on 1,2,5; test on 3,4
* try all combinations...
* calculate average accuracy

#### Step 3. Train Regression models

In [7]:
t0 = time()
train_regressors(data,  hyper_parameters_search = True)
# scalers and models will be saved in 'saxskit/saxskit/modeling_data/scalers_and_models_regression.yml'
# accuracy will be saved in 'saxskit/saxskit/modeling_data/accuracy_regression.txt'
# We can use yaml_filename='file_name.yml' as additional parametrs to save scalers and models in it
print("Training took about", (time()-t0)/60, " minutes.")

Training took about 14.804892500241598  minutes.


In [8]:
with open("saxskit/saxskit/modeling_data/accuracy_regression.txt", "r") as g:
    accuracy = g.readline()
    
accuracy

"{'r0_sphere': 0.29176775356808121, 'sigma_sphere': 0.6121171720411267, 'rg_gp': 0.27909422401837675}"

For the regression models, "Leave N Group Out" technique is also used. The accuracy is calculated as absolute mean error divided by standard derivation. 