# Entorno de experimentación `Linear Chain CRF`

Entorno dónde se recoge toda la información del lenguaje en la construcción de las *feature lists*

### Parámetros generales

* Maximo Iteraciones = 50
* K = 3

### Parametros por modelo

* `linearCRF_reg.crfsuite`
    * l1 = 0.1
    * l2 = 0.001
* `linearCRF_noreg.crfsuite`
    * l1 = 0
    * l2 = 0
* `linearCRF_l1_zero.crfsuite`
    * l1 = 0
    * l2 = 0.001
* `linearCRF_l2_zero.crfsuite`
    * l1 = 0.1
    * l2 = 0



## Importando bibliotecas de python

In [1]:
import os
import sys  
import random
import time
import pycrfsuite
import numpy as np
from sklearn.model_selection import KFold
from utils import (get_corpus, WordsToLetter, accuracy_score, model_trainer,
                   model_tester, write_report, eval_labeled_positions,
                   bio_classification_report)

## Funciones auxiliares    

In [2]:
def get_feature_lists(sent):
    ''' Rules that setting up the feature lists for training

    :param sent: Data as `[[[[[letter, POS, BIO-label],...],words],sents]]`
    :type: list
    :return: list of words with characters as features list:
        [[[[[letterfeatures],POS,BIO-label],letters],words]]
    :rtype: list
    '''
    featurelist = []
    senlen = len(sent)
    # each word in a sentence
    for i in range(senlen):
        word = sent[i]
        wordlen = len(word)
        lettersequence = ''
        # each letter in a word
        for j in range(wordlen):
            letter = word[j][0]
            # gathering previous letters
            lettersequence += letter
            # ignore digits
            if not letter.isdigit():
                features = [
                    'bias',
                    'letterLowercase=' + letter.lower(),
                    'postag=' + word[j][1],
                ]
                # Position of word in sentence
                if i == senlen -1:
                    features.append("EOS")
                else:
                    features.append("BOS")

                # Pos tag sequence (Don't get pos tag if sentence is 1 word long)
                if i > 0 and senlen > 1:
                    features.append('prevpostag=' + sent[i-1][0][1])
                    if i != senlen-1:
                        features.append('nxtpostag=' + sent[i+1][0][1])
                    else:
                        features.append('EOS')
                else:
                    features.append('BOS')
                    #Don't get pos tag if sentence is 1 word long
                    if i != senlen-1:
                        features.append('nxtpostag=' + sent[i+1][0][1])

                # Position of letter in word
                if j == 0:
                    features.append('BOW')
                elif j == wordlen-1:
                    features.append('EOW')
                else:
                    features.append('letterposition=-%s' % str(wordlen-1-j))

                # Letter sequences before letter
                if j >= 4:
                    features.append('prev4letters=' + lettersequence[j-4:j].lower() + '>')
                if j >= 3:
                    features.append('prev3letters=' + lettersequence[j-3:j].lower() + '>')
                if j >= 2:
                    features.append('prev2letters=' + lettersequence[j-2:j].lower() + '>')
                if j >= 1:
                    features.append('prevletter=' + lettersequence[j-1:j].lower() + '>')

                # letter sequences after letter
                if j <= wordlen-2:
                    nxtlets = word[j+1][0]
                    features.append('nxtletter=<' + nxtlets.lower())
                if j <= wordlen-3:
                    nxtlets += word[j+2][0]
                    features.append('nxt2letters=<' + nxtlets.lower())
                if j <= wordlen-4:
                    nxtlets += word[j+3][0]
                    features.append('nxt3letters=<' + nxtlets.lower())
                if j <= wordlen-5:
                    nxtlets += word[j+4][0]
                    features.append('nxt4letters=<' + nxtlets.lower())
            featurelist.append(features)
    return featurelist


def get_labels(sent, flag=0):
    labels = []
    for word in sent:
        for letter in word:
            labels.append(letter[2])
    return labels


def sent2features(data):
    return [get_feature_lists(sent) for sent in data]


def sent2labels(data):
    return [get_labels(sent) for sent in data]

### Funciones de Train y Test 

In [3]:
def model_trainer(train_data, hyper):
    """ Entrena un modelo y lo guarda en disco

    Función encargada de entrenar un modelo con base en los hyperparametro y
    lo guarda como un archivo utilizable por `pycrfsuite`

    Parameters
    ----------
    train_data : list
    models_path : str
    hyper : dict
    verbose : bool
    k : int

    Returns
    -------
    train_time : float
        Tiempo de entrenamiento
    compositive_name : str
        Nombre del modelo entrenado
    """
    X_train = sent2features(train_data)
    y_train = sent2labels(train_data)
    
    # Train the model
    trainer = pycrfsuite.Trainer(verbose=True)

    for xseq, yseq in zip(X_train, y_train):
        trainer.append(xseq, yseq)

    # Set training parameters. L-BFGS is default. Using Elastic Net (L1 + L2)
    trainer.set_params({
            'c1': hyper['L1'],  # coefficient for L1 penalty
            'c2': hyper['L2'],  # coefficient for L2 penalty
            'max_iterations': hyper['max-iter']  # early stopping
        })
    # The program saves the trained model to a file:
    start = time.time()
    trainer.train(hyper['path'])
    end = time.time()
    train_time = end - start
    return train_time


def model_tester(test_data, model_path):
    """ Prueba un modelo preentrenado

    Recibe los datos de prueba y realiza las pruebas con el modelo previo

    Parameters
    ----------
    test_data : list
    models_path : str
    model_name : str
    verbose : bool

    Returns
    -------
    y_test : list
        Etiquetas reales
    y_pred : list
        Etiquetas predichas por el modelo
    tagger : Object
        Objeto que etiqueta con base en el modelo
    """
    X_test = sent2features(test_data)
    y_test = sent2labels(test_data)

    # ### Make Predictions
    tagger = pycrfsuite.Tagger()
    # Passing model to tagger
    tagger.open(model_path)  
    # Tagging task using the model
    y_pred = [tagger.tag(xseq) for xseq in X_test]
    # Closing tagger
    tagger.close()
    return y_test, y_pred

## Obteniendo corpus completo

In [4]:
corpus = get_corpus('corpus_otomi_mod', '../corpora/') + \
         get_corpus('corpus_otomi_hard', '../corpora/')
letter_corpus = WordsToLetter(corpus)
dataset = np.array(letter_corpus, dtype=object)

## Parametros base

In [5]:
models_path = 'models/'
env_name = "linearCRF"
max_iter = 50
k = 3
kf = KFold(n_splits=k, shuffle=True)

## Parámetros para `linearCRF_reg.crfsuite`

In [6]:
params = {"L1": 0.1, "L2": 1e-3, "max-iter": max_iter}
variant = "reg"

### Entrenamiento y Tests

In [7]:
%%time
i = 0
full_time = 0
accuracy_set = []
for train_index, test_index in kf.split(dataset):
    i += 1
    train_data, test_data = dataset[train_index], dataset[test_index]
    model_name = f"{env_name}_{variant}_k_{i}.crf"
    params['path'] = os.path.join(models_path, env_name, model_name)
    print("*"*50)
    print(f"Entrenando nuevo modelo '{model_name}' | K = {i}")
    print(f"len train: {len(train_data)} len test: {len(test_data)}")
    print("*"*50)
    train_time = model_trainer(train_data, params)
    full_time += train_time
    print("*"*50)
    print(f"Tiempo de entrenamiento: {train_time}[s] | {train_time / 60}[m]")
    print("Test del modelo")
    y_test, y_pred = model_tester(test_data, params['path'])
    breakpoint()
    accuracy_set.append(accuracy_score(y_test, y_pred))
    print(f"Partial accuracy: {accuracy_set[i - 1]}\n")
    # Reports
    eval_labeled_positions(y_test, y_pred)
    print(bio_classification_report(y_test, y_pred))
print("\n\nAccuracy Set -->", accuracy_set)
train_time_format = str(round(full_time / 60, 2)) + "[m]"
print(f"\nTime>> {train_time_format}")
train_size = len(train_data)
test_size = len(test_data)
params['k-folds'] = k
write_report(model_name, train_size, test_size, accuracy_set, train_time_format,
             params)

**************************************************
Entrenando nuevo modelo 'linearCRF_reg_k_1.crf' | K = 1
len train: 1179 len test: 590
**************************************************
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 27761
Seconds required: 0.061

L-BFGS optimization
c1: 0.100000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 118511.480241
Feature norm: 1.000000
Error norm: 26058.712638
Active features: 27671
Line search trials: 1
Line search step: 0.000037
Seconds required for this iteration: 3.480

***** Iteration #2 *****
Loss: 116190.593865
Feature norm: 6.174299
Error norm: 31707.272925
Active features: 27599
Line search trials: 4
Line search step: 0.125000
Seconds required for this iteration: 7.

(Pdb)  n


> <timed exec>(21)<module>()


(Pdb)  n


Partial accuracy: 0.9596802716276438

> <timed exec>(23)<module>()


(Pdb)  c


		Precision	Recall	f1-score	Instances

stem		0.96		0.97	0.97		2559
2.cnt		0.89		1.00	0.94		8
p.loc		0.67		0.29	0.40		7
psd		0.96		0.99	0.98		124
1.cnt		1.00		0.96	0.98		23
pl.exc		0.93		0.94	0.93		67
lim		0.99		0.99	0.99		117
prag		1.00		0.99	1.00		118
dem		0.95		0.93	0.94		59
3.icp		0.98		0.99	0.99		108
1.pss		1.00		1.00	1.00		60
ila		0.88		1.00	0.94		46
1.icp		0.99		1.00	0.99		88
ctrf		0.93		0.91	0.92		77
det.pl		0.96		0.98	0.97		90
1.pot		1.00		0.99	0.99		78
1.cpl		0.99		1.00	0.99		75
lig		0.89		0.78	0.83		93
3.pot		0.99		1.00	1.00		104
det		1.00		0.99	0.99		236
3.cnt		0.91		1.00	0.95		30
3.cpl		0.99		1.00	0.99		146
3.pls		0.85		0.94	0.89		18
3.obj		0.83		0.75	0.79		20
pl		0.95		0.79	0.86		47
muy		0.94		0.89	0.91		18
dim		0.98		0.98	0.98		48
1.obj		0.90		0.90	0.90		42
3.pss		0.80		1.00	0.89		12
1.pls		0.50		0.50	0.50		2
dual.exc		0.94		0.89	0.91		18
1.icp.irr		1.00		1.00	1.00		6
loc		0.95		0.88	0.91		24
3.prf		0.91		1.00	0.95		10
1.prf		0.97		1.00	0.98		28
med		0.71		1.00	0.83		10
1

  _warn_prf(average, modifier, msg_start, len(result))


               precision    recall  f1-score   support

      B-1.cnt       1.00      0.96      0.98        23
      I-1.cnt       1.00      0.96      0.98        46
      B-1.cpl       0.99      1.00      0.99        75
      I-1.cpl       0.99      1.00      0.99        75
  B-1.cpl.irr       1.00      0.57      0.73         7
  I-1.cpl.irr       1.00      0.57      0.73         7
      B-1.enf       0.73      0.67      0.70        12
      I-1.enf       0.73      0.67      0.70        12
      B-1.icp       0.99      1.00      0.99        88
      I-1.icp       0.99      1.00      0.99        88
  B-1.icp.irr       1.00      1.00      1.00         6
  I-1.icp.irr       1.00      1.00      1.00        12
      B-1.irr       1.00      1.00      1.00         1
      I-1.irr       1.00      1.00      1.00         2
      B-1.obj       0.90      0.90      0.90        42
      I-1.obj       0.90      0.90      0.90        42
      B-1.pls       0.50      0.50      0.50         2
      I-1

(Pdb)  c


Partial accuracy: 0.9659782989149458

		Precision	Recall	f1-score	Instances

3.pot		1.00		1.00	1.00		80
stem		0.96		0.97	0.97		2545
1.icp		0.99		1.00	0.99		97
lim		0.99		1.00	1.00		130
psd		0.99		0.99	0.99		143
3.icp		0.99		0.98	0.99		115
pl		0.98		0.79	0.87		52
3.cpl		1.00		0.99	1.00		154
loc		1.00		1.00	1.00		21
2.pss		1.00		1.00	1.00		6
2.icp		1.00		1.00	1.00		34
dem		0.93		0.93	0.93		80
det		0.99		0.97	0.98		269
det.pl		1.00		0.99	0.99		88
prag		0.98		0.99	0.99		123
1.prf		0.96		1.00	0.98		23
pl.exc		0.99		0.95	0.97		88
lig		0.98		0.78	0.87		83
mod		0.90		0.83	0.86		23
aum		1.00		0.60	0.75		5
1.obj		0.97		0.83	0.90		42
1.pss		1.00		1.00	1.00		52
3.icp.irr		1.00		0.90	0.95		20
dim		0.98		0.88	0.93		51
ila		1.00		0.98	0.99		50
ctrf		0.92		0.97	0.95		75
3.pss		0.75		0.86	0.80		14
1.pot		0.98		0.98	0.98		85
3.imp		0.83		1.00	0.91		5
it		1.00		0.88	0.94		26
muy		0.86		0.71	0.77		17
3.cnt		0.93		1.00	0.96		25
1.cpl		1.00		1.00	1.00		60
prt		0.83		0.28	0.42		18
3.pls		1.00		1.00	1.00		18


  _warn_prf(average, modifier, msg_start, len(result))


                  precision    recall  f1-score   support

         B-1.cnt       1.00      1.00      1.00        19
         I-1.cnt       1.00      1.00      1.00        38
         B-1.cpl       1.00      1.00      1.00        60
         I-1.cpl       1.00      1.00      1.00        60
     B-1.cpl.irr       1.00      1.00      1.00         5
     I-1.cpl.irr       1.00      1.00      1.00         5
         B-1.enf       0.75      0.43      0.55         7
         I-1.enf       0.75      0.43      0.55         7
         B-1.icp       0.99      1.00      0.99        97
         I-1.icp       0.99      1.00      0.99        97
     B-1.icp.irr       1.00      1.00      1.00         4
     I-1.icp.irr       1.00      1.00      1.00         8
         B-1.obj       0.97      0.83      0.90        42
         I-1.obj       0.97      0.83      0.90        42
         B-1.pls       1.00      0.29      0.44         7
         I-1.pls       1.00      0.29      0.44         7
         B-1.

(Pdb)  c


Partial accuracy: 0.9615355944837868

		Precision	Recall	f1-score	Instances

3.cpl		0.96		1.00	0.98		144
ctrf		0.95		0.97	0.96		89
stem		0.96		0.96	0.96		2396
2.icp		0.94		1.00	0.97		31
2.cnt		1.00		1.00	1.00		6
det.pl		0.99		0.99	0.99		91
3.icp		0.93		0.94	0.94		118
1.pls		0.25		1.00	0.40		1
pl.exc		0.91		0.98	0.95		64
3.pot		1.00		1.00	1.00		62
lim		0.99		0.99	0.99		123
psd		0.97		0.97	0.97		146
muy		0.87		1.00	0.93		20
1.obj		0.98		0.96	0.97		48
1.pss		0.98		0.97	0.98		61
1.pot		0.98		1.00	0.99		60
det		1.00		0.99	1.00		228
prag		0.97		0.99	0.98		116
dem		0.93		0.95	0.94		56
lig		0.93		0.76	0.84		111
1.irr		1.00		1.00	1.00		1
3.obj		0.94		0.65	0.77		26
3.pss		0.80		1.00	0.89		8
pl		0.94		0.74	0.83		39
ila		0.95		0.97	0.96		39
dim		0.98		0.98	0.98		43
3.pls		1.00		0.86	0.92		28
1.icp		0.99		1.00	0.99		85
1.icp.irr		0.75		1.00	0.86		3
loc		0.87		0.96	0.92		28
dual.exc		0.95		0.95	0.95		20
1.cnt		1.00		1.00	1.00		25
1.cpl		1.00		1.00	1.00		65
int		1.00		1.00	1.00		8
it		1.00		1.00	1.00

  _warn_prf(average, modifier, msg_start, len(result))


## Parámetros para `linearCRF_noreg.crfsuite`

In [8]:
params = {"L1": 0.0, "L2": 0.0, "max-iter": max_iter}
variant = "noreg"

### Entrenamiento y Tests

In [9]:
%%time
i = 0
full_time = 0
accuracy_set = []
for train_index, test_index in kf.split(dataset):
    i += 1
    train_data, test_data = dataset[train_index], dataset[test_index]
    model_name = f"{env_name}_{variant}_k_{i}.crf"
    params['path'] = os.path.join(models_path, env_name, model_name)
    print("*"*50)
    print(f"Entrenando nuevo modelo '{model_name}' | K = {i}")
    print(f"len train: {len(train_data)} len test: {len(test_data)}")
    print("*"*50)
    train_time = model_trainer(train_data, params)
    full_time += train_time
    print("*"*50)
    print(f"Tiempo de entrenamiento: {train_time}[s] | {train_time / 60}[m]")
    print("Test del modelo")
    y_test, y_pred = model_tester(test_data, params['path'])
    accuracy_set.append(accuracy_score(y_test, y_pred))
    print(f"Partial accuracy: {accuracy_set[i - 1]}\n")
    # Reports
    eval_labeled_positions(y_test, y_pred)
    print(bio_classification_report(y_test, y_pred))
print("\n\nAccuracy Set -->", accuracy_set)
train_time_format = str(round(full_time / 60, 2)) + "[m]"
print(f"\nTime>> {train_time_format}")
train_size = len(train_data)
test_size = len(test_data)
params['k-folds'] = k
write_report(model_name, train_size, test_size, accuracy_set, train_time_format,
             params)

**************************************************
Entrenando nuevo modelo 'linearCRF_noreg_k_1.crf' | K = 1
len train: 1179 len test: 590
**************************************************
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28250
Seconds required: 0.059

L-BFGS optimization
c1: 0.000000
c2: 0.000000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 107101.253611
Feature norm: 5.000000
Error norm: 31561.631898
Active features: 28250
Line search trials: 2
Line search step: 0.000186
Seconds required for this iteration: 5.354

***** Iteration #2 *****
Loss: 70570.237489
Feature norm: 3.576359
Error norm: 9321.798882
Active features: 28250
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 1.

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28294
Seconds required: 0.058

L-BFGS optimization
c1: 0.000000
c2: 0.000000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 106507.838942
Feature norm: 5.000000
Error norm: 31100.945247
Active features: 28294
Line search trials: 2
Line search step: 0.000187
Seconds required for this iteration: 5.864

***** Iteration #2 *****
Loss: 70851.431498
Feature norm: 3.558058
Error norm: 9807.205344
Active features: 28294
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 1.955

***** Iteration #3 *****
Loss: 66352.167029
Feature norm: 4.312576
Error norm: 5565.142949
Active features: 28294
Line search trials: 1
Line search step: 1.000000
Seconds required for t

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28534
Seconds required: 0.062

L-BFGS optimization
c1: 0.000000
c2: 0.000000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 106625.666734
Feature norm: 5.000000
Error norm: 31242.506653
Active features: 28534
Line search trials: 2
Line search step: 0.000182
Seconds required for this iteration: 6.214

***** Iteration #2 *****
Loss: 70745.311210
Feature norm: 3.590099
Error norm: 9734.140693
Active features: 28534
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 2.054

***** Iteration #3 *****
Loss: 66232.276305
Feature norm: 4.332195
Error norm: 5497.248900
Active features: 28534
Line search trials: 1
Line search step: 1.000000
Seconds required for t

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Parámetros para `linearCRF_l1_zero.crfsuite`

In [10]:
params = {"L1": 0.0, "L2": 1e-3, "max-iter": max_iter}
variant = "l1_zero"

### Entrenamiento y Tests

In [11]:
%%time
i = 0
full_time = 0
accuracy_set = []
for train_index, test_index in kf.split(dataset):
    i += 1
    train_data, test_data = dataset[train_index], dataset[test_index]
    model_name = f"{env_name}_{variant}_k_{i}.crf"
    params['path'] = os.path.join(models_path, env_name, model_name)
    print("*"*50)
    print(f"Entrenando nuevo modelo '{model_name}' | K = {i}")
    print(f"len train: {len(train_data)} len test: {len(test_data)}")
    print("*"*50)
    train_time = model_trainer(train_data, params)
    full_time += train_time
    print("*"*50)
    print(f"Tiempo de entrenamiento: {train_time}[s] | {train_time / 60}[m]")
    print("Test del modelo")
    y_test, y_pred = model_tester(test_data, params['path'])
    accuracy_set.append(accuracy_score(y_test, y_pred))
    print(f"Partial accuracy: {accuracy_set[i - 1]}\n")
    # Reports
    eval_labeled_positions(y_test, y_pred)
    print(bio_classification_report(y_test, y_pred))
print("\n\nAccuracy Set -->", accuracy_set)
train_time_format = str(round(full_time / 60, 2)) + "[m]"
print(f"\nTime>> {train_time_format}")
train_size = len(train_data)
test_size = len(test_data)
params['k-folds'] = k
write_report(model_name, train_size, test_size, accuracy_set, train_time_format,
             params)

**************************************************
Entrenando nuevo modelo 'linearCRF_l1_zero_k_1.crf' | K = 1
len train: 1179 len test: 590
**************************************************
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28300
Seconds required: 0.056

L-BFGS optimization
c1: 0.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 105796.274260
Feature norm: 5.000000
Error norm: 30993.978148
Active features: 28300
Line search trials: 2
Line search step: 0.000188
Seconds required for this iteration: 5.360

***** Iteration #2 *****
Loss: 69818.295239
Feature norm: 3.566140
Error norm: 9382.363023
Active features: 28300
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 27781
Seconds required: 0.057

L-BFGS optimization
c1: 0.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 105665.942539
Feature norm: 5.000000
Error norm: 31089.869060
Active features: 27781
Line search trials: 2
Line search step: 0.000185
Seconds required for this iteration: 5.480

***** Iteration #2 *****
Loss: 69463.737937
Feature norm: 3.588252
Error norm: 9221.597580
Active features: 27781
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 1.881

***** Iteration #3 *****
Loss: 65288.629143
Feature norm: 4.308549
Error norm: 5446.528735
Active features: 27781
Line search trials: 1
Line search step: 1.000000
Seconds required for t

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28965
Seconds required: 0.057

L-BFGS optimization
c1: 0.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 108795.438765
Feature norm: 5.000000
Error norm: 31837.745475
Active features: 28965
Line search trials: 2
Line search step: 0.000183
Seconds required for this iteration: 6.220

***** Iteration #2 *****
Loss: 72629.088744
Feature norm: 3.568959
Error norm: 10097.313712
Active features: 28965
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 2.082

***** Iteration #3 *****
Loss: 67951.270202
Feature norm: 4.327143
Error norm: 5651.443689
Active features: 28965
Line search trials: 1
Line search step: 1.000000
Seconds required for 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Parámetros para `linearCRF_l2_zero.crfsuite`

In [12]:
params = {"L1": 0.1, "L2": 0.0, "max-iter": max_iter}
variant = "l2_zero"

### Entrenamiento y Tests

In [13]:
%%time
i = 0
full_time = 0
accuracy_set = []
for train_index, test_index in kf.split(dataset):
    i += 1
    train_data, test_data = dataset[train_index], dataset[test_index]
    model_name = f"{env_name}_{variant}_k_{i}.crf"
    params['path'] = os.path.join(models_path, env_name, model_name)
    print("*"*50)
    print(f"Entrenando nuevo modelo '{model_name}' | K = {i}")
    print(f"len train: {len(train_data)} len test: {len(test_data)}")
    print("*"*50)
    train_time = model_trainer(train_data, params)
    full_time += train_time
    print("*"*50)
    print(f"Tiempo de entrenamiento: {train_time}[s] | {train_time / 60}[m]")
    print("Test del modelo")
    y_test, y_pred = model_tester(test_data, params['path'])
    accuracy_set.append(accuracy_score(y_test, y_pred))
    print(f"Partial accuracy: {accuracy_set[i - 1]}\n")
    # Reports
    eval_labeled_positions(y_test, y_pred)
    print(bio_classification_report(y_test, y_pred))
print("\n\nAccuracy Set -->", accuracy_set)
train_time_format = str(round(full_time / 60, 2)) + "[m]"
print(f"\nTime>> {train_time_format}")
train_size = len(train_data)
test_size = len(test_data)
params['k-folds'] = k
write_report(model_name, train_size, test_size, accuracy_set, train_time_format,
             params)

**************************************************
Entrenando nuevo modelo 'linearCRF_l2_zero_k_1.crf' | K = 1
len train: 1179 len test: 590
**************************************************
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28098
Seconds required: 0.057

L-BFGS optimization
c1: 0.100000
c2: 0.000000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 121671.666597
Feature norm: 1.000000
Error norm: 26752.120805
Active features: 27983
Line search trials: 1
Line search step: 0.000036
Seconds required for this iteration: 3.679

***** Iteration #2 *****
Loss: 76553.662124
Feature norm: 3.744126
Error norm: 20955.874718
Active features: 27960
Line search trials: 5
Line search step: 0.062500
Seconds required for this iteration:

  _warn_prf(average, modifier, msg_start, len(result))


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28289
Seconds required: 0.057

L-BFGS optimization
c1: 0.100000
c2: 0.000000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 121761.857160
Feature norm: 1.000000
Error norm: 26335.211051
Active features: 28228
Line search trials: 1
Line search step: 0.000037
Seconds required for this iteration: 4.019

***** Iteration #2 *****
Loss: 77408.054586
Feature norm: 3.808274
Error norm: 21790.763268
Active features: 28153
Line search trials: 5
Line search step: 0.062500
Seconds required for this iteration: 10.078

***** Iteration #3 *****
Loss: 67120.386975
Feature norm: 3.837122
Error norm: 10314.077982
Active features: 27008
Line search trials: 1
Line search step: 1.000000
Seconds required fo

  _warn_prf(average, modifier, msg_start, len(result))


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 28387
Seconds required: 0.057

L-BFGS optimization
c1: 0.100000
c2: 0.000000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 118773.480316
Feature norm: 1.000000
Error norm: 25676.661690
Active features: 28282
Line search trials: 1
Line search step: 0.000038
Seconds required for this iteration: 3.527

***** Iteration #2 *****
Loss: 118395.273709
Feature norm: 6.278887
Error norm: 31776.355958
Active features: 28226
Line search trials: 4
Line search step: 0.125000
Seconds required for this iteration: 7.039

***** Iteration #3 *****
Loss: 67183.764586
Feature norm: 5.059998
Error norm: 16654.305778
Active features: 27158
Line search trials: 1
Line search step: 1.000000
Seconds required fo

  _warn_prf(average, modifier, msg_start, len(result))
