## Training models

In this notebook we will take a look at a whole training pipeline for our recognition models and will try to find the best parameters for it

### Importing libraries

In [None]:
import sys
import os
import json
import pandas as pd
import numpy as np
!pip install pandasql
from pandasql import sqldf
sys.path.append('/opt/workspace/src/python_scripts/')
from ops_face_recognition import get_pipeline_results, train_pipelines
from ops_files_operations import read_pickle_file, read_json_file

### Training a single pipeline

Given specific parameters, such as seed, test sample and SVM parameters, we can train a specific pipeline and get it results

In [None]:
seed = 3195
test_sample = 0.3
faces_folder = './datasets/actor_faces'
C=1.0
#gamma=1
#degree=2
kernel='linear'
pipeline_results = get_pipeline_results(
    faces_folder=faces_folder,
    seed=seed,
    test_sample=test_sample,
    kernel=kernel,
    C=C#,
    #gamma=gamma,
    #degree=degree
)
print('Pipeline results: \b')
print(json.dumps(pipeline_results,indent=4))

### Iterating over multiple variables to find best models

We can also stablish a range for the input parameters and try to find the one with the best results

In [None]:
faces_folder = './datasets/actor_faces'
pipelines_folder= './models/pipelines'
test_sample = 0.3
C_values = [1,10,100,1000]
gammas = [0.1, 1, 10, 100]
kernels = ['linear', 'rbf', 'poly']
degrees = [2, 3, 4]
iterations = 8

pipelines_dict, pipelines_metadata = train_pipelines(
    faces_folder=faces_folder,
    test_sample=test_sample,
    kernels=kernels,
    C_values=C_values,
    gammas=gammas,
    degrees=degrees,
    iterations=iterations,
    save_to_pickle=True,
    output_folder=pipelines_folder
)

### Analyzing results

Once we iterate over several hiperparameters, we analyze how the different configurations performed

#### Loading Pandas DataFrame

In [None]:
pipelines_metadata_path = './models/pipelines/pipelines_metadata.json'
pipelines_metadata = read_json_file(pipelines_metadata_path)
pipelines_dict = read_pickle_file(pipelines_metadata[2]['pickle_path'])
pipelines_df = pd.DataFrame(pipelines_dict)

#### Count the different combinations of parameters available

In [None]:
pipelines_df['parameters_uuid'].nunique()

#### Aggregating by parameters_uuid

We get an average of the average accuracys for each paramters_uuid, and then we rank them. The nominal sum of both ranks will be the "overall rank", and those with the lowest overall rank will be the best performant models

In [None]:
aggregated_df = sqldf("""
    select 
        parameters_uuid, 
        kernel, 
        C, 
        gamma, 
        degree, 
        avg(accuracy) as avg_accuracy,
        avg(accuracy_top3) as avg_accuracy_top3
        
    from pipelines_df 
    group by 1,2,3,4,5 
    order by 2,3,4,5
    """)

In [None]:
ranked_df = sqldf("""
    with ranked_df as (
        select 
            *,
            rank() over(order by avg_accuracy desc) as accuracy_rank,
            rank() over(order by avg_accuracy_top3 desc) as accuracy_top3_rank

        from aggregated_df
        order by avg_accuracy desc
    
    )
    
    select
        *,
        accuracy_rank + accuracy_top3_rank as composed_rank
    from ranked_df
    order by composed_rank
    limit 5
    

""")

ranked_df