# Training machines to smell - Part 3: Model set up, training and evaluation

## 1. Overview

In this notebook code for the model set up, training and evaluation will be provided. 

## 2. Models set up

As mentioned in the project proposal, the base line model for this task (state-of-art) is random forest. Detail is not provided about the hyperparameters since it was a competition. Thus, this model will be implemented for both tasks and the results compared to the baseline result of the competition for both tasks ``task 1: predict intensity`` and ``task 2: valence and descritors``.


Here I will implement 2 models for both tasks: Random forest and KRR. The reason for the first is that the winner of the competition used random forest (I do not know if alone or in a ensemble). The second is that it is a model of great interest (linear least squares with l2-norm regularization with the kernel trick). I will also trying using both as an ensemble by taking the average of both.

Both models will be implemented via ``Scikitlearn``.

## 2.1 Task 1 - Predict intensity

Import packages

In [2]:
import pandas as pd
import sagemaker
import os
import sklearn
import matplotlib.pyplot as plt
import numpy as np
from sagemaker.sklearn.estimator import SKLearn

Get role, session and bucket

In [3]:
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

## Random Forest Estimator

For random forest, the features can be used directly as they appear, no rescaling and preprocessing is needed. So the RF will be trained using the ``X_train.csv`` and ``y_train.csv`` the way they appear.

## Upload data to S3

In [None]:
if not os.path.isdir('task1/rf_train'):
    os.mkdir('task1/rf_train')
    
if not os.path.isdir('task1/rf_val'):
    os.mkdir('task1/rf_val')

Copying files

In [None]:
from shutil import copyfile
copyfile("task1/X_train.csv", "task1/rf_train/X_train.csv")
copyfile("task1/y_train.csv", "task1/rf_train/y_train.csv")
copyfile("task1/X_ldb.csv", "task1/rf_train/X_ldb.csv")
copyfile("task1/y_ldb.csv", "task1/rf_train/y_ldb.csv")

In [None]:
prefix = 'task1'
train_folder = 'rf_train'
val_folder = 'rf_val'

Creating variable with path to data

In [None]:
train_path = os.path.join(prefix, train_folder)
val_path = os.path.join(prefix, val_folder)
print(train_path)
print('\n')
print(val_path)

Uploading data

In [None]:
train_input = sagemaker_session.upload_data(train_path, key_prefix = train_path)
val_input = sagemaker_session.upload_data(val_path, key_prefix = val_path)

I just createad this two variables so I did not had to upload several times

In [None]:
train_input = 's3://sagemaker-us-east-1-004057822769/task1/rf_train'
val_input = 's3://sagemaker-us-east-1-004057822769/task1/rf_val'

In [None]:
print(train_input,'\n',val_input)

## Launching training job

Instantiating the estimator

In [None]:
FRAMEWORK_VERSION = "0.23-1"
train_fname = 'train_rf.py'
script_dir = os.path.join(prefix, 'source_sklearn')
output_path = 's3://{}/{}'.format(bucket, prefix)

rf = SKLearn(output_path = output_path,
    source_dir = script_dir,
    entry_point = train_fname,
    framework_version = FRAMEWORK_VERSION,
    instance_type="ml.m4.xlarge",
    role=role,
    metric_definitions=[
                   {'Name': 'validation r_mean', 'Regex': 'r_mean for validation set is: ([0-9.]+).*$'}
                ],
    sagemaker_session=sagemaker_session)

Fitting the model

In [None]:
rf.fit({'train':train_input, 'validation':val_input})

Instantiating a predictor object with ``deploy`` method

In [None]:
predictor = rf.deploy(instance_type='ml.c5.large', initial_instance_count=1)

Calculating task 1 metric

In [None]:
X_ldb = pd.read_csv('task1/X_ldb.csv')
y_ldb = pd.read_csv('task1/y_ldb.csv')

In [None]:
y_pred = predictor.predict(np.ravel(X_ldb.iloc[:, 1:].values))

Defining function to calculate metric

In [None]:
def task_1_metric(y_true, y_pred, label = 'INTENSITY/STRENGTH'):
    """ Calculate DREAM challenge metric for intensity target for random forest model
        
        y_true: pandas dataframe with two columns: subject # and intensity values
        y_pred: 1D numpy array with predictions.
        label: string with the column name that holds intensity values in y_true
        return: tuple (all_r, mean(all_r)). The first is a list with 49 Pearson coefficients
        the second is the average of these values.
    """
    
    all_r = [] #Initialize an empty dictionary 
    individual_ids = y_true.loc[:,'subject #'].unique() #retrieve subject ids {1,...,49}
    
    #loop through subject ids
    for ind in individual_ids: 
        ind_df = y_true.loc[y_true['subject #'] == ind, [label]] #dataframe with individual id "ind" and intensity
        ind_indexes = ind_df.index.to_list() # retrieve ind_df indices
        
        y_ind = ind_df.values #transform dataframe in a numpy array
        y_pred_new = y_pred[ind_indexes] #get rows in predictions array filtered with ind_indexes
        
        r_ind = np.corrcoef(x = y_ind, y = y_pred_new, rowvar = False)[0,1] #calculate Pearson coefficient
        
        all_r.append(r_ind) #Append Pearson to the list of Pearsons [r1,r2,...,r49]
        
    return all_r, np.mean(all_r)

In [None]:
all_r, mean_r = task_1_metric(y_ldb, y_pred, label = 'INTENSITY/STRENGTH')

## Launching hyperparameter tunning job

In [None]:
# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter

# Define exploration boundaries
hyperparameter_ranges = {
    'min_samples_leaf': IntegerParameter(2, 8),
    'min_samples_split': IntegerParameter(4, 16),
    'estimators': IntegerParameter(30, 70)                    
                        }

# create Optimizer
Optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator = rf,
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type='Maximize',
    objective_metric_name='validation r_mean',
    metric_definitions=[
                   {'Name': 'validation r_mean', 'Regex': 'r_mean for validation set is: ([0-9.]+).*$'}
                ],
    max_jobs = 10,
    max_parallel_jobs = 5)

In [None]:
Optimizer.fit({'train':train_input, 'validation':val_input})

In [None]:
# get tuner results in a df
results = Optimizer.analytics().dataframe()
while results.empty:
    time.sleep(1)
    results = Optimizer.analytics().dataframe()


In [None]:
cols = ['estimators', 'min_samples_leaf', 'min_samples_split', 'FinalObjectiveValue', 'TrainingJobName']
results[cols].to_csv('task1/rf_tunning.csv', header = True)

Now, I will simply attach the training job with highest performance to the estimator called ``rf_best``

In [None]:
results = pd.read_csv('task1/rf_tunning.csv')

In [None]:
#results.to_latex(index = False)

In [None]:
job_name = results.loc[results['FinalObjectiveValue'] == results['FinalObjectiveValue'].max(), 'TrainingJobName'].values[0]
print(job_name)

In [None]:
FRAMEWORK_VERSION = "0.23-1"
train_fname = 'train_rf.py'
script_dir = os.path.join(prefix, 'source_sklearn')
output_path = 's3://{}/{}'.format(bucket, prefix)

rf_best = SKLearn(output_path = output_path,
    source_dir = script_dir,
    entry_point = train_fname,
    framework_version = FRAMEWORK_VERSION,
    instance_type="ml.m4.xlarge",
    role=role,
    metric_definitions=[
                   {'Name': 'validation r_mean', 'Regex': 'r_mean for validation set is: ([0-9.]+).*$'}
                ],
    sagemaker_session=sagemaker_session)

### Deploying

Attaching the job to the estimator

In [None]:
rf_attached = rf_best.attach(job_name)

Instantiating a predictor

In [None]:
predictor = rf_attached.deploy(instance_type='ml.c5.large', initial_instance_count=1)

Loading test data

In [31]:
X_test = pd.read_csv('task1/X_test.csv')
y_test = pd.read_csv('task1/y_test.csv')

Splitting the data in small chuncks to avoid excessive run time.

In [None]:
split_X = np.split(X_test.iloc[:, 1:].values[0:3300,:], 10)
last_x = X_test.iloc[:, 1:].values[3300:,:]

In [None]:
y_pred = np.array([predictor.predict(x) for x in split_X])
y_pred2 = predictor.predict(last_x)

In [None]:
y_pred = y_pred.reshape(y_pred.shape[0]*y_pred.shape[1])

final vector of predictions

In [None]:
y_predf = np.concatenate((y_pred , y_pred2), axis = 0)

Calculating challenge metric

In [None]:
all_r, mean_r = task_1_metric(y_test, y_predf, label = 'INTENSITY/STRENGTH')

In [None]:
print('mean pearson is', mean_r)

## Kernel Ridge Regressor

For the KRR, I will have to used the preprocessed data to fit this model since the original feature space is too large for that.

## Upload data to s3

In [5]:
prefix = 'task1'
folder = 'krr_train'

In [6]:
path = os.path.join(prefix, folder)
print(path)

task1/krr_train


In [6]:
s3_path = sagemaker_session.upload_data(path, key_prefix = path)

In [7]:
print(s3_path)

s3://sagemaker-us-east-1-004057822769/task1/krr_train


In [7]:
s3_path = 's3://sagemaker-us-east-1-004057822769/task1/krr_train'

## Launching training job

Instantiating the estimator. Now with a function ``create_model()``

In [106]:
def create_model(FRAMEWORK_VERSION = "0.23-1", train_fname = 'train_krr.py', script_dir = os.path.join(prefix, 'source_sklearn'),  output_path = 's3://{}/{}'.format(bucket, prefix)):
    """ Creates an scikitlearn custom estimator
        
        FRAMEWORK_VERSION: a string that refers to scikitlearn version
        train_fname: sring with train file name.
        script_dir: string with directory where train_fname is located
        output_path: string with location on s3 bucket where model artifacts will be saved
        return: Estimator object
    """

    model = SKLearn(output_path = output_path,
                source_dir = script_dir,
                entry_point = train_fname,
                framework_version = FRAMEWORK_VERSION,
                instance_type="ml.m4.xlarge",
                role=role,
                metric_definitions=[
                   {'Name': 'validation r_mean', 'Regex': 'r_mean for validation set is: ([0-9.]+).*$'}
                ],
                sagemaker_session=sagemaker_session)
    return model

In [48]:
krr = create_model()

In [44]:
krr.fit({'train':s3_path})

2021-03-16 21:33:58 Starting - Starting the training job...
2021-03-16 21:34:21 Starting - Launching requested ML instancesProfilerReport-1615930438: InProgress
.........
2021-03-16 21:35:44 Starting - Preparing the instances for training......
2021-03-16 21:36:44 Downloading - Downloading input data...
2021-03-16 21:37:24 Training - Downloading the training image..[34m2021-03-16 21:37:38,432 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-03-16 21:37:38,435 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-16 21:37:38,446 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m

2021-03-16 21:37:44 Training - Training image download completed. Training in progress.[34m2021-03-16 21:37:45,858 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-16 21:37:45,874 sagemaker-training-toolkit INFO     No GPUs detected (

## Hyperparameter Tunning

In [126]:
# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter

# Define exploration boundaries
def hypertuner(model, hyperparameter_ranges):
        """ Creates an optimizer object 
        
        model: Estimator object to use
        hyperparameter_ranges: dictionary with parameter names as keys and their ranges and data types
        return: Optimizer object
        
    """
    # create Optimizer
    Optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator = model,
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type='Maximize',
    objective_metric_name='validation r_mean',
    metric_definitions=[
                   {'Name': 'validation r_mean', 'Regex': 'r_mean for validation set is: ([0-9.]+).*$'}
                ],
    max_jobs = 10,
    max_parallel_jobs = 5)
    
    return Optimizer

Instantiating the hypertuner

In [50]:
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0.01, 0.1),
    'gamma': ContinuousParameter(0.1, 1)                  
                        }
krr_optimizer = hypertuner(krr, hyperparameter_ranges)

Fitting hypertunner.

In [51]:
krr_optimizer.fit({'train':s3_path})

................................................................................................................................................................................................................................................................................................................!


In [52]:
# get tuner results in a df
results = krr_optimizer.analytics().dataframe()
while results.empty:
    time.sleep(1)
    results = krr_optimizer.analytics().dataframe()

In [56]:
cols = ['alpha', 'gamma', 'FinalObjectiveValue', 'TrainingJobName']
results[cols].to_csv('task1/krr_tunning.csv', header = True)

Exporting as latex for build report

In [89]:
results.sort_values(by = ['FinalObjectiveValue'], ascending = False)[cols][0:3].to_latex(index = False)

'\\begin{tabular}{rrrl}\n\\toprule\n    alpha &  gamma &  FinalObjectiveValue &                                TrainingJobName \\\\\n\\midrule\n 0.032118 &    0.1 &             0.226011 &  sagemaker-scikit-lea-210316-2150-014-c240c60f \\\\\n 0.030319 &    0.1 &             0.226011 &  sagemaker-scikit-lea-210316-2150-015-a6028516 \\\\\n 0.079896 &    0.1 &             0.226005 &  sagemaker-scikit-lea-210316-2150-010-1aaf4551 \\\\\n\\bottomrule\n\\end{tabular}\n'

### Deploying

Retrieving job name with the highest final objective value

In [None]:
job_name = results.loc[results['FinalObjectiveValue'] == results['FinalObjectiveValue'].max(), 'TrainingJobName'].values[0]
print(job_name)

In [60]:
krr = create_model()

Attaching training job to estimator

In [61]:
krr_best = krr.attach(job_name)


2021-03-16 22:09:33 Starting - Preparing the instances for training
2021-03-16 22:09:33 Downloading - Downloading input data
2021-03-16 22:09:33 Training - Training image download completed. Training in progress.
2021-03-16 22:09:33 Uploading - Uploading generated training model
2021-03-16 22:09:33 Completed - Training job completed


Instantiating predictor and deploying model

In [62]:
predictor = krr_best.deploy(instance_type='ml.c5.large', initial_instance_count=1)

---------------!

Loading test data

In [66]:
test_data = pd.read_csv('task1/krr_train/test.csv', header = None)

In [67]:
y_test = test_data.iloc[:, 0:2]
X_test = test_data.iloc[:, 2:]

Splitting the input o avoid excessive runtimes

In [72]:
split_X = np.split(X_test.values[0:3300,:], 10)
last_x = X_test.values[3300:,:]

In [74]:
y_pred = np.array([predictor.predict(x) for x in split_X])
y_pred2 = predictor.predict(last_x)

In [75]:
y_pred = y_pred.reshape(y_pred.shape[0]*y_pred.shape[1])

In [76]:
y_predf = np.concatenate((y_pred , y_pred2), axis = 0)

Creating new metric function

In [85]:
def task_1_metric_krr(y_true, y_pred, label = 0):
    """ Calculate DREAM challenge metric for intensity target for kernel ridge regressor
        
        y_true: pandas dataframe with two columns without names.
        y_pred: 1D numpy array with predictions.
        label: int with the column index that holds intensity values in y_true
        return: tuple (all_r, mean(all_r)). The first is a list with 49 Pearson coefficients
        the second is the average of these values.
    """
    
    all_r = [] #Initialize an empty dictionary 
    col_index = 1 #column that contatin intensity values
    individual_ids = y_true.iloc[:, col_index].unique() #retrieve subject ids {1,...,49}
    
     #loop through subject ids
    for ind in individual_ids:
        ind_df = y_true.loc[y_true.iloc[:, col_index] == ind].iloc[:, label] #dataframe with individual id "ind" and intensity
        ind_indexes = ind_df.index.to_list() # retrieve ind_df indices as a list
        
        y_ind = ind_df.values #transform dataframe in a numpy array
        y_pred_new = y_pred[ind_indexes] #get rows in predictions array filtered with ind_indexes
        
        r_ind = np.corrcoef(x = y_ind, y = y_pred_new, rowvar = False)[0,1] #calculate Pearson coefficient
        
        all_r.append(r_ind) #Append Pearson to the list of Pearsons [r1,r2,...,r49]
        
    return all_r, np.mean(all_r)

In [86]:
all_r, mean_r = task_1_metric_krr(y_test, y_predf)

In [87]:
mean_r

0.3774739351413951

## 2.1 Task 2 - Predict valence and odor descriptors

## Upload data to s3

In [5]:
if not os.path.isdir('task2/rf_train'):
    os.mkdir('task2/rf_train')
    
if not os.path.isdir('task2/rf_val'):
    os.mkdir('task2/rf_val')

Copying features and target files to task2/rf foldes

In [7]:
from shutil import copyfile
copyfile("task1/X_train.csv", "task2/rf_train/X_train.csv")
copyfile("task2/y_train.csv", "task2/rf_train/y_train.csv")
copyfile("task1/X_ldb.csv", "task2/rf_val/X_ldb.csv")
copyfile("task2/y_ldb.csv", "task2/rf_val/y_ldb.csv")

'task2/rf_val/y_ldb.csv'

In [9]:
prefix = 'task2'
train_folder = 'rf_train'
val_folder = 'rf_val'

In [10]:
train_path = os.path.join(prefix, train_folder)
val_path = os.path.join(prefix, val_folder)
print(train_path)
print('\n')
print(val_path)

task2/rf_train


task2/rf_val


In [11]:
train_input = sagemaker_session.upload_data(train_path, key_prefix = train_path)
val_input = sagemaker_session.upload_data(val_path, key_prefix = val_path)

In [109]:
print(train_input , '\n', val_input)

s3://sagemaker-us-east-1-004057822769/task2/rf_train 
 s3://sagemaker-us-east-1-004057822769/task2/rf_val


In [12]:
if not os.path.isdir('task2/source_sklearn'):
    os.mkdir('task2/source_sklearn')

## Preparing Estimator

Instantiating estimator

In [127]:
rf2 = create_model(FRAMEWORK_VERSION = "0.23-1", train_fname = 'train_rf.py', 
             script_dir = os.path.join(prefix, 'source_sklearn'),  
             output_path = 's3://{}/{}'.format(bucket, prefix))

Defining task 2 metric function to be used in test set

In [179]:
#Defining metric function
def task_2_metric(y_true, y_pred):
        """ Calculate DREAM challenge metric for valence and odor descritors target for
        random forest
        
        y_true: pandas dataframe with two columns without names.
        y_pred: 1D numpy array with predictions.
        return: float with the sum of r_int and r_dec
    """
    r_val = [] #Initialize an empty dictionary to hold pearson for valence
    r_odors = [] #Initialize an empty dictionary to hold pearson for 19 odor descriptors
    
    col_name = 'subject #' #Column that contains subject ids
    individual_ids = y_true.loc[:,col_name].unique() #retrieve subject unique ids {1,...,49}
    
    #Valence loop-----------------------------------------
     #loop through subject ids
    for ind in individual_ids:
        ind_df = y_true.loc[y_true[col_name] == ind, ['VALENCE/PLEASANTNESS']] #dataframe with individual id "ind" and valence
        ind_indexes = ind_df.index.to_list() # retrieve ind_df indices
        
        epsilon = np.random.randn(len(ind_df))*0.0001 #Small random value to add to columns with all zeros
        
        y_ind = ind_df.values #transform dataframe in a numpy array
        
        #Conditional to check inf the variability is 0
        if np.var(y_ind) == 0.0:
            y_ind += abs(epsilon) #Add small random number if True
            
        y_pred_new = y_pred[ind_indexes, 0] #get rows in predictions array filtered with ind_indexes
        
        r_ind = np.corrcoef(x = y_ind, y = y_pred_new, rowvar = False)[0,1] #calculate Pearson coefficient
        
        r_val.append(r_ind) #Append Pearson to the list of Pearsons [r1,r2,...,r49]
        
    r_mean_val = np.mean(r_val) #Calculate mean pearson
    
    #Odors loop-------------------------------------------------------
    
    for col in range(2,21): #Column indexes of odor descriptors values
        print(col)
        for ind in individual_ids:
            
            ind_df = y_true.loc[y_true[col_name] == ind].iloc[:, col] #indexes for individual
            ind_indexes = ind_df.index.to_list() #Transform indexes to list
            
            epsilon = np.random.randn(len(ind_df))*0.0001 #Random noise to avoid error with columns with 0 variance
            
            y_ind = ind_df.values  #Get true outputs for individuals
            
            if np.var(y_ind) == 0.0: #Conditional to check inf the variability is 0
                y_ind += abs(epsilon) #Add small random number if True
                
            y_pred_new = y_pred[ind_indexes, col-1] #Get predictions for individuals
        
            r_ind = np.corrcoef(x = y_ind, y = y_pred_new, rowvar = False)[0,1] #Calculate correlation
        
            r_odors.append(r_ind) #append values to r_odors list
    print(r_mean_val) # Logging
    print(np.mean(r_odors)) #Logging
    return r_mean_val +  np.mean(r_odors) #Return the sum of the two metrics

## Running training job to check it is correctly configured

fitting estimator

In [120]:
rf2.fit({'train':train_input, 'validation':val_input})

2021-03-17 19:55:45 Starting - Starting the training job...ProfilerReport-1616010945: InProgress
..............................
2021-03-17 20:01:14 Starting - Launching requested ML instances..................
2021-03-17 20:04:19 Starting - Preparing the instances for training...........................
2021-03-17 20:08:42 Downloading - Downloading input data.........
2021-03-17 20:10:23 Training - Downloading the training image..[34m2021-03-17 20:10:32,621 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-03-17 20:10:32,624 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-17 20:10:32,636 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m

2021-03-17 20:10:55 Training - Training image download completed. Training in progress.[34m2021-03-17 20:10:40,660 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-03-17 2

## Hyperparameter Tunning

Instantiating ``optimizer`` object

In [128]:
hyperparameter_ranges = {
    'min_samples_leaf': IntegerParameter(2, 15),
    'min_samples_split': IntegerParameter(4, 50),
    'estimators': IntegerParameter(20, 70)                    
                        }

rf_optimizer = hypertuner(rf2, hyperparameter_ranges)

fitting optimizer

In [None]:
rf_optimizer.fit({'train':train_input, 'validation':val_input})

..........................................................................................................................................................................................................................................................................................................................................................................................................

retrieving results

In [132]:
# get tuner results in a df
results = rf_optimizer.analytics().dataframe()
while results.empty:
    time.sleep(1)
    results = rf_optimizer.analytics().dataframe()

saving results

In [134]:
cols = ['estimators', 'min_samples_leaf', 'min_samples_split', 'FinalObjectiveValue', 'TrainingJobName']
results[cols].to_csv('task2/rf_tunning.csv', header = True)

In [141]:
sorted_results = results[cols].sort_values(by = ['FinalObjectiveValue'], ascending = False)

In [142]:
sorted_results

Unnamed: 0,estimators,min_samples_leaf,min_samples_split,FinalObjectiveValue,TrainingJobName
9,68.0,12.0,17.0,0.461973,sagemaker-scikit-lea-210317-2022-001-bbda7e89
0,68.0,3.0,4.0,0.459336,sagemaker-scikit-lea-210317-2022-010-8871460c
4,64.0,10.0,41.0,0.456904,sagemaker-scikit-lea-210317-2022-006-216d11e3
6,37.0,12.0,17.0,0.454657,sagemaker-scikit-lea-210317-2022-004-54abbf48
5,48.0,9.0,35.0,0.454008,sagemaker-scikit-lea-210317-2022-005-06d22863
2,34.0,4.0,23.0,0.451621,sagemaker-scikit-lea-210317-2022-008-989509a4
3,23.0,12.0,14.0,0.450059,sagemaker-scikit-lea-210317-2022-007-0dc852b2
7,62.0,15.0,5.0,0.445999,sagemaker-scikit-lea-210317-2022-003-656288f0
8,63.0,14.0,30.0,0.445827,sagemaker-scikit-lea-210317-2022-002-0ffa76ab
1,40.0,9.0,47.0,0.441251,sagemaker-scikit-lea-210317-2022-009-d2ce40e0


Transforming in latex code for report

In [143]:
sorted_results.head(3).to_latex(index = False)

'\\begin{tabular}{rrrrl}\n\\toprule\n estimators &  min\\_samples\\_leaf &  min\\_samples\\_split &  FinalObjectiveValue &                                TrainingJobName \\\\\n\\midrule\n       68.0 &              12.0 &               17.0 &             0.461973 &  sagemaker-scikit-lea-210317-2022-001-bbda7e89 \\\\\n       68.0 &               3.0 &                4.0 &             0.459336 &  sagemaker-scikit-lea-210317-2022-010-8871460c \\\\\n       64.0 &              10.0 &               41.0 &             0.456904 &  sagemaker-scikit-lea-210317-2022-006-216d11e3 \\\\\n\\bottomrule\n\\end{tabular}\n'

### Deploying

In [144]:
job_name = results.loc[results['FinalObjectiveValue'] == results['FinalObjectiveValue'].max(), 'TrainingJobName'].values[0]
print(job_name)

sagemaker-scikit-lea-210317-2022-001-bbda7e89


In [145]:
rf2_best = create_model(FRAMEWORK_VERSION = "0.23-1", train_fname = 'train_rf.py', 
             script_dir = os.path.join(prefix, 'source_sklearn'),  
             output_path = 's3://{}/{}'.format(bucket, prefix))

Attaching best model job name to estimator

In [148]:
rf2_best = rf2_best.attach(job_name)


2021-03-17 20:43:03 Starting - Preparing the instances for training
2021-03-17 20:43:03 Downloading - Downloading input data
2021-03-17 20:43:03 Training - Training image download completed. Training in progress.
2021-03-17 20:43:03 Uploading - Uploading generated training model
2021-03-17 20:43:03 Completed - Training job completed


Instantiating a predictor with ``Estimator.deploy()`` method

In [149]:
predictor = rf2_best.deploy(instance_type='ml.c5.large', initial_instance_count=1)

-------------!

Loading data

In [183]:
X_test = pd.read_csv('task1/X_test.csv')
y_test = pd.read_csv('task2/y_test.csv')

In [184]:
print(X_test.shape)

(2992, 4870)


Splitting data to avoid excessive run time

In [186]:
split_X = np.split(X_test.iloc[:, 1:].values[0:2990,:], 10)
last_x = X_test.iloc[:, 1:].values[2990:,:]

Predicting

In [187]:
y_pred = np.array([predictor.predict(x) for x in split_X])
y_pred2 = predictor.predict(last_x)

Reshaping ``y_pred`` to a 2D array

In [188]:
y_pred = y_pred.reshape(y_pred.shape[0]*y_pred.shape[1], y_pred.shape[2])

Concatenating the first 2990 predictions with the last ones

In [189]:
y_predf = np.concatenate((y_pred , y_pred2), axis = 0)

In [190]:
print(y_predf.shape, y_test.shape)

(2992, 20) (2992, 21)


Calculating metric

In [192]:
mean_r = task_2_metric(y_test, y_predf)

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.35621828233417147
0.1711582233673975


In [193]:
mean_r

0.5273765057015689