## Training Sklearn models

In this notebook we train, test and use several machine learning models from sklearn.

We start by importing all needed libraries (not being so mindful about leaving only those thar are actually used), downloading the datasets locally if needed, and using fleet.splitters package to split the dataset using scaffold splitting.

In [1]:
from dotenv import load_dotenv
load_dotenv('../.env')
load_dotenv('../.env.secret')


from typing import Literal, Union, Dict, List

import os
import datetime
import seaborn as sns
import pandas as pd
from pathlib import Path
from humps import camel
from pydantic import BaseModel
import numpy as np
import sklearn.base
import mlflow.sklearn
import mlflow
import mlflow.tracking

from fleet.base_schemas import BaseModelFunctions
from fleet.model_builder.utils import get_references_dict
from fleet.dataset_schemas import DatasetConfigBuilder, DatasetConfig
from fleet import data_types
from fleet.utils import data
from fleet.yaml_model import YAML_Model
from fleet.model_builder.utils import get_class_from_path_string
from fleet.model_builder import splitters
from fleet.metrics import Metrics
from fleet import model_functions
from fleet.scikit_.schemas import SklearnModelSchema, SklearnModelSpec
from fleet.scikit_.model_functions import SciKitFunctions
from fleet.mlflow import log_sklearn_model_and_create_version

os.environ['MLFLOW_TRACKING_URI'] = 'http://localhost:5000'

### Loading the dataset

In [2]:
! [ ! -f HIV.csv ] && wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/HIV.csv
! [ ! -f SAMPL.csv ] && wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv

In [3]:

sampl_df = pd.read_csv('SAMPL.csv')

if 'step' not in sampl_df.columns:
    splitters.apply_split_indexes(
        sampl_df,
        split_type="scaffold",
        split_column="smiles",
        split_target="80-10-10")
    sampl_df.to_csv('SAMPL.csv', index=False)
sampl_df

Unnamed: 0,iupac,smiles,expt,calc,step
0,"4-methoxy-N,N-dimethyl-benzamide",CN(C)C(=O)c1ccc(cc1)OC,-11.01,-9.625,1
1,methanesulfonyl chloride,CS(=O)(=O)Cl,-4.87,-6.219,1
2,3-methylbut-1-ene,CC(C)C=C,1.83,2.452,1
3,2-ethylpyrazine,CCc1cnccn1,-5.45,-5.809,3
4,heptan-1-ol,CCCCCCCO,-4.21,-2.917,1
...,...,...,...,...,...
637,methyl octanoate,CCCCCCCC(=O)OC,-2.04,-3.035,1
638,pyrrolidine,C1CCNC1,-5.48,-4.278,3
639,4-hydroxybenzaldehyde,c1cc(ccc1C=O)O,-8.83,-10.050,1
640,1-chloroheptane,CCCCCCCCl,0.29,1.467,1


## Defining the YAML model  and dataset configs

In the cell below, we parse the model and dataset configurations.
A few things to note:
1. the `featureColumns` must match the name of the columns in the downloaded CSV
2. Not numeric features must be featurized, here we use one of molfeat featurizers.


In [4]:
from fleet import model_functions

sampl_dataset_config = """

name: SAMPL
featureColumns:
    - name: smiles
      dataType:
        domainKind: smiles
targetColumns:    
    - name: expt
      dataType:
        domainKind: numeric
featurizers:
    - name: MolFPFeaturizer
      type: molfeat.trans.fp.FPVecFilteredTransformer
      forward_args:
          X: $smiles
"""

rf_model_config = """
model:
    type: sklearn.ensemble.RandomForestRegressor
    fitArgs:
        X: $MolFPFeaturizer
        y: $expt
"""

specs = [
    SklearnModelSpec(
        name=name,
        dataset=DatasetConfig.from_yaml_str(dataset_config_yaml),
        spec=SklearnModelSchema.from_yaml_str(model_config_yaml)
    )
    for name, dataset_config_yaml, model_config_yaml in [
         ('rf sampl', sampl_dataset_config, rf_model_config)
    ]
]
specs

[SklearnModelSpec(framework='sklearn', name='rf sampl', dataset=DatasetConfig(name='SAMPL', target_columns=[ColumnConfig(name='expt', data_type=NumericDataType(domain_kind='numeric'))], feature_columns=[ColumnConfig(name='smiles', data_type=SmileDataType(domain_kind='smiles'))], featurizers=[FPVecFilteredTransformerConfig(name='MolFPFeaturizer', constructor_args=FPVecFilteredTransformerConstructorArgs(del_invariant=None, length=None), type='molfeat.trans.fp.FPVecFilteredTransformer', forward_args={'X': '$smiles'})], transforms=[]), spec=SklearnModelSchema(model=RandomForestRegressorConfig(type='sklearn.ensemble.RandomForestRegressor', task_type=['regressor'], constructor_args=RandomForestRegressorConstructorArgs(n_estimators=50, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, ccp_alpha=0.0, max_samples=None), fit_args={'X': '$MolFPFeatur

## Training, validation and testing

Now we're ready to train the models and see how good it work as a prediction for this dataset.

In [5]:
import mlflow.sklearn


# Storing ModelFunction instances  in this array to 
# test prediction and model persistence separetely
trained_model_functions = []

for spec in specs:
    functions = SciKitFunctions(
        spec=spec,
        dataset=sampl_df,
    ) 
    train_metrics = functions.train()
    print(train_metrics)
    val_metrics = functions.val()
    print(val_metrics)
    test_metrics = functions.test()
    print(test_metrics)
    
    trained_model_functions.append(functions)


  from .autonotebook import tqdm as notebook_tqdm


{'train/mse/expt': 0.4300687909126282, 'train/mae/expt': 0.3930722177028656, 'train/ev/expt': 0.9607484143798277, 'train/mape/expt': 582.1995849609375, 'train/R2/expt': 0.9599645733833313, 'train/pearson/expt': 0.9842469096183777}
{'val/mse/expt': 32.4252815246582, 'val/mae/expt': 3.6422805786132812, 'val/ev/expt': 0.36854564723505845, 'val/mape/expt': 0.6994104981422424, 'val/R2/expt': 0.12269556522369385, 'val/pearson/expt': 0.6623643040657043}
{'test/mse/expt': 18.023508071899414, 'test/mae/expt': 3.3117551803588867, 'test/ev/expt': 0.297905984045984, 'test/mape/expt': 0.5709939002990723, 'test/R2/expt': -0.3636363744735718, 'test/pearson/expt': 0.545944094657898}


## Using the models

The next cell shows the problem we have on prediction:

In [6]:
for functions in trained_model_functions:
    functions.predict(pd.DataFrame({
        'smiles': ['CCCC'] 
    }))

0    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: MolFPFeaturizer, dtype: object


ValueError: setting an array element with a sequence.

## Logging the model and using with mlflow API

In [None]:

for functions in trained_model_functions:
    
    model_version = result.mlflow_model_version
    model_uri = (
        f'models:/{model_version.name}/{model_version.version}'
    )
    model = mlflow.sklearn.load_model(model_uri)
    