# Train ML

> A collection of machine learning tools

## Overview

A collection of utilities for training, evaluating, and deploying scikit-learn models for kinase substrate specificity prediction.

---

**Data Splitting**

`get_splits` - Creates cross-validation splits using stratified, grouped, or stratified-grouped KFold methods. This ensures proper data separation to avoid data leakage (e.g., keeping kinases from the same subfamily in the same fold).

```python
splits = get_splits(
    df=pspa_info,        # DataFrame containing metadata for splitting
    stratified=None,     # column name for stratified sampling (samples from different strata in each fold)
    group='subfamily',   # column name for group splitting (train/test never share groups)
    nfold=5,             # number of cross-validation folds
    seed=123,            # random seed for reproducibility
)
```

`split_data` - Splits a dataframe into train/test features and targets based on a single split tuple from `get_splits`.

```python
X_train, y_train, X_test, y_test = split_data(
    df=df,               # full DataFrame with features and targets
    feat_col=feat_col,   # list of feature column names (e.g., T5 embeddings)
    target_col=target_col,  # list of target column names (e.g., PSSM values)
    split=splits[0],     # tuple of (train_indices, test_indices)
)
```

---

**Model Training**

`train_ml` - Fits a single sklearn model on one train/test split and returns predictions on the test set. Optionally saves the trained model.

```python
y_test, y_pred = train_ml(
    df=df,                    # DataFrame with features and targets
    feat_col=feat_col,        # feature column names
    target_col=target_col,    # target column names
    split=splits[0],          # single split tuple (train_idx, test_idx)
    model=LinearRegression(), # any sklearn-compatible model
    save='models/lr_fold0.joblib',  # path to save model (None to skip)
    params={},                # extra kwargs passed to model.fit()
)
```

`train_ml_cv` - Performs full cross-validation across all splits, returning out-of-fold (OOF) predictions for the entire dataset.

```python
oof = train_ml_cv(
    df=df,                    # DataFrame with features and targets
    feat_col=feat_col,        # feature column names
    target_col=target_col,    # target column names
    splits=splits,            # list of split tuples from get_splits
    model=Ridge(alpha=1.0),   # sklearn model (re-instantiated each fold)
    save='ridge',             # base name for saved models (becomes ridge_0.joblib, etc.)
    params={},                # extra kwargs for model.fit()
)
```

---

**Post-Processing**

`post_process` - Cleans raw PSSM predictions by clipping negatives to zero, cleaning position zero, and normalizing each position to sum to 1.

```python
pssm_clean = post_process(
    pssm_df=raw_pssm,    # raw PSSM DataFrame (positions Ã— amino acids)
)
```

`post_process_oof` - Applies `post_process` to all rows in an OOF prediction DataFrame.

```python
oof_clean = post_process_oof(
    oof_ml=oof,          # OOF DataFrame from train_ml_cv
    target_col=target_col,  # target column names to process
)
```

---

**Scoring**

`get_score` - Computes a per-sample score between target and prediction using a custom function.

```python
scores = get_score(
    target=df[target_col],  # ground truth DataFrame
    pred=oof[target_col],   # predictions DataFrame
    func=js_divergence_flat,  # scoring function (target_row, pred_row) -> float
)
```

**Convenience partials** - Pre-configured scorers:

```python
jsd_scores = get_score_jsd(target=df[target_col], pred=oof)  # Jensen-Shannon divergence
kld_scores = get_score_kld(target=df[target_col], pred=oof)  # KL divergence
ce_scores  = get_score_ce(target=df[target_col], pred=oof)   # Cross-entropy
```

---

**Inference**

`predict_ml` - Loads a saved model and generates predictions on new data.

```python
predictions = predict_ml(
    df=new_data,              # DataFrame containing features
    feat_col=feat_col,        # feature column names (must match training)
    target_col=target_col,    # column names for output DataFrame
    model_pth='models/ridge_0.joblib',  # path to saved model
)
```

---

**Typical Workflow**

```python
# 1. Prepare splits (group by subfamily to prevent leakage)
splits = get_splits(df=info, group='subfamily', nfold=5)

# 2. Train with cross-validation
oof = train_ml_cv(df=df, feat_col=feat_col, target_col=target_col, 
                  splits=splits, model=Ridge(), save='ridge')

# 3. Post-process predictions
oof = post_process_oof(oof_ml=oof, target_col=target_col)

# 4. Evaluate
info['jsd'] = get_score_jsd(target=df[target_col], pred=oof)
print(f"Mean JSD: {info.groupby('nfold').jsd.mean()}")

# 5. Deploy
pred = predict_ml(df=test_df, feat_col=feat_col, target_col=target_col,
                  model_pth='models/ridge_0.joblib')
```

## Setup

In [None]:
#| default_exp train

In [None]:
#| export
# katlas
from katlas.data import Data
from katlas.pssm.core import *
from katlas.pssm.compare import *
# from katlas.feature import *
from functools import partial

# essentials
import pandas as pd, numpy as np
from joblib import dump, load
import math,matplotlib.pyplot as plt
from pathlib import Path

# scipy
from scipy.stats import spearmanr, pearsonr

# sklearn
from sklearn.model_selection import StratifiedKFold, GroupKFold, StratifiedGroupKFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import *
from sklearn.svm import *
from sklearn.ensemble import *

from sklearn import set_config
set_config(transform_output="pandas")

## Splitter

In [None]:
#| export
def get_splits(df: pd.DataFrame, # df contains info for split
               stratified: str=None, # colname to make stratified kfold; sampling from different groups
               group: str=None, # colname to make group kfold; test and train are from different groups
               nfold: int=5,
               seed: int=123):
    
    "Split samples in a dataframe based on Stratified, Group, or StratifiedGroup Kfold method"
    def _log(colname):
        print(kf)
        split=splits[0]
        print(f'# {colname} in train set: {df.loc[split[0]][colname].unique().shape[0]}')
        print(f'# {colname} in test set: {df.loc[split[1]][colname].unique().shape[0]}')
        
    splits = []
    if stratified is not None and group is None:
        kf = StratifiedKFold(nfold, shuffle=True, random_state=seed)
        for split in kf.split(df.index, df[stratified]):
            splits.append(split)
            
        _log(stratified)
        
    elif group is not None and stratified is None:
        kf = GroupKFold(nfold)
        for split in kf.split(df.index, groups=df[group]):
            splits.append(split)
            
        _log(group)
        
    elif stratified is not None and group is not None:
        kf = StratifiedGroupKFold(nfold, shuffle=True, random_state=seed)
        for split in kf.split(df.index, groups=df[group], y=df[stratified]):
            splits.append(split)
            
        _log(stratified)

    else:
        raise ValueError("Either 'stratified' or 'group' argument must be provided.")
    
    return splits

In [None]:
!ls

00_data.ipynb	     02e_pssm_compare.ipynb  07_pathway.ipynb	   custom.scss
01_utils.ipynb	     03_scoring.ipynb	     10_ML.ipynb	   index.ipynb
02a_pssm_core.ipynb  04_clustering.ipynb     11_DNN.ipynb	   models
02b_pssm_plot.ipynb  04b_hierarchical.ipynb  _08_statistics.ipynb  nbdev.yml
02c_pssm_lo.ipynb    05_feature.ipynb	     _quarto.yml	   styles.css
02d_pssm_pspa.ipynb  06_plot.ipynb	     _test.ipynb


In [None]:
# df=pd.read_parquet('paper/kinase_domain/train/pspa_t5.parquet')

In [None]:
# info=Data.get_kinase_info()

# info = info[info.pseudo=='0']

# info = info[info.kd_ID.notna()]

# subfamily_map = info[['kd_ID','subfamily']].drop_duplicates().set_index('kd_ID')['subfamily']

# pspa_info = pd.DataFrame(df.index.tolist(),columns=['kinase'])

# pspa_info['subfamily'] = pspa_info.kinase.map(subfamily_map)

# splits = get_splits(pspa_info, group='subfamily',nfold=5)

# split0 = splits[0]

In [None]:
# df=df.reset_index()

In [None]:
# df.columns

In [None]:
# # column name of feature and target
# feat_col = df.columns[df.columns.str.startswith('T5_')]
# target_col = df.columns[~df.columns.isin(feat_col)][1:]

In [None]:
# feat_col

In [None]:
# target_col

In [None]:
#| export
def split_data(df: pd.DataFrame, # dataframe of values
               feat_col: list, # feature columns
               target_col: list, # target columns
               split: tuple # one of the split in splits
               ):
    "Given split tuple, split dataframe into X_train, y_train, X_test, y_test"
    
    X_train = df.loc[split[0]][feat_col]
    y_train = df.loc[split[0]][target_col]
    
    X_test = df.loc[split[1]][feat_col]
    y_test = df.loc[split[1]][target_col]
    
    return X_train, y_train, X_test, y_test

In [None]:
# X_train, y_train, X_test, y_test = split_data(df,feat_col, target_col, split0)

In [None]:
# X_train.shape,y_train.shape,X_test.shape,y_test.shape

## Trainer

In [None]:
#| export
def train_ml(df, # dataframe of values
             feat_col, # feature columns
             target_col, # target columns
             split, # one split in splits
             model,  # a sklearn models
             save = None, # file (.joblib) to save, e.g. 'model.joblib'
             params=None, # dict parameters for model.fit from sklearn
            ):
    
    "Fit and predict using sklearn model format, return target and pred of valid dataset."
    
    # split data
    X_train, y_train, X_test, y_test = split_data(df, feat_col, target_col, split)
    
    # Fit the model
    model.fit(X_train, y_train, **(params or {})) # better convert y_train to numpy array and flatten
    
    if save is not None:
        # Save the model to a file
        # joblib.dump(model, save)
        dump(model, save)
        
    # Predict train
    y_train_pred = model.predict(X_train) # X_test is dataframe, y_pred is numpy array
    
    # Predict test
    y_pred = model.predict(X_test) # X_test is dataframe, y_pred is numpy array

    # Make dataframe
    y_pred = pd.DataFrame(y_pred,index=y_test.index, columns = y_test.columns)
    
    return y_test, y_pred

In [None]:
# model = LinearRegression()

# ## Uncheck to run with saving model
# # target,pred = train_ml(df, feat_col, target_col, split0, model,'model.joblib')

# # Run without saving model
# target,pred = train_ml(df, feat_col, target_col, split0, model)

# pred.head()

## Cross-Validation

In [None]:
#| export
def train_ml_cv( df, # dataframe of values
                 feat_col, # feature columns
                 target_col,  # target columns
                 splits, # splits
                 model, # sklearn model
                 save = None, # model name to be saved, e.g., 'LR'
                 params = None, # act as kwargs, for model.fit
                ):
    
    "Cross-validation through the given splits"
    
    OOF = []
    
    for fold, split in enumerate(splits):
        # print(f'------ fold: {fold} --------')
        
        if save is not None: 
            save = f'models/{save}_{fold}.joblib'
            
        target, pred = train_ml(df, feat_col, target_col, split, model,save,params=params)
        
        pred['nfold'] = fold
        OOF.append(pred)
        
    # Concatenate OOF from each fold to a new dataframe
    oof = pd.concat(OOF).sort_index()
    
    return oof

In [None]:
# oof = train_ml_cv(df,feat_col,target_col,splits=splits,model=model)

## Score

In [None]:
#| export
def post_process(pssm_df):
    "Convert neg value to 0, clean non-last three values in position zero, and normalize each position"
    pssm = pssm_df.copy()
    pssm = pssm.clip(lower=0)
    return clean_zero_normalize(pssm)

In [None]:
# pssm = post_process(recover_pssm(oof.iloc[0,:-1].sort_values()))

In [None]:
# pssm.sum()

In [None]:
#| export
def post_process_oof(oof_ml,target_col):
    oof = oof_ml.copy()
    oof[target_col] = oof.apply(lambda r: pd.Series(flatten_pssm(post_process(recover_pssm(r[target_col])),column_wise=False)), axis=1)
    return oof

In [None]:
# oof = post_process_oof(oof,target_col)

In [None]:
#| export
def get_score(target,pred,func):
    distance = [func(target.loc[i],pred.loc[i,target.columns]) for i in target.index]
    return pd.Series(distance,index=target.index)

In [None]:
#| export
get_score_jsd = partial(get_score,func=js_divergence_flat)

In [None]:
#| export
get_score_kld = partial(get_score,func=kl_divergence_flat)

In [None]:
# target = df[target_col].copy()

In [None]:
# pspa_info['jsd'] =get_score_jsd(target,oof)
# pspa_info['kld'] =get_score_kld(target,oof)

In [None]:
# pspa_info['jsd']

In [None]:
# pspa_info['kld']

In [None]:
#| export
def calculate_ce(target_series,pred_series):
    return float((-(np.log(recover_pssm(pred_series+EPSILON))*(recover_pssm(target_series))).sum()).mean())

In [None]:
#| export
get_score_ce = partial(get_score,func=calculate_ce)

In [None]:
# pspa_info['ce'] =get_score_ce(target,oof)

In [None]:
# pspa_info['ce']

In [None]:
# pspa_info['nfold'] = oof['nfold']

In [None]:
# pspa_info.groupby('nfold').jsd.mean()

## Predictor

In [None]:
#| export
def predict_ml(df, # Dataframe that contains features
               feat_col, # feature columns
               target_col=None,
               model_pth = 'model.joblib'
              ):
    
    "Make predictions based on trained model."
    
    test = df[feat_col]
    
    model = load(model_pth)
    
    pred = model.predict(test)
    
    pred_df = pd.DataFrame(pred,index=df.index,columns=target_col)
    
    return pred_df

Uncheck below to run if you have model_pth:

In [None]:
# pred2 = predict_ml(X_test,feat_col, target_col, model_pth = 'model.joblib')
# pred2.head()
## or
# predict_ml(df.iloc[split_0[1]],feat_col,'model.joblib')

## Export -

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()