# The Bernstein Bears CRP Submission 1

## Theme: Scaled TunA with PyTorch-Lightining, Optuna, and LGBMRegressor

## The Gist
1. Fine tune a roberta for sequence classification model from huggingface
2. Scale target values with SKLearn StandardScaler
3. Pytorch for model framework (tried PL but still buggy)
4. Optional: Find best hyperparameters for roberta model with Optuna
5. Optional: Run each optuna trial through k-fold validation and return mean score as trial score
6. Train roberta model with best params on all train data and run inference
7. Join roberta inference with original text data and apply textstat library for new features
8. Run feature selection with support vector machine (SVR) from sklearn
9. Select the top features (should be the roberta logits) and manually select the text category standard 
10. Scale inputs and targets again as a new LGBM dataset
11. Run LGBM dataset through LGBMRegressor hyperparameter search with section optuna study
12. Apply best params to LGBMRegressor and apply to all train data for predictions
13. Strip extra fields from test data and apply LGBMRegressor to test data
14. Apply the inverse transformation scaler to target values
15. Submit final predictions as submission.csv

## Some Key Learnings
1. scaling values for both inputs and targets is important
2. the text standard category is quite powerful in a lgbm regressor as a categorical feature
3. most of the textstat features seem to be not great for predicting the target value
4. kfold validation inside the optuna study works nicely for ML model
5. kfold validation inside optuna for deep learning works but consumes a ton of memory
6. to reduce memory footprint, ensure to set output hidden states to false for roberta
7. also try not to do deep copy or copy while avoiding set with copy issues in pandas df

In [None]:
%%capture

# install necessary libraries from input
# import progressbar library for offline usage
!ls ../input/progresbar2local
!pip install progressbar2 --no-index --find-links=file:///kaggle/input/progresbar2local/progressbar2

# import text stat library for additional ml data prep
!ls ../input/textstat-local
!pip install textstat --no-index --find-links=file:///kaggle/input/textstat-local/textstat 

In [None]:
FAST_DEV_RUN = False
USE_CHECKPOINT = True
USE_HIDDEN_IN_RGR = False
N_FEATURES_TO_USE_HEAD = 7
N_FEATURES_TO_USE_TAIL = None
# in this kernel, run train on all data to maximize score on held out data but use what we learned about optimal parameters
# set to 16 bit precision to cut compute requirements/increase batch size capacity
USE_16_BIT_PRECISION = True
# set a seed value for consistent experimentation; optional, else leave as None
SEED_VAL = 42
# set a train-validation split, .7 means 70% of train data and 30% to validation set
TRAIN_VALID_SPLIT = .8 # if None, then don't split
# set hyperparameters learned from tuning: https://www.kaggle.com/justinchae/tune-roberta-pytorch-lightning-optuna
MAX_EPOCHS = 4
BATCH_SIZE = 16
GRADIENT_CLIP_VAL = 0.18318092164684585
LEARNING_RATE = 3.613894271216525e-05
TOKENIZER_MAX_LEN = 363
WARMUP_STEPS = 292
WEIGHT_DECAY = 0.004560699842170359

In [None]:
import kaggle_config
from kaggle_config import (WORKFLOW_ROOT, DATA_PATH, CACHE_PATH, FIG_PATH, 
                           MODEL_PATH, ANALYSIS_PATH, KAGGLE_INPUT, 
                           CHECKPOINTS_PATH, LOGS_PATH)

INPUTS, DEVICE = kaggle_config.run()
KAGGLE_TRAIN_PATH = kaggle_config.get_train_path(INPUTS)
KAGGLE_TEST_PATH = kaggle_config.get_test_path(INPUTS)

import pytorch_lightning as pl
from pytorch_lightning import loggers as pl_loggers
from pytorch_lightning import seed_everything
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from pytorch_lightning.tuner.batch_size_scaling import scale_batch_size
from pytorch_lightning.tuner.lr_finder import _LRFinder, lr_find

import torchmetrics

import optuna
from optuna.integration import PyTorchLightningPruningCallback
from optuna.samplers import TPESampler, RandomSampler, CmaEsSampler
from optuna.visualization import (plot_intermediate_values
                                  , plot_optimization_history
                                  , plot_param_importances)

import optuna.integration.lightgbm as lgb
import lightgbm as lgm

from sklearn.model_selection import KFold, cross_val_score, RepeatedKFold, train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE, f_regression, mutual_info_regression, SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

import math

import textstat

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataset import random_split

import tensorflow as tf

from transformers import (RobertaForSequenceClassification
                          , RobertaTokenizer
                          , AdamW
                          , get_linear_schedule_with_warmup)

import os
import pandas as pd
import numpy as np

import gc
from functools import partial

from typing import List, Dict
from typing import Optional
from argparse import ArgumentParser

import random

if SEED_VAL:
    random.seed(SEED_VAL)
    np.random.seed(SEED_VAL)
    seed_everything(SEED_VAL)
    
NUM_DATALOADER_WORKERS = os.cpu_count()

try: 
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    n_tpus = len(tf.config.list_logical_devices('TPU'))
except ValueError:
    n_tpus = 0

ACCELERATOR_TYPE = {}
ACCELERATOR_TYPE.update({'gpus': torch.cuda.device_count() if torch.cuda.is_available() else None})
ACCELERATOR_TYPE.update({'tpu_cores': n_tpus if n_tpus > 0 else None})
# still debugging how to best toggle between tpu and gpu; there's too much code to configure to work simply
print("ACCELERATOR_TYPE:\n", ACCELERATOR_TYPE)

PRETTRAINED_ROBERTA_BASE_MODEL_PATH = "/kaggle/input/pre-trained-roberta-base"
PRETRAINED_ROBERTA_BASE_TOKENIZER_PATH = "/kaggle/input/tokenizer-roberta"
PRETRAINED_ROBERTA_BASE_TOKENIZER = RobertaTokenizer.from_pretrained(PRETRAINED_ROBERTA_BASE_TOKENIZER_PATH)

TUNED_CHECKPOINT_PATH = "/kaggle/input/best-crp-ckpt-4/crp_roberta_trial_4.ckpt"
# from: https://www.kaggle.com/justinchae/crp-regression-with-roberta-and-lightgbm
TUNED_BEST_ROBERTA_PATH = "/kaggle/input/my-best-tuned-roberta"
# from: https://www.kaggle.com/justinchae/the-bernstein-bears-crp-submission-2
MODEL_42 = "/kaggle/input/model-42"
MODEL_0 = "/kaggle/input/model-0"
MODEL_21 = "/kaggle/input/model-21"

In [None]:
"""Implementing Lightning instead of torch.nn.Module
"""
class LitRobertaLogitRegressor(pl.LightningModule):
    def __init__(self, pre_trained_path: str
                     , output_hidden_states: bool = False
                     , num_labels: int = 1
                     , layer_1_output_size: int = 64
                     , layer_2_output_size: int = 1
                     , learning_rate: float = 1e-5
                     , task_name: Optional[str] = None
                     , warmup_steps: int = 100
                     , weight_decay: float = 0.0
                     , adam_epsilon: float = 1e-8
                     , batch_size: Optional[int] = None
                     , train_size: Optional[int] = None
                     , max_epochs: Optional[int] = None
                     , n_gpus: Optional[int] = 0
                     , n_tpus: Optional[int] = 0 
                     , accumulate_grad_batches = None
                     , tokenizer = None
                     , do_decode = False
                ):
        """refactored from: https://www.kaggle.com/justinchae/my-bert-tuner and https://www.kaggle.com/justinchae/roberta-tuner
        """
        super(LitRobertaLogitRegressor, self).__init__()
        
        # this saves class params as self.hparams
        self.save_hyperparameters()
        
        self.model = RobertaForSequenceClassification.from_pretrained(self.hparams.pre_trained_path
                                                                      , output_hidden_states=self.hparams.output_hidden_states
                                                                       , num_labels=self.hparams.num_labels
                                                                        )

        self.accelerator_multiplier = n_gpus if n_gpus > 0 else 1
        
        self.config = self.model.config
        self.parameters = self.model.parameters
        self.save_pretrained = self.model.save_pretrained
        # these layers are not currently used, tbd in future iteration
        self.layer_1 = torch.nn.Linear(768, layer_1_output_size)
        self.layer_2 = torch.nn.Linear(layer_1_output_size, layer_2_output_size)
        self.tokenizer = tokenizer
        self.do_decode = do_decode
        self.output_hidden_states = output_hidden_states
        
        def rmse_loss(x, y):
            criterion = F.mse_loss
            loss = torch.sqrt(criterion(x, y))
            return loss
        
        # TODO: enable toggle for various loss funcs and torchmetrics package
        self.loss_func = rmse_loss
#         self.eval_func = rmse_loss   
        
    def setup(self, stage=None) -> None:
        if stage == 'fit':
            # when this class is called by trainer.fit, this stage runs and so on
            # Calculate total steps
            tb_size = self.hparams.batch_size * self.accelerator_multiplier
            ab_size = self.hparams.accumulate_grad_batches * float(self.hparams.max_epochs)
            self.total_steps = (self.hparams.train_size // tb_size) // ab_size
        
    def extract_logit_only(self, input_ids, attention_mask) -> float:
        output = self.model(input_ids=input_ids, attention_mask=attention_mask)
        logit = output.logits
        logit = logit.cpu().numpy().astype(float)
        return logit
    
    def extract_hidden_only(self, input_ids, attention_mask) -> np.array:
        output = self.model(input_ids=input_ids, attention_mask=input_ids)
        hidden_states = output.hidden_states
        x = torch.stack(hidden_states[-4:]).sum(0)
        m1 = torch.nn.Sequential(self.layer_1
                                 , self.layer_2
                                 , torch.nn.Flatten())
        x = m1(x)
        x = torch.squeeze(x).cpu().numpy()
        
        return x
        
    def forward(self, input_ids, attention_mask) -> torch.Tensor:
        output = self.model(input_ids=input_ids, attention_mask=attention_mask)
        x = output.logits
        return x
    
    def training_step(self, batch, batch_idx: int) -> float:
        # refactored from: https://www.kaggle.com/justinchae/epoch-utils
        labels, encoded_batch, kaggle_ids = batch
        input_ids = encoded_batch['input_ids']
        attention_mask = encoded_batch['attention_mask']
        # per docs, keep train step separate from forward call
        output = self.model(input_ids=input_ids, attention_mask=attention_mask)
        y_hat = output.logits
        # quick reshape to align labels to predictions
        labels = labels.view(-1, 1)
        loss = self.loss_func(y_hat, labels)
        self.log('train_loss', loss)
        return loss
    
    def validation_step(self, batch, batch_idx: int) -> float:
        # refactored from: https://www.kaggle.com/justinchae/epoch-utils
        labels, encoded_batch, kaggle_ids = batch
        input_ids = encoded_batch['input_ids']
        attention_mask = encoded_batch['attention_mask']
        # this self call is calling the forward method
        y_hat = self(input_ids, attention_mask)
        # quick reshape to align labels to predictions
        labels = labels.view(-1, 1)
        loss = self.loss_func(y_hat, labels)
        self.log('val_loss', loss)
        return loss
    
    def predict(self, batch, batch_idx: int, dataloader_idx: int = None):
        # creating this predict method overrides the pl predict method
        target, encoded_batch, kaggle_ids = batch
        
        input_ids = encoded_batch['input_ids']
        attention_mask = encoded_batch['attention_mask']
        # this self call is calling the forward method
        y_hat = self(input_ids, attention_mask)
        # convert to numpy then list like struct to zip with ids
        y_hat = y_hat.cpu().numpy().ravel()
        # customizing the predict behavior to account for unique ids
        if self.tokenizer is not None and self.do_decode:
            target = target.cpu().numpy().ravel() if len(target) > 0 else None
            
            excerpt = self.tokenizer.batch_decode(input_ids.cpu().numpy()
                                            , skip_special_tokens=True
                                            , clean_up_tokenization_spaces=True)
            if self.output_hidden_states:   
                hidden_states = self.extract_hidden_only(input_ids=input_ids
                                                             , attention_mask=attention_mask)
            else:
                hidden_states = None
            
            if target is not None:
                predictions = list(zip(kaggle_ids
                                       , target
                                       , y_hat
#                                        , hidden_states
                                      ))
                predictions = pd.DataFrame(predictions, columns=['id'
                                                                 , 'target'
                                                                 , 'logit'
#                                                                  , 'hidden_states'
                                                                ])
            else:
                predictions = list(zip(kaggle_ids
                                       , y_hat
#                                        , hidden_states
                                      ))
                predictions = pd.DataFrame(predictions, columns=['id'
                                                                 , 'logit'
#                                                                  , 'hidden_states'
                                                                ])
                
        else:
            predictions = list(zip(kaggle_ids, y_hat))
            predictions = pd.DataFrame(predictions, columns=['id', 'target'])

        return predictions
    
    def configure_optimizers(self) -> torch.optim.Optimizer:
        # Reference: https://pytorch-lightning.readthedocs.io/en/latest/notebooks/lightning_examples/text-transformers.html
        model = self.model
        
        no_decay = ["bias", "LayerNorm.weight"]
        
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        
        optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon)
        
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=self.total_steps
        )
        scheduler = {'scheduler': scheduler, 'interval': 'step', 'frequency': 1}
        
        return [optimizer], [scheduler]

In [None]:
def my_collate_fn(batch
                 , tokenizer
                 , max_length: int = 100
                 , return_tensors: str = 'pt'
                 , padding: str = "max_length"
                 , truncation: bool = True
                 ):
    # source: https://www.kaggle.com/justinchae/nn-utils
    labels = []
    batch_texts = []
    kaggle_ids = []

    for (_label, batch_text, kaggle_id) in batch:
        if _label is not None:
            labels.append(_label)
        
        batch_texts.append(batch_text)
        kaggle_ids.append(kaggle_id)
    
            
    if _label is not None:
        labels = torch.tensor(labels, dtype=torch.float)
    
    encoded_batch = tokenizer(batch_texts
                              , return_tensors=return_tensors
                              , padding=padding
                              , max_length=max_length
                              , truncation=truncation)

    return labels, encoded_batch, kaggle_ids


class CommonLitDataset(Dataset):
    def __init__(self
                 , df
                 , text_col: str = 'excerpt'
                 , label_col: str = 'target'
                 , kaggle_id: str = 'id'
                 , sample_size: Optional[str] = None
                ):
        self.df = df if sample_size is None else df.sample(sample_size)
        self.text_col = text_col
        self.label_col = label_col
        self.kaggle_id = kaggle_id
        self.num_labels = len(df[label_col].unique()) if label_col in df.columns else None
        # source: https://www.kaggle.com/justinchae/nn-utils
        
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        result = None
        text = self.df.iloc[idx][self.text_col]
        kaggle_id = self.df.iloc[idx][self.kaggle_id]
        
        if 'target' in self.df.columns:
            target = self.df.iloc[idx][self.label_col]
            return target, text, kaggle_id     
        else:
            return None, text, kaggle_id


class CommonLitDataModule(pl.LightningDataModule):
    def __init__(self
                 , tokenizer
                 , train_path
                 , collate_fn=None
                 , max_length: int = 280
                 , batch_size: int = 16
                 , valid_path: Optional[str] = None
                 , test_path: Optional[str] = None
                 , train_valid_split: float = .6
                 , dtypes=None
                 , shuffle_dataloader: bool = True
                 , num_dataloader_workers: int = NUM_DATALOADER_WORKERS
                 , kfold: Optional[dict] = None):
        super(CommonLitDataModule, self).__init__()
        self.tokenizer = tokenizer
        self.train_path = train_path
        self.valid_path = valid_path
        self.test_path = test_path
        self.train_valid_split = train_valid_split
        self.dtypes = {'id': str} if dtypes is None else dtypes
        self.train_size = None
        self.train_df, self.train_data = None, None
        self.valid_df, self.valid_data = None, None
        self.test_df, self.test_data = None, None
        if collate_fn is not None:
            self.collate_fn = partial(collate_fn
                                      , tokenizer=tokenizer
                                      , max_length=max_length) 
        else:
            
            self.collate_fn = partial(my_collate_fn
                                      , batch=batch_size
                                      , tokenizer=tokenizer)
            
        self.shuffle_dataloader = shuffle_dataloader
        self.batch_size = batch_size
        self.num_dataloader_workers = num_dataloader_workers
        # refactored from: https://www.kaggle.com/justinchae/nn-utils
    
    def _strip_extraneous(self, df):
        strip_cols = ['url_legal', 'license']
        if all(col in df.columns for col in strip_cols):
            extraneous_data = strip_cols
            return df.drop(columns=extraneous_data)
        else: 
            return df
    
    def prepare(self, prep_type=None):
        if prep_type == 'train':
            # creates just an instance of the train data as a pandas df
            self.train_df = self.train_path if isinstance(self.train_path, pd.DataFrame) else pd.read_csv(self.train_path, dtype=self.dtypes)
            self.train_df = self._strip_extraneous(self.train_df)
            
        if prep_type == 'train_stage_2':
            self.train_df = self.train_path if isinstance(self.train_path, pd.DataFrame) else pd.read_csv(self.train_path, dtype=self.dtypes)
            self.train_df = self._strip_extraneous(self.train_df)
            self.train_size = int(len(self.train_df))
            self.train_data = CommonLitDataset(df=self.train_df)
        
    def setup(self, stage: Optional[str] = None) -> None:
        if stage == 'fit':
            # when this class is called by trainer.fit, this stage runs and so on
            self.train_df = self.train_path if isinstance(self.train_path, pd.DataFrame) else pd.read_csv(self.train_path, dtype=self.dtypes)
            self.train_df = self._strip_extraneous(self.train_df)
            self.train_size = int(len(self.train_df))
            self.train_data = CommonLitDataset(df=self.train_df)
        
            if self.train_valid_split is not None and self.valid_path is None:
                self.train_size = int(len(self.train_df) * self.train_valid_split)
                self.train_data, self.valid_data = random_split(self.train_data, [self.train_size, len(self.train_df) - self.train_size])
            elif self.valid_path is not None:
                self.valid_df = self.valid_path if isinstance(self.valid_path, pd.DataFrame) else pd.read_csv(self.valid_path, dtype=self.dtypes)
                self.valid_data = CommonLitDataset(df=self.valid_df)
            
        if stage == 'predict':           
            self.test_df = self.test_path if isinstance(self.test_path, pd.DataFrame) else pd.read_csv(self.test_path, dtype=self.dtypes)
            self.test_df = self._strip_extraneous(self.test_df)
            self.test_data = CommonLitDataset(df=self.test_df)
            
            self.train_df = self.train_path if isinstance(self.train_path, pd.DataFrame) else pd.read_csv(self.train_path, dtype=self.dtypes)
            self.train_df = self._strip_extraneous(self.train_df)
            self.train_size = int(len(self.train_df))
            self.train_data = CommonLitDataset(df=self.train_df)
    
    def kfold_data(self):
        # TODO: wondering how to integrate kfolds into the datamodule
        pass
    
    def train_dataloader(self) -> DataLoader:
        return DataLoader(self.train_data
                          , batch_size=self.batch_size
                          , shuffle=self.shuffle_dataloader
                          , collate_fn=self.collate_fn
                          , num_workers=self.num_dataloader_workers
                          , pin_memory=True
                          )
    def val_dataloader(self) -> DataLoader:
        if self.valid_data is None:
            return None
        else:
            return DataLoader(self.valid_data
                              , batch_size=self.batch_size
                              , shuffle=False
                              , collate_fn=self.collate_fn
                              , num_workers=self.num_dataloader_workers
                              , pin_memory=True
                              )
    def predict_dataloader(self) -> DataLoader:
        if self.test_data is None:
            return None
        else:
            return DataLoader(self.test_data
                              , batch_size=self.batch_size
                              , shuffle=False
                              , collate_fn=self.collate_fn
                              , num_workers=self.num_dataloader_workers
                              , pin_memory=True
                              ) 

In [None]:
def add_textstat_features(df):
    # adding the text standard seems to boost the accuracy score a bit
    df['text_standard'] = df['excerpt'].apply(lambda x: textstat.text_standard(x))
    df['text_standard_category'] = df['text_standard'].astype('category').cat.codes
    
    # counting ratio of difficult words by lexicon count
    df['difficult_words_ratio'] = df['excerpt'].apply(lambda x: textstat.difficult_words(x))
    df['difficult_words_ratio'] = df.apply(lambda x: x['difficult_words_ratio'] / textstat.lexicon_count(x['excerpt']), axis=1)
                                           
    df['syllable_ratio'] = df['excerpt'].apply(lambda x: textstat.syllable_count(x))
    df['syllable_ratio'] = df.apply(lambda x: x['syllable_ratio'] / textstat.lexicon_count(x['excerpt']), axis=1) 
                                    

    ### You can add/remove any feature below and it will be used in training and test
    df['coleman_liau_index'] = df['excerpt'].apply(lambda x: textstat.coleman_liau_index(x))
    df['flesch_reading_ease'] = df['excerpt'].apply(lambda x: textstat.flesch_reading_ease(x))
    df['smog_index'] = df['excerpt'].apply(lambda x: textstat.smog_index(x))
    df['gunning_fog'] = df['excerpt'].apply(lambda x: textstat.gunning_fog(x))
    df['flesch_kincaid_grade'] = df['excerpt'].apply(lambda x: textstat.flesch_kincaid_grade(x))
    df['automated_readability_index'] = df['excerpt'].apply(lambda x: textstat.automated_readability_index(x))
    df['dale_chall_readability_score'] = df['excerpt'].apply(lambda x: textstat.dale_chall_readability_score(x))
    df['linsear_write_formula'] = df['excerpt'].apply(lambda x: textstat.linsear_write_formula(x))
    ###
    df = df.drop(columns=['excerpt', 'text_standard'])
    return df

In [None]:
def process_hidden_states(df, drop_hidden_states=False):
    # for convenience, moving hidden states to the far right of the df
    if drop_hidden_states:
        df.drop(columns=['hidden_states'], inplace=True)
        return df
    
    elif "hidden_states" in df.columns:
        df['hidden_state'] = df['hidden_states']
        df.drop(columns=['hidden_states'], inplace=True)

        temp = df['hidden_state'].apply(pd.Series)
        temp = temp.rename(columns = lambda x: 'hidden_state_' + str(x))
        df = pd.concat([df, temp], axis=1)
        df.drop(columns=['hidden_state'], inplace=True)

        return df
    else:
        print("hidden_states not found in dataframe, skipping process_hidden_states")
        return df

In [None]:
datamodule = CommonLitDataModule(collate_fn=my_collate_fn
                                         , tokenizer=PRETRAINED_ROBERTA_BASE_TOKENIZER
                                         , train_path=KAGGLE_TRAIN_PATH
                                         , test_path=KAGGLE_TEST_PATH
                                         , max_length=TOKENIZER_MAX_LEN
                                         , batch_size=BATCH_SIZE
                                         , train_valid_split=TRAIN_VALID_SPLIT
                                          )
# manually calling this stage since we need some params to set up model initially
datamodule.setup(stage='fit')

if USE_CHECKPOINT:
    
#     model = LitRobertaLogitRegressor.load_from_checkpoint(TUNED_CHECKPOINT_PATH)
    trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                     , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                     )
    
    model = LitRobertaLogitRegressor(pre_trained_path=TUNED_BEST_ROBERTA_PATH
                                          , train_size=datamodule.train_size
                                          , batch_size=datamodule.batch_size
                                          , output_hidden_states=USE_HIDDEN_IN_RGR
                                          , n_gpus=ACCELERATOR_TYPE['gpus']
                                          , accumulate_grad_batches=trainer.accumulate_grad_batches
                                          , learning_rate=LEARNING_RATE
                                          , warmup_steps=WARMUP_STEPS
                                          , max_epochs=MAX_EPOCHS
                                          , tokenizer=datamodule.tokenizer
                                          )
    
    trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                     , tpu_cores=ACCELERATOR_TYPE['tpu_cores'])
    
else:

    checkpoint_filename = f'crp_roberta_trial_main'
    checkpoint_save = ModelCheckpoint(dirpath=CHECKPOINTS_PATH
                                      , filename=checkpoint_filename
                                      )

    early_stopping_callback = EarlyStopping(monitor='val_loss'
                                            , patience=2
                                            )

    trainer = pl.Trainer(max_epochs=MAX_EPOCHS
                         , gpus=ACCELERATOR_TYPE['gpus']
                         , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                         , precision=16 if USE_16_BIT_PRECISION else 32
                         , default_root_dir=CHECKPOINTS_PATH
                         , gradient_clip_val=GRADIENT_CLIP_VAL
                         , stochastic_weight_avg=True
                         , callbacks=[checkpoint_save
                                     , early_stopping_callback
                                     ]
                         , fast_dev_run=FAST_DEV_RUN
                        )

    model = LitRobertaLogitRegressor(pre_trained_path=PRETTRAINED_ROBERTA_BASE_MODEL_PATH
                                          , train_size=datamodule.train_size
                                          , batch_size=datamodule.batch_size
                                          , n_gpus=trainer.gpus
                                          , n_tpus=trainer.tpu_cores
                                          , max_epochs=trainer.max_epochs
                                          , accumulate_grad_batches=trainer.accumulate_grad_batches
                                          , learning_rate=LEARNING_RATE
                                          , warmup_steps=WARMUP_STEPS
                                          , tokenizer=datamodule.tokenizer
                                          )

    trainer.fit(model, datamodule=datamodule)

    # let's also save the tuned roberta state which our model wraps around 
    model_file_name = f"tuned_roberta_model"
    model_file_path = os.path.join(MODEL_PATH, model_file_name)
    model.save_pretrained(model_file_path)

    # clean up memory
    torch.cuda.empty_cache()
    gc.collect()

In [None]:
# freeze the model for prediction
model.eval()
model.freeze()
datamodule.shuffle_dataloader = False
datamodule.setup(stage='predict')

model.do_decode = True

# run predict on the test data
train_data_stage_20 = trainer.predict(model=model, dataloaders=datamodule.train_dataloader())
train_data_stage_20 = pd.concat(train_data_stage_20).reset_index(drop=True)

train_data_stage_20 = pd.merge(left=train_data_stage_20 
                                , right=datamodule.train_df.drop(columns=['standard_error', 'target'])
                                , left_on='id'
                                , right_on='id')

print(train_data_stage_20)
# TODO: test whether we need to save and upload the fine-tuned state of roberta or if pytorch lightning checkpoints take care of it all

In [None]:
def free_mem():                          
    torch.cuda.empty_cache()
    gc.collect()

In [None]:
free_mem()

In [None]:
# apply other models

trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                 , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                 )

model_42 = LitRobertaLogitRegressor(pre_trained_path=MODEL_42
                                      , train_size=datamodule.train_size
                                      , batch_size=datamodule.batch_size
                                      , output_hidden_states=USE_HIDDEN_IN_RGR
                                      , n_gpus=ACCELERATOR_TYPE['gpus']
                                      , accumulate_grad_batches=trainer.accumulate_grad_batches
                                      , learning_rate=LEARNING_RATE
                                      , warmup_steps=WARMUP_STEPS
                                      , max_epochs=MAX_EPOCHS
                                      , tokenizer=datamodule.tokenizer
                                      )

model_42.do_decode = True
# freeze the model for prediction
model_42.eval()
model_42.freeze()

In [None]:
# run predict on the test data
train_data_stage_21 = trainer.predict(model=model_42, dataloaders=datamodule.train_dataloader())
train_data_stage_21 = pd.concat(train_data_stage_21).reset_index(drop=True)
train_data_stage_21.rename(columns={'logit': 'logit_42'}, inplace=True)

train_data_stage_two = pd.merge(left=train_data_stage_20 
                                , right=train_data_stage_21[['id', 'logit_42']]
                                , left_on='id'
                                , right_on='id')

print(train_data_stage_two)
# TODO: test whether we need to save and upload the fine-tuned state of roberta or if pytorch lightning checkpoints take care of it all

In [None]:
free_mem()

In [None]:
# apply two other models

trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                 , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                 )

model_0 = LitRobertaLogitRegressor(pre_trained_path=MODEL_0
                                      , train_size=datamodule.train_size
                                      , batch_size=datamodule.batch_size
                                      , output_hidden_states=USE_HIDDEN_IN_RGR
                                      , n_gpus=ACCELERATOR_TYPE['gpus']
                                      , accumulate_grad_batches=trainer.accumulate_grad_batches
                                      , learning_rate=LEARNING_RATE
                                      , warmup_steps=WARMUP_STEPS
                                      , max_epochs=MAX_EPOCHS
                                      , tokenizer=datamodule.tokenizer
                                      )

model_0.do_decode = True
# freeze the model for prediction
model_0.eval()
model_0.freeze()

In [None]:
# run predict on the test data
train_data_stage_22 = trainer.predict(model=model_0, dataloaders=datamodule.train_dataloader())
train_data_stage_22 = pd.concat(train_data_stage_22).reset_index(drop=True)
train_data_stage_22.rename(columns={'logit': 'logit_0'}, inplace=True)

train_data_stage_two = pd.merge(left=train_data_stage_two 
                                , right=train_data_stage_22[['id', 'logit_0']]
                                , left_on='id'
                                , right_on='id')

print(train_data_stage_two)
# TODO: test whether we need to save and upload the fine-tuned state of roberta or if pytorch lightning checkpoints take care of it all

In [None]:
free_mem()

In [None]:
# apply two other models

trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                 , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                 )

model_21 = LitRobertaLogitRegressor(pre_trained_path=MODEL_21
                                      , train_size=datamodule.train_size
                                      , batch_size=datamodule.batch_size
                                      , output_hidden_states=USE_HIDDEN_IN_RGR
                                      , n_gpus=ACCELERATOR_TYPE['gpus']
                                      , accumulate_grad_batches=trainer.accumulate_grad_batches
                                      , learning_rate=LEARNING_RATE
                                      , warmup_steps=WARMUP_STEPS
                                      , max_epochs=MAX_EPOCHS
                                      , tokenizer=datamodule.tokenizer
                                      )

model_21.do_decode = True
# freeze the model for prediction
model_21.eval()
model_21.freeze()

In [None]:
# run predict on the test data
train_data_stage_23 = trainer.predict(model=model_21, dataloaders=datamodule.train_dataloader())
train_data_stage_23 = pd.concat(train_data_stage_23).reset_index(drop=True)
train_data_stage_23.rename(columns={'logit': 'logit_21'}, inplace=True)

train_data_stage_two = pd.merge(left=train_data_stage_two 
                                , right=train_data_stage_23[['id', 'logit_21']]
                                , left_on='id'
                                , right_on='id')

print(train_data_stage_two)
# TODO: test whether we need to save and upload the fine-tuned state of roberta or if pytorch lightning checkpoints take care of it all

In [None]:
free_mem()

In [None]:
train_data_stage_three = add_textstat_features(train_data_stage_two)

label_data = train_data_stage_three[['id']].copy(deep=True)

train_data = train_data_stage_three.drop(columns=['id', 'target', 'text_standard_category']).copy(deep=True)

train_data_cols =  list(train_data.columns)

target_data = train_data_stage_three[['target']].copy(deep=True)

scaler = StandardScaler()
train_data_scaled = scaler.fit_transform(train_data)
train_data_scaled = pd.DataFrame(train_data_scaled, columns=train_data_cols)

TARGET_SCALER = StandardScaler()
target_data_scaled = TARGET_SCALER.fit_transform(target_data)
target_data_scaled = pd.DataFrame(target_data_scaled, columns=['target'])

regr = SVR(kernel='linear')
regr.fit(train_data_scaled, target_data_scaled['target'])

In [None]:
print("   Assessment of Features   ")
print("R2 Score: ", regr.score(train_data_scaled, target_data_scaled['target']))
print("RSME Score: ", math.sqrt(mean_squared_error(target_data_scaled['target'], regr.predict(train_data_scaled))))

# regr.coef_ is a array of n, 1
feats_coef = list(zip(train_data_cols, regr.coef_[0]))
feature_analysis = pd.DataFrame(feats_coef
                               , columns=['feature_col', 'coef_val'])

feature_analysis['coef_val'] = feature_analysis['coef_val']#.abs()
feature_analysis = feature_analysis.sort_values('coef_val',ascending = False)

feature_analysis.plot.barh(x='feature_col', y='coef_val', title="Comparison of Features and Importance")

# select the top n features for use in final regression approach
best_n_features = feature_analysis.head(N_FEATURES_TO_USE_HEAD)['feature_col'].to_list()
# the opposite
if N_FEATURES_TO_USE_TAIL is not None:
    worst_n_features = feature_analysis.tail(N_FEATURES_TO_USE_TAIL)['feature_col'].to_list()
    best_n_features.extend(worst_n_features)

# manually adding this categorical feature in
if 'text_standard_category' not in best_n_features:
    best_n_features.append('text_standard_category')
    
# manually adding this categorical feature in
if 'logit_0' not in best_n_features:
    best_n_features.append('logit_0')
    
# manually adding this categorical feature in
if 'logit_21' not in best_n_features:
    best_n_features.append('logit_21')

best_n_features = list(set(best_n_features))
train_data = train_data_stage_three[best_n_features]

In [None]:
DATASET = train_data.copy(deep=True)
DATASET['target'] = target_data_scaled['target']
DATASET['id'] = label_data['id']

if 'text_standard_category' in best_n_features:
    drop_cols = ['id', 'target', 'text_standard_category']
else:
    drop_cols = ['id', 'target']
    
temp_cols = list(DATASET.drop(columns=drop_cols).columns)


DATASET_scaled = DATASET[temp_cols]

scaler = StandardScaler()
DATASET_scaled = scaler.fit_transform(DATASET_scaled)
DATASET_scaled = pd.DataFrame(DATASET_scaled, columns=temp_cols)

DATASET_scaled[drop_cols] = DATASET[drop_cols] 
print(DATASET_scaled)

In [None]:
# https://medium.com/optuna/lightgbm-tuner-new-optuna-integration-for-hyperparameter-optimization-8b7095e99258
# https://www.kaggle.com/corochann/optuna-tutorial-for-hyperparameter-optimization

RGR_MODELS = []

def objective(trial: optuna.trial.Trial
              , n_folds=10
              , shuffle=True
             ):
    
    params = {'metric': 'rmse'
              , 'boosting_type': 'gbdt'
              , 'verbose': -1
              , 'num_leaves': trial.suggest_int('num_leaves', 4, 512)
              , 'max_depth': trial.suggest_int('max_depth', 4, 512)
              , 'max_bin': trial.suggest_int('max_bin', 4, 512)
              , 'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 64, 512)
              , "bagging_fraction": trial.suggest_uniform('bagging_fraction', 0.1, 1.0)
              , "bagging_freq": trial.suggest_int('max_bin', 5, 10)
              , "feature_fraction": trial.suggest_uniform('feature_fraction', 0.4, 1.0)
              , 'learning_rate': trial.suggest_float("bagging_fraction", .0005, .01)
              , 'n_estimators': trial.suggest_int('num_leaves', 10, 10000)
              , 'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0)
              , 'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0)
             }
     
    fold = KFold(n_splits=n_folds
               , shuffle=shuffle
               , random_state=SEED_VAL if shuffle else None
                )
    
    valid_score = 0
    
    best_model_tracker = {}

    for fold_idx, (train_idx, valid_idx) in enumerate(fold.split(range(len(DATASET_scaled)))):
        train_data = DATASET_scaled.iloc[train_idx].drop(columns=['id', 'target']).copy(deep=True)
        train_target = DATASET_scaled[['target']].iloc[train_idx].copy(deep=True)

        valid_data = DATASET_scaled.iloc[valid_idx].drop(columns=['id', 'target']).copy(deep=True)
        valid_target = DATASET_scaled[['target']].iloc[valid_idx].copy(deep=True)

        lgbm_train = lgm.Dataset(train_data, label=train_target
                                 , categorical_feature=['text_standard_category'] if 'text_standard_category' in best_n_features else None
                                )
        lgbm_valid = lgm.Dataset(valid_data, label=valid_target
                                 , categorical_feature=['text_standard_category'] if 'text_standard_category' in best_n_features else None
                                )
        
        curr_model = lgm.train(params,
                          train_set=lgbm_train,
                          valid_sets=[lgbm_train, lgbm_valid],
                          verbose_eval=-1,
                         )
        
        valid_pred = curr_model.predict(valid_data, num_iteration=curr_model.best_iteration)
        
        best_score = curr_model.best_score['valid_1']['rmse']
        
        best_model_tracker.update({best_score: curr_model})
            
        valid_score += best_score
    
#     best_model_score = min([k for k, v in best_model_tracker.items()])
#     best_model = best_model_tracker[best_model_score]

#     RGR_MODELS.append(best_model) 
    
#     RGR_MODELS.append({best_model_score: best_model})
    
#     worst_rgr_model_idx = max([d.keys[0] for d in RGR_MODELS])
    
#     RGR_MODELS[worst_rgr_model_idx] = {best_model_score: None}
    
    score = valid_score / n_folds
    return score

In [None]:
study = optuna.create_study(storage='sqlite:///lgm-study.db')
study.optimize(objective, n_trials=256)

plot_optimization_history(study).show()
print("Best Trial: ", study.best_trial, '\n')

In [None]:
# use the study parameters to create and train a lgbm regressor
lgm_train_data = DATASET_scaled.drop(columns=['id']).copy(deep=True)
x_features = lgm_train_data.loc[:, lgm_train_data.columns != 'target']
y_train = lgm_train_data[['target']]

lgm_train_set_full = lgm.Dataset(data=x_features
                                 , categorical_feature=['text_standard_category'] if 'text_standard_category' in best_n_features else None
                                 , label=y_train)

gbm = lgm.train(study.best_trial.params,
                lgm_train_set_full,
                )

In [None]:
model.do_decode = True

trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                     , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                     )

# run predict on the test data
submission_stage_10 = trainer.predict(model=model
                                     , dataloaders=datamodule.predict_dataloader())

submission_stage_10 = pd.concat(submission_stage_10).reset_index(drop=True)
print("   Submission Stage 10: After RoBERTA\n")
print(submission_stage_10)

In [None]:
if 'logit_42' in best_n_features:
    model_42.do_decode = True

    trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                         , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                         )

    # run predict on the test data
    submission_stage_11 = trainer.predict(model=model_42
                                         , dataloaders=datamodule.predict_dataloader())

    submission_stage_11 = pd.concat(submission_stage_11).reset_index(drop=True)
    submission_stage_11.rename(columns={'logit': 'logit_42'}, inplace=True)

    submission_stage_1 = pd.merge(left=submission_stage_10
                                 , right=submission_stage_11
                                 , left_on='id'
                                 , right_on='id'
                                 , how='left')

    print("   Submission Stage 11: After RoBERTA\n")
    print(submission_stage_1)

In [None]:
if "logit_0" in best_n_features:

    model_0.do_decode = True

    trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                         , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                         )

    # run predict on the test data
    submission_stage_12 = trainer.predict(model=model_0
                                         , dataloaders=datamodule.predict_dataloader())

    submission_stage_12 = pd.concat(submission_stage_12).reset_index(drop=True)
    submission_stage_12.rename(columns={'logit': 'logit_0'}, inplace=True)

    submission_stage_1 = pd.merge(left=submission_stage_1
                                 , right=submission_stage_12
                                 , left_on='id'
                                 , right_on='id'
                                 , how='left')

    print("   Submission Stage 12: After RoBERTA\n")
    print(submission_stage_1)

In [None]:
if "logit_21" in best_n_features:
    model_21.do_decode = True

    trainer = pl.Trainer(gpus=ACCELERATOR_TYPE['gpus']
                         , tpu_cores=ACCELERATOR_TYPE['tpu_cores']
                         )

    # run predict on the test data
    submission_stage_13 = trainer.predict(model=model_21
                                         , dataloaders=datamodule.predict_dataloader())

    submission_stage_13 = pd.concat(submission_stage_13).reset_index(drop=True)
    submission_stage_13.rename(columns={'logit': 'logit_21'}, inplace=True)

    submission_stage_1 = pd.merge(left=submission_stage_1
                                 , right=submission_stage_13
                                 , left_on='id'
                                 , right_on='id'
                                 , how='left')

    print("   Submission Stage 13: After RoBERTA\n")
    print(submission_stage_1)

In [None]:
submission_stage_2 = pd.merge(left=submission_stage_1
                             , right=datamodule.test_df
                             , left_on='id'
                             , right_on='id'
                             , how='left')

submission_stage_2 = add_textstat_features(submission_stage_2)

feature_cols = list(submission_stage_2.drop(columns=['id']).copy(deep=True).columns)

predict_data = submission_stage_2.drop(columns=['id']).copy(deep=True)
predict_data = predict_data[best_n_features]

if 'text_standard_category' in best_n_features:
    temp_cols = list(predict_data.drop(columns=['text_standard_category']).columns)

predict_data_scaled = predict_data[temp_cols]
predict_data_scaled = scaler.transform(predict_data_scaled)
predict_data_scaled = pd.DataFrame(predict_data_scaled, columns=temp_cols)
if 'text_standard_category' in best_n_features:
    predict_data_scaled['text_standard_category'] = predict_data['text_standard_category']

In [None]:
submission = submission_stage_2[['id']].copy(deep=True)

submission['target'] = gbm.predict(predict_data_scaled)
submission['target'] = TARGET_SCALER.inverse_transform(submission['target'])

print("   Final Stage After LGBM\n")
print(submission)
submission.to_csv('submission.csv', index=False)


#### Helpful Resources

* Optuna Docs: [https://optuna.readthedocs.io/en/stable/index.html](https://optuna.readthedocs.io/en/stable/index.html)

* PyTorch Lightning Docs: [https://pytorch-lightning.readthedocs.io/en/latest/](https://pytorch-lightning.readthedocs.io/en/latest/)

* For learning rate tuning: [https://medium.com/pytorch/using-optuna-to-optimize-pytorch-hyperparameters-990607385e36](https://medium.com/pytorch/using-optuna-to-optimize-pytorch-hyperparameters-990607385e36)

* For PyTorch Lightning Precision: [https://pytorch-lightning.readthedocs.io/en/stable/advanced/amp.html](https://pytorch-lightning.readthedocs.io/en/stable/advanced/amp.html)

* For PyTorch Lightning Early Stopping: [https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html)

* For PyTorch Lightning Checkpointing: [https://pytorch-lightning.readthedocs.io/en/stable/common/weights_loading.html](https://pytorch-lightning.readthedocs.io/en/stable/common/weights_loading.html)

* BERT Example from PyTorch Lighting: [https://pytorch-lightning.readthedocs.io/en/stable/advanced/transfer_learning.html](https://pytorch-lightning.readthedocs.io/en/stable/advanced/transfer_learning.html)

* Fine-Tuning a Transformer from PyTorch Lightning: [https://pytorch-lightning.readthedocs.io/en/latest/notebooks/lightning_examples/text-transformers.html](https://pytorch-lightning.readthedocs.io/en/latest/notebooks/lightning_examples/text-transformers.html)

* Example of Optuna with PyTorch Lightning: [https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py](https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py)

* For PyTorch Lightning Logging: [https://pytorch-lightning.readthedocs.io/en/stable/extensions/logging.html](https://pytorch-lightning.readthedocs.io/en/stable/extensions/logging.html)

* For Predict Mode with PyTorch Lightning: [https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction_guide.html](https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction_guide.html)

* Restoring Checkpoints and Continuing Training: [https://pytorch-lightning.readthedocs.io/en/latest/common/weights_loading.html?highlight=checkpoint#checkpoint-loading](https://pytorch-lightning.readthedocs.io/en/latest/common/weights_loading.html?highlight=checkpoint#checkpoint-loading)

* Gradient Clipping in PyTorch Lightning: [https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html?highlight=memory#advanced-gpu-optimizations](https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html?highlight=memory#advanced-gpu-optimizations)

* How to approach trial suggestions in Optuna: [https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html?highlight=suggest#optuna.trial.Trial.suggest_int](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html?highlight=suggest#optuna.trial.Trial.suggest_int)

* Guidance on Early Stopping Callbacks with PyTorch Lightning: [https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.callbacks.early_stopping.html#pytorch_lightning.callbacks.early_stopping.EarlyStopping](https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.callbacks.early_stopping.html#pytorch_lightning.callbacks.early_stopping.EarlyStopping)

* For Reproducible Optuna Studies: [https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-obtain-reproducible-optimization-results](https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-obtain-reproducible-optimization-results)

* Guidance on Optuna Pruners: [https://optuna.readthedocs.io/en/stable/reference/generated/optuna.pruners.HyperbandPruner.html#optuna.pruners.HyperbandPruner](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.pruners.HyperbandPruner.html#optuna.pruners.HyperbandPruner)

* More guidance on which Optuna Pruners to use based on ML task: [https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/003_efficient_optimization_algorithms.html?highlight=memory#activating-pruners](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/003_efficient_optimization_algorithms.html?highlight=memory#activating-pruners)

* TPUs [https://www.kaggle.com/justusschock/pytorch-on-tpu-with-pytorch-lightning](https://www.kaggle.com/justusschock/pytorch-on-tpu-with-pytorch-lightning)

* For Neptune to PyTorch Lightning Integration: [https://docs.neptune.ai/integrations-and-supported-tools/model-training/pytorch-lightning](https://docs.neptune.ai/integrations-and-supported-tools/model-training/pytorch-lightning)