# Capturing Models with Scrybe
This notebook builds various (non-deep learning) models including XGB and LGB Regressor. It showcases how Scrybe, with a single line of code, automatically captures detailed information on each model including: 
* Feature names
* Variable importance
* Hyperparameters
* Metrics
* Lineage 

We are using data from the House Price Prediction challenge for this tutorial.

## Scrybe Installation

*Skip if Scrybe package is already installed*

The Scrybe Python package is hosted on a private pip server protected by a username and password. As part of the signing up with Scrybe, you should have received a username and password for the package installation. 

In the following cell, replace `username` and `password` with the provided username and password. 

----

> If incorrect username and password is provided, the command would **wait/hang** asking for a username. In such case, kill the execution from **Kernel &rarr; Interrupt**, fix the username/password and rerun.

In [None]:
pip install --extra-index-url http://username:password@15.206.48.113:80/simple/ --trusted-host 15.206.48.113 --upgrade scrybe

## Scrybe Initialization

You need to `import scrybe` at the beginning of your notebook or Python script and initialize it using your access key. You can find the access key on the Scrybe dashboard.

> If you are using Scrybe on-premise, change `host_url` to point to your deployment. 

In [1]:
import scrybe
scrybe.init(project_name="Sample Project", user_access_key='aa0e0c5c-3138-45b8-9db5-1fb51b536836', host_url='3.6.105.91:5001')

## Scrybe Labels
With Scrybe, you can easily group different categories of models/experiments by using `scrybe.set_label` API. This allows you to specify a string or array of strings which will get appended as tags to all models/plots/etc. which are created in this process. These tags allow you to filter artifacts easily when looking for information on the dashboard. 

In this tutorial, we will add to labels: a model version string ("v2") and an experiment identifier ("Traditional"). In the other tutorial, where we build deep learning models, we'll be using the same version string with a different experiment identifier. 

In [2]:
scrybe.set_label(["v2", "Traditional"])

## Model Training
You are now fully setup with Scrybe experiment tracking. Beyond this point, Scrybe will automatically: 

* Capture any models which get trained 
* Track model predictions and log metrics computed on them
* Print a URL for each model which can be shared with your team to view/comment upon. 

The rest of the notebook is regular model training code. We start by loading pre-transformed train/test datasets into Pandas frame and build following 5 models:

* Lasso
* RandomForestRegressor
* ExtraTreesRegressor
* XGBRegressor
* LGBMRegressor

In [3]:
import lightgbm as lgb
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV

In [4]:
train_set = pd.read_csv('https://raw.githubusercontent.com/scrybe-ml/tutorials/master/data/train_set.csv')
test_set = pd.read_csv('https://raw.githubusercontent.com/scrybe-ml/tutorials/master/data/test_set.csv')

y = train_set['target'].copy()
del train_set['target']
y_test = test_set['target']
del test_set['target']

In [5]:
def select_features(df, model_type):
    to_drop = [col for col in df.columns if 'NoGrg' in col]  # dropping dummies that are redundant
    to_drop += [col for col in df.columns if 'NoBsmt' in col]

    if model_type == 'lasso':
        to_drop += [col for col in df.columns if 'BsmtExposure' in col]
        to_drop += [col for col in df.columns if 'BsmtCond' in col]
        to_drop += [col for col in df.columns if 'ExterCond' in col]
        to_drop += [col for col in df.columns if 'HouseStyle' in col]
        to_drop += [col for col in df.columns if 'LotShape' in col]
        to_drop += [col for col in df.columns if 'LotFrontage' in col]
        to_drop += [col for col in df.columns if 'GarageYrBlt' in col]
        to_drop += [col for col in df.columns if 'GarageType' in col]
        to_drop += ['OpenPorchSF', '3SsnPorch']
    if model_type == 'forest':
        to_drop += [col for col in df.columns if 'BsmtExposure' in col]
        to_drop += [col for col in df.columns if 'BsmtCond' in col]
        to_drop += [col for col in df.columns if 'ExterCond' in col]
        to_drop += ['OpenPorchSF', '3SsnPorch']
    if model_type == 'xgb':
        to_drop += [col for col in df.columns if 'BsmtExposure' in col]
        to_drop += [col for col in df.columns if 'BsmtCond' in col]
        to_drop += [col for col in df.columns if 'ExterCond' in col]
    if model_type == 'lgb':
        to_drop += [col for col in df.columns if 'LotFrontage' in col]
        to_drop += [col for col in df.columns if 'HouseStyle' in col]
        to_drop += ['MisBsm']

    for col in to_drop:
        try:
            del df[col]
        except KeyError:
            pass

    return df


models = [('lasso', Lasso(alpha=0.01)),
          ('forest', RandomForestRegressor(n_estimators=10)),
          ('xtree', ExtraTreesRegressor(n_estimators=10)),
          ('xgb', xgb.XGBRegressor(n_estimators=10, objective='reg:squarederror')),
          ('lgb', lgb.LGBMRegressor(n_estimators=10))]

for model in models:
    train = train_set.copy()
    test = test_set.copy()
    print(model[0])

    # Feature subselection
    train = select_features(df=train, model_type=model[0])
    test = select_features(df=test, model_type=model[0])

    model_obj = model[1]
    model_obj.fit(train, y)
    preds = model_obj.predict(test)

    print(f'Test set MSE: {round(mean_squared_error(y_test, preds), 4)}')
    print(f'Test set MAE: {round(mean_absolute_error(y_test, preds), 4)}')
    print(f'Test set R2: {round(r2_score(y_test, preds), 4)}')
    if round(mean_squared_error(y_test, preds), 2) <= 0.2:
        scrybe.bookmark(obj=model_obj, obj_name=model[0], msg="Shortlisted models with RMSE <= 0.2")

lasso
Scrybe dashboard URL for model_obj:Lasso: http://dashboard.scrybe.ml/#/dashboard/projects/61/models/7e502bc2-c39d-4aa3-a586-91b16a89b78b?client_id=true
Test set MSE: 0.1306
Test set MAE: 0.255
Test set R2: 0.8694
forest
Scrybe dashboard URL for model_obj:RandomForestRegressor: http://dashboard.scrybe.ml/#/dashboard/projects/61/models/86f68f8d-8e06-4d51-8f3a-58a5f2ab26f7?client_id=true
Test set MSE: 0.1763
Test set MAE: 0.2882
Test set R2: 0.8237
xtree
Scrybe dashboard URL for model_obj:ExtraTreesRegressor: http://dashboard.scrybe.ml/#/dashboard/projects/61/models/010c936d-c4dc-40c8-906a-8d4969db3f7d?client_id=true
Test set MSE: 0.2193
Test set MAE: 0.3397
Test set R2: 0.7807
xgb
Scrybe dashboard URL for model_obj:XGBRegressor: http://dashboard.scrybe.ml/#/dashboard/projects/61/models/fd4cbd7d-f676-4a04-96a3-919b1c8b48da?client_id=true
Test set MSE: 0.3844
Test set MAE: 0.4369
Test set R2: 0.6156
lgb
Scrybe dashboard URL for model_obj:LGBMRegressor: http://dashboard.scrybe.ml/#/da

## Bookmarking Models
You might have noticed in the above script, we added the following code snippet in the training loop: 

```python
if round(mean_squared_error(y_test, preds), 2) <= 0.2:
   scrybe.bookmark(obj=model_obj, obj_name=model[0], msg="Shortlisted models with RMSE <= 0.2")
```
        
This allows you to programmatically bookmark certain models based on your specific criteria. So when you go back to the Scrybe dashboard, you will be able to easily shortlist these interesting models. 

## Grid Search Summary
The next part of this tutorial uses `GridSearchCV` to tune our `LGBRegressor` model. The purpose of this part is to familiarize you with some features Scrybe provides specifically around hyperparameter search. 

Scrybe captures all models trained as part of the grid search but displays only the best estimator in the model listing table. When you view the details of the best estimator (by clicking on the displayed URL or using `scrybe.peek` shown below), you will find two useful pieces of information: 

* Grid search summary plot: A parallel coordinates plot with all modified hyperparams and the tuning metric as axes
* A listing of all models trained within the grid search

In [6]:
params = {
    'lgb__learning_rate': [0.1, 0.2],
    'lgb__n_estimators': [5, 10, 20],
}

grid_search = GridSearchCV(lgb.LGBMRegressor(), param_grid=params, cv=2, scoring="neg_mean_squared_error")
train = train_set.copy()
test = test_set.copy()
grid_search.fit(train, y=y)
best_estimator = grid_search.best_estimator_
preds = best_estimator.predict(test)
print(f'[Best estimator] Test set MSE: {round(mean_squared_error(y_test, preds), 4)}')
print(f'[Best estimator] Test set MAE: {round(mean_absolute_error(y_test, preds), 4)}')
print(f'[Best estimator] Test set R2: {round(r2_score(y_test, preds), 4)}')

Scrybe dashboard URL for best_estimator:LGBMRegressor: http://dashboard.scrybe.ml/#/dashboard/projects/61/models/1c675049-2a95-4aa2-8716-2ea5334af093?client_id=true
[Best estimator] Test set MSE: 0.1655
[Best estimator] Test set MAE: 0.2861
[Best estimator] Test set R2: 0.8345


## Scrybe Peek
While you can always click on the displayed model URL to see the model details in Scrybe dashboard and share it with your team, you can also simply load the model page right here in the notebook. With this, you can access Scrybe's rich model details right here in the notbeook. You can also share the model with a team member by adding a comment addressing them using "@" callouts. They'll be notified and can see the model in full detail on their own time. 

In [7]:
scrybe.peek(best_estimator)

http://dashboard.scrybe.ml/#/dashboard/projects/61/models/1c675049-2a95-4aa2-8716-2ea5334af093?client_id=true
