# Logistic Regression with Grid Search (scikit-learn)

<a href="https://colab.research.google.com/github/VertaAI/modeldb-client/blob/master/workflows/demos/sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# restart your notebook if prompted on Colab
try:
    import verta
except ModuleNotFoundError:
    !pip install verta

This example features:
- **scikit-learn**'s `LinearRegression` model
- **scikit-learn**'s `GridSearchCV` utility for performing grid search and cross-validation
- **verta**'s Python client logging the grid search results
- **verta**'s Python client retrieving the best run from the grid search to calculate full training accuracy
- predictions against a deployed model

In [2]:
HOST = "app.verta.ai"

PROJECT_NAME = "Census Income Classification - MIT Class"
EXPERIMENT_NAME = "Logistic Regression"

In [3]:
import os
os.environ['VERTA_EMAIL'] = 'jmftrindade@gmail.com'
os.environ['VERTA_DEV_KEY'] = '6aad493b-c09f-4af5-8424-a052d307ad7a'

## Imports

In [4]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import itertools
from multiprocessing import Pool
import os
import time

import six

import numpy as np
import pandas as pd

import sklearn
from sklearn import model_selection
from sklearn import linear_model
from sklearn import metrics

In [5]:
try:
    import wget
except ModuleNotFoundError:
    !pip install wget  # you may need pip3
    import wget

---

# Log Workflow

## Instantiate Client

In [6]:
from verta import Client
from verta.utils import ModelAPI

client = Client(HOST)
proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)

set email from environment
set developer key from environment
connection successfully established
set existing Project: Census Income Classification - MIT Class
set existing Experiment: Logistic Regression


## Prepare Data

In [7]:
DATASET_PATH = "./"

train_data_filename = DATASET_PATH + "census-train.csv"
test_data_filename = DATASET_PATH + "census-test.csv"

In [8]:
df_train = pd.read_csv(train_data_filename)
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:, -1]


df_train.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_local-gov,workclass_private,workclass_self-emp-inc,workclass_self-emp-not-inc,workclass_state-gov,workclass_without-pay,...,occupation_handlers-cleaners,occupation_machine-op-inspct,occupation_other-service,occupation_priv-house-serv,occupation_prof-specialty,occupation_protective-serv,occupation_sales,occupation_tech-support,occupation_transport-moving,>50k
0,44,0,0,40,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,21,0,0,40,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,53,7298,0,60,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
3,49,0,0,40,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,53,0,1485,40,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1


## Prepare Hyperparameters

In [9]:
hyperparam_candidates = {
    'C': [1e-6, 1e-4, 1e-2, 1e2, 1e4],
    'solver': ['lbfgs'],
    'max_iter': [15, 28],
}
hyperparam_sets = [dict(zip(hyperparam_candidates.keys(), values))
                   for values
                   in itertools.product(*hyperparam_candidates.values())]

## Run Validation

In [10]:
import json

def run_experiment(hyperparams):
    # create object to track experiment run
    run = client.set_experiment_run()
    
    # create validation split
    (X_val_train, X_val_test,
     y_val_train, y_val_test) = model_selection.train_test_split(X_train, y_train,
                                                                 test_size=0.2,
                                                                 shuffle=True)

    # log hyperparameters
    run.log_hyperparameters(hyperparams)
    print(hyperparams)
    
    # create and train model
    model = linear_model.LogisticRegression(**hyperparams)
    model.fit(X_train, y_train)
    run.log_attribute("features", X_train.columns.tolist())
    run.log_attribute("resource_requirements", json.dumps({"memory" : "512Mi", "cpu" : "100m"}))
    
    # calculate and log validation accuracy
    val_acc = model.score(X_val_test, y_val_test)
    run.log_metric("val_acc", val_acc)
    run.log_tags(["log_regr"])
    print("Validation accuracy: {:.4f}".format(val_acc))
    
    # create deployment artifacts
    model_api = ModelAPI(X_train, y_train)
    requirements = six.StringIO("scikit-learn=={}".format(sklearn.__version__))
    #requirements = 'requirements.txt'
    
    # save and log model
    run.log_model_for_deployment(model, model_api, requirements, X_train, y_train)
    
for hyperparams in hyperparam_sets:
    run_experiment(hyperparams)

created new ExperimentRun: Run 44415728946718970733
{'C': 1e-06, 'solver': 'lbfgs', 'max_iter': 15}
Validation accuracy: 0.7988
upload complete (model.pkl)
upload complete (model_api.json)
upload complete (requirements.txt)
upload complete (train_data.csv)
created new ExperimentRun: Run 44415728946784318366
{'C': 1e-06, 'solver': 'lbfgs', 'max_iter': 28}
Validation accuracy: 0.7902
upload complete (model.pkl)
upload complete (model_api.json)
upload complete (requirements.txt)
upload complete (train_data.csv)
created new ExperimentRun: Run 44415728946839121037
{'C': 0.0001, 'solver': 'lbfgs', 'max_iter': 15}
Validation accuracy: 0.7942
upload complete (model.pkl)
upload complete (model_api.json)
upload complete (requirements.txt)
upload complete (train_data.csv)
created new ExperimentRun: Run 44415728946894350758
{'C': 0.0001, 'solver': 'lbfgs', 'max_iter': 28}
Validation accuracy: 0.7942
upload complete (model.pkl)
upload complete (model_api.json)
upload complete (requirements.txt)
upl

In [11]:
sklearn.__version__

'0.21.3'

---

# Revisit Workflow

## Retrieve Best Run

In [12]:
best_run = expt.expt_runs.sort("metrics.val_acc", descending=True)[0]
print("Validation Accuracy: {:.4f}".format(best_run.get_metric("val_acc")))

best_hyperparams = best_run.get_hyperparameters()
print("Hyperparameters: {}".format(best_hyperparams))

KeyError: 'no metric found with key val_acc'

In [None]:
import verta
verta.__version__

## Train on Full Dataset

In [None]:
model = linear_model.LogisticRegression(**best_hyperparams, multi_class='auto')
model.fit(X_train, y_train)

In [None]:
from platform import python_version

print(python_version())

## Calculate Accuracy on Full Training Set

In [None]:
train_acc = model.score(X_train, y_train)
print("Training accuracy: {:.4f}".format(train_acc))

---