# Storing Machine Learning Run Metadata Using sqlite

Author: Travis Jefferies<br>
Last Updated: 05/07/2019<br>

This notebook walks through the creation of a flexible relational model that can be used to store metadata related to a given machine learning train/deployment run.  The relational model is then implemented in sqlite using a parallel ETL approach where data is stored in memory for on-demand processing needs downstream during training/deployment runs and archived on disk for reproducibility/audit trail purposes. Storing as much detail as possible about a given machine learning model run is necessary for model relevancy, metric tracking, and model assessment. Other pros/cons of this implementation technique are also explained.

## Import Libraries

In [1]:
import sqlalchemy
import sqlite3
from sqlite3 import Error
import pandas as pd
import numpy as np
from functools import partial
np.random.seed(0)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from datetime import datetime
import time

In [2]:
def create_connection():
    """ create a database connection to a SQLite database """
    try:
        conn = sqlite3.connect(':memory:')
        print(sqlite3.version)
    except Error as e:
        print(e)
    finally:
        conn.close()

## Load data

In [3]:
features, target = make_classification(n_samples=1000, n_features=15, n_informative=6, n_classes=20)

## Store current time as `str`

We'll use this later in a variety of ways.

In [4]:
now = datetime.now()
now = now.strftime("%Y%m%d%H%M")
now

'201905072022'

## Train model and tune hyper parameters using `GridSearchCV`

In [6]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(features, target)
train_untuned_accuracy = clf.score(features, target)
print('rfc untuned accuracy: {}'.format(train_untuned_accuracy))

param_grid = { 
    'n_estimators': [20, 40],
    'max_features': ['auto', 'log2'],
    'max_depth': [10,20]
}


t = time.asctime( time.localtime(time.time()) )
CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
CV_rfc.fit(features, target)
print(CV_rfc.best_params_)
CV_rfc.refit
e = time.asctime( time.localtime(time.time()) )
train_tuned_accuracy = CV_rfc.score(features, target)
print('tuned accuracy: {}'.format(train_tuned_accuracy))



rfc untuned accuracy: 0.994
{'max_depth': 10, 'max_features': 'auto', 'n_estimators': 40}
tuned accuracy: 1.0




## `SQLite` class

The `SQLite` class is used to create the sqlite database and execute queries.<br>
Under the hood, it uses `pandas` and `sqlite3` python libraries.

In [7]:
class SQLite:

    def __init__(self, db=None):
        """
        
        """
        if db:
            assert isinstance(db, str)
            assert db.split('.')[1] == 'db'
        self.create_connection(db)
    
    
    def create_connection(self, db=None):
        """ create a database connection to a SQLite database """
        try:
            if db:
                self.conn = sqlite3.connect(db)
                print(sqlite3.version)
            else:
                self.conn = sqlite3.connect(':memory:')
                print(sqlite3.version)
        except Error as e:
            print(e)
    
    def close_conn(self):
        self.conn.close()
            
def query_sqlite_db(conn, query):
    """
    
    """
    try:
        cur = conn.cursor()    
        cur.execute(query)
    except Error as e:
        print(e)
    finally:
        cur.close()

# Model metadata

In [8]:
create_sql = """CREATE TABLE Model(id INTEGER PRIMARY KEY, name TEXT, type TEXT, start_dt FLOAT, end_dt FLOAT)"""
insert_sql = """INSERT INTO Model VALUES({},'{}','{}','{}','{}')""".format(now, type(CV_rfc.estimator).__name__ ,str(type(CV_rfc.estimator))[8:-2],t,e)

s = SQLite()
query_sqlite_db(s.conn, create_sql)
query_sqlite_db(s.conn, insert_sql)
df = pd.read_sql_query('select * from Model',s.conn)

2.6.0


In [9]:
df.head()

Unnamed: 0,id,name,type,start_dt,end_dt
0,201905072022,RandomForestClassifier,sklearn.ensemble.forest.RandomForestClassifier,Tue May 7 20:23:20 2019,Tue May 7 20:23:30 2019


# Extending the concept to .pkl files

Now let's extend the concept from above to include .pkl files generated during the machine learning lifecycle. We'll be using the `cv_results_` attribute of the `GridSearchCV` object to illustrate.

In [10]:
CV_rfc.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [11]:
import pickle

# reference: https://stackoverflow.com/a/2340858

pdata0 = pickle.dumps(CV_rfc.best_params_, pickle.HIGHEST_PROTOCOL)
pdata1 = pickle.dumps(CV_rfc.cv_results_['params'], pickle.HIGHEST_PROTOCOL)
pdata2 = pickle.dumps(CV_rfc.cv_results_['mean_test_score'], pickle.HIGHEST_PROTOCOL)
pdata3 = pickle.dumps(CV_rfc.cv_results_['mean_train_score'], pickle.HIGHEST_PROTOCOL)
pdata4 = pickle.dumps(CV_rfc.cv_results_['mean_fit_time'], pickle.HIGHEST_PROTOCOL)
pdata5 = pickle.dumps(CV_rfc.cv_results_['mean_score_time'], pickle.HIGHEST_PROTOCOL)
pdata6 = pickle.dumps(CV_rfc.best_estimator_, pickle.HIGHEST_PROTOCOL)



In [12]:
create_ModelGeneral_sql = """CREATE TABLE ModelTrainCV(id INTEGER PRIMARY KEY, name TEXT, type TEXT, start_dt TEXT, end_dt TEXT, optimal_model_params BLOB, all_models_params BLOB, all_models_test_scores BLOB, all_models_train_scores BLOB, all_models_fit_time_secs BLOB, all_models_score_time_secs BLOB, optimal_model BLOB)"""
curr = s.conn.cursor()
curr.execute(create_ModelGeneral_sql)
curr.execute("INSERT INTO ModelTrainCV VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",(now, type(CV_rfc.estimator).__name__, str(type(CV_rfc.estimator))[8:-2], t, e, sqlite3.Binary(pdata0), sqlite3.Binary(pdata1), sqlite3.Binary(pdata2), sqlite3.Binary(pdata3), sqlite3.Binary(pdata4), sqlite3.Binary(pdata5), sqlite3.Binary(pdata6)))
df = pd.read_sql_query('select id, all_models_score_time_secs from ModelTrainCV',s.conn)

In [13]:
df.head()

Unnamed: 0,id,all_models_score_time_secs
0,201905072022,b'\x80\x04\x95\xca\x00\x00\x00\x00\x00\x00\x00...


In [14]:
pickle.loads(df['all_models_score_time_secs'][0])

array([0.0025784 , 0.00430191, 0.00259876, 0.00440335, 0.00276504,
       0.00532231, 0.0028295 , 0.00483508])

## Write in memory database to disk

In [None]:
s.conn.commit()

# write database to disk

c2 = sqlite3.connect('mydb.db')
with c2:
    for line in s.conn.iterdump():
        if line not in ('BEGIN;', 'COMMIT;'): # let python handle the transactions
            c2.execute(line)
c2.commit()

In [None]:
s.close_conn()
c2.close()