Skip to content
LiuYongFeng edited this page Apr 13, 2020 · 10 revisions

The Challenge

Model Definitions by ML Specialists

Scientific research has produced many model definitions, in the form of Python classes calling TensorFlow or PyTorch API, or of a set of XGBoost parameters in a text file.

End-to-end ML Pipelines by Production Engineers

We want to reuse these models in production. To do so, we need to preprocess large training and test datasets, set correct hyperparameters, run distributed training jobs given the model definition and preprocess datasets on computing clusters, and evaluate the trained model parameters using preprocesses test datasets. All these steps are usually called an end-to-end machine learning pipeline.

SQL v.s. Python

In practice, it is often that a model definition takes dozens of lines of Python code, but the end-to-end steps usually take hundreds if not thousands of lines of Python code

SQLFlow allows users to write the end-to-end pipeline in SQL with a few syntax extensions. Such a rewrite often reduces the thousands of lines of Python code into a few lines of SQL code.

A Conceptual Side-by-side Comparison

Here follow some code snippets to exhibit the value of SQLFlow for developers. Consider we have a model definition in as a Python class written by a machine learning specialist:

import pytorch as t
class MyModelDefinition:
  def __init__(self):
    self.layers=...
  def __call__(self, inputs):
    return self.layers(inputs)

End-to-End Pipeline in Python

End-to-end applications of such model definitions often include business intelligence, market analysis, ranking results of search engines, advertising, or recommendation systems. Companies running such applications hire software engineers who write the end-to-end pipeline in Python, Java, C++, or other languages.

An oversimplified example of an end-to-end pipeline in Python might look like the following.

def preprocess_data(db, sql_preprocess_stmts):
   mysql.run_sql_stmts(db, sql_preprocess_stmts) # write to a tmp table
  
def dump_data(db, tmp_table, dataset_file):
   tbl = db.open(tmp_table)
   for row in enumerate(tbl):
      append_to_file(dataset_file, row)

def do_some_statistics(dataset_file):
  for row in enumerate(dataset_file):
    ... compute mean, covariance, vocabulary list for each column ...
  return statistics

def transform_data_to_features(dataset_file, statistics):
  for row in enumerate(dataset_file):
    ... 

def train_model(model_definition, processed_data, hyperparmaters):
  job = kubeflow.create_tensorflow_job(model_definition, processed_data, hyperparameters)
  job.run()
  return job.workdir() + "/trained_model_parameters"

def end_to_end_train_MyClassDefinition(dataset, hyperparameters):
  db = mysql.open(database)
  tmp_table = preprocess_data(db, "SELECT * FROM tb1, tb2 WHERE ...")
  dump_data(db, tmp_table, "/mnt/storage/dataset.txt")
  stats = do_some_statistics("/mnt/storage/dataset.txt")
  processed_file = transform_data_to_features("/mnt/storage/dataset.txt", stats)
  trained_model_file = train_model(MyModelDefinition, processd_file, hyperparameters)
  return trained_model_file

End-to-end Pipeline in SQL

In SQLFlow, we can write such a long pipeline in SQL with extended syntax in just a few lines

SELECT * FROM tb1, tb2 WHERE ... 
  TO TRAIN MyModelDefinition WITH hyperparameters INTO trained_model_file;

SQLFlow translates the above single line SQL program into the more complex Python program, with expertise from many ML specialists. More precisely, SQLFlow outputs a workflow other than a Python program. Each step of this workflow runs as a Docker container on a Kubernetes cluster.

Real Cases

Train a Credit Model for Fintech