Red Hat's implementation of multiple data science models against input data sets.
Operationalizing outputs from data science efforts is a tricky art. With so much data science relying on cutting-edge techniques, it can be difficult to balance that with the need for a stable infrastructure. Running a static analysis to provide either a spreadsheet or slide deck is fairly straightforward, but also prone to human error, time constraints, and limited resources. If you're trying to make real-time decisions based on data science models, the engine for processing models has to be reliable.
Our original use case for this package was deploying lead scoring models: given information we know about one person (based on marketing data), how should they be prioritized when being passed over to sales? If we implement a model for scoring leads, and the infrastructure breaks (due to deploying a new model, uncaught exceptions, etc), there are immediate downstream impacts to other groups who depend on this information.
Challenges with other platforms:
- Limited modeling options; usually only simple "if-then" statements
- Slow processing times
- Lack of flexibility to change model types quickly
- Lack of version control
To address these challenges, we built the mlsm
package.
Using a Python framework offers the following solutions:
- Limited modeling options; usually only simple "if-then" statements
- Python offers direct access to data science standards, such as linear regression, random forests, and machine learning
- Slow processing times
- By running Python on a server, we can add more resources to speed up processing as needed
- Lack of flexibility to quickly deploy new models
- With a standardized architecture designed for multiple concurrent models, deploying new models is easy
- Lack of version control
- By deploying with a Git-based system, can easily view changes and roll back to older code; architecture for multiple models allows easier comparison of outputs against live data
Model
is a high-level wrapper for functions that run models. This not only puts functions in a framework to be more consistently applied, it also facilitates easier storage of results by model name/version, better validation of input data, and simplifying any script which actually applies models.
The inputs for executing a function, data
and results
, are expected to a single record (the function RunModelsAll
is used to apply across multiple records).
Model
also collects a fields
dict, used to validate input values before execution (future enhancements will include extreme unit/case/exception handling).
SummaryModel
is an extension of Model
- it expects output of previously run Model
s to be passed (via results
) for the purpose of running a second-level computation. For example, if three Model
s are run, a SummaryModel
could then be used to determine which score 'wins' (or, which score is passed back to business users).
When passed to RunModelsAll
, every SummaryModel
will be run after all Model
s have run.
- An ETL script prepares data for processing (not part of the
mlsm
package)- Each record is a dict
- Each record has a unique identifier (which can later be passed as a parameter to
RunModelsAll
) - Data for each record is structured in sub-dictionaries
data
top-level dictModel.name
sub-levelModel.version
sub-level - data values contained here
- When running a
Model
, the data passed to the underlying function will be taken from the pathdata[Model.name][Model.version]
- For each record passed (
records
):- For each
Model
passed (models
):- Execute model
- For each
SummaryModel
passed (summaryModels
):- Execute summary model
- If DB parameters passed:
- Store output
results
object in MongoDB
- Store output
- For each
We use Red Hat's Openshift (Python 3.3 cartridge), with an additional install script to run Anaconda for more advanced models
We follow this procedure for creating/deploying models (also workflow for passing models from data scientists to engineers for deployment):
- Create a new local git repo (and a remote for backup)
- Use git-flow for develop/release/feature management
- When a new model is ready to be deployed, push to remote and inform engineering team with commit ID
- If testing a model from the
develop
branch, then must setModel.status='draft'
- Pull model repo to local
- Add to deployment code repo
- Import model to script which runs
RunModelsAll
- Add any additional requirements to existing ETL scripts
Implementation standard leveraging Jenkins, so Data Scientists can have their latest models pulled directly in from remote