Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Release History

## 1.0.4

### New Features
- Model Registry: Added support save/load/deploy Tensorflow models (`tensorflow.Module`).
- Model Registry: Added support save/load/deploy MLFlow PyFunc models (`mlflow.pyfunc.PyFuncModel`).
- Model Development: Input dataframes can now be joined against data loaded from staged files.
- Model Development: Added support for non-English languages.

### Bug Fixes

- Model Registry: Fix an issue that model dependencies are incorrectly reported as unresolvable on certain platforms.

## 1.0.3 (2023-07-14)

### Behavior Changes
Expand Down
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Snowpark ML is a set of tools including SDKs and underlying infrastructure to build and deploy machine learning models. With Snowpark ML, you can pre-process data, train, manage and deploy ML models all within Snowflake, using a single SDK, and benefit from Snowflake’s proven performance, scalability, stability and governance at every stage of the Machine Learning workflow.

## Key Components of Snowpark ML

The Snowpark ML Python SDK provides a number of APIs to support each stage of an end-to-end Machine Learning development and deployment process, and includes two key components.

### Snowpark ML Development [Public Preview]
Expand All @@ -16,6 +17,7 @@ A collection of python APIs to enable efficient model development directly in Sn
### Snowpark ML Ops [Private Preview]

Snowpark MLOps complements the Snowpark ML Development API, and provides model management capabilities along with integrated deployment into Snowflake. Currently, the API consists of

1. FileSet API: FileSet provides a Python fsspec-compliant API for materializing data into a Snowflake internal stage from a query or Snowpark Dataframe along with a number of convenience APIs.

1. Model Registry: A python API for managing models within Snowflake which also supports deployment of ML models into Snowflake Warehouses as vectorized UDFs.
Expand All @@ -25,15 +27,19 @@ During PrPr, we are iterating on API without backward compatibility guarantees.
- [Documentation](https://docs.snowflake.com/developer-guide/snowpark-ml)

## Getting started

### Have your Snowflake account ready

If you don't have a Snowflake account yet, you can [sign up for a 30-day free trial account](https://signup.snowflake.com/).

### Create a Python virtual environment
Python 3.8 is required. You can use [miniconda](https://docs.conda.io/en/latest/miniconda.html), [anaconda](https://www.anaconda.com/), or [virtualenv](https://docs.python.org/3/tutorial/venv.html) to create a Python 3.8 virtual environment.

Python version 3.8, 3.9 & 3.10 are supported. You can use [miniconda](https://docs.conda.io/en/latest/miniconda.html), [anaconda](https://www.anaconda.com/), or [virtualenv](https://docs.python.org/3/tutorial/venv.html) to create a virtual environment.

To have the best experience when using this library, [creating a local conda environment with the Snowflake channel](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-packages.html#local-development-and-testing) is recommended.

### Install the library to the Python virtual environment

```
pip install snowflake-ml-python
```
6 changes: 4 additions & 2 deletions bazel/get_affected_targets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,10 @@ help() {
echo "Running ${PROG}"

bazel="bazel"
current_revision=$(git rev-parse HEAD)
pr_revision=${current_revision}
current_revision=$(git symbolic-ref --short -q HEAD \
|| git describe --tags --exact-match 2> /dev/null \
|| git rev-parse --short HEAD)
pr_revision=$(git rev-parse HEAD)
output_path="/tmp/affected_targets/targets"
workspace_path=$(pwd)

Expand Down
3 changes: 3 additions & 0 deletions bazel/mypy/CREDITS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Special thanks to [bazel-mypy-integration](https://github.com/bazel-contrib/bazel-mypy-integration).

This package has been forked from that repo and modified to cater specific need of this Snowflake repo.
263 changes: 138 additions & 125 deletions bazel/mypy/mypy.bzl
Original file line number Diff line number Diff line change
@@ -1,54 +1,52 @@
"Public API"

load("@bazel_skylib//lib:shell.bzl", "shell")
load("@bazel_skylib//lib:sets.bzl", "sets")
load("//bazel/mypy:rules.bzl", "MyPyStubsInfo")

MyPyAspectInfo = provider(
"TODO: documentation",
fields = {
"out": "mypy output.",
"cache": "cache generated by mypy.",
"exe": "Used to pass the rule implementation built exe back to calling aspect.",
"out": "Used to pass the dummy output file back to calling aspect.",
},
)

# We don't support stubs (pyi) yet.
PY_EXTENSIONS = ["py"]
PY_RULES = ["py_binary", "py_library", "py_test", "py_wheel", "py_package"]
# Switch to True only during debugging and development.
# All releases should have this as False.
DEBUG = False

VALID_EXTENSIONS = ["py", "pyi"]

DEFAULT_ATTRS = {
"_mypy_sh": attr.label(
"_template": attr.label(
default = Label("//bazel/mypy:mypy.sh.tpl"),
allow_single_file = True,
),
"_mypy": attr.label(
"_mypy_cli": attr.label(
default = Label("//bazel/mypy:mypy"),
executable = True,
cfg = "host",
cfg = "exec",
),
"_mypy_config": attr.label(
default = Label("//:mypy.ini"),
allow_single_file = True,
),
"_debug": attr.bool(
default = False,
)
}

# See https://github.com/python/mypy/pull/4759 for what `cache_map_triples` mean.
def _sources_to_cache_map_triples(cache_files, dep_cache_files):
def _sources_to_cache_map_triples(srcs):
triples_as_flat_list = []
for d in (cache_files, dep_cache_files):
for src, (meta, data) in d.items():
triples_as_flat_list.extend([
shell.quote(src.path),
shell.quote(meta.path),
shell.quote(data.path),
])
for f in srcs:
f_path = f.path
triples_as_flat_list.extend([
shell.quote(f_path),
shell.quote("{}.meta.json".format(f_path)),
shell.quote("{}.data.json".format(f_path)),
])
return triples_as_flat_list

def _flatten_cache_dict(cache_files):
result = []
for meta, data in cache_files.values():
result.append(meta)
result.append(data)
return result
def _is_external_dep(dep):
return dep.label.workspace_root.startswith("external/")

def _is_external_src(src_file):
return src_file.path.startswith("external/")
Expand All @@ -57,127 +55,142 @@ def _extract_srcs(srcs):
direct_src_files = []
for src in srcs:
for f in src.files.to_list():
if f.extension in PY_EXTENSIONS and not _is_external_src(f):
if f.extension in VALID_EXTENSIONS:
direct_src_files.append(f)
return direct_src_files

# Overview
# This aspect does the following:
# - Create an action to run mypy against the sources of `target`
# - input of this action:
# - source files of `target` and source files of all its deps.
# - cache files produced by checking its deps.
# - output of this action:
# - mypy stderr+stdout in a file
# - cache files produced by checking the source files of `target`
# - this action depends on actions created for the deps, so that it always
# has access to cache files produced by those actions.
# - Propagate the output of this action along the `deps` edge of the build graph.
# - Produces a OutputGroup which contains the output of all the actions created
# along the build graph so that one can use bazel commandline to mark all those
# actions as required and to make them run.
def _mypy_aspect_impl(target, ctx):
if (ctx.rule.kind not in PY_RULES or
ctx.label.workspace_root.startswith("external")):
return []
def _extract_transitive_deps(deps):
transitive_deps = []
for dep in deps:
if MyPyStubsInfo not in dep and PyInfo in dep and not _is_external_dep(dep):
transitive_deps.append(dep[PyInfo].transitive_sources)
return transitive_deps

def _extract_stub_deps(deps):
# Need to add the .py files AND the .pyi files that are
# deps of the rule
stub_files = []
for dep in deps:
if MyPyStubsInfo in dep:
for stub_srcs_target in dep[MyPyStubsInfo].srcs:
for src_f in stub_srcs_target.files.to_list():
if src_f.extension == "pyi":
stub_files.append(src_f)
return stub_files

def _extract_imports(imports, label):
# NOTE: Bazel's implementation of this for py_binary, py_test is at
# src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java
mypypath_parts = []
for import_ in imports:
if import_.startswith("/"):
# buildifier: disable=print
print("ignoring invalid absolute path '{}'".format(import_))
elif import_ in ["", "."]:
mypypath_parts.append(label.package)
else:
mypypath_parts.append("{}/{}".format(label.package, import_))
return mypypath_parts

def _mypy_rule_impl(ctx):
base_rule = ctx.rule
debug = ctx.attr._debug
mypy_config_file = ctx.file._mypy_config

# Get the cache files generated by running mypy against the deps.
dep_cache_files = {}
for dep in ctx.rule.attr.deps:
if MyPyAspectInfo in dep:
dep_cache_files.update(dep[MyPyAspectInfo].cache)
mypy_config_file = ctx.file._mypy_config

mypypath_parts = []
direct_src_files = []
transitive_srcs_depsets = []
stub_files = []

if hasattr(base_rule.attr, "srcs"):
direct_src_files = _extract_srcs(base_rule.attr.srcs)

# It's possible that this target does not have srcs (py_wheel for example).
# However, if the user requests to type check a py_wheel, we should make sure
# its python transitive deps get checked.
if direct_src_files:
# There are source files in this target to check. The check will result in
# cache files. Request bazel to allocate those files now.
cache_files = {}
for src in direct_src_files:
meta_file = ctx.actions.declare_file("{}.meta.json".format(src.basename))
data_file = ctx.actions.declare_file("{}.data.json".format(src.basename))
cache_files[src] = (meta_file, data_file)


# The mypy stdout, which is expected to be produced by mypy_script.
mypy_out = ctx.actions.declare_file("%s_mypy_out" % ctx.rule.attr.name)
# The script to invoke mypy against this target.
mypy_script = ctx.actions.declare_file(
"%s_mypy_script" % ctx.rule.attr.name,
)

# Generated files are located in a different root dir than source files
# Thus we need to let mypy know where to find both kinds in case in one analysis
# both kinds are present.
src_root_paths = sets.to_list(
sets.make(
[f.root.path for f in dep_cache_files.keys()] +
[f.root.path for f in cache_files.keys()]),
)

all_src_files = direct_src_files + list(dep_cache_files.keys())
if hasattr(base_rule.attr, "deps"):
transitive_srcs_depsets = _extract_transitive_deps(base_rule.attr.deps)
stub_files = _extract_stub_deps(base_rule.attr.deps)

if hasattr(base_rule.attr, "imports"):
mypypath_parts = _extract_imports(base_rule.attr.imports, ctx.label)

final_srcs_depset = depset(transitive = transitive_srcs_depsets +
[depset(direct = direct_src_files)])
src_files = [f for f in final_srcs_depset.to_list() if not _is_external_src(f)]
if not src_files:
return None

mypypath_parts += [src_f.dirname for src_f in stub_files]
mypypath = ":".join(mypypath_parts)

out = ctx.actions.declare_file("%s_dummy_out" % ctx.rule.attr.name)
exe = ctx.actions.declare_file(
"%s_mypy_exe" % ctx.rule.attr.name,
)

# Compose a list of the files needed for use. Note that aspect rules can use
# the project version of mypy however, other rules should fall back on their
# relative runfiles.
runfiles = ctx.runfiles(files = src_files + stub_files + [mypy_config_file])

src_root_paths = sets.to_list(
sets.make([f.root.path for f in src_files]),
)

ctx.actions.expand_template(
template = ctx.file._template,
output = exe,
substitutions = {
"{MYPY_BIN}": ctx.executable._mypy.path,
"{CACHE_MAP_TRIPLES}": " ".join(_sources_to_cache_map_triples(cache_files, dep_cache_files)),
"{MYPY_EXE}": ctx.executable._mypy_cli.path,
"{MYPY_ROOT}": ctx.executable._mypy_cli.root.path,
"{CACHE_MAP_TRIPLES}": " ".join(_sources_to_cache_map_triples(src_files)),
"{PACKAGE_ROOTS}": " ".join([
"--package-root " + shell.quote(path or ".")
for path in src_root_paths
]),
"{SRCS}": " ".join([
shell.quote(f.path)
for f in all_src_files
for f in src_files
]),
"{VERBOSE_OPT}": "--verbose" if debug else "",
"{VERBOSE_BASH}": "set -x" if debug else "",
"{OUTPUT}": mypy_out.path,
"{ADDITIONAL_MYPYPATH}": ":".join([p for p in src_root_paths if p]),
"{MYPY_INI}": mypy_config_file.path,
}
ctx.actions.expand_template(
template = ctx.file._mypy_sh,
output = mypy_script,
substitutions = substitutions,
is_executable = True,
)

# We want mypy to follow imports, so all the source files of the dependencies
# are need altoghther to check this target.
ctx.actions.run(
outputs = [mypy_out] + _flatten_cache_dict(cache_files),
inputs = depset(
all_src_files +
[mypy_config_file] +
_flatten_cache_dict(dep_cache_files) # cache generated by analyzing deps
),
tools = [ctx.executable._mypy],
executable = mypy_script,
mnemonic = "MyPy",
progress_message = "Type-checking %s" % ctx.label,
use_default_shell_env = True,
)
dep_cache_files.update(cache_files)
transitive_mypy_outs = []
for dep in ctx.rule.attr.deps:
if OutputGroupInfo in dep:
if hasattr(dep[OutputGroupInfo], "mypy"):
transitive_mypy_outs.append(dep[OutputGroupInfo].mypy)
"{VERBOSE_OPT}": "--verbose" if DEBUG else "",
"{VERBOSE_BASH}": "set -x" if DEBUG else "",
"{OUTPUT}": out.path if out else "",
"{MYPYPATH_PATH}": mypypath if mypypath else "",
"{MYPY_INI_PATH}": mypy_config_file.path,
},
is_executable = True,
)

return [
DefaultInfo(executable = exe, runfiles = runfiles),
MyPyAspectInfo(exe = exe, out = out),
]

def _mypy_aspect_impl(_, ctx):
if (ctx.rule.kind not in ["py_binary", "py_library", "py_test", "mypy_test"] or
ctx.label.workspace_root.startswith("external")):
return []

providers = _mypy_rule_impl(
ctx
)
if not providers:
return []

info = providers[0]
aspect_info = providers[1]

ctx.actions.run(
outputs = [aspect_info.out],
inputs = info.default_runfiles.files,
tools = [ctx.executable._mypy_cli],
executable = aspect_info.exe,
mnemonic = "MyPy",
progress_message = "Type-checking %s" % ctx.label,
use_default_shell_env = True,
)
return [
OutputGroupInfo(
# We may not need to run mypy against this target, but we request
# all its dependencies to be checked, recursively, but demanding the output
# of those checks.
mypy = depset([mypy_out] if direct_src_files else [], transitive=transitive_mypy_outs),
mypy = depset([aspect_info.out]),
),
MyPyAspectInfo(out = mypy_out if direct_src_files else None, cache = dep_cache_files),
]

mypy_aspect = aspect(
Expand Down
Loading