Bugfixes
- Fixed an issue in the dagstermill library that caused solid config fetch to be non-deterministic.
- Fixed an issue in the K8sScheduler where multiple pipeline runs were kicked off for each scheduled execution.
New
- Added ADLS2 storage plugin for Spark DataFrame (Thanks @sd2k!)
- Added feature in the Dagit Playground to automatically remove extra configuration that does not conform to a pipeline’s config schema.
- [Dagster-Celery/Celery-K8s/Celery-Docker] Added Celery worker names and pods to the logs for each step execution
Community contributions
- Re-enabled dagster-azure integration tests in dagster-databricks tests (Thanks @sd2k!)
- Moved dict_without_keys from dagster-pandas into dagster.utils (Thanks @DavidKatz-il)
- Moved Dask DataFrame read/to options under read/to keys (Thanks @kinghuang)
Bugfixes
- Fixed helper for importing data from GCS paths into Bigquery (Thanks @grabangomb (https://github.com/grabangomb)!)
- Postgres event storage now waits to open a thread to watch runs until it is needed
Experimental
- Added version computation function for DagsterTypeLoader. (Actual versioning will be supported in 0.10.0)
- Added version attribute to solid and SolidDefinition. (Actual versioning will be supported in 0.10.0)
New
- UI improvements to the backfill partition selector
- Enabled sorting of steps by failure in the partition run matrix in Dagit
Bugfixes
- [dagstermill] fixes an issue with output notebooks and s3 storage
- [dagster_celery] bug fixed in pythonpath calculation (thanks @enima2648!)
- [dagster_pandas] marked create_structured_dataframe_type and ConstraintWithMetadata as experimental APIs
- [dagster_k8s] reduced default job backoff limit to 0
Docs
- Various docs site improvements
Breaking Changes
- When using the
configured
API on a solid or composite solid, a new solid name must be provided. - The image used by the K8sScheduler to launch scheduled executions is now specified under the “scheduler” section of the Helm chart (previously under “pipeline_run” section).
New
- Added an experimental mode that speeds up interactions in dagit by launching a gRPC server on startup for each repository location in your workspace. To enable it, add the following to your
dagster.yaml
:
opt_in:
local_servers: true
- Intermediate Storage and System Storage now default to the first provided storage definition when no configuration is provided. Previously, it would be necessary to provide a run config for storage whenever providing custom storage definitions, even if that storage required no run configuration. Now, if the first provided storage definition requires no run configuration, the system will default to using it.
- Added a timezone picker to Dagit, and made all timestamps timezone-aware
- Added solid_config to hook context which provides the access to the config schema variable of the corresponding solid.
- Hooks can be directly set on
PipelineDefinition
or@pipeline
, e.g.@pipeline(hook_defs={hook_a})
. It will apply the hooks on every single solid instance within the pipeline. - Added Partitions tab for partitioned pipelines, with new backfill selector.
Breaking Changes
- Removed deprecated
--env
flag from CLI - The
--host
CLI param has been renamed to--grpc_host
to avoid conflict with the dagit--host
param.
New
- Descriptions for solid inputs and outputs will now be inferred from doc blocks if available (thanks @AndersonReyes !)
- Various documentation improvements (thanks @jeriscc !)
- Load inputs from pyspark dataframes (thanks @davidkatz-il !)
- Added step-level run history for partitioned schedules on the schedule view
- Added great_expectations integration, through the
dagster_ge
library. Example usage is under a new example, calledge_example
, and documentation for the library can be found under the libraries section of the api docs. PythonObjectDagsterType
can now take a tuple of types as well as a single type, more closely mirroringisinstance
and allowing Union types to be represented in Dagster.- The
configured
API can now be used on all definition types (includingCompositeDefinition
). Example usage has been updated in the configuration documentation. - Updated Helm chart to include auto-generated user code configmap in user code deployment by default
Bugfixes
- Databricks now checks intermediate storage instead of system storage
- Fixes a bug where applying hooks on a pipeline with composite solids would flatten the top-level solids. Now applying hooks on pipelines or composite solids means attaching hooks to every single solid instance within the pipeline or the composite solid.
- Fixes the GraphQL playground hosted by dagit
- Fixes a bug where K8s CronJobs were stopped unnecessarily during schedule reconciliation
Experimental
- New
dagster-k8s/config
tag that lets users pass in custom configuration to the KubernetesJob
,Job
metadata,JobSpec
,PodSpec
, andPodTemplateSpec
metadata.- This allows users to specify settings like eviction policy annotations and node affinities.
- Example:
@solid( tags = { 'dagster-k8s/config': { 'container_config': { 'resources': { 'requests': { 'cpu': '250m', 'memory': '64Mi' }, 'limits': { 'cpu': '500m', 'memory': '2560Mi' }, } }, 'pod_template_spec_metadata': { 'annotations': { "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"} }, 'pod_spec_config': { 'affinity': { 'nodeAffinity': { 'requiredDuringSchedulingIgnoredDuringExecution': { 'nodeSelectorTerms': [{ 'matchExpressions': [{ 'key': 'beta.kubernetes.io/os', 'operator': 'In', 'values': ['windows', 'linux'], }] }] } } } }, }, }, ) def my_solid(context): context.log.info('running')
Breaking Changes
- The
--env
flag no longer works for thepipeline launch
orpipeline execute
commands. Use--config
instead. - The
pipeline execute
command no longer accepts the--workspace
argument. To execute pipelines in a workspace, usepipeline launch
instead.
New
- Added
ResourceDefinition.mock_resource
helper for magic mocking resources. Example usage can be found here - Remove the
row_count
metadata entry from the Dask DataFrame type check (thanks @kinghuang!) - Add
orient
to the config options when materializing a Dask DataFrame tojson
(thanks @kinghuang!)
Bugfixes
- Fixed a bug where applying
configured
to a solid definition would overwrite inputs from run config. - Fixed a bug where pipeline tags would not apply to solid subsets.
- Improved error messages for repository-loading errors in CLI commands.
- Fixed a bug where pipeline execution error messages were not being surfaced in Dagit.
Bugfixes
- Fixes an issue in the
dagster-k8s-celery
executor when executing solid subsets
Breaking Changes
- Deprecated the
IntermediateStore
API.IntermediateStorage
now wraps an ObjectStore, andTypeStoragePlugin
now accepts anIntermediateStorage
instance instead of anIntermediateStore
instance. (Noe thatIntermediateStore
andIntermediateStorage
are both internal APIs that are used in some non-core libraries).
Breaking Changes
- The
dagit
key is no longer part of the instance configuration schema and must be removed fromdagster.yaml
files before they can be used. -d
can no longer be used as a command-line argument to specify a mode. Use--mode
instead.- Use
--preset
instead of--preset-name
to specify a preset to thepipeline launch
command. - We have removed the
config
argument to theConfigMapping
,@composite_solid
,@solid
,SolidDefinition
,@executor
,ExecutorDefinition
,@logger
,LoggerDefinition
,@resource
, andResourceDefinition
APIs, which we deprecated in 0.8.0. Useconfig_schema
instead.
New
- Python 3.8 is now fully supported.
-d
or--working-directory
can be used to specify a working directory in any command that takes in a-f
or--python_file
argument.- Removed the deprecation of
create_dagster_pandas_dataframe_type
. This is the currently supported API for custom pandas data frame type creation. - Removed gevent dependency from dagster
- New
configured
API for predefining configuration for various definitions: https://docs.dagster.io/overview/configuration/#configured - Added hooks to enable success and failure handling policies on pipelines. This enables users to set up policies on all solids within a pipeline or on a per solid basis. Example usage can be found here
- New instance level view of Scheduler and running schedules
- dagster-graphql is now only required in dagit images.
Breaking Changes
AssetMaterializations
no longer accepts adagster_type
argument. This reverts the change billed as "AssetMaterializations
can now have type information attached as metadata." in the previous release.
New
- Added new GCS and Azure file manager resources
AssetMaterializations
can now have type information attached as metadata. See the materializations tutorial for more- Added verification for resource arguments (previously only validated at runtime)
Bugfixes
- Fixed bug with order-dependent python module resolution seen with some packages (e.g. numpy)
- Fixed bug where Airflow's
context['ts']
was not passed properly - Fixed a bug in celery-k8s when using
task_acks_late: true
that resulted in a409 Conflict error
from Kubernetes. The creation of a Kubernetes Job will now be aborted if another Job with the same name exists - Fixed a bug with composite solid output results when solids are skipped
- Hide the re-execution button in Dagit when the pipeline is not re-executable in the currently loaded repository
Docs
- Fixed code example in the advanced scheduling doc (Thanks @wingyplus!)
- Various other improvements
New
CeleryK8sRunLauncher
supports termination of pipeline runs. This can be accessed via the “Terminate” button in Dagit’s Pipeline Run view or via “Cancel” in Dagit’s All Runs page. This will terminate the run master K8s Job along with all running step job K8s Jobs; steps that are still in the Celery queue will not create K8s Jobs. The pipeline and all impacted steps will be marked as failed. We recommend implementing resources as context managers and we will execute the finally block upon termination.K8sRunLauncher
supports termination of pipeline runs.AssetMaterialization
events display the asset key in the Runs view.- Added a new "Actions" button in Dagit to allow to cancel or delete mulitple runs.
Bugfixes
- Fixed an issue where
DagsterInstance
was leaving database connections open due to not being garbage collected. - Fixed an issue with fan-in inputs skipping when upstream solids have skipped.
- Fixed an issue with getting results from composites with skippable outputs in python API.
- Fixed an issue where using
Enum
in resource config schemas resulted in an error.
New
- The new
configured
API makes it easy to create configured versions of resources. - Deprecated the
Materialization
event type in favor of the newAssetMaterialization
event type, which requires theasset_key
parameter. Solids yieldingMaterialization
events will continue to work as before, though theMaterialization
event will be removed in a future release. - We are starting to deprecate "system storages" - instead of pipelines having a system storage
definition which creates an intermediate storage, pipelines now directly have an intermediate
storage definition.
- We have added an
intermediate_storage_defs
argument toModeDefinition
, which accepts a list ofIntermediateStorageDefinition
s, e.g.s3_plus_default_intermediate_storage_defs
. As before, the default includes an in-memory intermediate and a local filesystem intermediate storage. - We have deprecated
system_storage_defs
argument toModeDefinition
in favor ofintermediate_storage_defs
.system_storage_defs
will be removed in 0.10.0 at the earliest. - We have added an
@intermediate_storage
decorator, which makes it easy to define intermediate storages. - We have added
s3_file_manager
andlocal_file_manager
resources to replace the file managers that previously lived inside system storages. The airline demo has been updated to include an example of how to do this: https://github.com/dagster-io/dagster/blob/0.8.8/examples/airline_demo/airline_demo/solids.py#L171.
- We have added an
- The help panel in the dagit config editor can now be resized and toggled open or closed, to enable easier editing on smaller screens.
Bugfixes
- Opening new Dagit browser windows maintains your current repository selection. #2722
- Pipelines with the same name in different repositories no longer incorrectly share playground state. #2720
- Setting
default_value
config on a field now works as expected. #2725 - Fixed rendering bug in the dagit run reviewer where yet-to-be executed execution steps were rendered on left-hand side instead of the right.
Breaking Changes
- Loading python modules reliant on the working directory being on the PYTHONPATH is no longer
supported. The
dagster
anddagit
CLI commands no longer add the working directory to the PYTHONPATH when resolving modules, which may break some imports. Explicitly installed python packages can be specified in workspaces using thepython_package
workspace yaml config option. Thepython_module
config option is deprecated and will be removed in a future release.
New
- Dagit can be hosted on a sub-path by passing
--path-prefix
to the dagit CLI. #2073 - The
date_partition_range
util function now accepts an optionalinclusive
boolean argument. By default, the function does not return include the partition for which the end time of the date range is greater than the current time. Ifinclusive=True
, then the list of partitions returned will include the extra partition. MultiDependency
or fan-in inputs will now only cause the solid step to skip if all of the fanned-in inputs upstream outputs were skipped
Bugfixes
- Fixed accidental breaking change with
input_hydration_config
arguments - Fixed an issue with yaml merging (thanks @shasha79!)
- Invoking
alias
on a solid output will produce a useful error message (thanks @iKintosh!) - Restored missing run pagination controls
- Fixed error resolving partition-based schedules created via dagster schedule decorators (e.g.
daily_schedule
) for certain workspace.yaml formats
Breaking Changes
- The
dagster-celery
module has been broken apart to manage dependencies more coherently. There are now three modules:dagster-celery
,dagster-celery-k8s
, anddagster-celery-docker
. - Related to above, the
dagster-celery worker start
command now takes a required-A
parameter which must point to theapp.py
file within the appropriate module. E.g if you are using thecelery_k8s_job_executor
then you must use the-A dagster_celery_k8s.app
option when using thecelery
ordagster-celery
cli tools. Similar for thecelery_docker_executor
:-A dagster_celery_docker.app
must be used. - Renamed the
input_hydration_config
andoutput_materialization_config
decorators todagster_type_
anddagster_type_materializer
respectively. Renamed DagsterType'sinput_hydration_config
andoutput_materialization_config
arguments toloader
andmaterializer
respectively.
New
-
New pipeline scoped runs tab in Dagit
-
Add the following Dask Job Queue clusters: moab, sge, lsf, slurm, oar (thanks @DavidKatz-il!)
-
K8s resource-requirements for run coordinator pods can be specified using the
dagster-k8s/ resource_requirements
tag on pipeline definitions:@pipeline( tags={ 'dagster-k8s/resource_requirements': { 'requests': {'cpu': '250m', 'memory': '64Mi'}, 'limits': {'cpu': '500m', 'memory': '2560Mi'}, } }, ) def foo_bar_pipeline():
-
Added better error messaging in dagit for partition set and schedule configuration errors
-
An initial version of the CeleryDockerExecutor was added (thanks @mrdrprofuroboros!). The celery workers will launch tasks in docker containers.
-
Experimental: Great Expectations integration is currently under development in the new library dagster-ge. Example usage can be found here
Breaking Changes
- Python 3.5 is no longer under test.
Engine
andExecutorConfig
have been deleted in favor ofExecutor
. Instead of the@executor
decorator decorating a function that returns anExecutorConfig
it should now decorate a function that returns anExecutor
.
New
- The python built-in
dict
can be used as an alias forPermissive()
within a config schema declaration. - Use
StringSource
in theS3ComputeLogManager
configuration schema to support using environment variables in the configuration (Thanks @mrdrprofuroboros!) - Improve Backfill CLI help text
- Add options to spark_df_output_schema (Thanks @DavidKatz-il!)
- Helm: Added support for overriding the PostgreSQL image/version used in the init container checks.
- Update celery k8s helm chart to include liveness checks for celery workers and flower
- Support step level retries to celery k8s executor
Bugfixes
- Improve error message shown when a RepositoryDefinition returns objects that are not one of the allowed definition types (Thanks @sd2k!)
- Show error message when
$DAGSTER_HOME
environment variable is not an absolute path (Thanks @AndersonReyes!) - Update default value for
staging_prefix
in theDatabricksPySparkStepLauncher
configuration to be an absolute path (Thanks @sd2k!) - Improve error message shown when Databricks logs can't be retrieved (Thanks @sd2k!)
- Fix errors in documentation fo
input_hydration_config
(Thanks @joeyfreund!)
Bugfix
- Reverted changed in 0.8.3 that caused error during run launch in certain circumstances
- Updated partition graphs on schedule page to select most recent run
- Forced reload of partitions for partition sets to ensure not serving stale data
New
- Added reload button to dagit to reload current repository
- Added option to wipe a single asset key by using
dagster asset wipe <asset_key>
- Simplified schedule page, removing ticks table, adding tags for last tick attempt
- Better debugging tools for launch errors
Breaking Changes
-
Previously, the
gcs_resource
returned aGCSResource
wrapper which had a singleclient
property that returned agoogle.cloud.storage.client.Client
. Now, thegcs_resource
returns the client directly.To update solids that use the
gcp_resource
, change:context.resources.gcs.client
To:
context.resources.gcs
New
- Introduced a new Python API
reexecute_pipeline
to reexecute an existing pipeline run. - Performance improvements in Pipeline Overview and other pages.
- Long metadata entries in the asset details view are now scrollable.
- Added a
project
field to thegcs_resource
indagster_gcp
. - Added new CLI command
dagster asset wipe
to remove all existing asset keys.
Bugfix
- Several Dagit bugfixes and performance improvements
- Fixes pipeline execution issue with custom run launchers that call
executeRunInProcess
. - Updates
dagster schedule up
output to be repository location scoped
Bugfix
- Fixes issues with
dagster instance migrate
. - Fixes bug in
launch_scheduled_execution
that would mask configuration errors. - Fixes bug in dagit where schedule related errors were not shown.
- Fixes JSON-serialization error in
dagster-k8s
when specifying per-step resources.
New
- Makes
label
optional parameter for materializations withasset_key
specified. - Changes
Assets
page to have a typeahead selector and hierarchical views based on asset_key path. - dagster-ssh
- adds SFTP get and put functions to
SSHResource
, replacing sftp_solid.
- adds SFTP get and put functions to
Docs
- Various docs corrections
Bugfix
- Fixed a file descriptor leak that caused
OSError: [Errno 24] Too many open files
when enough temporary files were created. - Fixed an issue where an empty config in the Playground would unexpectedly be marked as invalid YAML.
- Removed "config" deprecation warnings for dask and celery executors.
New
- Improved performance of the Assets page.
Major Changes
Please see the 080_MIGRATION.md
migration guide for details on updating existing code to be
compatible with 0.8.0
-
Workspace, host and user process separation, and repository definition Dagit and other tools no longer load a single repository containing user definitions such as pipelines into the same process as the framework code. Instead, they load a "workspace" that can contain multiple repositories sourced from a variety of different external locations (e.g., Python modules and Python virtualenvs, with containers and source control repositories soon to come).
The repositories in a workspace are loaded into their own "user" processes distinct from the "host" framework process. Dagit and other tools now communicate with user code over an IPC mechanism. This architectural change has a couple of advantages:
- Dagit no longer needs to be restarted when there is an update to user code.
- Users can use repositories to organize their pipelines, but still work on all of their repositories using a single running Dagit.
- The Dagit process can now run in a separate Python environment from user code so pipeline dependencies do not need to be installed into the Dagit environment.
- Each repository can be sourced from a separate Python virtualenv, so teams can manage their dependencies (or even their own Python versions) separately.
We have introduced a new file format,
workspace.yaml
, in order to support this new architecture. The workspace yaml encodes what repositories to load and their location, and supersedes therepository.yaml
file and associated machinery.As a consequence, Dagster internals are now stricter about how pipelines are loaded. If you have written scripts or tests in which a pipeline is defined and then passed across a process boundary (e.g., using the
multiprocess_executor
or dagstermill), you may now need to wrap the pipeline in thereconstructable
utility function for it to be reconstructed across the process boundary.In addition, rather than instantiate the
RepositoryDefinition
class directly, users should now prefer the@repository
decorator. As part of this change, the@scheduler
and@repository_partitions
decorators have been removed, and their functionality subsumed under@repository
.
-
Dagit organization The Dagit interface has changed substantially and is now oriented around pipelines. Within the context of each pipeline in an environment, the previous "Pipelines" and "Solids" tabs have been collapsed into the "Definition" tab; a new "Overview" tab provides summary information about the pipeline, its schedules, its assets, and recent runs; the previous "Playground" tab has been moved within the context of an individual pipeline. Related runs (e.g., runs created by re-executing subsets of previous runs) are now grouped together in the Playground for easy reference. Dagit also now includes more advanced support for display of scheduled runs that may not have executed ("schedule ticks"), as well as longitudinal views over scheduled runs, and asset-oriented views of historical pipeline runs.
-
Assets Assets are named materializations that can be generated by your pipeline solids, which support specialized views in Dagit. For example, if we represent a database table with an asset key, we can now index all of the pipelines and pipeline runs that materialize that table, and view them in a single place. To use the asset system, you must enable an asset-aware storage such as Postgres.
-
Run launchers The distinction between "starting" and "launching" a run has been effaced. All pipeline runs instigated through Dagit now make use of the
RunLauncher
configured on the Dagster instance, if one is configured. Additionally, run launchers can now support termination of previously launched runs. If you have written your own run launcher, you may want to update it to support termination. Note also that as of 0.7.9, the semantics ofRunLauncher.launch_run
have changed; this method now takes therun_id
of an existing run and should no longer attempt to create the run in the instance. -
Flexible reexecution Pipeline re-execution from Dagit is now fully flexible. You may re-execute arbitrary subsets of a pipeline's execution steps, and the re-execution now appears in the interface as a child run of the original execution.
-
Support for historical runs Snapshots of pipelines and other Dagster objects are now persisted along with pipeline runs, so that historial runs can be loaded for review with the correct execution plans even when pipeline code has changed. This prepares the system to be able to diff pipeline runs and other objects against each other.
-
Step launchers and expanded support for PySpark on EMR and Databricks We've introduced a new
StepLauncher
abstraction that uses the resource system to allow individual execution steps to be run in separate processes (and thus on separate execution substrates). This has made extensive improvements to our PySpark support possible, including the option to execute individual PySpark steps on EMR using theEmrPySparkStepLauncher
and on Databricks using theDatabricksPySparkStepLauncher
Theemr_pyspark
example demonstrates how to use a step launcher. -
Clearer names What was previously known as the environment dictionary is now called the
run_config
, and the previousenvironment_dict
argument to APIs such asexecute_pipeline
is now deprecated. We renamed this argument to focus attention on the configuration of the run being launched or executed, rather than on an ambiguous "environment". We've also renamed theconfig
argument to all use definitions to beconfig_schema
, which should reduce ambiguity between the configuration schema and the value being passed in some particular case. We've also consolidated and improved documentation of the valid types for a config schema. -
Lakehouse We're pleased to introduce Lakehouse, an experimental, alternative programming model for data applications, built on top of Dagster core. Lakehouse allows developers to define data applications in terms of data assets, such as database tables or ML models, rather than in terms of the computations that produce those assets. The
simple_lakehouse
example gives a taste of what it's like to program in Lakehouse. We'd love feedback on whether this model is helpful! -
Airflow ingest We've expanded the tooling available to teams with existing Airflow installations that are interested in incrementally adopting Dagster. Previously, we provided only injection tools that allowed developers to write Dagster pipelines and then compile them into Airflow DAGs for execution. We've now added ingestion tools that allow teams to move to Dagster for execution without having to rewrite all of their legacy pipelines in Dagster. In this approach, Airflow DAGs are kept in their own container/environment, compiled into Dagster pipelines, and run via the Dagster orchestrator. See the
airflow_ingest
example for details!
Breaking Changes
-
dagster
-
The
@scheduler
and@repository_partitions
decorators have been removed. Instances ofScheduleDefinition
andPartitionSetDefinition
belonging to a repository should be specified using the@repository
decorator instead. -
Support for the Dagster solid selection DSL, previously introduced in Dagit, is now uniform throughout the Python codebase, with the previous
solid_subset
arguments (--solid-subset
in the CLI) being replaced bysolid_selection
(--solid-selection
). In addition to the names of individual solids, this argument now supports selection queries like*solid_name++
(i.e.,solid_name
, all of its ancestors, its immediate descendants, and their immediate descendants). -
The built-in Dagster type
Path
has been removed. -
PartitionSetDefinition
names, including those defined by aPartitionScheduleDefinition
, must now be unique within a single repository. -
Asset keys are now sanitized for non-alphanumeric characters. All characters besides alphanumerics and
_
are treated as path delimiters. Asset keys can also be specified usingAssetKey
, which accepts a list of strings as an explicit path. If you are running 0.7.10 or later and using assets, you may need to migrate your historical event log data for asset keys from previous runs to be attributed correctly. Thisevent_log
data migration can be invoked as follows:from dagster.core.storage.event_log.migration import migrate_event_log_data from dagster import DagsterInstance migrate_event_log_data(instance=DagsterInstance.get())
-
The interface of the
Scheduler
base class has changed substantially. If you've written a custom scheduler, please get in touch! -
The partitioned schedule decorators now generate
PartitionSetDefinition
names using the schedule name, suffixed with_partitions
. -
The
repository
property onScheduleExecutionContext
is no longer available. If you were using this property to pass toScheduler
instance methods, this interface has changed significantly. Please see theScheduler
class documentation for details. -
The CLI option
--celery-base-priority
is no longer available for the command:dagster pipeline backfill
. Use the tags option to specify the celery priority, (e.g.dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'
-
The
execute_partition_set
API has been removed. -
The deprecated
is_optional
parameter toField
andOutputDefinition
has been removed. Useis_required
instead. -
The deprecated
runtime_type
property onInputDefinition
andOutputDefinition
has been removed. Usedagster_type
instead. -
The deprecated
has_runtime_type
,runtime_type_named
, andall_runtime_types
methods onPipelineDefinition
have been removed. Usehas_dagster_type
,dagster_type_named
, andall_dagster_types
instead. -
The deprecated
all_runtime_types
method onSolidDefinition
andCompositeSolidDefinition
has been removed. Useall_dagster_types
instead. -
The deprecated
metadata
argument toSolidDefinition
and@solid
has been removed. Usetags
instead. -
The graphviz-based DAG visualization in Dagster core has been removed. Please use Dagit!
-
-
dagit
dagit-cli
has been removed, anddagit
is now the only console entrypoint.
-
dagster-aws
- The AWS CLI has been removed.
dagster_aws.EmrRunJobFlowSolidDefinition
has been removed.
-
dagster-bash
- This package has been renamed to dagster-shell. The
bash_command_solid
andbash_script_solid
solid factory functions have been renamed tocreate_shell_command_solid
andcreate_shell_script_solid
.
- This package has been renamed to dagster-shell. The
-
dagster-celery
- The CLI option
--celery-base-priority
is no longer available for the command:dagster pipeline backfill
. Use the tags option to specify the celery priority, (e.g.dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'
- The CLI option
-
dagster-dask
- The config schema for the
dagster_dask.dask_executor
has changed. The previous config should now be nested under the keylocal
.
- The config schema for the
-
dagster-gcp
- The
BigQueryClient
has been removed. Usebigquery_resource
instead.
- The
-
dagster-dbt
- The dagster-dbt package has been removed. This was inadequate as a reference integration, and will be replaced in 0.8.x.
-
dagster-spark
dagster_spark.SparkSolidDefinition
has been removed - usecreate_spark_solid
instead.- The
SparkRDD
Dagster type, which only worked with an in-memory engine, has been removed.
-
dagster-twilio
- The
TwilioClient
has been removed. Usetwilio_resource
instead.
- The
New
-
dagster
- You may now set
asset_key
on anyMaterialization
to use the new asset system. You will also need to configure an asset-aware storage, such as Postgres. Thelongitudinal_pipeline
example demonstrates this system. - The partitioned schedule decorators now support an optional
end_time
. - Opt-in telemetry now reports the Python version being used.
- You may now set
-
dagit
- Dagit's GraphQL playground is now available at
/graphiql
as well as at/graphql
.
- Dagit's GraphQL playground is now available at
-
dagster-aws
- The
dagster_aws.S3ComputeLogManager
may now be configured to override the S3 endpoint and associated SSL settings. - Config string and integer values in the S3 tooling may now be set using either environment variables or literals.
- The
-
dagster-azure
- We've added the dagster-azure package, with support for Azure Data Lake Storage Gen2; you can
use the
adls2_system_storage
or, for direct access, theadls2_resource
resource. (Thanks @sd2k!)
- We've added the dagster-azure package, with support for Azure Data Lake Storage Gen2; you can
use the
-
dagster-dask
- Dask clusters are now supported by
dagster_dask.dask_executor
. For full support, you will need to install extras withpip install dagster-dask[yarn, pbs, kube]
. (Thanks @DavidKatz-il!)
- Dask clusters are now supported by
-
dagster-databricks
- We've added the dagster-databricks package, with support for running PySpark steps on Databricks
clusters through the
databricks_pyspark_step_launcher
. (Thanks @sd2k!)
- We've added the dagster-databricks package, with support for running PySpark steps on Databricks
clusters through the
-
dagster-gcp
- Config string and integer values in the BigQuery, Dataproc, and GCS tooling may now be set using either environment variables or literals.
-
dagster-k8s
- Added the
CeleryK8sRunLauncher
to submit execution plan steps to Celery task queues for execution as k8s Jobs. - Added the ability to specify resource limits on a per-pipeline and per-step basis for k8s Jobs.
- Many improvements and bug fixes to the dagster-k8s Helm chart.
- Added the
-
dagster-pandas
- Config string and integer values in the dagster-pandas input and output schemas may now be set using either environment variables or literals.
-
dagster-papertrail
- Config string and integer values in the
papertrail_logger
may now be set using either environment variables or literals.
- Config string and integer values in the
-
dagster-pyspark
- PySpark solids can now run on EMR, using the
emr_pyspark_step_launcher
, or on Databricks using the new dagster-databricks package. Theemr_pyspark
example demonstrates how to use a step launcher.
- PySpark solids can now run on EMR, using the
-
dagster-snowflake
- Config string and integer values in the
snowflake_resource
may now be set using either environment variables or literals.
- Config string and integer values in the
-
dagster-spark
dagster_spark.create_spark_solid
now accepts arequired_resource_keys
argument, which enables setting up a step launcher for Spark solids, like theemr_pyspark_step_launcher
.
Bugfix
dagster pipeline execute
now sets a non-zero exit code when pipeline execution fails.
Bugfix
- Enabled
NoOpComputeLogManager
to be configured as thecompute_logs
implementation indagster.yaml
- Suppressed noisy error messages in logs from skipped steps
New
- Improve dagster scheduler state reconciliation.
New
- Dagit now allows re-executing arbitrary step subset via step selector syntax, regardless of whether the previous pipeline failed or not.
- Added a search filter for the root Assets page
- Adds tooltip explanations for disabled run actions
- The last output of the cron job command created by the scheduler is now stored in a file. A new
dagster schedule logs {schedule_name}
command will show the log file for a given schedule. This helps uncover errors like missing environment variables and import errors. - The Dagit schedule page will now show inconsistency errors between schedule state and the cron
tab that were previously only displayed by the
dagster schedule debug
command. As before, these errors can be resolve usingdagster schedule up
Bugfix
- Fixes an issue with config schema validation on Arrays
- Fixes an issue with initializing K8sRunLauncher when configured via
dagster.yaml
- Fixes a race condition in Airflow injection logic that happens when multiple Operators try to create PipelineRun entries simultaneously.
- Fixed an issue with schedules that had invalid config not logging the appropriate error.
Breaking Changes
dagster pipeline backfill
command no longer takes amode
flag. Instead, it uses the mode specified on thePartitionSetDefinition
. Similarly, the runs created from the backfill also use thesolid_subset
specified on thePartitionSetDefinition
BugFix
- Fixes a bug where using solid subsets when launching pipeline runs would fail config validation.
- (dagster-gcp) allow multiple "bq_solid_for_queries" solids to co-exist in a pipeline
- Improve scheduler state reconciliation with dagster-cron scheduler.
dagster schedule
debug command will display issues related to missing crob jobs, extraneous cron jobs, and duplicate cron jobs. Runningdagster schedule up
will fix any issues.
New
- The dagster-airflow package now supports loading Airflow dags without depending on initialized Airflow db
- Improvements to the longitudinal partitioned schedule view, including live updates, run filtering, and better default states.
- Added user warning for dagster library packages that are out of sync with the core
dagster
package.
Bugfix
- We now only render the subset of an execution plan that has actually executed, and persist that subset information along with the snapshot.
- @pipeline and @composite_solid now correctly capture
__doc__
from the function they decorate. - Fixed a bug with using solid subsets in the Dagit playground
Bugfix
- Fixed an issue with strict snapshot ID matching when loading historical snapshots, which caused errors on the Runs page when viewing historical runs.
- Fixed an issue where
dagster_celery
had introduced a spurious dependency ondagster_k8s
(#2435) - Fixed an issue where our Airflow, Celery, and Dask integrations required S3 or GCS storage and prevented use of filesystem storage. Filesystem storage is now also permitted, to enable use of these integrations with distributed filesystems like NFS (#2436).
New
RepositoryDefinition
now takesschedule_defs
andpartition_set_defs
directly. The loading scheme for these definitions viarepository.yaml
under thescheduler:
andpartitions:
keys is deprecated and expected to be removed in 0.8.0.- Mark published modules as python 3.8 compatible.
- The dagster-airflow package supports loading all Airflow DAGs within a directory path, file path, or Airflow DagBag.
- The dagster-airflow package supports loading all 23 DAGs in Airflow example_dags folder and
execution of 17 of them (see:
make_dagster_repo_from_airflow_example_dags
). - The dagster-celery CLI tools now allow you to pass additional arguments through to the underlying
celery CLI, e.g., running
dagster-celery worker start -n my-worker -- --uid=42
will pass the--uid
flag to celery. - It is now possible to create a
PresetDefinition
that has no environment defined. - Added
dagster schedule debug
command to help debug scheduler state. - The
SystemCronScheduler
now verifies that a cron job has been successfully been added to the crontab when turning a schedule on, and shows an error message if unsuccessful.
Breaking Changes
- A
dagster instance migrate
is required for this release to support the new experimental assets view. - Runs created prior to 0.7.8 will no longer render their execution plans as DAGs. We are only rendering execution plans that have been persisted. Logs are still available.
Path
is no longer valid in config schemas. Usestr
ordagster.String
instead.- Removed the
@pyspark_solid
decorator - its functionality, which was experimental, is subsumed by requiring a StepLauncher resource (e.g. emr_pyspark_step_launcher) on the solid.
Dagit
- Merged "re-execute", "single-step re-execute", "resume/retry" buttons into one "re-execute" button with three dropdown selections on the Run page.
Experimental
- Added new
asset_key
string parameter to Materializations and created a new “Assets” tab in Dagit to view pipelines and runs associated with these keys. The API and UI of these asset-based are likely to change, but feedback is welcome and will be used to inform these changes. - Added an
emr_pyspark_step_launcher
that enables launching PySpark solids in EMR. The "simple_pyspark" example demonstrates how it’s used.
Bugfix
- Fixed an issue when running Jupyter notebooks in a Python 2 kernel through dagstermill with Dagster running in Python 3.
- Improved error messages produced when dagstermill spins up an in-notebook context.
- Fixed an issue with retrieving step events from
CompositeSolidResult
objects.
Breaking Changes
- If you are launching runs using
DagsterInstance.launch_run
, this method now takes a run id instead of an instance ofPipelineRun
. Additionally,DagsterInstance.create_run
andDagsterInstance.create_empty_run
have been replaced byDagsterInstance.get_or_create_run
andDagsterInstance.create_run_for_pipeline
. - If you have implemented your own
RunLauncher
, there are two required changes:RunLauncher.launch_run
takes a pipeline run that has already been created. You should remove any calls toinstance.create_run
in this method.- Instead of calling
startPipelineExecution
(defined in thedagster_graphql.client.query.START_PIPELINE_EXECUTION_MUTATION
) in the run launcher, you should callstartPipelineExecutionForCreatedRun
(defined indagster_graphql.client.query.START_PIPELINE_EXECUTION_FOR_CREATED_RUN_MUTATION
). - Refer to the
RemoteDagitRunLauncher
for an example implementation.
New
- Improvements to preset and solid subselection in the playground. An inline preview of the pipeline instead of a modal when doing subselection, and the correct subselection is chosen when selecting a preset.
- Improvements to the log searching. Tokenization and autocompletion for searching messages types and for specific steps.
- You can now view the structure of pipelines from historical runs, even if that pipeline no longer exists in the loaded repository or has changed structure.
- Historical execution plans are now viewable, even if the pipeline has changed structure.
- Added metadata link to raw compute logs for all StepStart events in PipelineRun view and Step view.
- Improved error handling for the scheduler. If a scheduled run has config errors, the errors are persisted to the event log for the run and can be viewed in Dagit.
Bugfix
- No longer manually dispose sqlalchemy engine in dagster-postgres
- Made boto3 dependency in dagster-aws more flexible (#2418)
- Fixed tooltip UI cleanup in partitioned schedule view
Documentation
- Brand new documentation site, available at https://docs.dagster.io
- The tutorial has been restructured to multiple sections, and the examples in intro_tutorial have been rearranged to separate folders to reflect this.
Breaking Changes
- The
execute_pipeline_with_mode
andexecute_pipeline_with_preset
APIs have been dropped in favor of new top level arguments toexecute_pipeline
,mode
andpreset
. - The use of
RunConfig
to pass options toexecute_pipeline
has been deprecated, andRunConfig
will be removed in 0.8.0. - The
execute_solid_within_pipeline
andexecute_solids_within_pipeline
APIs, intended to support tests, now take new top level argumentsmode
andpreset
.
New
- The dagster-aws Redshift resource now supports providing an error callback to debug failed queries.
- We now persist serialized execution plans for historical runs. They will render correctly even if the pipeline structure has changed or if it does not exist in the current loaded repository.
- Clicking on a pipeline tag in the
Runs
view will apply that tag as a filter.
Bugfix
- Fixed a bug where telemetry logger would create a log file (but not write any logs) even when telemetry was disabled.
Experimental
- The dagster-airflow package supports ingesting Airflow dags and running them as dagster pipelines
(see:
make_dagster_pipeline_from_airflow_dag
). This is in the early experimentation phase. - Improved the layout of the experimental partition runs table on the
Schedules
detailed view.
Documentation
- Fixed a grammatical error (Thanks @flowersw!)
Breaking Changes
-
The default sqlite and
dagster-postgres
implementations have been altered to extract the eventstep_key
field as a column, to enable faster per-step queries. You will need to rundagster instance migrate
to update the schema. You may optionally migrate your historical event log data to extract thestep_key
using themigrate_event_log_data
function. This will ensure that your historical event log data will be captured in future step-key based views. Thisevent_log
data migration can be invoked as follows:from dagster.core.storage.event_log.migration import migrate_event_log_data from dagster import DagsterInstance migrate_event_log_data(instance=DagsterInstance.get())
-
We have made pipeline metadata serializable and persist that along with run information. While there are no user-facing features to leverage this yet, it does require an instance migration. Run
dagster instance migrate
. If you have already run the migration for theevent_log
changes above, you do not need to run it again. Any unforeseen errors related to the newsnapshot_id
in theruns
table or the newsnapshots
table are related to this migration. -
dagster-pandas
ColumnTypeConstraint
has been removed in favor ofColumnDTypeFnConstraint
andColumnDTypeInSetConstraint
.
New
- You can now specify that dagstermill output notebooks be yielded as an output from dagstermill solids, in addition to being materialized.
- You may now set the extension on files created using the
FileManager
machinery. - dagster-pandas typed
PandasColumn
constructors now support pandas 1.0 dtypes. - The Dagit Playground has been restructured to make the relationship between Preset, Partition Sets, Modes, and subsets more clear. All of these buttons have be reconciled and moved to the left side of the Playground.
- Config sections that are required but not filled out in the Dagit playground are now detected and labeled in orange.
- dagster-celery config now support using
env:
to load from environment variables.
Bugfix
- Fixed a bug where selecting a preset in
dagit
would not populate tags specified on the pipeline definition. - Fixed a bug where metadata attached to a raised
Failure
was not displayed in the error modal indagit
. - Fixed an issue where reimporting dagstermill and calling
dagstermill.get_context()
outside of the parameters cell of a dagstermill notebook could lead to unexpected behavior. - Fixed an issue with connection pooling in dagster-postgres, improving responsiveness when using the Postgres-backed storages.
Experimental
- Added a longitudinal view of runs for on the
Schedule
tab for scheduled, partitioned pipelines. Includes views of run status, execution time, and materializations across partitions. The UI is in flux and is currently optimized for daily schedules, but feedback is welcome.
Breaking Changes
default_value
inField
no longer accepts native instances of python enums. Instead the underlying string representation in the config system must be used.default_value
inField
no longer accepts callables.- The
dagster_aws
imports have been reorganized; you should now import resources fromdagster_aws.<AWS service name>
.dagster_aws
providess3
,emr
,redshift
, andcloudwatch
modules. - The
dagster_aws
S3 resource no longer attempts to model the underlying boto3 API, and you can now just use any boto3 S3 API directly on a S3 resource, e.g.context.resources.s3.list_objects_v2
. (#2292)
New
- New
Playground
view indagit
showing an interactive config map - Improved storage and UI for showing schedule attempts
- Added the ability to set default values in
InputDefinition
- Added CLI command
dagster pipeline launch
to launch runs using a configuredRunLauncher
- Added ability to specify pipeline run tags using the CLI
- Added a
pdb
utility toSolidExecutionContext
to help with debugging, available within a solid ascontext.pdb
- Added
PresetDefinition.with_additional_config
to allow for config overrides - Added resource name to log messages generated during resource initialization
- Added grouping tags for runs that have been retried / reexecuted.
Bugfix
- Fixed a bug where date range partitions with a specified end date was clipping the last day
- Fixed an issue where some schedule attempts that failed to start would be marked running forever.
- Fixed the
@weekly
partitioned schedule decorator - Fixed timezone inconsistencies between the runs view and the schedules view
- Integers are now accepted as valid values for Float config fields
- Fixed an issue when executing dagstermill solids with config that contained quote characters.
dagstermill
- The Jupyter kernel to use may now be specified when creating dagster notebooks with the
--kernel
flag.
dagster-dbt
dbt_solid
now has aNothing
input to allow for sequencing
dagster-k8s
- Added
get_celery_engine_config
to select celery engine, leveraging Celery infrastructure
Documentation
- Improvements to the airline and bay bikes demos
- Improvements to our dask deployment docs (Thanks jswaney!!)
New
-
Added the
IntSource
type, which lets integers be set from environment variables in config. -
You may now set tags on pipeline definitions. These will resolve in the following cases:
- Loading in the playground view in Dagit will pre-populate the tag container.
- Loading partition sets from the preset/config picker will pre-populate the tag container with the union of pipeline tags and partition tags, with partition tags taking precedence.
- Executing from the CLI will generate runs with the pipeline tags.
- Executing programmatically using the
execute_pipeline
api will create a run with the union of pipeline tags andRunConfig
tags, withRunConfig
tags taking precedence. - Scheduled runs (both launched and executed) will have the union of pipeline tags and the schedule tags function, with the schedule tags taking precedence.
-
Output materialization configs may now yield multiple Materializations, and the tutorial has been updated to reflect this.
-
We now export the
SolidExecutionContext
in the public API so that users can correctly type hint solid compute functions.
Dagit
- Pipeline run tags are now preserved when resuming/retrying from Dagit.
- Scheduled run stats are now grouped by partition.
- A "preparing" section has been added to the execution viewer. This shows steps that are in progress of starting execution.
- Markers emitted by the underlying execution engines are now visualized in the Dagit execution timeline.
Bugfix
- Resume/retry now works as expected in the presence of solids that yield optional outputs.
- Fixed an issue where dagster-celery workers were failing to start in the presence of config
values that were
None
. - Fixed an issue with attempting to set
threads_per_worker
on Dask distributed clusters.
dagster-postgres
- All postgres config may now be set using environment variables in config.
dagster-aws
- The
s3_resource
now exposes alist_objects_v2
method corresponding to the underlying boto3 API. (Thanks, @basilvetas!) - Added the
redshift_resource
to access Redshift databases.
dagster-k8s
- The
K8sRunLauncher
config now includes theload_kubeconfig
andkubeconfig_file
options.
Documentation
- Fixes and improvements.
Dependencies
- dagster-airflow no longer pins its werkzeug dependency.
Community
-
We've added opt-in telemetry to Dagster so we can collect usage statistics in order to inform development priorities. Telemetry data will motivate projects such as adding features in frequently-used parts of the CLI and adding more examples in the docs in areas where users encounter more errors.
We will not see or store solid definitions (including generated context) or pipeline definitions (including modes and resources). We will not see or store any data that is processed within solids and pipelines.
If you'd like to opt in to telemetry, please add the following to
$DAGSTER_HOME/dagster.yaml
:telemetry: enabled: true
-
Thanks to @basilvetas and @hspak for their contributions!
New
- It is now possible to use Postgres to back schedule storage by configuring
dagster_postgres.PostgresScheduleStorage
on the instance. - Added the
execute_pipeline_with_mode
API to allow executing a pipeline in test with a specific mode without having to specifyRunConfig
. - Experimental support for retries in the Celery executor.
- It is now possible to set run-level priorities for backfills run using the Celery executor by
passing
--celery-base-priority
todagster pipeline backfill
. - Added the
@weekly
schedule decorator.
Deprecations
- The
dagster-ge
library has been removed from this release due to drift from the underlying Great Expectations implementation.
dagster-pandas
PandasColumn
now includes anis_optional
flag, replacing the previousColumnExistsConstraint
.- You can now pass the
ignore_missing_values flag
toPandasColumn
in order to apply column constraints only to the non-missing rows in a column.
dagster-k8s
- The Helm chart now includes provision for an Ingress and for multiple Celery queues.
Documentation
- Improvements and fixes.
New
- It is now possible to configure a Dagit instance to disable executing pipeline runs in a local subprocess.
- Resource initialization, teardown, and associated failure states now emit structured events visible in Dagit. Structured events for pipeline errors and multiprocess execution have been consolidated and rationalized.
- Support Redis queue provider in
dagster-k8s
Helm chart. - Support external postgresql in
dagster-k8s
Helm chart.
Bugfix
- Fixed an issue with inaccurate timings on some resource initializations.
- Fixed an issue that could cause the multiprocess engine to spin forever.
- Fixed an issue with default value resolution when a config value was set using
SourceString
. - Fixed an issue when loading logs from a pipeline belonging to a different repository in Dagit.
- Fixed an issue with where the CLI command
dagster schedule up
would fail in certain scenarios with theSystemCronScheduler
.
Pandas
- Column constraints can now be configured to permit NaN values.
Dagstermill
- Removed a spurious dependency on sklearn.
Docs
- Improvements and fixes to docs.
- Restored dagster.readthedocs.io.
Experimental
- An initial implementation of solid retries, throwing a
RetryRequested
exception, was added. This API is experimental and likely to change.
Other
- Renamed property
runtime_type
todagster_type
in definitions. The following are deprecated and will be removed in a future version.InputDefinition.runtime_type
is deprecated. UseInputDefinition.dagster_type
instead.OutputDefinition.runtime_type
is deprecated. UseOutputDefinition.dagster_type
instead.CompositeSolidDefinition.all_runtime_types
is deprecated. UseCompositeSolidDefinition.all_dagster_types
instead.SolidDefinition.all_runtime_types
is deprecated. UseSolidDefinition.all_dagster_types
instead.PipelineDefinition.has_runtime_type
is deprecated. UsePipelineDefinition.has_dagster_type
instead.PipelineDefinition.runtime_type_named
is deprecated. UsePipelineDefinition.dagster_type_named
instead.PipelineDefinition.all_runtime_types
is deprecated. UsePipelineDefinition.all_dagster_types
instead.
Docs
- New docs site at docs.dagster.io.
- dagster.readthedocs.io is currently stale due to availability issues.
New
- Improvements to S3 Resource. (Thanks @dwallace0723!)
- Better error messages in Dagit.
- Better font/styling support in Dagit.
- Changed
OutputDefinition
to takeis_required
rather thanis_optional
argument. This is to remain consistent with changes toField
in 0.7.1 and to avoid confusion with python's typing and dagster's definition ofOptional
, which indicates None-ability, rather than existence.is_optional
is deprecated and will be removed in a future version. - Added support for Flower in dagster-k8s.
- Added support for environment variable config in dagster-snowflake.
Bugfixes
- Improved performance in Dagit waterfall view.
- Fixed bug when executing solids downstream of a skipped solid.
- Improved navigation experience for pipelines in Dagit.
- Fixed for the dagster-aws CLI tool.
- Fixed issue starting Dagit without DAGSTER_HOME set on windows.
- Fixed pipeline subset execution in partition-based schedules.
Dagit
- Dagit now looks up an available port on which to run when the default port is not available. (Thanks @rparrapy!)
dagster_pandas
- Hydration and materialization are now configurable on
dagster_pandas
dataframes.
dagster_aws
- The
s3_resource
no longer uses an unsigned session by default.
Bugfixes
- Type check messages are now displayed in Dagit.
- Failure metadata is now surfaced in Dagit.
- Dagit now correctly displays the execution time of steps that error.
- Error messages now appear correctly in console logging.
- GCS storage is now more robust to transient failures.
- Fixed an issue where some event logs could be duplicated in Dagit.
- Fixed an issue when reading config from an environment variable that wasn't set.
- Fixed an issue when loading a repository or pipeline from a file target on Windows.
- Fixed an issue where deleted runs could cause the scheduler page to crash in Dagit.
Documentation
- Expanded and improved docs and error messages.
Breaking Changes
There are a substantial number of breaking changes in the 0.7.0 release.
Please see 070_MIGRATION.md
for instructions regarding migrating old code.
Scheduler
-
The scheduler configuration has been moved from the
@schedules
decorator toDagsterInstance
. Existing schedules that have been running are no longer compatible with current storage. To migrate, remove thescheduler
argument on all@schedules
decorators:instead of:
@schedules(scheduler=SystemCronScheduler) def define_schedules(): ...
Remove the
scheduler
argument:@schedules def define_schedules(): ...
Next, configure the scheduler on your instance by adding the following to
$DAGSTER_HOME/dagster.yaml
:scheduler: module: dagster_cron.cron_scheduler class: SystemCronScheduler
Finally, if you had any existing schedules running, delete the existing
$DAGSTER_HOME/schedules
directory and rundagster schedule wipe && dagster schedule up
to re-instatiate schedules in a valid state. -
The
should_execute
andenvironment_dict_fn
argument toScheduleDefinition
now have a required first argumentcontext
, representing theScheduleExecutionContext
Config System Changes
-
In the config system,
Dict
has been renamed toShape
;List
toArray
;Optional
toNoneable
; andPermissiveDict
toPermissive
. The motivation here is to clearly delineate config use cases versus cases where you are using types as the inputs and outputs of solids as well as python typing types (for mypy and friends). We believe this will be clearer to users in addition to simplifying our own implementation and internal abstractions.Our recommended fix is not to use
Shape
andArray
, but instead to use our new condensed config specification API. This allow one to use bare dictionaries instead ofShape
, lists with one member instead ofArray
, bare types instead ofField
with a single argument, and python primitive types (int
,bool
etc) instead of the dagster equivalents. These result in dramatically less verbose config specs in most cases.So instead of
from dagster import Shape, Field, Int, Array, String # ... code config=Shape({ # Dict prior to change 'some_int' : Field(Int), 'some_list: Field(Array[String]) # List prior to change })
one can instead write:
config={'some_int': int, 'some_list': [str]}
No imports and much simpler, cleaner syntax.
-
config_field
is no longer a valid argument onsolid
,SolidDefinition
,ExecutorDefintion
,executor
,LoggerDefinition
,logger
,ResourceDefinition
,resource
,system_storage
, andSystemStorageDefinition
. Useconfig
instead. -
For composite solids, the
config_fn
no longer takes aConfigMappingContext
, and the context has been deleted. To upgrade, remove the first argument toconfig_fn
.So instead of
@composite_solid(config={}, config_fn=lambda context, config: {})
one must instead write:
@composite_solid(config={}, config_fn=lambda config: {})
-
Field
takes ais_required
rather than ais_optional
argument. This is to avoid confusion with python's typing and dagster's definition ofOptional
, which indicates None-ability, rather than existence.is_optional
is deprecated and will be removed in a future version.
Required Resources
-
All solids, types, and config functions that use a resource must explicitly list that resource using the argument
required_resource_keys
. This is to enable efficient resource management during pipeline execution, especially in a multiprocessing or remote execution environment. -
The
@system_storage
decorator now requires argumentrequired_resource_keys
, which was previously optional.
Dagster Type System Changes
dagster.Set
anddagster.Tuple
can no longer be used within the config system.- Dagster types are now instances of
DagsterType
, rather than a class than inherits fromRuntimeType
. Instead of dynamically generating a class to create a custom runtime type, just create an instance of aDagsterType
. The type checking function is now an argument to theDagsterType
, rather than an abstract method that has to be implemented in a subclass. RuntimeType
has been renamed toDagsterType
is now an encouraged API for type creation.- Core type check function of DagsterType can now return a naked
bool
in addition to aTypeCheck
object. type_check_fn
onDagsterType
(formerlytype_check
andRuntimeType
, respectively) now takes a first argumentcontext
of typeTypeCheckContext
in addition to the second argument ofvalue
.define_python_dagster_type
has been eliminated in favor ofPythonObjectDagsterType
.dagster_type
has been renamed tousable_as_dagster_type
.as_dagster_type
has been removed and similar capabilities added asmake_python_type_usable_as_dagster_type
.PythonObjectDagsterType
andusable_as_dagster_type
no longer take atype_check
argument. If a custom type_check is needed, useDagsterType
.- As a consequence of these changes, if you were previously using
dagster_pyspark
ordagster_pandas
and expecting Pyspark or Pandas types to work as Dagster types, e.g., in type annotations to functions decorated with@solid
to indicate that they are input or output types for a solid, you will need to callmake_python_type_usable_as_dagster_type
from your code in order to map the Python types to the Dagster types, or just use the Dagster types (dagster_pandas.DataFrame
instead ofpandas.DataFrame
) directly.
Other
- We no longer publish base Docker images. Please see the updated deployment docs for an example Dockerfile off of which you can work.
step_metadata_fn
has been removed fromSolidDefinition
&@solid
.SolidDefinition
&@solid
now takestags
and enforces that values are strings or are safely encoded as JSON.metadata
is deprecated and will be removed in a future version.resource_mapper_fn
has been removed fromSolidInvocation
.
New
-
Dagit now includes a much richer execution view, with a Gantt-style visualization of step execution and a live timeline.
-
Early support for Python 3.8 is now available, and Dagster/Dagit along with many of our libraries are now tested against 3.8. Note that several of our upstream dependencies have yet to publish wheels for 3.8 on all platforms, so running on Python 3.8 likely still involves building some dependencies from source.
-
dagster/priority
tags can now be used to prioritize the order of execution for the built-in in-process and multiprocess engines. -
dagster-postgres
storages can now be configured with separate arguments and environment variables, such as:run_storage: module: dagster_postgres.run_storage class: PostgresRunStorage config: postgres_db: username: test password: env: ENV_VAR_FOR_PG_PASSWORD hostname: localhost db_name: test
-
Support for
RunLauncher
s onDagsterInstance
allows for execution to be "launched" outside of the Dagit/Dagster process. As one example, this is used bydagster-k8s
to submit pipeline execution as a Kubernetes Job. -
Added support for adding tags to runs initiated from the
Playground
view in Dagit. -
Added
@monthly_schedule
decorator. -
Added
Enum.from_python_enum
helper to wrap Python enums for config. (Thanks @kdungs!) -
[dagster-bash] The Dagster bash solid factory now passes along
kwargs
to the underlying solid construction, and now has a singleNothing
input by default to make it easier to create a sequencing dependency. Also, logs are now buffered by default to make execution less noisy. -
[dagster-aws] We've improved our EMR support substantially in this release. The
dagster_aws.emr
library now provides anEmrJobRunner
with various utilities for creating EMR clusters, submitting jobs, and waiting for jobs/logs. We also now provide aemr_pyspark_resource
, which together with the new@pyspark_solid
decorator makes moving pyspark execution from your laptop to EMR as simple as changing modes. [dagster-pandas] Addedcreate_dagster_pandas_dataframe_type
,PandasColumn
, andConstraint
API's in order for users to create custom types which perform column validation, dataframe validation, summary statistics emission, and dataframe serialization/deserialization. -
[dagster-gcp] GCS is now supported for system storage, as well as being supported with the Dask executor. (Thanks @habibutsu!) Bigquery solids have also been updated to support the new API.
Bugfix
- Ensured that all implementations of
RunStorage
clean up pipeline run tags when a run is deleted. Requires a storage migration, usingdagster instance migrate
. - The multiprocess and Celery engines now handle solid subsets correctly.
- The multiprocess and Celery engines will now correctly emit skip events for steps downstream of failures and other skips.
- The
@solid
and@lambda_solid
decorators now correctly wrap their decorated functions, in the sense offunctools.wraps
. - Performance improvements in Dagit when working with runs with large configurations.
- The Helm chart in
dagster_k8s
has been hardened against various failure modes and is now compatible with Helm 2. - SQLite run and event log storages are more robust to concurrent use.
- Improvements to error messages and to handling of user code errors in input hydration and output materialization logic.
- Fixed an issue where the Airflow scheduler could hang when attempting to load dagster-airflow pipelines.
- We now handle our SQLAlchemy connections in a more canonical way (thanks @zzztimbo!).
- Fixed an issue using S3 system storage with certain custom serialization strategies.
- Fixed an issue leaking orphan processes from compute logging.
- Fixed an issue leaking semaphores from Dagit.
- Setting the
raise_error
flag inexecute_pipeline
now actually raises user exceptions instead of a wrapper type.
Documentation
- Our docs have been reorganized and expanded (thanks @habibutsu, @vatervonacht, @zzztimbo). We'd love feedback and contributions!
Thank you Thank you to all of the community contributors to this release!! In alphabetical order: @habibutsu, @kdungs, @vatervonacht, @zzztimbo.
Bugfix
- Improved SQLite concurrency issues, uncovered while using concurrent nodes in Airflow
- Fixed sqlalchemy warnings (thanks @zzztimbo!)
- Fixed Airflow integration issue where a Dagster child process triggered a signal handler of a parent Airflow process via a process fork
- Fixed GCS and AWS intermediate store implementations to be compatible with read/write mode serialization strategies
- Improve test stability
Documentation
- Improved descriptions for setting up the cron scheduler (thanks @zzztimbo!)
New
- Added the dagster-github library, a community contribution from @Ramshackle-Jamathon and @k-mahoney!
dagster-celery
- Simplified and improved config handling.
- An engine event is now emitted when the engine fails to connect to a broker.
Bugfix
- Fixes a file descriptor leak when running many concurrent dagster-graphql queries (e.g., for backfill).
- The
@pyspark_solid
decorator now handles inputs correctly. - The handling of solid compute functions that accept kwargs but which are decorated with explicit input definitions has been rationalized.
- Fixed race conditions in concurrent execution using SQLite event log storage with concurrent execution, uncovered by upstream improvements in the Python inotify library we use.
Documentation
- Improved error messages when using system storages that don't fulfill executor requirements.
New
- We are now more permissive when specifying configuration schema in order make constructing configuration schema more concise.
- When specifying the value of scalar inputs in config, one can now specify that value directly as
the key of the input, rather than having to embed it within a
value
key.
Breaking
- The implementation of SQL-based event log storages has been consolidated,
which has entailed a schema change. If you have event logs stored in a
Postgres- or SQLite-backed event log storage, and you would like to maintain
access to these logs, you should run
dagster instance migrate
. To check what event log storages you are using, rundagster instance info
. - Type matches on both sides of an
InputMapping
orOutputMapping
are now enforced.
New
- Dagster is now tested on Python 3.8
- Added the dagster-celery library, which implements a Celery-based engine for parallel pipeline execution.
- Added the dagster-k8s library, which includes a Helm chart for a simple Dagit installation on a Kubernetes cluster.
Dagit
- The Explore UI now allows you to render a subset of a large DAG via a new solid
query bar that accepts terms like
solid_name+*
and+solid_name+
. When viewing very large DAGs, nothing is displayed by default and*
produces the original behavior. - Performance improvements in the Explore UI and config editor for large pipelines.
- The Explore UI now includes a zoom slider that makes it easier to navigate large DAGs.
- Dagit pages now render more gracefully in the presence of inconsistent run storage and event logs.
- Improved handling of GraphQL errors and backend programming errors.
- Minor display improvements.
dagster-aws
- A default prefix is now configurable on APIs that use S3.
- S3 APIs now parametrize
region_name
andendpoint_url
.
dagster-gcp
- A default prefix is now configurable on APIs that use GCS.
dagster-postgres
- Performance improvements for Postgres-backed storages.
dagster-pyspark
- Pyspark sessions may now be configured to be held open after pipeline execution completes, to enable extended test cases.
dagster-spark
spark_outputs
must now be specified when initializing aSparkSolidDefinition
, rather than in config.- Added new
create_spark_solid
helper and newspark_resource
. - Improved EMR implementation.
Bugfix
- Fixed an issue retrieving output values using
SolidExecutionResult
(e.g., in test) for dagster-pyspark solids. - Fixes an issue when expanding composite solids in Dagit.
- Better errors when solid names collide.
- Config mapping in composite solids now works as expected when the composite solid has no top level config.
- Compute log filenames are now guaranteed not to exceed the POSIX limit of 255 chars.
- Fixes an issue when copying and pasting solid names from Dagit.
- Termination now works as expected in the multiprocessing executor.
- The multiprocessing executor now executes parallel steps in the expected order.
- The multiprocessing executor now correctly handles solid subsets.
- Fixed a bad error condition in
dagster_ssh.sftp_solid
. - Fixed a bad error message giving incorrect log level suggestions.
Documentation
- Minor fixes and improvements.
Thank you Thank you to all of the community contributors to this release!! In alphabetical order: @cclauss, @deem0n, @irabinovitch, @pseudoPixels, @Ramshackle-Jamathon, @rparrapy, @yamrzou.
Breaking
- The
selector
argument toPipelineDefinition
has been removed. This API made it possible to construct aPipelineDefinition
in an invalid state. UsePipelineDefinition.build_sub_pipeline
instead.
New
- Added the
dagster_prometheus
library, which exposes a basic Prometheus resource. - Dagster Airflow DAGs may now use GCS instead of S3 for storage.
- Expanded interface for schedule management in Dagit.
Dagit
- Performance improvements when loading, displaying, and editing config for large pipelines.
- Smooth scrolling zoom in the explore tab replaces the previous two-step zoom.
- No longer depends on internet fonts to run, allowing fully offline dev.
- Typeahead behavior in search has improved.
- Invocations of composite solids remain visible in the sidebar when the solid is expanded.
- The config schema panel now appears when the config editor is first opened.
- Interface now includes hints for autocompletion in the config editor.
- Improved display of solid inputs and output in the explore tab.
- Provides visual feedback while filter results are loading.
- Better handling of pipelines that aren't present in the currently loaded repo.
Bugfix
- Dagster Airflow DAGs previously could crash while handling Python errors in DAG logic.
- Step failures when running Dagster Airflow DAGs were previously not being surfaced as task failures in Airflow.
- Dagit could previously get into an invalid state when switching pipelines in the context of a solid subselection.
frozenlist
andfrozendict
now pass Dagster's parameter type checks forlist
anddict
.- The GraphQL playground in Dagit is now working again.
Nits
- Dagit now prints its pid when it loads.
- Third-party dependencies have been relaxed to reduce the risk of version conflicts.
- Improvements to docs and example code.
Breaking
- The interface for type checks has changed. Previously the
type_check_fn
on a custom type was required to return None (=passed) or else raiseFailure
(=failed). Now, atype_check_fn
may returnTrue
/False
to indicate success/failure in the ordinary case, or else return aTypeCheck
. The newsuccess
field onTypeCheck
now indicates success/failure. This obviates the need for thetypecheck_metadata_fn
, which has been removed. - Executions of individual composite solids (e.g. in test) now produce a
CompositeSolidExecutionResult
rather than aSolidExecutionResult
. dagster.core.storage.sqlite_run_storage.SqliteRunStorage
has moved todagster.core.storage.runs.SqliteRunStorage
. Any persisteddagster.yaml
files should be updated with the new classpath.is_secret
has been removed fromField
. It was not being used to any effect.- The
environmentType
andconfigTypes
fields have been removed from the dagster-graphqlPipeline
type. TheconfigDefinition
field onSolidDefinition
has been renamed toconfigField
.
Bugfix
PresetDefinition.from_files
is now guaranteed to give identical results across all Python minor versions.- Nested composite solids with no config, but with config mapping functions, now behave as expected.
- The dagster-airflow
DagsterKubernetesPodOperator
has been fixed. - Dagit is more robust to changes in repositories.
- Improvements to Dagit interface.
New
- dagster_pyspark now supports remote execution on EMR with the
@pyspark_solid
decorator.
Nits
- Documentation has been improved.
- The top level config field
features
in thedagster.yaml
will no longer have any effect. - Third-party dependencies have been relaxed to reduce the risk of version conflicts.
- Scheduler errors are now visible in Dagit
- Run termination button no longer persists past execution completion
- Fixes run termination for multiprocess execution
- Fixes run termination on Windows
dagit
no longer prematurely returns control to terminal on Windowsraise_on_error
is now available on theexecute_solid
test utilitycheck_dagster_type
added as a utility to help test type checks on custom types- Improved support in the type system for
Set
andTuple
types - Allow composite solids with config mapping to expose an empty config schema
- Simplified graphql API arguments to single-step re-execution to use
retryRunId
,stepKeys
execution parameters instead of areexecutionConfig
input object - Fixes missing step-level stdout/stderr from dagster CLI
-
Adds a
type_check
parameter toPythonObjectType
,as_dagster_type
, and@as_dagster_type
to enable custom type checks in place of defaultisinstance
checks. See documentation here: https://dagster.readthedocs.io/en/latest/sections/learn/tutorial/types.html#custom-type-checks -
Improved the type inference experience by automatically wrapping bare python types as dagster types.
-
Reworked our tutorial (now with more compelling/scary breakfast cereal examples) and public API documentation. See the new tutorial here: https://dagster.readthedocs.io/en/latest/sections/learn/tutorial/index.html
-
New solids explorer in Dagit allows you to browse and search for solids used across the repository.
-
Enabled solid dependency selection in the Dagit search filter.
- To select a solid and its upstream dependencies, search
+{solid_name}
. - To select a solid and its downstream dependents, search
{solid_name}+
. - For both search
+{solid_name}+
.
For example. In the Airline demo, searching
+join_q2_data
will get the following: - To select a solid and its upstream dependencies, search
-
Added a terminate button in Dagit to terminate an active run.
-
Added an
--output
flag todagster-graphql
CLI. -
Added confirmation step for
dagster run wipe
anddagster schedule wipe
commands (Thanks @shahvineet98). -
Fixed a wrong title in the
dagster-snowflake
library README (Thanks @Step2Web).
- Changed composition functions
@pipeline
and@composite_solid
to automatically give solids aliases with an incrementing integer suffix when there are conflicts. This removes to the need to manually alias solid definitions that are used multiple times. - Add
dagster schedule wipe
command to delete all schedules and remove all schedule cron jobs execute_solid
test util now works on composite solids.- Docs and example improvements: https://dagster.readthedocs.io/
- Added
--remote
flag todagster-graphql
for querying remote Dagit servers. - Fixed issue with duplicate run tag autocomplete suggestions in Dagit (#1839)
- Fixed Windows 10 / py3.6+ bug causing pipeline execution failures
- Fixed an issue where Dagster public images tagged
latest
on Docker Hub were erroneously published with an older version of Dagster (#1814) - Fixed an issue where the most recent scheduled run was not displayed in Dagit (#1815)
- Fixed a bug with the
dagster schedule start --start-all
command (#1812) - Added a new scheduler command to restart a schedule:
dagster schedule restart
. Also added a flag to restart all running schedules:dagster schedule restart --restart-all-running
.
New
This major release includes features for scheduling, operating, and executing pipelines that elevate Dagit and dagster from a local development tool to a deployable service.
DagsterInstance
introduced as centralized system to control run, event, compute log, and local intermediates storage.- A
Scheduler
abstraction has been introduced along side an initial implementation ofSystemCronScheduler
indagster-cron
. dagster-aws
has been extended with a CLI for deploying dagster to AWS. This can spin up a Dagit node and all the supporting infrastructure—security group, RDS PostgreSQL instance, etc.—without having to touch the AWS console, and for deploying your code to that instance.- Dagit
Runs
: a completely overhauled Runs history page. Includes the ability toRetry
,Cancel
, andDelete
pipeline runs from the new runs page.Scheduler
: a page for viewing and interacting with schedules.Compute Logs
: stdout and stderr are now viewable on a per execution step basis in each run. This is available in real time for currently executing runs and for historical runs.- A
Reload
button in the top right in Dagit restarts the web-server process and updates the UI to reflect repo changes, including DAG structure, solid names, type names, etc. This replaces the previous file system watching behavior.
Breaking Changes
--log
and--log-dir
no longer supported as CLI args. Existing runs and events stored via these flags are no longer compatible with current storage.raise_on_error
moved from in process executor config to argument to arguments in python API methods such asexecute_pipeline
- Fixes an issue using custom types for fan-in dependencies with intermediate storage.
- Fixes an issue running some Dagstermill notebooks on Windows.
- Fixes a transitive dependency issue with Airflow.
- Bugfixes, performance improvements, and better documentation.
- Fixed an issue with specifying composite output mappings (#1674)
- Added support for specifying Dask worker resources (#1679)
- Fixed an issue with launching Dagit on Windows
- Execution details are now configurable. The new top-level
ExecutorDefinition
and@executor
APIs are used to define in-process, multiprocess, and Dask executors, and may be used by users to define new executors. Like loggers and storage, executors may be added to aModeDefinition
and may be selected and configured through theexecution
field in the environment dict or YAML, including through Dagit. Executors may no longer be configured through theRunConfig
. - The API of dagster-dask has changed. Pipelines are now executed on Dask using the
ordinary
execute_pipeline
API, and the Dask executor is configured through the environment. (See the dagster-dask README for details.) - Added the
PresetDefinition.from_files
API for constructing a preset from a list of environment files (replacing the old usage of this class).PresetDefinition
may now be directly instantiated with an environment dict. - Added a prototype integration with dbt.
- Added a prototype integration with Great Expectations.
- Added a prototype integration with Papertrail.
- Added the dagster-bash library.
- Added the dagster-ssh library.
- Added the dagster-sftp library.
- Loosened the PyYAML compatibility requirement.
- The dagster CLI no longer takes a
--raise-on-error
or--no-raise-on-error
flag. Set this option in executor config. - Added a
MarkdownMetadataEntryData
class, so events yielded from client code may now render markdown in their metadata. - Bug fixes, documentation improvements, and improvements to error display.
- Dagit now accepts parameters via environment variables prefixed with
DAGIT_
, e.g.DAGIT_PORT
. - Fixes an issue with reexecuting Dagstermill notebooks from Dagit.
- Bug fixes and display improvments in Dagit.
- Reworked the display of structured log information and system events in Dagit, including support for structured rendering of client-provided event metadata.
- Dagster now generates events when intermediates are written to filesystem and S3 storage, and these events are displayed in Dagit and exposed in the GraphQL API.
- Whitespace display styling in Dagit can now be toggled on and off.
- Bug fixes, display nits and improvements, and improvements to JS build process, including better display for some classes of errors in Dagit and improvements to the config editor in Dagit.
- Pinned RxPY to 1.6.1 to avoid breaking changes in 3.0.0 (py3-only).
- Most definition objects are now read-only, with getters corresponding to the previous properties.
- The
valueRepr
field has been removed fromExecutionStepInputEvent
andExecutionStepOutputEvent
. - Bug fixes and Dagit UX improvements, including SQL highlighting and error handling.
- Added top-level
define_python_dagster_type
function. - Renamed
metadata_fn
totypecheck_metadata_fn
in all runtime type creation APIs. - Renamed
result_value
andresult_values
tooutput_value
andoutput_values
onSolidExecutionResult
- Dagstermill: Reworked public API now contains only
define_dagstermill_solid
,get_context
,yield_event
,yield_result
,DagstermillExecutionContext
,DagstermillError
, andDagstermillExecutionError
. Please see the new guide for details. - Bug fixes, including failures for some dagster CLI invocations and incorrect handling of Airflow timestamps.