Skip to content

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

samueleresca/deequ.net

Repository files navigation

deequ.NET

deequ.NET codecov Nuget NuGet

⚠️Warning: The library is still in alpha, and it is not fully tested.

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. deequ.NET runs on dotnet/spark.

Requirements and Installation

deequ.NET runs on Apache Spark and depends on dotnet/spark. Therefore it is required to install the following dependencies locally:

It is also necessary to install the Microsoft.Spark.Worker on your local machine and configure the path into the PATH env var. For a detailed instructions, see dotnet/spark - Getting started

Usage

The following example implements a set of checks on some records and it submits the execution using the spark-submit command.

  • Use the dotnet CLI to create a console application:

    dotnet new console -o DeequExample
  • Install Microsoft.Spark and the deequ Nuget packages into the project:

    cd DeequExample
    
    dotnet add package Microsoft.Spark
    dotnet add package deequ
  • Replace the contents of the Program.cs file with the following code:

    using deequ;
    using deequ.Checks;
    using deequ.Extensions;
    using Microsoft.Spark.Sql;
    
    namespace DeequExample
    {
        class Program
        {
            static void Main(string[] args)
            {
                SparkSession spark = SparkSession.Builder().GetOrCreate();
                DataFrame data = spark.Read().Json("inventory.json");
    
                data.Show();
    
                VerificationResult verificationResult = new VerificationSuite()
                    .OnData(data)
                    .AddCheck(
                        new Check(CheckLevel.Error, "integrity checks")
                            .HasSize(value => value == 5)
                            .IsComplete("id")
                            .IsUnique("id")
                            .IsComplete("productName")
                            .IsContainedIn("priority", new[] { "high", "low" })
                            .IsNonNegative("numViews")
                    )
                    .AddCheck(
                        new Check(CheckLevel.Warning, "distribution checks")
                            .ContainsURL("description", value => value >= .5)
                    )
                    .Run();
    
                verificationResult.Debug();
            }
        }
    }
  • Use the dotnet CLI to build the application:

    dotnet build

Running the example

  • Open your terminal and navigate into your app folder.

    cd <your-app-output-directory>
  • Create inventory.json with the following content:

    {"id":1, "productName":"Thingy A", "description":"awesome thing. http://thingb.com", "priority":"high", "numViews":0}
    {"id":2, "productName":"Thingy B", "description":"available at http://thingb.com","priority":null, "numViews":0}
    {"id":3, "productName":"Thingy C", "description": null, "priority":"low", "numViews":5}
    {"id":4, "productName":"Thingy D", "description": "checkout https://thingd.ca", "priority":"low","numViews": 10}
    {"id":5, "productName":"Thingy E", "description":null, "priority":"high","numViews": 12}
  • Run your app.

    spark-submit \
        --class org.apache.spark.deploy.dotnet.DotnetRunner \
        --master local \
        microsoft-spark-2.4.x-<version>.jar \
    dotnet DeequExample.dll

    Note: This command requires Apache Spark in your PATH environment variable to be able to use spark-submit. For detailed instructions, you can see Building .NET for Apache Spark from Source on Ubuntu.

  • The output of the application should look similar to the output below:

    
         _                         _   _ ______ _______
        | |                       | \ | |  ____|__   __|
      __| | ___  ___  __ _ _   _  |  \| | |__     | |
     / _` |/ _ \/ _ \/ _` | | | | | . ` |  __|    | |
    | (_| |  __/  __/ (_| | |_| |_| |\  | |____   | |
     \__,_|\___|\___|\__, |\__,_(_)_| \_|______|  |_|
                        | |
                        |_|
    
    
    
    Success
    

More examples

The following list shows more examples/showcases of the deequ.NET API:

Credits

Citation

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.

About

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Topics

Resources

Stars

Watchers

Forks

Languages