Skip to content

samueleresca/deequ.net

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

deequ.NET

deequ.NET codecov Nuget NuGet

⚠️Warning: The library is still in alpha, and it is not fully tested.

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. deequ.NET runs on dotnet/spark.

Requirements and Installation

deequ.NET runs on Apache Spark and depends on dotnet/spark. Therefore it is required to install the following dependencies locally:

It is also necessary to install the Microsoft.Spark.Worker on your local machine and configure the path into the PATH env var. For a detailed instructions, see dotnet/spark - Getting started

Usage

The following example implements a set of checks on some records and it submits the execution using the spark-submit command.

  • Use the dotnet CLI to create a console application:

    dotnet new console -o DeequExample
  • Install Microsoft.Spark and the deequ Nuget packages into the project:

    cd DeequExample
    
    dotnet add package Microsoft.Spark
    dotnet add package deequ
  • Replace the contents of the Program.cs file with the following code:

    using deequ;
    using deequ.Checks;
    using deequ.Extensions;
    using Microsoft.Spark.Sql;
    
    namespace DeequExample
    {
        class Program
        {
            static void Main(string[] args)
            {
                SparkSession spark = SparkSession.Builder().GetOrCreate();
                DataFrame data = spark.Read().Json("inventory.json");
    
                data.Show();
    
                VerificationResult verificationResult = new VerificationSuite()
                    .OnData(data)
                    .AddCheck(
                        new Check(CheckLevel.Error, "integrity checks")
                            .HasSize(value => value == 5)
                            .IsComplete("id")
                            .IsUnique("id")
                            .IsComplete("productName")
                            .IsContainedIn("priority", new[] { "high", "low" })
                            .IsNonNegative("numViews")
                    )
                    .AddCheck(
                        new Check(CheckLevel.Warning, "distribution checks")
                            .ContainsURL("description", value => value >= .5)
                    )
                    .Run();
    
                verificationResult.Debug();
            }
        }
    }
  • Use the dotnet CLI to build the application:

    dotnet build

Running the example

  • Open your terminal and navigate into your app folder.

    cd <your-app-output-directory>
  • Create inventory.json with the following content:

    {"id":1, "productName":"Thingy A", "description":"awesome thing. http://thingb.com", "priority":"high", "numViews":0}
    {"id":2, "productName":"Thingy B", "description":"available at http://thingb.com","priority":null, "numViews":0}
    {"id":3, "productName":"Thingy C", "description": null, "priority":"low", "numViews":5}
    {"id":4, "productName":"Thingy D", "description": "checkout https://thingd.ca", "priority":"low","numViews": 10}
    {"id":5, "productName":"Thingy E", "description":null, "priority":"high","numViews": 12}
  • Run your app.

    spark-submit \
        --class org.apache.spark.deploy.dotnet.DotnetRunner \
        --master local \
        microsoft-spark-2.4.x-<version>.jar \
    dotnet DeequExample.dll

    Note: This command requires Apache Spark in your PATH environment variable to be able to use spark-submit. For detailed instructions, you can see Building .NET for Apache Spark from Source on Ubuntu.

  • The output of the application should look similar to the output below:

    
         _                         _   _ ______ _______
        | |                       | \ | |  ____|__   __|
      __| | ___  ___  __ _ _   _  |  \| | |__     | |
     / _` |/ _ \/ _ \/ _` | | | | | . ` |  __|    | |
    | (_| |  __/  __/ (_| | |_| |_| |\  | |____   | |
     \__,_|\___|\___|\__, |\__,_(_)_| \_|______|  |_|
                        | |
                        |_|
    
    
    
    Success
    

More examples

The following list shows more examples/showcases of the deequ.NET API:

Credits

Citation

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.

About

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Topics

Resources

Stars

Watchers

Forks

Languages