Skip to content
Joydeep Banik Roy edited this page Nov 30, 2022 · 20 revisions

Problem

While writing tests for Spark code, we tend to write a lot of boilerplate just to create a test spark dataframe initialised with some test data. Not only were these test sets not readable, but they also do not adhere to Property-based testing standards.

Solution

We needed a utility that would have

🥇 less boilerplate code

🥇 easily extensible interface for your custom use-cases

🥇 easily build out-of-box support for most common attributes in your data/project

🥇 promote usage of Property-based tests

This utility is based on the spark-testing-base library by Holden Karau

Usage

If you simply want to get started, use one of the inbuilt column types like

  • primitives - stringColumn,intColumn,boolColumn,doubleColumn & longColumn
  • complex - mapColumn, arrayColumn
  • custom - supported, see Wiki

and wrap it in a SparkDataframe like this and call the method getOne which returns a randomly-generated dataframe

 val demographics = SparkDataframe(
      // custom column - see the User Defined Columns Wiki
      adid(),
      // base data type - see Supported Column Types Wiki
      dataColumn("country", DString, AlwaysPresent, List("BRA", "CAN", "MEX", "COL", "USA", "ARG", null, " ", "", "null")),
      stringColumn("device_os", List("android", "ios", "null")).withJunk,
      dataColumn("gender", DString, AlwaysUniform, List("Male", "Female", "Other", "", "null")),
      age(),
      intColumn("Person_Income", List("3455", "5600", "4500"))
)
implicit val sc=spark
val df = demographics.getOne()
df.get.show()

+------------------------------------+------------------+---------+------+---+-----------------------+
|adid                                |country           |device_os|gender|age|Person_Income          |
+------------------------------------+------------------+---------+------+---+-----------------------+
|a4d38240-9df5-4651-93da-d098260678c8|                  |null     |Other |98 |3455                   |
|3dc59790-3c97-458d-9a3f-4586c6aa9afb|MEX               |         |Male  |22 |5600                   |
|5e9a507e-ffaa-4ab9-8662-5adf58c8a1fc|CAN               |null     |null  |23 |4500                   |
|710fa030-51a2-4083-9f0c-c7cb9fd5c6fd|                  |         |null  |98 |3455                   |
|cd34b062-858c-4ec0-99dd-21787fbb5fff|CAN               |null     |Male  |56 |4500                   |
|b5673d48-75cc-41b1-811a-c06d954db352|COL               |android  |Female|34 |4500                   |
|f5fd533b-6215-4709-b0c9-2c207d68f115|MEX               |null     |null  |50 |4500                   |
|c09f8380-5366-4c1e-9ac0-2b3fce149119|MEX               |         |Female|12 |4500                   |
|bfd83eca-68e4-49b6-9cfc-eb8a0442d889|ARG               |ios      |null  |28 |4500                   |
|5e9a507e-ffaa-4ab9-8662-5adf58c8a1fc|null              |ios      |Other |12 |4500                   |
|9d7eb68a-94ea-438a-a55d-94ee11834a65|USA               |junkValue|null  |37 |5600                   |
|710fa030-51a2-4083-9f0c-c7cb9fd5c6fd|ARG               |ios      |Male  |41 |5600                   |
|cd34b062-858c-4ec0-99dd-21787fbb5fff|MEX               |android  |      |97 |5600                   |
|7f298550-d50b-44bb-a147-7598242f9cf7|CAN               |null     |Other |18 |5600                   |
|9fcc1ac8-5320-40f6-9736-ab606e561b6e|                  |null     |Female|12 |5600                   |
|e64efe2c-17b2-4dbf-b4c1-5bf38673e08c|MEX               |ios      |null  |28 |4500                   |
|3dc59790-3c97-458d-9a3f-4586c6aa9afb|ARG               |         |      |23 |3455                   |
|ca2bb12a-87f8-4256-8446-d601c5e0360d|USA               |null     |null  |22 |3455                   |
|9fcc1ac8-5320-40f6-9736-ab606e561b6e|null              |junkValue|Other |68 |4500                   |
|9fcc1ac8-5320-40f6-9736-ab606e561b6e|                  |ios      |Male  |41 |4500                   |
+------------------------------------+------------------+---------+------+---+-----------------------+

Property-Based tests

Write using forAll

But this library can be best used if you start using these objects while writing property-based tests and obtaining some arbitrary generators like this

 forAll(demographics.getArbitraryGenerator(), configParams = minSize(1), minSuccessful(60)) {
      df =>
           // Write your tests here
 }

Schema

You can also get the spark schema from the demographics definition

println(demographics.getSchema())
/*
StructType(
StructField(adid,StringType,true), StructField(country,StringType,true), 
StructField(device_os,StringType,true), StructField(gender,StringType,true), 
StructField(age,IntegerType,true), StructField(Person_Income,IntegerType,true))
*/

Examples

Please take a look at the examples directory. Users can arrange their often-used columns in a similar fashion and host it in a common library to be used across organizations.