-
Notifications
You must be signed in to change notification settings - Fork 4
Home
While writing tests for Spark code, we tend to write a lot of boilerplate just to create a test spark dataframe initialised with some test data. Not only were these test sets not readable, but they also do not adhere to Property-based testing standards.
We needed a utility that would have
🥇 less boilerplate code
🥇 easily extensible interface for your custom use-cases
🥇 easily build out-of-box support for most common attributes in your data/project
🥇 promote usage of Property-based tests
This utility is based on the spark-testing-base library by Holden Karau
If you simply want to get started, use one of the inbuilt column types like
- primitives -
stringColumn
,intColumn
,boolColumn
,doubleColumn
&longColumn
- complex -
mapColumn
,arrayColumn
- custom - supported, see Wiki
and wrap it in a SparkDataframe
like this and call the method getOne
which returns a randomly-generated dataframe
val demographics = SparkDataframe(
// custom column - see the User Defined Columns Wiki
adid(),
// base data type - see Supported Column Types Wiki
dataColumn("country", DString, AlwaysPresent, List("BRA", "CAN", "MEX", "COL", "USA", "ARG", null, " ", "", "null")),
stringColumn("device_os", List("android", "ios", "null")).withJunk,
dataColumn("gender", DString, AlwaysUniform, List("Male", "Female", "Other", "", "null")),
age(),
intColumn("Person_Income", List("3455", "5600", "4500"))
)
implicit val sc=spark
val df = demographics.getOne()
df.get.show()
+------------------------------------+------------------+---------+------+---+-----------------------+
|adid |country |device_os|gender|age|Person_Income |
+------------------------------------+------------------+---------+------+---+-----------------------+
|a4d38240-9df5-4651-93da-d098260678c8| |null |Other |98 |3455 |
|3dc59790-3c97-458d-9a3f-4586c6aa9afb|MEX | |Male |22 |5600 |
|5e9a507e-ffaa-4ab9-8662-5adf58c8a1fc|CAN |null |null |23 |4500 |
|710fa030-51a2-4083-9f0c-c7cb9fd5c6fd| | |null |98 |3455 |
|cd34b062-858c-4ec0-99dd-21787fbb5fff|CAN |null |Male |56 |4500 |
|b5673d48-75cc-41b1-811a-c06d954db352|COL |android |Female|34 |4500 |
|f5fd533b-6215-4709-b0c9-2c207d68f115|MEX |null |null |50 |4500 |
|c09f8380-5366-4c1e-9ac0-2b3fce149119|MEX | |Female|12 |4500 |
|bfd83eca-68e4-49b6-9cfc-eb8a0442d889|ARG |ios |null |28 |4500 |
|5e9a507e-ffaa-4ab9-8662-5adf58c8a1fc|null |ios |Other |12 |4500 |
|9d7eb68a-94ea-438a-a55d-94ee11834a65|USA |junkValue|null |37 |5600 |
|710fa030-51a2-4083-9f0c-c7cb9fd5c6fd|ARG |ios |Male |41 |5600 |
|cd34b062-858c-4ec0-99dd-21787fbb5fff|MEX |android | |97 |5600 |
|7f298550-d50b-44bb-a147-7598242f9cf7|CAN |null |Other |18 |5600 |
|9fcc1ac8-5320-40f6-9736-ab606e561b6e| |null |Female|12 |5600 |
|e64efe2c-17b2-4dbf-b4c1-5bf38673e08c|MEX |ios |null |28 |4500 |
|3dc59790-3c97-458d-9a3f-4586c6aa9afb|ARG | | |23 |3455 |
|ca2bb12a-87f8-4256-8446-d601c5e0360d|USA |null |null |22 |3455 |
|9fcc1ac8-5320-40f6-9736-ab606e561b6e|null |junkValue|Other |68 |4500 |
|9fcc1ac8-5320-40f6-9736-ab606e561b6e| |ios |Male |41 |4500 |
+------------------------------------+------------------+---------+------+---+-----------------------+
But this library can be best used if you start using these objects while writing property-based tests and obtaining some arbitrary generators like this
forAll(demographics.getArbitraryGenerator(), configParams = minSize(1), minSuccessful(60)) {
df =>
// Write your tests here
}
You can also get the spark schema from the demographics
definition
println(demographics.getSchema())
/*
StructType(
StructField(adid,StringType,true), StructField(country,StringType,true),
StructField(device_os,StringType,true), StructField(gender,StringType,true),
StructField(age,IntegerType,true), StructField(Person_Income,IntegerType,true))
*/
Please take a look at the examples
directory. Users can arrange their often-used columns in a similar fashion and host it in a common library to be used across organizations.