Skip to content

Commit

Permalink
Add tests for generate_constraints, example for generate_constraints …
Browse files Browse the repository at this point in the history
…in Constraints_Suite.ipynb
  • Loading branch information
MilenaTrajanoska committed Jan 25, 2022
1 parent 21af833 commit eb06ace
Show file tree
Hide file tree
Showing 3 changed files with 515 additions and 44 deletions.
334 changes: 332 additions & 2 deletions examples/Constraints2.ipynb → examples/Constraints_Suite.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 5,
"id": "bfd0b5a7",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -1991,7 +1991,7 @@
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": 6,
"id": "db57da2c",
"metadata": {},
"outputs": [],
Expand All @@ -2001,6 +2001,8 @@
" \"last_name\": [\"Doe\", \"Doe\", \"Smith\", \"Jones\"],\n",
" \"username\": [\"jd123\", \"jane.doe@example.com\", \"bobsmith\", \"_anna_\"],\n",
" \"email\": [\"john.doe@example.com\", \"jane.doe@example.com\", \"bob.smith@example.com\", \"anna_jones@example.com\"],\n",
" \"followers\": [1525, 12268, 51343, 867],\n",
" \"points\": [23.4, 123.2, 432.22, 32.1],\n",
"})"
]
},
Expand Down Expand Up @@ -2032,6 +2034,334 @@
"# we expect 1 out of 4 evaluations of the constraint to be a failure, sicne Jane Doe's email is the same as their username\n",
"format_report(dc.report())"
]
},
{
"cell_type": "markdown",
"id": "a7da9d24",
"metadata": {},
"source": [
"# Generate default constraints for data set"
]
},
{
"cell_type": "markdown",
"id": "a88c6ede",
"metadata": {},
"source": [
"Let's log the users data frame from the previous example, without any constraints. We will use WhyLogs' generate_constraints method to generate default constraints using the dataset profile."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b938730f",
"metadata": {},
"outputs": [],
"source": [
"profile = session.log_dataframe(users, \"test.data\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a552cdce",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"properties\": {\n",
" \"schemaMajorVersion\": 1,\n",
" \"schemaMinorVersion\": 2,\n",
" \"sessionId\": \"8222b610-9472-4bfb-92f5-a56a49cd8199\",\n",
" \"sessionTimestamp\": \"1643116248232\",\n",
" \"dataTimestamp\": \"1643112751681\",\n",
" \"tags\": {\n",
" \"name\": \"test.data\"\n",
" },\n",
" \"metadata\": {}\n",
" },\n",
" \"summaryConstraints\": {\n",
" \"first_name\": {\n",
" \"constraints\": [\n",
" {\n",
" \"name\": \"summary column_values_type EQ STRING\",\n",
" \"firstField\": \"column_values_type\",\n",
" \"value\": 5.0,\n",
" \"op\": \"EQ\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary unique_count BTWN 3 and 5\",\n",
" \"firstField\": \"unique_count\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": 3.0,\n",
" \"upperValue\": 5.0\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary most_common_value IN {'Bob', 'Anna', 'John', 'Jane'}\",\n",
" \"firstField\": \"most_common_value\",\n",
" \"op\": \"IN\",\n",
" \"referenceSet\": [\n",
" \"Bob\",\n",
" \"Anna\",\n",
" \"John\",\n",
" \"Jane\"\n",
" ],\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" }\n",
" ]\n",
" },\n",
" \"followers\": {\n",
" \"constraints\": [\n",
" {\n",
" \"name\": \"summary min GE 0/None\",\n",
" \"firstField\": \"min\",\n",
" \"value\": 0.0,\n",
" \"op\": \"GE\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary mean BTWN -7308.11238882488 and 40309.612388824884\",\n",
" \"firstField\": \"mean\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": -7308.11238882488,\n",
" \"upperValue\": 40309.612388824884\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary column_values_type EQ INTEGRAL\",\n",
" \"firstField\": \"column_values_type\",\n",
" \"value\": 3.0,\n",
" \"op\": \"EQ\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary unique_count BTWN 3 and 5\",\n",
" \"firstField\": \"unique_count\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": 3.0,\n",
" \"upperValue\": 5.0\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary most_common_value IN {'51343', '867', '1525', '12268'}\",\n",
" \"firstField\": \"most_common_value\",\n",
" \"op\": \"IN\",\n",
" \"referenceSet\": [\n",
" \"51343\",\n",
" \"867\",\n",
" \"1525\",\n",
" \"12268\"\n",
" ],\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" }\n",
" ]\n",
" },\n",
" \"last_name\": {\n",
" \"constraints\": [\n",
" {\n",
" \"name\": \"summary column_values_type EQ STRING\",\n",
" \"firstField\": \"column_values_type\",\n",
" \"value\": 5.0,\n",
" \"op\": \"EQ\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary unique_count BTWN 2 and 4\",\n",
" \"firstField\": \"unique_count\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": 2.0,\n",
" \"upperValue\": 4.0\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary most_common_value IN {'Jones', 'Doe', 'Smith'}\",\n",
" \"firstField\": \"most_common_value\",\n",
" \"op\": \"IN\",\n",
" \"referenceSet\": [\n",
" \"Jones\",\n",
" \"Doe\",\n",
" \"Smith\"\n",
" ],\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" }\n",
" ]\n",
" },\n",
" \"email\": {\n",
" \"constraints\": [\n",
" {\n",
" \"name\": \"summary column_values_type EQ STRING\",\n",
" \"firstField\": \"column_values_type\",\n",
" \"value\": 5.0,\n",
" \"op\": \"EQ\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary unique_count BTWN 3 and 5\",\n",
" \"firstField\": \"unique_count\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": 3.0,\n",
" \"upperValue\": 5.0\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary most_common_value IN {'bob.smith@example.com', 'john.doe@example.com', 'jane.doe@example.com', 'anna_jones@example.com'}\",\n",
" \"firstField\": \"most_common_value\",\n",
" \"op\": \"IN\",\n",
" \"referenceSet\": [\n",
" \"bob.smith@example.com\",\n",
" \"john.doe@example.com\",\n",
" \"jane.doe@example.com\",\n",
" \"anna_jones@example.com\"\n",
" ],\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" }\n",
" ]\n",
" },\n",
" \"points\": {\n",
" \"constraints\": [\n",
" {\n",
" \"name\": \"summary min GE 0/None\",\n",
" \"firstField\": \"min\",\n",
" \"value\": 0.0,\n",
" \"op\": \"GE\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary mean BTWN -38.98552432358383 and 344.44552432358387\",\n",
" \"firstField\": \"mean\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": -38.98552432358383,\n",
" \"upperValue\": 344.44552432358387\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary column_values_type EQ FRACTIONAL\",\n",
" \"firstField\": \"column_values_type\",\n",
" \"value\": 2.0,\n",
" \"op\": \"EQ\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary most_common_value IN {'123.2', '432.22', '32.1', '23.4'}\",\n",
" \"firstField\": \"most_common_value\",\n",
" \"op\": \"IN\",\n",
" \"referenceSet\": [\n",
" \"123.2\",\n",
" \"432.22\",\n",
" \"32.1\",\n",
" \"23.4\"\n",
" ],\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" }\n",
" ]\n",
" },\n",
" \"username\": {\n",
" \"constraints\": [\n",
" {\n",
" \"name\": \"summary column_values_type EQ STRING\",\n",
" \"firstField\": \"column_values_type\",\n",
" \"value\": 5.0,\n",
" \"op\": \"EQ\",\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary unique_count BTWN 3 and 5\",\n",
" \"firstField\": \"unique_count\",\n",
" \"op\": \"BTWN\",\n",
" \"between\": {\n",
" \"lowerValue\": 3.0,\n",
" \"upperValue\": 5.0\n",
" },\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" },\n",
" {\n",
" \"name\": \"summary most_common_value IN {'jd123', 'bobsmith', '_anna_', 'jane.doe@example.com'}\",\n",
" \"firstField\": \"most_common_value\",\n",
" \"op\": \"IN\",\n",
" \"referenceSet\": [\n",
" \"jd123\",\n",
" \"bobsmith\",\n",
" \"_anna_\",\n",
" \"jane.doe@example.com\"\n",
" ],\n",
" \"verbose\": false,\n",
" \"quantileValue\": 0.0\n",
" }\n",
" ]\n",
" }\n",
" },\n",
" \"valueConstraints\": {}\n",
"}\n"
]
}
],
"source": [
"auto_constraints = profile.generate_constraints()\n",
"print(message_to_json(auto_constraints.to_protobuf()))"
]
},
{
"cell_type": "markdown",
"id": "77ea23ed",
"metadata": {},
"source": [
"For the columns with inferred type STRING, the generate constraints method generates 3 types of constraints: columnValuesTypeEqualsConstraint where the type is STRING, columnUniqueValueCountBetweenConstraint which makes a constraint that the unique values in a column should range between unique_count - 1 and unique_count + 1 in the current data frame, and finally columnMostCommonValueInSetConstraint which takes a set of the 5 most common values and defines a constraint that the most common value in this column should be in that set."
]
},
{
"cell_type": "markdown",
"id": "3f683579",
"metadata": {},
"source": [
"The columns which have inferred type FRACTIONAL or INTEGRAL, such as 'points' and 'followers' respectively, have numeric constraints generated such as minimum value greater than 0, maximum value less than 0, mean in range [mean - stddev, mean + stddev], if these constraints apply to the current column. Apart from these constraints, columnValuesTypeEqualsConstraint and columnMostCommonValueInSetConstraint are generated for both types. columnUniqueValueCountBetweenConstraint is generated only for the INTEGRAL valued columns."
]
},
{
"cell_type": "markdown",
"id": "5cd56524",
"metadata": {},
"source": [
"No constraints are generated for columns which have an inferred type of NULL."
]
}
],
"metadata": {
Expand Down
Loading

0 comments on commit eb06ace

Please sign in to comment.