# TDDA: Test-Driven Data Analysis

[TDDA](https://github.com/tdda/tdda) uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

* `tdda.referencetest` supports the creation of reference tests based on either unittest or pytest.
* `tdda.constraints` is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.
* `tdda.rexpy` is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

## 1. Imports

In [1]:
import pandas as pd
import numpy as np
from tdda.constraints import discover_df, verify_df

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv')

## 2. Check data

With [pandas.DataFrame.sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) we display ten random data sets:

In [3]:
df.sample(10)

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
51603,2017-01-22T02:55:23,sherry06,28,83,5c1d8967-fcfc-1a8d-2f62-5036da241848,1,sleep
119933,2017-02-18T10:09:06,psnyder,9,83,d4e5846a-0b8e-ecd1-5346-a680c8271524,1,test
67013,2017-01-28T06:57:26,hayesthomas,22,78,a5cca8fd-e6aa-ddd9-9980-8d32077ca099,0,update
5554,2017-01-03T17:25:28,dianajohnson,29,80,7e30f6b8-4e2f-025b-515d-4f2593e7ce08,1,
118950,2017-02-18T00:42:42,katherinefaulkner,17,79,71613d5f-72fd-ee43-a27c-5f93cc693be1,1,interval
50388,2017-01-21T15:19:28,diazgregory,20,68,6ef03856-0470-1664-f749-4fd59572efda,0,wake
88116,2017-02-05T17:38:11,thomas62,10,74,7c19890c-ef1b-75a0-acfa-efdf21ac90b6,0,
64332,2017-01-27T05:17:04,kanderson,28,81,0b94e0ba-ecee-0b76-8b53-191f93f12404,1,sleep
48896,2017-01-21T00:55:48,heidi76,28,74,c3fd9b2a-2900-ced7-e721-ff7940419a13,0,update
143209,2017-02-27T17:28:19,johnsonmiguel,9,74,785fc5b8-7be8-1a01-ddbe-c0581d8c5d5f,0,test


And with [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) we display the data types for the individual columns:

In [4]:
df.dtypes

timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

## 3. Creating a constraints object

With `discover_constraints` a constraints object can be created.

In [5]:
constraints = discover_df(df)

In [6]:
constraints

<tdda.constraints.base.DatasetConstraints at 0x7fe58e48dcd0>

In [7]:
constraints.fields

Fields([('timestamp',
         <tdda.constraints.base.FieldConstraints at 0x7fe58e48dfd0>),
        ('username',
         <tdda.constraints.base.FieldConstraints at 0x7fe58e4ab280>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x7fe58e4ab5e0>),
        ('heartrate',
         <tdda.constraints.base.FieldConstraints at 0x7fe58e4ab940>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x7fe58e4abca0>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x7fe58e4b0040>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x7fe58e4b0370>)])

## 4. Writing the constraints into a file

In [8]:
with open('../../data/ignore-iot_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the `timestamp` column and `temperature` expects integers with values from 5-29.

In [9]:
cat ../../data/ignore-iot_constraints.tdda

{
    "creation_metadata": {
        "local_time": "2021-11-20 16:16:01",
        "utc_time": "2021-11-20 15:15:01",
        "creator": "TDDA 1.0.32",
        "host": "eve.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "

## 5. Checking data frames

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

In [10]:
new_df = pd.read_csv('iot_example_with_nulls.csv')
new_df.sample(10)

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
34897,2017-01-15T10:33:45,waltersann,19.0,76,9a55a840-e586-4cc4-375f-00db11ad6157,,interval
46490,2017-01-20T01:59:35,dunlaprobert,,63,,0.0,
48329,2017-01-20T19:33:15,heidi31,16.0,64,e14014b4-b96b-82dd-5e9b-a4fea08839b4,,interval
23625,2017-01-10T22:15:30,kurtcain,28.0,73,66e31ec0-2e6c-9882-cbf5-8d572cd18bf1,1.0,
114909,2017-02-16T10:01:53,frankbates,22.0,75,9afa2b75-0f44-b530-4ab1-fb29beac6443,,interval
40464,2017-01-17T16:01:21,rbaker,,71,c6a27614-1632-885b-1e3c-b1e0441b231d,1.0,test
110461,2017-02-14T15:30:22,carpenterashlee,23.0,85,c45944a9-1c69-8692-d6a2-c3462dd6b4d3,0.0,
79579,2017-02-02T07:49:53,alexistucker,8.0,61,f787577b-1080-ac9d-e871-40db40c7225f,0.0,
68692,2017-01-28T23:09:11,hallmaria,12.0,62,f6b642b7-6fdf-d772-34de-f8e8da949ff1,0.0,
4142,2017-01-03T03:56:31,veronicalamb,18.0,76,,0.0,update


We see several fields that are output as `NaN`. Now, to analyse this systematically, we apply [verify_df](https://tdda.readthedocs.io/en/v1.0.31/constraints.html#tdda.constraints.verify_df) to our new DataFrame. Here, `passes` returns the number of passed constraints, and `failures` returns the number of failed constraints.

In [11]:
v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')

In [12]:
v

<tdda.constraints.pd.constraints.PandasVerification at 0x7fe57a173f70>

In [13]:
v.passes

30

In [14]:
v.failures

3

We can also display which constraints passed and failed in which columns:

In [15]:
print(str(v))

FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3


Alternatively, we can also display these results in tabular form:

In [16]:
v.to_frame()

Unnamed: 0,field,failures,passes,type,min,min_length,max,max_length,sign,max_nulls,no_duplicates,allowed_values
0,timestamp,0,5,True,,True,,True,,True,True,
1,username,0,4,True,,True,,True,,True,,
2,temperature,1,4,True,True,,True,,True,False,,
3,heartrate,0,5,True,True,,True,,True,True,,
4,build,1,4,True,,True,,True,,False,True,
5,latest,1,4,True,True,,True,,True,False,,
6,note,0,4,True,,True,,True,,,,True
