## Addressing dirty data with ticdat

Dirty data is an unloved and often overlooked challenge when building analytical models. A typical assumption is that the input data to a model will somehow magically be clean. In reality, there are any number of reasons why dirty data might be passed as input to your engine. The data might be munged together from different systems, the requirements of your data model might be poorly understood, or a user might be simply pushing your model to its limits via what-if analysis. Regardless of the cause, a professional engine will respond gracefully when passed input data that violates basic integrity checks.

`ticdat` allows for a data scientist to define data integrity checks for 4 different categories of problems (in addition to checking for the correct table and field names).
 1. Duplicate rows (i.e. duplicate primary key entries in the same table).
 1. Data type failures. This checks each column for correct data type, legal ranges for numeric data, acceptable flagging strings, nulls present only for columns that allow null, etc.
 1. Foreign key failures, which check that each record of a child table can cross-reference into the appropriate parent table.
 1. Data predicate failures. This checks each row for conditions more complex than the data type failure checks. For example, a maximum column can not be allowed to be smaller than the minimum column.
 
For a `ticdat` app deployed on Enframe, there will be a dedicated subsection of the input tables dedicated to diagnosing data integrity problems. This subsection is populated whenever an app is solved. There is also an integrity "Action" that can be launched to look for data integrity problems independently of the solve process.

For a data scientist working offline, `ticdat` provides bulk-query routines that can be used from within a notebook. We briefly tour these routines below. Please consult the docstrings for more information regarding their utility.

In [1]:
import ticdat
from diet import input_schema

First, we quickly check that the csv files in `diet_sample_data` represent clean data. The `ticdat` bulk query routines all return "falsey" results on clean data sets. 

In [2]:
dat = input_schema.csv.create_pan_dat("diet_sample_data")
any (_ for _ in [input_schema.find_duplicates(dat),
                 input_schema.find_data_type_failures(dat), 
                 input_schema.find_foreign_key_failures(dat), 
                 input_schema.find_data_row_failures(dat)])

False

Next, we examine the `diet_dirty_sample_data` data set, which has been deliberately seeded with dirty data.

In [3]:
dat = input_schema.csv.create_pan_dat("diet_dirty_sample_data")
input_schema.find_duplicates(dat).keys()

dict_keys(['nutrition_quantities'])

`ticdat` is telling us that there are duplicated rows in the Nutrition Quantities table. 

Here are the duplicated rows.

In [4]:
input_schema.find_duplicates(dat, keep=False)["nutrition_quantities"]

Unnamed: 0,Food,Category,Quantity
16,milk,fat,2.5
36,milk,fat,1.0


Specifically, there is duplication in defining the amount of fat in milk. This can be easily confirmed by manually inspecting the "nutrition_quantities.csv" file in the "diet_dirty_sample_data" directory. In a real-world data set, manual inspection would be impossible and such a duplication would be easily overlooked.

In [5]:
input_schema.find_data_type_failures(dat).keys()

dict_keys([TableField(table='nutrition_quantities', field='Quantity')])

In [6]:
input_schema.find_data_type_failures(dat)['nutrition_quantities', 'Quantity']

Unnamed: 0,Food,Category,Quantity
22,chicken,fat,
30,macaroni,calories,


`ticdat` is telling us that there are two rows which have bad values in the Quantity field of the Nutrition Quantities table. In both cases, the problem is `Nan` (or equivalently `None`, or also equivalently, Null) where a number is expected. As before, these two errant rows can easily be double checked by manually examining "nutrition_quantities.csv".

In [7]:
input_schema.find_foreign_key_failures(dat, verbosity="Low").keys()

dict_keys([('nutrition_quantities', 'foods', ('Food', 'Name'))])

In [8]:
_ = input_schema.find_foreign_key_failures(dat, verbosity="Low")
_['nutrition_quantities', 'foods', ('Food', 'Name')]

Unnamed: 0,Food,Category,Quantity
12,pizza,sodium,820.0
13,pizza,protein,15.0
14,pizza,calories,320.0
25,pizza,fat,12.0


Here, `ticdat` is telling us that there are 4 records in the Nutrition Quantities table that fail to cross reference with the Foods table. In all 4 cases, it is specifically the "pizza" string in the Food field that fails to find a match from the Name field of the Foods table. If you manually examine "foods.csv", you can see this problem arose because of the Foods table was altered to have a "pizza pie" entry instead of a "pizza" entry.

In [9]:
input_schema.find_data_row_failures(dat).keys()

dict_keys([TablePredicateName(table='categories', predicate_name='Min Max Check')])

In [10]:
input_schema.find_data_row_failures(dat)['categories', 'Min Max Check']

Unnamed: 0,Name,Min Nutrition,Max Nutrition
2,fat,70,65.0


Here, `ticdat` is telling us that the "Min Max Check" (i.e. the check that `row["Max Nutrition"] >= row["Min Nutrition"]`) failed for the "fat" record of the Categories table. This is easily verified by manual inspection of "categories.csv". 