# YData Quality - Erroneous Data Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package functionality for detection of erroneous data values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
import pandas as pd
import numpy as np
from ydata_quality.erroneous_data import ErroneousDataIdentifier

## Load the example dataset
We will use a transformed version of the "macrodata" dataset available from the statsmodels package.

In [2]:
df = pd.read_csv('../datasets/transformed/macrodata.csv')
df.head(5)

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,!,2.31,1.19


## Create the engine
Each engine contains the checks and tests for each suite. To create a Erroneous Data Identifier, you provide:
- df: target DataFrame, for which we will run the test suite
- ed_extensions (optional): list of feature names for which checking duplicates after grouping-by is applicable.

In [3]:
edv_extensions = ['a_custom_edv', 999999999, '!', '', 'UNKNOWN']
edi = ErroneousDataIdentifier(df=df, ed_extensions=edv_extensions)  # Note we are passing our ED extensions here

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [4]:
results = edi.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [5]:
edi.report()

	[FLATLINES] Found 1 flatline events with a minimun length of 5 among the columns {'cpi'}. (Priority 2: usage allowed, limited human intelligibility)
	[PREDEFINED ERRONEOUS DATA] Found 30 ED values in the dataset. (Priority 2: usage allowed, limited human intelligibility)


### Quality Warning

In [6]:
# Get a sample warning
sample_warning = edi.get_warnings()[1]

In [7]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Predefined Erroneous Data',
 'Found 30 ED values in the dataset.',
 <Priority.P2: 2>)

In [8]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

In [9]:
sample_warning_data

Unnamed: 0,cpi,pop,m1
unknown,8,0,0
!,0,10,0
999999999,0,0,12


## Full Test Suite
In this section, you will find a detailed overview of the available tests in the erroneous data module of ydata_quality.

### Flatlines

We consider flatlines as sequences (order according to index matters) of the same value in a given column.

The return is a DataFrame mapping all the flatline events found in each column of the dataframe.

And by the way, did you notice in the report printout that our flatlines evaluation did not return one of the flatlines added in the dataset corruption step?
> df.loc[50:53, 'realdpi'] = df['realdpi'][50]

Flatlines is ran by default to detect flatlines of sequences with minimun length of 5, the above added a flatline of length 4, therefore it was not returned.
By running flatlines explicitly we can pass non-default arguments. Argument "th" sets the minimun flatline length, which we can now set to 4.

Also notice how our demo dataset has quarter data and therefore years appear 4 times in sequence (once for each quarter).
The argument skip allows us to skip evaluation of passed columns.
Lets put both arguments to use to retrieve all relevant flatlines.


In [10]:
flatlines_out = edi.flatlines(th=4, skip=['year'])

In [11]:
flatlines_out['realdpi']  # Printing found flatlines just for the 'realdpi' column

Unnamed: 0_level_0,length,ends
starts,Unnamed: 1_level_1,Unnamed: 2_level_1
50,4,53


### Predefined Erroneous Data Values
Sometimes data can be amiss despite not being easily detectable as such.
Some flags for missing data might not be parsed as nan by Pandas per example.
To detect these cases we added a set of predefined erroneous data values and give you the means to extend it as demonstrated above during instantiation of the ErroneousDataIdentifier.

In [12]:
edi.predefined_erroneous_data()

Unnamed: 0,cpi,pop,m1
unknown,8,0,0
!,0,10,0
999999999,0,0,12
