# YData Quality - Valued Missing Values Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for detection of valued missing values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
import pandas as pd
import numpy as np
from ydata_quality.valued_missing_values import VMVIdentifier

## Load the example dataset
We will use a transformed version of the "macrodata" dataset available from the statsmodels package.

In [2]:
df = pd.read_csv('../datasets/transformed/macrodata.csv')

## Create the engine
Each engine contains the checks and tests for each suite. To create a Valued Missing Values Identifier, you provide:
- df: target DataFrame, for which we will run the test suite
- VMV_extensions (optional): list of feature names for which checking duplicates after grouping-by is applicable.

In [3]:
vmv_extensions = ['a_custom_VMV', 'another_VMV', 999999999, '!', '', 'UNKNOWN']
vmvi = VMVIdentifier(df=df, vmv_extensions=vmv_extensions)  # Note we are passing our VMV extensions

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [4]:
results = vmvi.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [5]:
vmvi.report()

	[PREDEFINED VALUED MISSING VALUES] Found 55 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)
	[FLATLINES] Found 1 flatline events with a minimun length of 5 among the columns {'cpi'}. (Priority 2: usage allowed, limited human intelligibility)


### Quality Warning

In [6]:
# Get a sample warning
sample_warning = vmvi.get_warnings()[1]

In [7]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Predefined Valued Missing Values',
 'Found 67 vmvs in the dataset.',
 <Priority.P2: 2>)

In [8]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

In [9]:
sample_warning_data

Unnamed: 0,unemp,pop,realinv,cpi,m1,infl
,10,0,0,0,0,0
a_custom_vmv,0,0,0,14,0,0
!,0,0,0,0,9,0
another_vmv,0,0,0,0,0,10
unknown,0,13,0,0,0,0
999999999,0,0,11,0,0,0


## Full Test Suite
In this section, you will find a detailed overview of the available tests in the valued missing values module of ydata_quality.

### Flatlines

We consider flatlines as sequences (order according to index matters) of the same value in a given column.

The return is a DataFrame mapping all the flatline events found in each column of the dataframe.

And by the way, did you notice in the report printout that our flatlines evaluation did not return one of the flatlines added in the dataset corruption step?
> df.loc[50:53, 'realdpi'] = df['realdpi'][50]

Flatlines is ran by default to detect flatlines of sequences with minimun length of 5, the above added a flatline of length 4, therefore it was not returned.
By running flatlines explicitly we can pass non-default arguments. Argument "th" sets the minimun flatline length, which we can set to 4.
Also notice how our demo dataset has quarter data and therefore years appear 4 times in sequence (once for each quarter).
The argument skip allows us to skip evaluation of passed columns.
Lets put both arguments to use to retrieve all relevant flatlines.


In [10]:
flatlines_out = vmvi.flatlines(th=4, skip=['year'])

In [11]:
flatlines_out['realdpi']  # Printing found flatlines just for the 'realdpi' column

Unnamed: 0_level_0,length,ends
starts,Unnamed: 1_level_1,Unnamed: 2_level_1
50,4,53


### Predefined Valued Missing Values
Sometimes data can be amiss despite not being detected.
Some flags for missing data might not be parsed as nan by Pandas per example.
To detect these cases we added a set of predefined Valued Missing Values and give you the means to extend it as demonstrated above during instantion of the VMVIdentifier.
The method can be called explicitly

In [12]:
vmvi.predefined_valued_missing_values()

Unnamed: 0,tbilrate,pop,m1,cpi,realcons,Unnamed: 6,Unnamed: 7
,10,0,0,0,0,8,0.0
!,0,10,0,0,14,0,0.0
!,0,0,0,0,9,0,
another_vmv,0,0,0,0,0,10,
a_custom_vmv,14,0,13,0,0,0,0.0
999999999,0,0,11,0,0,0,
