# YData Quality - Valued Missing Values Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for detection of valued missing values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
import statsmodels.api as sm
import numpy as np
from ydata_quality.valued_missing_values import VMVIdentifier

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [2]:
df = sm.datasets.macrodata.load_pandas().data

## Inspect dataset
Inspecting the data to find ways to corrupt and test functionality

In [3]:
df.head(15)

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19
5,1960.0,2.0,2834.39,1792.9,298.152,460.4,1966.1,29.55,140.2,2.68,5.2,180.671,0.14,2.55
6,1960.0,3.0,2839.022,1785.8,296.375,474.676,1967.8,29.75,140.9,2.36,5.6,181.528,2.7,-0.34
7,1960.0,4.0,2802.616,1788.2,259.764,476.434,1966.6,29.84,141.1,2.29,6.3,182.287,1.21,1.08
8,1961.0,1.0,2819.264,1787.7,266.405,475.854,1984.5,29.81,142.1,2.37,6.8,182.992,-0.4,2.77
9,1961.0,2.0,2872.005,1814.3,286.246,480.328,2014.4,29.92,142.9,2.29,7.0,183.691,1.47,0.81


## Distort the original dataset
Apply transformations to highlight the data quality functionalities.

In [4]:
def corrupt_dataset(df, vmv_extensions):
    random_columns = np.random.randint(2, df.shape[1], size=len(vmv_extensions))  # Random columns (left out time index cols)
    random_indexes = [np.random.randint(df.shape[0], size=np.random.randint(10,15)) for i in range(len(vmv_extensions))]  # Random indexes per predefined VMV
    for i, vmv in enumerate(vmv_extensions):
        df.iloc[random_indexes[i],random_columns[i]] = vmv
    # Creating flatlines
    df.loc[5:25, 'cpi'] = df['cpi'][5]
    df.loc[50:53, 'realdpi'] = df['realdpi'][50]
    return df

vmv_extensions = ['a_custom_VMV', 'another_VMV', 999999999, '!', '', 'UNKNOWN']
df = corrupt_dataset(df, vmv_extensions)

In [5]:
# Inspect changes (random VMVs added and flatline in cpi)
df.head(15)

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,UNKNOWN,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19
5,1960.0,2.0,2834.39,1792.9,298.152,460.4,1966.1,29.55,140.2,2.68,5.2,180.671,0.14,2.55
6,1960.0,3.0,2839.022,1785.8,296.375,474.676,1967.8,29.55,140.9,2.36,5.6,181.528,2.7,-0.34
7,1960.0,4.0,2802.616,1788.2,259.764,476.434,1966.6,29.55,141.1,2.29,6.3,182.287,1.21,1.08
8,1961.0,1.0,2819.264,1787.7,266.405,475.854,1984.5,29.55,142.1,2.37,6.8,182.992,-0.4,2.77
9,1961.0,2.0,2872.005,1814.3,286.246,480.328,2014.4,29.55,142.9,2.29,999999999.0,183.691,1.47,0.81


## Create the engine
Each engine contains the checks and tests for each suite. To create a Valued Missing Values Identifier, you provide:
- df: target DataFrame, for which we will run the test suite
- VMV_extensions (optional): list of feature names for which checking duplicates after grouping-by is applicable.

In [6]:
vmvi = VMVIdentifier(df=df, vmv_extensions=vmv_extensions)  # Note we are passing our VMV extensions here

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [7]:
results = vmvi.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [8]:
vmvi.report()

[FLATLINES] Found 1 flatline events with a minimun length of 5 among the columns {'cpi'}. (Priority 2: usage allowed, limited human intelligibility)
[PREDEFINED VALUED MISSING VALUES] Found 69 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)


### Quality Warning

In [9]:
# Get a sample warning
sample_warning = list(vmvi.warnings)[1]

In [10]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Predefined Valued Missing Values',
 'Found 69 vmvs in the dataset.',
 <Priority.P2: 2>)

In [11]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

In [12]:
sample_warning_data

Unnamed: 0,realint,realinv,cpi,realgdp,unemp
,0,0,0,13,0
!,0,10,0,0,0
unknown,0,0,0,0,11
another_vmv,0,0,10,0,0
a_custom_vmv,11,0,0,0,0
999999999,0,0,0,0,14


## Full Test Suite
In this section, you will find a detailed overview of the available tests in the valued missing values module of ydata_quality.

### Flatlines

We consider flatlines as sequences (order according to index matters) of the same value in a given column.

The return is a DataFrame mapping all the flatline events found in each column of the dataframe.

And by the way, did you notice in the report printout that our flatlines evaluation did not return one of the flatlines added in the dataset corruption step?
> df.loc[50:53, 'realdpi'] = df['realdpi'][50]

Flatlines is ran by default to detect flatlines of sequences with minimun length of 5, the above added a flatline of length 4, therefore it was not returned.
By running flatlines explicitly we can pass non-default arguments. Argument "th" sets the minimun flatline length, which we can set to 4.
Also notice how our demo dataset has quarter data and therefore years appear 4 times in sequence (once for each quarter).
The argument skip allows us to skip evaluation of passed columns.
Lets put both arguments to use to retrieve all relevant flatlines.


In [13]:
flatlines_out = vmvi.flatlines(th=4, skip=['year'])

In [14]:
flatlines_out['realdpi']  # Printing found flatlines just for the 'realdpi' column

Unnamed: 0_level_0,length,ends
starts,Unnamed: 1_level_1,Unnamed: 2_level_1
50,4,53


### Predefined Valued Missing Values
Sometimes data can be amiss despite not being detected.
Some flags for missing data might not be parsed as nan by Pandas per example.
To detect these cases we added a set of predefined Valued Missing Values and give you the means to extend it as demonstrated above during instantion of the VMVIdentifier.
The method can be called explicitly

In [15]:
vmvi.predefined_valued_missing_values()

Unnamed: 0,realint,realinv,cpi,realgdp,unemp
,0,0,0,13,0
!,0,10,0,0,0
unknown,0,0,0,0,11
another_vmv,0,0,10,0,0
a_custom_vmv,11,0,0,0,0
999999999,0,0,0,0,14
