# YData Quality - Missings Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for Missing Values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
# Update the imports
import statsmodels.api as sm
from ydata_quality.missings import MissingsProfiler

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [2]:
df = sm.datasets.get_rdataset('baseball', 'plyr').data

## Create the engine
Each engine contains the checks and tests for each suite. To create a {ENGINE NAME}, you provide:
- df: target DataFrame, for which we will run the test suite
- arg (optional): description of optional argument

In [3]:
mp = MissingsProfiler(df=df, random_state=42)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [4]:
results = mp.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test.

In [5]:
mp.report()

	[MISSINGNESS PREDICTION] Found 9 features with prediction performance of missingness above threshold (0.8). (Priority 2: usage allowed, limited human intelligibility)
	[HIGH MISSINGS] Found 4 columns with more than 20.0% of missing values. (Priority 3: minor impact, aesthetic)
	[HIGH MISSING CORRELATIONS] Found 9 feature pairs with correlation of missing values higher than defined threshold (0.5). (Priority 3: minor impact, aesthetic)


### Quality Warning

In [6]:
# Get a sample warning
sample_warning = mp.get_warnings()[0]

In [7]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Missingness Prediction',
 'Found 9 features with prediction performance of missingness above threshold (0.8).',
 <Priority.P2: 2>)

In [8]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the duplicates module of ydata_quality.

### Null Count

Count the number of nulls/missings for a DataFrame. Can be calculated for:
- Specific column (entity defined) or all columns (entity=None)
- Count of nulls (normalize=False) or ratio of rows (normalize=True)

In [9]:
mp.null_count()

lg        65
rbi       12
sb       250
cs      4525
so      1305
ibb     7528
hbp      377
sh       960
sf      7390
gidp    5272
dtype: int64

In [10]:
mp.nulls_higher_than(th=0.1)

cs      0.208535
ibb     0.346928
sf      0.340569
gidp    0.242961
dtype: float64

## Correlation of Missings
Calculates the correlation between missing feature values. High correlation between missing values signals that data absence may not be completely at random. Is provided as:
- Missing Correlations: full matrix of correlations between missing feature values;
- High Missing Correlations: missing correlations filtered by a given threshold.

In [11]:
mp.missing_correlations()

Unnamed: 0,lg,rbi,sb,cs,so,ibb,hbp,sh,sf,gidp
lg,1.0,-0.001289,-0.005918,-0.028136,-0.013866,0.075205,0.412222,0.254769,0.076273,0.096756
rbi,-0.001289,1.0,-0.00254,0.045827,0.09299,0.032274,0.176903,0.109332,0.032732,0.041522
sb,-0.005918,-0.00254,1.0,0.210326,-0.02731,0.148125,0.811914,0.501793,0.150227,0.190571
cs,-0.028136,0.045827,0.210326,1.0,0.49138,0.704262,0.20175,0.383291,0.714259,0.626763
so,-0.013866,0.09299,-0.02731,0.49138,1.0,0.345846,0.009386,0.072834,0.350768,0.446525
ibb,0.075205,0.032274,0.148125,0.704262,0.345846,1.0,0.181698,0.29519,0.986003,0.776136
hbp,0.412222,0.176903,0.811914,0.20175,0.009386,0.181698,1.0,0.614607,0.184284,0.233897
sh,0.254769,0.109332,0.501793,0.383291,0.072834,0.29519,0.614607,1.0,0.299381,0.379781
sf,0.076273,0.032732,0.150227,0.714259,0.350768,0.986003,0.184284,0.299381,1.0,0.787165
gidp,0.096756,0.041522,0.190571,0.626763,0.446525,0.776136,0.233897,0.379781,0.787165,1.0


In [12]:
mp.high_missing_correlations(th=0.8)

features
ibb_sf    0.986003
hbp_sb    0.811914
Name: missings_corr, dtype: float64

## Prediction of Missingness
The ability to easily predict missing values for a given feature with a baseline model indicates that the process causing the missing values may not be completely at random.

In [13]:
mp.predict_missings(['so', 'lg'])

so    0.914115
lg    0.993898
Name: predict_missings, dtype: float64

In [14]:
mp.predict_missings()

lg      0.993898
rbi     0.722921
sb      0.964801
cs      0.832594
so      0.914115
ibb     0.866497
hbp     0.963099
sh      0.963187
sf      0.866573
gidp    0.860869
Name: predict_missings, dtype: float64

## Performance Drop
Testing the performance drop when the feature values are missing enables the Data Scientists to better understand the downstream impact of missing values.
When normalized, the performance is measured as a ratio over a baseline performance metric achieved for the whole dataset.

In [15]:
mp.target = 'ab'
mp.performance_drop(normalize=True)

Unnamed: 0,lg,rbi,sb,cs,so,ibb,hbp,sh,sf,gidp
missing,1.038829,1.046662,1.044594,1.0397,1.040884,1.040173,1.042719,1.042521,1.039939,1.039254
valued,0.999868,0.999964,0.999508,0.989739,0.997459,0.979011,0.999218,0.998125,0.979836,0.987577


### Advanced - Custom Warning
For custom warnings, we can implement a QualityWarning from scratch based on the outputs of the Performance Drop and store in the original MissingsProfiler engine.

In [16]:
from ydata_quality.core import QualityWarning

In [17]:
# Define a new custom QualityWarning
new_warning = QualityWarning(
    category='Missings',
    test='Performance Drop',
    description='Found severe  differences in performance between missing and non-missing feature values.',
    priority=2, # 0 critical, 1 heavy, 2 medium, 3 minor 
    data=mp.performance_drop(normalize=True),
)

In [18]:
# Store to the original data quality engine
mp.store_warning(new_warning)

In [19]:
# Retrieve the custom warning from the Performance Drop
perf_drop_warnings = mp.get_warnings(test='Performance Drop')