# YData Quality - Missings Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for Missing Values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [None]:
# Update the imports
import statsmodels.api as sm
from ydata_quality.missings import MissingsProfiler

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [None]:
df = sm.datasets.get_rdataset('baseball', 'plyr').data

## Create the engine
Each engine contains the checks and tests for each suite. To create a {ENGINE NAME}, you provide:
- df: target DataFrame, for which we will run the test suite
- arg (optional): description of optional argument

In [None]:
mp = MissingsProfiler(df=df)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [None]:
results = mp.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are a dictionary of {test: result}.

In [None]:
mp.report()

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the duplicates module of ydata_quality.

### Null Count

Count the number of nulls/missings for a DataFrame. Can be calculated for:
- Specific column (entity defined) or all columns (entity=None)
- Count of nulls (as_pct=False) or ratio of rows (as_pct=True)

In [None]:
mp.null_count()

In [None]:
mp.nulls_higher_than(th=0.1)

## Correlation of Missings
Calculates the correlation between missing feature values. High correlation between missing values signals that data absence may not be completely at random. Is provided as:
- Missing Correlations: full matrix of correlations between missing feature values;
- High Missing Correlations: missing correlations filtered by a given threshold.

In [None]:
mp.missing_correlations()

In [None]:
mp.high_missing_correlations(th=0.8)

## Prediction of Missingness
The ability to easily predict missing values for a given feature with a baseline model indicates that the process causing the missing values may not be completely at random.

In [None]:
mp.predict_missings(col=['so', 'lg'])

In [None]:
mp.predict_missings()

## Performance Drop
Testing the performance drop when the feature values are missing enable the Data Scientists to better understand the downstream impact of missing values.

In [None]:
mp.target = 'ab'
mp.performance_drop()