# YData Quality - Missings Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for Missing Values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
# Update the imports
import statsmodels.api as sm
from ydata_quality.missings import MissingsProfiler

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [2]:
df = sm.datasets.get_rdataset('baseball', 'plyr').data

## Create the engine
Each engine contains the checks and tests for each suite. To create a {ENGINE NAME}, you provide:
- df: target DataFrame, for which we will run the test suite
- arg (optional): description of optional argument

In [3]:
mp = MissingsProfiler(df=df)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [4]:
mp.null_count()

lg        65
rbi       12
sb       250
cs      4525
so      1305
ibb     7528
hbp      377
sh       960
sf      7390
gidp    5272
dtype: int64

In [5]:
results = mp.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are a dictionary of {test: result}.

In [6]:
mp.report()

[HIGH MISSING CORRELATIONS] Found 9 feature pairs with correlation of missing values higher than defined threshold (0.5). (Priority 3: minor impact, aesthetic)
[MISSINGNESS PREDICTION] Found 9 features with prediction performance of missingness above threshold (0.8). (Priority 2: usage allowed, limited human intelligibility)
[HIGH MISSINGS] Found 4 columns with more than 20.0% of missing values. (Priority 3: minor impact, aesthetic)


## Full Test Suite
In this section, you will find a detailed overview of the available tests in the duplicates module of ydata_quality.

### Null Count

Count the number of nulls/missings for a DataFrame. Can be calculated for:
- Specific column (entity defined) or all columns (entity=None)
- Count of nulls (as_pct=False) or ratio of rows (as_pct=True)

In [7]:
mp.null_count()

lg        65
rbi       12
sb       250
cs      4525
so      1305
ibb     7528
hbp      377
sh       960
sf      7390
gidp    5272
dtype: int64

In [8]:
mp.nulls_higher_than(th=0.1)

cs      0.208535
ibb     0.346928
sf      0.340569
gidp    0.242961
dtype: float64

## Correlation of Missings
Calculates the correlation between missing feature values. High correlation between missing values signals that data absence may not be completely at random. Is provided as:
- Missing Correlations: full matrix of correlations between missing feature values;
- High Missing Correlations: missing correlations filtered by a given threshold.

In [9]:
mp.missing_correlations()

Unnamed: 0,lg,rbi,sb,cs,so,ibb,hbp,sh,sf,gidp
lg,1.0,-0.001289,-0.005918,-0.028136,-0.013866,0.075205,0.412222,0.254769,0.076273,0.096756
rbi,-0.001289,1.0,-0.00254,0.045827,0.09299,0.032274,0.176903,0.109332,0.032732,0.041522
sb,-0.005918,-0.00254,1.0,0.210326,-0.02731,0.148125,0.811914,0.501793,0.150227,0.190571
cs,-0.028136,0.045827,0.210326,1.0,0.49138,0.704262,0.20175,0.383291,0.714259,0.626763
so,-0.013866,0.09299,-0.02731,0.49138,1.0,0.345846,0.009386,0.072834,0.350768,0.446525
ibb,0.075205,0.032274,0.148125,0.704262,0.345846,1.0,0.181698,0.29519,0.986003,0.776136
hbp,0.412222,0.176903,0.811914,0.20175,0.009386,0.181698,1.0,0.614607,0.184284,0.233897
sh,0.254769,0.109332,0.501793,0.383291,0.072834,0.29519,0.614607,1.0,0.299381,0.379781
sf,0.076273,0.032732,0.150227,0.714259,0.350768,0.986003,0.184284,0.299381,1.0,0.787165
gidp,0.096756,0.041522,0.190571,0.626763,0.446525,0.776136,0.233897,0.379781,0.787165,1.0


In [10]:
mp.high_missing_correlations(th=0.8)

Unnamed: 0,index,variable,value,sorted_pairs
58,sf,ibb,0.986003,ibb_sf
26,hbp,sb,0.811914,hbp_sb


## Prediction of Missingness
The ability to easily predict missing values for a given feature with a baseline model indicates that the process causing the missing values may not be completely at random.

In [11]:
mp.predict_missings(col=['so', 'lg'])

{'so': 0.8980369132761978, 'lg': 0.9961467324290999}

In [12]:
mp.predict_missings()

{'lg': 0.9961467324290999,
 'rbi': 0.7229208301306688,
 'sb': 0.9634839632608511,
 'cs': 0.8326044514256971,
 'so': 0.8980369132761978,
 'ibb': 0.8664950760966876,
 'hbp': 0.9631568181544718,
 'sh': 0.9664952978056426,
 'sf': 0.866573017058454,
 'gidp': 0.8608684022651725}

## Performance Drop
Calculate the drop in performance when the feature values for a specific column are NAs

In [19]:
mp.target = 'ab'

In [20]:
mp.performance_drop()

{'lg': {'missing': 862.0865059674587, 'valued': 606.2528927899764},
 'rbi': {'missing': 1168.7201400097863, 'valued': 606.6857910611225},
 'sb': {'missing': 919.3489502674844, 'valued': 603.6746227805031},
 'cs': {'missing': 832.9330970826526, 'valued': 548.7537445878879},
 'so': {'missing': 1212.2542250783617, 'valued': 569.5000508725395},
 'ibb': {'missing': 781.329293658966, 'valued': 516.1003343120956},
 'hbp': {'missing': 959.1039321312575, 'valued': 600.6756626768799},
 'sh': {'missing': 1073.6962665431095, 'valued': 586.5386043710178},
 'sf': {'missing': 787.9208312425715, 'valued': 515.8380897177242},
 'gidp': {'missing': 900.6760588771198, 'valued': 514.2116560990816}}