# YData Quality - Labelling (Numerical) Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for numerical labels.

**Structure:**

1. Load dataset
2. Corrupt the dataset
2. Instantiate the Data Quality engine
3. Run the quality checks
4. Assess the warnings
5. (Extra) Detailed overview

In [1]:
import pandas as pd
from ydata_quality.labelling import label_inspector_dispatch

## Load the example dataset
We will use a transformed version of the "Guerry" dataset available from the statsmodels package.
Records without label and single record labels are added to the dataset.

In [2]:
df = pd.read_csv('../datasets/transformed/guerry_num_label.csv')
df.head()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,1,E,Ain,28870,15890,37,,33120,35039,2:Med,...,71,60,69,41,55,46,13,218.372,5762,346.03
1,2,N,Aisne,26226,5521,51,50000.0,14572,12831,2:Med,...,4,82,36,38,82,24,327,65.945,7369,513.0
2,3,C,Allier,26747,7925,13,10973.0,17044,114121,2:Med,...,46,42,76,66,16,85,34,161.927,7340,298.26
3,4,E,Basses-Alpes,12935,7289,46,2733.0,23018,14238,1:Sm,...,70,12,37,80,32,29,2,351.399,6925,155.9
4,5,E,Hautes-Alpes,17488,8174,69,6962.0,23076,16171,1:Sm,...,22,23,64,79,35,7,1,320.28,5549,129.1


## Create the engine
Each engine contains the checks and tests for each suite. To create a Label Inspector, you provide:
- df: target DataFrame, for which we will run the test suite
- label: name of the column to be used as label, in this case it points to a numerical label!

In [3]:
li = label_inspector_dispatch(df=df, label='Donations', random_state=24)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns the results of the tests, stores the warnings and prints a summary of the data quality warnings found during the analysis.

In [4]:
results = li.evaluate()


[38;5;209m[1mPriority 1[0m - [1mheavy impact expected[0m:
	[38;5;209m*[0m [1m[LABELS[0m - [4mMISSING LABELS][0m Found 1 instances with missing labels.
	[38;5;209m*[0m [1m[LABELS[0m - [4mTEST NORMALITY][0m The label distribution failed to pass a normality test as-is and following a battery of transforms. It is possible that the data originates from an exotic distribution, there is heavy outlier presence or it is multimodal. Addressing this issue might prove critical for regressor performance.
[38;5;11m[1mPriority 2[0m - [1musage allowed, limited human intelligibility[0m:
	[38;5;11m*[0m [1m[LABELS[0m - [4mOUTLIER DETECTION][0m Found 3 potential outliers across the full dataset. A distance bigger than 3.0 standard deviations of intra-cluster distances to the respective centroids was used to define the potential outliers.



## Full Test Suite
In this section, you will find a detailed overview of the available tests in the labelling module of ydata_quality.

### Missing Labels

Return records with a missing label.

In [5]:
li.missing_labels()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,1,E,Ain,28870,15890,37,,33120,35039,2:Med,...,71,60,69,41,55,46,13,218.372,5762,346.03


### Outlier Detection

Get a dictionary of all potential outliers obtained over the full dataset (see example 1) or across clusters on the label feature (see example 2 below). Outliers are defined based on a threshold of standard deviations of distances to the median of the dataset or of the cluster (see example 2).

In [6]:
# Example 1
li.outlier_detection(th=3)  # The top 3 valued labels were flagged as potential outliers

{'full_dataset':     dept Region Department  Crime_pers  Crime_prop  Literacy  Donations  \
 1      2      N      Aisne       26226        5521        51    50000.0   
 12    14      N   Calvados       17577        4500        52    27830.0   
 85   200    NaN      Corse        2199        4589        49    37015.0   
 
     Infants  Suicides MainCity  ...  Crime_parents  Infanticide  \
 1     14572     12831    2:Med  ...              4           82   
 12     8983     31807    2:Med  ...             57           56   
 85    24743     37016    2:Med  ...             81            2   
 
     Donation_clergy  Lottery  Desertion  Instruction  Prostitutes  Distance  \
 1                36       38         82           24          327    65.945   
 12               11       13         12           22          194   117.487   
 85               84       83          9           25            1   539.213   
 
     Area  Pop1831  
 1   7369   513.00  
 12  5548   494.70  
 85  8680   195.41 

In [7]:
# Example 2
li.outlier_detection(th=3, use_clusters=True)  # Only the highest donation was flagged as a potential outlier

{}

### Normality Test

Test the labels for normality (test if the data might originate from a normal distribution with a pre-specified confidence).
If the test fails apply a battery of transforms to the label and repeat the test after each transform.
Raise a low priority warning if a transform is required, raise a high priority warning if the transforms did not result in positive normality test.

In [8]:
li.test_normality()

