# YData Quality - Labelling (Categorical) Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for categorical labels.

**Structure:**

1. Load dataset
2. Corrupt the dataset
2. Instantiate the Data Quality engine
3. Run the quality checks
4. Assess the warnings
5. (Extra) Detailed overview

In [1]:
# Update the imports
import statsmodels.api as sm
from ydata_quality.labelling import LabelInspector

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [2]:
df = sm.datasets.get_rdataset('Guerry', 'HistData').data

In [3]:
df.head()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,1,E,Ain,28870,15890,37,5098,33120,35039,2:Med,...,71,60,69,41,55,46,13,218.372,5762,346.03
1,2,N,Aisne,26226,5521,51,8901,14572,12831,2:Med,...,4,82,36,38,82,24,327,65.945,7369,513.0
2,3,C,Allier,26747,7925,13,10973,17044,114121,2:Med,...,46,42,76,66,16,85,34,161.927,7340,298.26
3,4,E,Basses-Alpes,12935,7289,46,2733,23018,14238,1:Sm,...,70,12,37,80,32,29,2,351.399,6925,155.9
4,5,E,Hautes-Alpes,17488,8174,69,6962,23076,16171,1:Sm,...,22,23,64,79,35,7,1,320.28,5549,129.1


## Corrupt the dataset
Records without label and single record labels are added to the dataset.

In [4]:
label = 'MainCity'

df.loc[0, label] = None
df.loc[1, label] = 'A lonesome label'

## Create the engine
Each engine contains the checks and tests for each suite. To create a Label Inspector, you provide:
- df: target DataFrame, for which we will run the test suite
- label: name of the column to be used as label, in this case it points to a categorical label!

In [5]:
li = LabelInspector(df=df, label='MainCity')

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [6]:
results = li.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are a dictionary of {test: result}.

In [7]:
li.report()

	[MISSING LABELS] Found 1 instances with missing labels. (Priority 1: heavy impact expected)
	[ONE VS REST PERFORMANCE] Classes {'2:Med', '3:Lg'} performed under the 61.8% AUROC threshold. The threshold was defined as an average of all classifiers with 0% slack. (Priority 2: usage allowed, limited human intelligibility)
	[UNBALANCED CLASSES] Classes {'A lonesome label'} are under-represented each having less than 10.0% of total instances. Classes {'2:Med'} are over-represented each having more than 40.0% of total instances (Priority 2: usage allowed, limited human intelligibility)
	[FEW LABELS] Found 1 labels with 1 or less records. (Priority 2: usage allowed, limited human intelligibility)
	[OUTLIER DETECTION] Found 74 potential outliers across 3 classes. A distance bigger than 3 standard deviations of intra-cluster distances to the respective centroids was used to define the potential outliers. (Priority 2: usage allowed, limited human intelligibility)


### Quality Warning

In [8]:
# Get a sample warning
sample_warning = li.warnings[2]

In [9]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Unbalanced Classes',
 "Classes {'A lonesome label'} are under-represented each having less than 10.0% of total instances. Classes {'2:Med'} are over-represented each having more than 40.0% of total instances",
 <Priority.P2: 2>)

In [10]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the labelling module of ydata_quality.

### Missing Labels

Return records with a missing label.

In [11]:
li.missing_labels()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,1,E,Ain,28870,15890,37,5098,33120,35039,,...,71,60,69,41,55,46,13,218.372,5762,346.03


### Few Labels

Find label classes with few records (less than a threshold, defaults to 1).

In [12]:
li.few_labels()

A lonesome label    1
Name: Label counts, dtype: int64

### One vs Rest Performance

Obtain a One vs Rest summary of performance across all label classes. Store a warning for all classes with performance below an implied threshold.

In [13]:
li.one_vs_rest_performance(slack=0.1)

2:Med    0.614286
1:Sm     0.660000
3:Lg     0.596591
dtype: float64

### Unbalanced Classes

Get a list of all classes with excess or deficit of representativity in the dataset. Unbalancement thresholds are implicitly defined through a slack parameter attending to a fair (homogeneous) distribution of records per class.

In [14]:
li.unbalanced_classes(slack=0.3)

Index(['2:Med', 'A lonesome label'], dtype='object')

### Outlier Detection

Get a dictionary of all potential outliers obtained over each class of the label feature. Outliers are defined based on a threshold of standard deviations of intra-cluster (class specific) distances to the cluster centroid.

In [15]:
li.outlier_detection(th=3)

{'2:Med':     dept Region    Department  Crime_pers  Crime_prop  Literacy  Donations  \
 6      8      N      Ardennes       35203        8847        67       6400   
 8     10      E          Aube       19602        4086        59       3608   
 9     11      S          Aude       15647       10431        34       2582   
 10    12      S       Aveyron        8236        6731        31       3211   
 12    14      N      Calvados       17577        4500        52      27830   
 ..   ...    ...           ...         ...         ...       ...        ...   
 81    86      W        Vienne       15010        4710        25       8922   
 82    87      C  Haute-Vienne       16256        6402        13      13817   
 83    88      E        Vosges       18835        9044        62       4040   
 84    89      C         Yonne       18006        6516        47       4276   
 85   200    NaN         Corse        2199        4589        49      37015   
 
     Infants  Suicides MainCity  ...  Cri