# YData Quality - Duplicates Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for duplicate values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
import statsmodels.api as sm
from ydata_quality.duplicates import DuplicateChecker

## Load the example dataset
We will use a dataset available from the statsmodels package.

In [2]:
df = sm.datasets.get_rdataset('Guerry', 'HistData').data

## Distort the original dataset
Apply transformations to highlight the data quality functionalities.

In [3]:
# Duplicate the first 20 rows
df = df.append(df[:20], ignore_index=True)

In [4]:
# Duplicate the dept column
df["dept2"] = df["dept"]

## Create the engine
Each engine contains the checks and tests for each suite. To create a DuplicateChecker, you provide:
- df: target DataFrame, for which we will run the test suite
- entities (optional): list of feature names for which checking duplicates after grouping-by is applicable.

In [5]:
dc = DuplicateChecker(df=df, entities=['Region', 'MainCity'])

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [6]:
results = dc.evaluate()
results.keys()

dict_keys(['exact_duplicates', 'entity_duplicates', 'duplicate_columns'])

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [7]:
dc.report()

[ENTITY DUPLICATES] Found 20 duplicates after grouping by entities. (Priority 2: usage allowed, limited human intelligibility)
[EXACT DUPLICATES] Found 20 instances with exact duplicate feature values. (Priority 2: usage allowed, limited human intelligibility)
[DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns. (Priority 1: heavy impact expected)


### Quality Warning

In [8]:
# Get a sample warning
sample_warning = list(dc.warnings)[1]

In [9]:
# Check the details
sample_warning.test, sample_warning.description, sample_warning.priority

('Exact Duplicates',
 'Found 20 instances with exact duplicate feature values.',
 <Priority.P2: 2>)

In [21]:
# Retrieve the relevant data from the warning
sample_warning_data = sample_warning.data

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831,dept2
86,1,E,Ain,28870,15890,37,5098,33120,35039,2:Med,...,60,69,41,55,46,13,218.372,5762,346.03,1
87,2,N,Aisne,26226,5521,51,8901,14572,12831,2:Med,...,82,36,38,82,24,327,65.945,7369,513.0,2
88,3,C,Allier,26747,7925,13,10973,17044,114121,2:Med,...,42,76,66,16,85,34,161.927,7340,298.26,3
89,4,E,Basses-Alpes,12935,7289,46,2733,23018,14238,1:Sm,...,12,37,80,32,29,2,351.399,6925,155.9,4
90,5,E,Hautes-Alpes,17488,8174,69,6962,23076,16171,1:Sm,...,23,64,79,35,7,1,320.28,5549,129.1,5
91,7,S,Ardeche,9474,10263,27,3188,42117,52547,1:Sm,...,47,67,70,19,62,1,279.413,5529,340.73,7
92,8,N,Ardennes,35203,8847,67,6400,16106,26198,2:Med,...,85,49,31,62,9,83,105.694,5229,289.62,8
93,9,S,Ariege,6173,9597,18,3542,22916,123625,1:Sm,...,28,63,75,22,77,3,385.313,4890,253.12,9
94,10,E,Aube,19602,4086,59,3608,18642,10989,2:Med,...,54,9,28,86,15,207,83.244,6004,246.36,10
95,11,S,Aude,15647,10431,34,2582,20225,66498,2:Med,...,35,27,50,63,48,1,370.949,6139,270.13,11


## Full Test Suite
In this section, you will find a detailed overview of the available tests in the duplicates module of ydata_quality.

### Exact Duplicates

We consider exact duplicates the rows which contain the exact same feature values for more than 1 row.

The return is a DataFrame containing the duplicate instances, not containing the original (i.e. first seen) rows.

In [11]:
exact_duplicates_out = dc.exact_duplicates()
exact_duplicates_out.head()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831,dept2
86,1,E,Ain,28870,15890,37,5098,33120,35039,2:Med,...,60,69,41,55,46,13,218.372,5762,346.03,1
87,2,N,Aisne,26226,5521,51,8901,14572,12831,2:Med,...,82,36,38,82,24,327,65.945,7369,513.0,2
88,3,C,Allier,26747,7925,13,10973,17044,114121,2:Med,...,42,76,66,16,85,34,161.927,7340,298.26,3
89,4,E,Basses-Alpes,12935,7289,46,2733,23018,14238,1:Sm,...,12,37,80,32,29,2,351.399,6925,155.9,4
90,5,E,Hautes-Alpes,17488,8174,69,6962,23076,16171,1:Sm,...,23,64,79,35,7,1,320.28,5549,129.1,5


### Entity Duplicates
We define an _entity_ as any feature value for which a groupby-aggregation would make sense (e.g. categoricals).

Entity duplicates exist when we have exactly the same rows after grouping by a given entity. Entity duplicates are by definition exact duplicates, but this perspective allows to isolate the grouping of interest (i.e. the groupby for which we have duplicates).

You can either specify the given entities for checking duplicates or default to the entities set in DuplicateChecker init.

In [12]:
given_entity_duplicates_out = dc.entity_duplicates('MainCity')

In [13]:
dc.entities = ['Region']
entity_duplicates_out = dc.entity_duplicates()

In [14]:
# If the entities are not specified, the test will be skipped.
dc.entities = []
dc.entity_duplicates()

[ENTITY DUPLICATES] There are no entities defined to run the analysis. Skipping the test.


In [15]:
# When passed a composed entity, get the duplicates grouped by value intersection
dc.entities = [['Region', 'MainCity']]
composed_entity_duplicates_out = dc.entity_duplicates()

### Column Duplicates
We define a column duplicate as any column that contains the exactly same feature values as another column in the same DataFrame.

In [16]:
dc.duplicate_columns()

{'dept': 'dept2'}