# YData Quality - Duplicates Tutorial
Time-to-Value: 4 minutes

This notebook provides a tutorial for the ydata_quality package funcionality for duplicate values.

**Structure:**

1. Load dataset
2. Distort dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

In [1]:
import pandas as pd
from ydata_quality.duplicates import DuplicateChecker

## Load the example dataset
We will use a transformed version of the "Guerry" dataset available from the statsmodels package.

In [2]:
df = pd.read_csv('../datasets/transformed/guerry_histdata.csv')

## Create the engine
Each engine contains the checks and tests for each suite. To create a DuplicateChecker, you provide:
- df: target DataFrame, for which we will run the test suite
- entities (optional): list of feature names for which checking duplicates after grouping-by is applicable.

In [3]:
dc = DuplicateChecker(df=df, entities=['Region', 'MainCity'])

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a list of warnings for each quality check. 

In [4]:
results = dc.evaluate()
results.keys()


[38;5;209m[1mPriority 1[0m - [1mheavy impact expected[0m:
	[38;5;209m*[0m [1m[DUPLICATES[0m - [4mDUPLICATE COLUMNS][0m Found 1 columns with exactly the same feature values as other columns.
[38;5;11m[1mPriority 2[0m - [1musage allowed, limited human intelligibility[0m:
	[38;5;11m*[0m [1m[DUPLICATES[0m - [4mENTITY DUPLICATES][0m Found 20 duplicates after grouping by entities.
	[38;5;11m*[0m [1m[DUPLICATES[0m - [4mEXACT DUPLICATES][0m Found 20 instances with exact duplicate feature values.



dict_keys(['exact_duplicates', 'entity_duplicates', 'duplicate_columns'])

## Check the status
After running the data quality checks, you can check the warnings for each individual test. The warnings are suited by priority and have additional details that can provide better insights for Data Scientists.

In [None]:
# Retrieve the warnings
warnings = dc.get_warnings()

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the duplicates module of ydata_quality.

### Exact Duplicates

We consider exact duplicates the rows which contain the exact same feature values for more than 1 row.

The return is a DataFrame containing the duplicate instances, not containing the original (i.e. first seen) rows.

In [None]:
exact_duplicates_out = dc.exact_duplicates()
exact_duplicates_out.head()

### Entity Duplicates
We define an _entity_ as any feature value for which a groupby-aggregation would make sense (e.g. categoricals).

Entity duplicates exist when we have exactly the same rows after grouping by a given entity. Entity duplicates are by definition exact duplicates, but this perspective allows to isolate the grouping of interest (i.e. the groupby for which we have duplicates).

You can either specify the given entities for checking duplicates or default to the entities set in DuplicateChecker init.

In [None]:
given_entity_duplicates_out = dc.entity_duplicates('MainCity')

In [None]:
dc.entities = ['Region']
entity_duplicates_out = dc.entity_duplicates()

In [None]:
# If the entities are not specified, the test will be skipped.
dc.entities = []
dc.entity_duplicates()

In [None]:
# When passed a composed entity, get the duplicates grouped by value intersection
dc.entities = [['Region', 'MainCity']]
composed_entity_duplicates_out = dc.entity_duplicates()

### Column Duplicates
We define a column duplicate as any column that contains the exactly same feature values as another column in the same DataFrame.

In [None]:
dc.duplicate_columns()