# PyRasgo Duplicate Data

This notebook explains how to identify and handle duplicate rows with `pyrasgo`.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [PyRasgo](https://app.gitbook.com/@rasgo/s/rasgo-docs/pyrasgo-0.1/dataframe-prep)

In [1]:
import pandas as pd
import pyrasgo

## Connect to Rasgo

NB: This does not run as this has not yet been built

In [None]:
api_key = pyrasgo.register(email='<your email>')
rasgo = pyrasgo.connect(api_key)

## Creating the data

We will create a dataframe that contains multiple occurances of duplication for this example.

In [3]:
df = pd.DataFrame({'A': ['A']*2 + ['A', 'A', 'B', 'A', 'B']*3 + ['A', 'A', 'B'],
                   'B': ['A']*2 + ['A', 'a', 'B', 'A', 'b']*3 + ['A', 'a', 'B'],
                   'C': ['A']*2 + ['A', 'B', 'C']*5 + ['A', 'A', 'B'],
                   'D': ['A']*2 + ['A', 'a', 'B']*5 + ['A', 'A', 'B']
                  })
df

Unnamed: 0,A,B,C,D
0,A,A,A,A
1,A,A,A,A
2,A,A,A,A
3,A,a,B,a
4,B,B,C,B
5,A,A,A,A
6,B,b,B,a
7,A,A,C,B
8,A,a,A,A
9,B,B,B,a


## Identify duplicates

### Duplicate in all columns

The function `evaluate.duplicate_rows` will identify duplicates in the data.

In [4]:
dups = rasgo.evaluate.duplicate_rows(df)
dups

Unnamed: 0,A,B,C,D
1,A,A,A,A
2,A,A,A,A
5,A,A,A,A
10,A,A,C,B
15,A,A,B,a
17,A,A,A,A
18,A,a,A,A


### Duplicate in selected columns

The function `evaluate.duplicate_rows` will identify duplicates in the data based on just the specified columns (in this case, **A** and **B**).  Currently, this function looks for duplicate entities in each column individually and returns any row that has a duplicate in any column.

In [5]:
dups = rasgo.evaluate.duplicate_rows(df, ['A', 'B'])
dups

Unnamed: 0,A,B,C,D
1,A,A,A,A
2,A,A,A,A
3,A,a,B,a
5,A,A,A,A
6,B,b,B,a
7,A,A,C,B
8,A,a,A,A
9,B,B,B,a
10,A,A,C,B
11,B,b,A,A


## Delete duplicates

### Delete only if all columns are duplicated

The function `prune.duplicate_rows` will remove duplicates in the data, leaving the first occurance.

In [6]:
dedup_df = rasgo.prune.duplicate_rows(df)
dedup_df

Dropping 7 rows


Unnamed: 0,A,B,C,D
0,A,A,A,A
3,A,a,B,a
4,B,B,C,B
6,B,b,B,a
7,A,A,C,B
8,A,a,A,A
9,B,B,B,a
11,B,b,A,A
12,A,A,B,a
13,A,a,C,B


### Delete only if specified columns are duplicated

The function `prune.duplicate_rows` will remove duplicates in the data based on just the specified columns (in this case, **A** and **B**) leaving just the first occurance.  Currently, this function looks for duplicate entities in each column individually and returns any row that has a duplicate in any column.

In [7]:
dedup_df = rasgo.prune.duplicate_rows(df, ['A', 'B'])
dedup_df

Dropping 34 rows


Unnamed: 0,A,B,C,D
0,A,A,A,A
4,B,B,C,B
