# Handling Duplicates

## Detecting and Removing duplicates

### 1.1 Detecting Duplicates

Detecting duplicates in a dataset is crucial for ensuring data integrity and accuracy. Duplicates can arise from data entry errors, merging of datasets, or other data processing steps. Pandas provides several functions to detect duplicates in a DataFrame.

#### Using `duplicated()`
The `duplicated()` function returns a Boolean Series indicating whether each row is a duplicate of a previous row. By default, it considers all columns but can be customized to check specific columns.


In [6]:
import pandas as pd

# Example DataFrame with duplicates
data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Alice'],
        'Age': [24, 27, 24, 29, 24],
        'Score': [85.5, 90.0, 85.5, 88.0, 85.5]}

df = pd.DataFrame(data)

# Detecting duplicate rows
duplicates = df.duplicated()
print(duplicates)

0    False
1    False
2     True
3    False
4     True
dtype: bool


#### Finding Duplicates Rows

To get the actual duplicate rows, you can use the 'duplicated()' function along with boolean indexing.

In [7]:
# Finding duplicate rows
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)

    Name  Age  Score
2  Alice   24   85.5
4  Alice   24   85.5


#### Detecting Duplicates in Specific Columns

You can specify columns to check for duplicates. This is useful when duplicates in certain columns are more critical than others.

In [8]:
# Detecting duplicates based on specific columns
duplicates_specific = df.duplicated(subset=['Name', 'Age'])
print(duplicates_specific)

0    False
1    False
2     True
3    False
4     True
dtype: bool


### Handling Duplicates

Once duplicates are detected, there are several ways to handle them, including dropping them or keeping only the first or last occurrence.

#### Dropping Duplicate Rows

The `drop_duplicates()` function removes duplicate rows from a DataFrame. By default, it keeps the first occurrence and drops the subsequent ones.

In [9]:
# Dropping duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

    Name  Age  Score
0  Alice   24   85.5
1    Bob   27   90.0
3  David   29   88.0


#### Keeping the Last Occurence

You can change the `keep` parameter to keep the last occurrence of each duplicate.

In [10]:
# Dropping duplicates, keeping the last occurrence
df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)

    Name  Age  Score
1    Bob   27   90.0
3  David   29   88.0
4  Alice   24   85.5
