## **Handling Duplicate Values**

Duplicates are common in real-world datasets and can distort analysis and model performance. This notebook demonstrates how to identify and handle duplicate records using Pandas.

## **Import Required Libraries**

In [1]:
import pandas as pd

## **Create Sample Dataset with Duplicates**

In [3]:
df = pd.DataFrame({
    'EmployeeID': [101, 102, 103, 104, 102, 105, 103],
    'Department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'Finance', 'Finance'],
    'Salary': [50000, 60000, 70000, 52000, 60000, 72000, 70000],
    'JoiningDate': ['2020-01-10', '2019-03-15', '2021-06-01', '2020-01-10', '2019-03-15', '2022-08-20', '2021-06-01']
})

df

Unnamed: 0,EmployeeID,Department,Salary,JoiningDate
0,101,HR,50000,2020-01-10
1,102,IT,60000,2019-03-15
2,103,Finance,70000,2021-06-01
3,104,HR,52000,2020-01-10
4,102,IT,60000,2019-03-15
5,105,Finance,72000,2022-08-20
6,103,Finance,70000,2021-06-01


## **Identify Duplicate Records**

## **1. Check duplicates for the entire DataFrame**

In [4]:
df.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
6     True
dtype: bool

## **2. View only duplicate rows**

In [5]:
df[df.duplicated()]

Unnamed: 0,EmployeeID,Department,Salary,JoiningDate
4,102,IT,60000,2019-03-15
6,103,Finance,70000,2021-06-01


## **3. Check duplicates based on specific columns**

In [6]:
df.duplicated(subset=['EmployeeID'])

0    False
1    False
2    False
3    False
4     True
5    False
6     True
dtype: bool

## Understanding duplicated() Parameters

- **keep='first'** (default): Marks duplicates except the first occurrence
- **keep='last'**: Marks duplicates except the last occurrence
- **keep=False**: Marks all duplicates

In [7]:
df.duplicated(subset=['EmployeeID'], keep='last')

0    False
1     True
2     True
3    False
4    False
5    False
6    False
dtype: bool

In [8]:
df.duplicated(subset=['EmployeeID'], keep=False)

0    False
1     True
2     True
3    False
4     True
5    False
6     True
dtype: bool

## **Remove Duplicate Records**

## **1. Remove full row duplicates**

In [None]:
df.drop_duplicates()

## **2. Remove duplicates based on a column**

In [None]:
df.drop_duplicates(subset=['EmployeeID'])

## **3. Keep the latest record**

In [None]:
df.drop_duplicates(subset=['EmployeeID'], keep='last')

## **4. Count Duplicate Records**

In [None]:
df.duplicated().sum()

- Always check duplicates before analysis
- Decide based on business logic which record to keep
- IDs should ideally be unique
- Removing duplicates blindly may cause data loss

## Conclusion

Handling duplicates is a critical data-cleaning step. The correct approach depends on whether duplicates are true errors or valid repeated events.




--------