In [1]:
import pandas as pd
import numpy as np

In [2]:
students = [('Sri',    34, 'Sydeny'),
            ('Sai',    30, 'Delhi'),
            ('Anand',  16, 'Singapore'),
            ('Steve',  30, 'Delhi'),
            ('Anand',   16, 'Singapore'),
            ('Priti',    30, 'Mumbai'),
            ('popcorn.ai', 40, 'Delhi'),
            ('popcorn.ai', 30, 'Delhi'),
            ('Anand',   16, 'Singapore'),
            ]

In [3]:
# Create a DataFrame object
df = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
df

Unnamed: 0,Name,Age,City
0,Sri,34,Sydeny
1,Sai,30,Delhi
2,Anand,16,Singapore
3,Steve,30,Delhi
4,Anand,16,Singapore
5,Priti,30,Mumbai
6,popcorn.ai,40,Delhi
7,popcorn.ai,30,Delhi
8,Anand,16,Singapore


Find Duplicate Rows based on all columns¶

In [4]:
df.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8     True
dtype: bool

In [5]:
df.duplicated(keep='last')

0    False
1    False
2     True
3    False
4     True
5    False
6    False
7    False
8    False
dtype: bool

In [6]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df[df.duplicated(keep='last')]

print("Duplicate Rows except first occurrence based on all columns are :")
duplicateRowsDF

Duplicate Rows except first occurrence based on all columns are :


Unnamed: 0,Name,Age,City
2,Anand,16,Singapore
4,Anand,16,Singapore


In [7]:
# Select duplicate rows except last occurrence based on all columns
duplicateRowsDF = df[df.duplicated(keep='first')]
 
print("Duplicate Rows except last occurrence based on all columns are :")
duplicateRowsDF

Duplicate Rows except last occurrence based on all columns are :


Unnamed: 0,Name,Age,City
4,Anand,16,Singapore
8,Anand,16,Singapore


#### duplicates based on selected few cols

In [8]:
# Select all duplicate rows based on one column
duplicateRowsDF = df[df.duplicated(['Name'], keep='first')]

print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n')

Duplicate Rows based on a single column are:
         Name  Age       City
4       Anand   16  Singapore
7  popcorn.ai   30      Delhi
8       Anand   16  Singapore


In [9]:
# Select all duplicate rows based on multiple column names in list
duplicateRowsDF = df[df.duplicated(['Age', 'City'], keep='first')]
 
print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n')

Duplicate Rows based on 2 columns are:
         Name  Age       City
3       Steve   30      Delhi
4       Anand   16  Singapore
7  popcorn.ai   30      Delhi
8       Anand   16  Singapore


Qs : would we use PANDAS for handling duplicates? Or we do this at the source (ET part)

#### Best Practices for Handling Duplicates in Datasets (AI/ML)

1. **Identify Duplicates**
   - Use methods like `DataFrame.duplicated()` in Pandas to identify duplicate rows.

   - Consider duplicates across `all columns` or `specific subsets` depending on the context.

2. **Understand the Cause of Duplicates**
   - Analyze why duplicates exist: data entry errors, merging multiple datasets, etc.

3. **Determine the Impact on the Model**
   - Assess how duplicates affect model performance, e.g., biasing the model if duplicates are more frequent in certain classes.


4. **Decide Whether to Remove or Retain Duplicates**
   - In some cases, duplicates may be legitimate (e.g., repeated measurements) and should be retained.

   - In other cases, removing duplicates may reduce noise and improve model generalization.


5. **De-duplication Strategies**
   - **Exact Match**: Remove exact duplicates.

   - **Near Duplicates**: Consider records that are similar but not identical (e.g., spelling errors, minor differences). Use techniques like fuzzy matching.
     
   - **Key-based Deduplication**: Deduplicate based on specific key columns, ignoring others.

6. **Handling Duplicates in Time-Series Data** IMP
   - In time-series datasets, duplicates may occur due to sensor errors or repeated events. Decide to aggregate, average, or retain based on context.


7. **Handling Duplicates in Text Data** IMP
   - In NLP datasets, duplicates (e.g., repeated sentences or documents) can distort the model. Use techniques like TF-IDF to identify and remove them.

8. **Evaluate Model Performance Before and After De-duplication**
   - Compare model metrics (accuracy, F1-score, etc.) with and without duplicates to understand their impact.


9. **Document Your Decisions**
   - Keep a record of decisions regarding duplicates, including why certain duplicates were retained or removed, for reproducibility and transparency.

10. **Consider Domain-specific Implications**
    - Duplicates may have different meanings in different domains (e.g., customer data vs. medical records). Tailor your approach accordingly.



