## Handle duplicate rows in a Pandas DataFrame

Create a hardcoded Pandas DataFrame df with some duplicate rows. 

then use various methods to find and handle these duplicates.

first use the duplicated() method to find duplicate rows based on all columns, as well as based on a subset of columns.

then use the drop_duplicates() method to remove these duplicate rows, either based on all columns or based on a subset of columns.

also use the keep parameter to keep only the first or last occurrence of each duplicate row.

Finally, use the drop_duplicates() method with the keep=False parameter to keep only the rows that are not duplicates.

The resulting DataFrames show the original DataFrame, the duplicate rows based on all columns and based on a subset of columns, as well as the various versions of the DataFrame with the duplicate rows handled in different ways.

In [1]:
import pandas as pd
import numpy as np

# create a hardcoded Pandas DataFrame with some duplicate rows
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 2],
                   'B': ['a', 'b', 'c', 'd', 'e', 'b'],
                   'C': [10, 20, 30, 40, 50, 20]})

# find the duplicate rows based on all columns
duplicate_rows = df[df.duplicated()]

# find the duplicate rows based on a subset of columns
duplicate_rows_subset = df[df.duplicated(subset=['A', 'B'])]

# drop the duplicate rows based on all columns
df_deduped = df.drop_duplicates()

# drop the duplicate rows based on a subset of columns
df_deduped_subset = df.drop_duplicates(subset=['A', 'B'])

# keep only the first occurrence of each duplicate row
df_first_occurrence = df.drop_duplicates(keep='first')

# keep only the last occurrence of each duplicate row
df_last_occurrence = df.drop_duplicates(keep='last')

# keep only the rows that are not duplicates
df_no_duplicates = df.drop_duplicates(keep=False)

# print the resulting DataFrames
print('Original DataFrame:\n', df)
print('\nDuplicate rows:\n', duplicate_rows)
print('\nDuplicate rows based on subset of columns:\n', duplicate_rows_subset)
print('\nDeduplicated DataFrame based on all columns:\n', df_deduped)
print('\nDeduplicated DataFrame based on subset of columns:\n', df_deduped_subset)
print('\nDataFrame with only first occurrence of each duplicate row:\n', df_first_occurrence)
print('\nDataFrame with only last occurrence of each duplicate row:\n', df_last_occurrence)
print('\nDataFrame with no duplicate rows:\n', df_no_duplicates)


Original DataFrame:
    A  B   C
0  1  a  10
1  2  b  20
2  3  c  30
3  4  d  40
4  5  e  50
5  2  b  20

Duplicate rows:
    A  B   C
5  2  b  20

Duplicate rows based on subset of columns:
    A  B   C
5  2  b  20

Deduplicated DataFrame based on all columns:
    A  B   C
0  1  a  10
1  2  b  20
2  3  c  30
3  4  d  40
4  5  e  50

Deduplicated DataFrame based on subset of columns:
    A  B   C
0  1  a  10
1  2  b  20
2  3  c  30
3  4  d  40
4  5  e  50

DataFrame with only first occurrence of each duplicate row:
    A  B   C
0  1  a  10
1  2  b  20
2  3  c  30
3  4  d  40
4  5  e  50

DataFrame with only last occurrence of each duplicate row:
    A  B   C
0  1  a  10
2  3  c  30
3  4  d  40
4  5  e  50
5  2  b  20

DataFrame with no duplicate rows:
    A  B   C
0  1  a  10
2  3  c  30
3  4  d  40
4  5  e  50
