PySpark API provide two functions to handle duplicate rows:
1. distinct()       : Drop the duplicate rows (all columns) from DataFrame
2. dropDuplicates() : Drop selected (one or multiple) columns.

## Initialize SparkSession

In [3]:
from __future__ import print_function
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName("Handle Duplicates").getOrCreate()

## Create Dataframe with sample data

Create a DataFrame with some duplicate rows and duplicate values on a few columns. 

In [8]:
columns= ["first", "last","address","city","state"]
data = [
    ("satish", "kumar", "EC", "Bangalore", "KAR"),
    ("Teja", "", "EC", "Bangalore", "KAR"),
    ("Ramya", "Sree", "EC", "Bangalore", "KAR"),
    ("satish", "kumar", "TR", "Yemmiganur", "AP"),
    ("Ramya", "Sree", "IN", "Anantapur", "AP"),
    ("satish", "kumar", "EC", "Bangalore", "KAR"),
    ("Ramya", "Sree", "EC", "Bangalore", "KAR"),
    ("satish", "kumar", "EC", "Bangalore", "KAR"),
]

In [9]:
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)



In [10]:
df.show(truncate=False)

+------+-----+-------+----------+-----+
|first |last |address|city      |state|
+------+-----+-------+----------+-----+
|satish|kumar|EC     |Bangalore |KAR  |
|Teja  |     |EC     |Bangalore |KAR  |
|Ramya |Sree |EC     |Bangalore |KAR  |
|satish|kumar|TR     |Yemmiganur|AP   |
|Ramya |Sree |IN     |Anantapur |AP   |
|satish|kumar|EC     |Bangalore |KAR  |
|Ramya |Sree |EC     |Bangalore |KAR  |
|satish|kumar|EC     |Bangalore |KAR  |
+------+-----+-------+----------+-----+



## Drop duplicate rows (all columns) from DataFrame

As you notice, DataFrame having total  8 records with 3 rows having same values. So by applying distinct() on dataframe it should return 5.

In [11]:
distinct_df = df.distinct()
distinct_df.count()

5

In [12]:
distinct_df.show(truncate=False)

+------+-----+-------+----------+-----+
|first |last |address|city      |state|
+------+-----+-------+----------+-----+
|Teja  |     |EC     |Bangalore |KAR  |
|satish|kumar|EC     |Bangalore |KAR  |
|Ramya |Sree |EC     |Bangalore |KAR  |
|Ramya |Sree |IN     |Anantapur |AP   |
|satish|kumar|TR     |Yemmiganur|AP   |
+------+-----+-------+----------+-----+



## Drop selected (one or multiple) columns.

In [13]:
distinct_df = df.dropDuplicates(["first","last"])
distinct_df.show(truncate=False)

+------+-----+-------+---------+-----+
|first |last |address|city     |state|
+------+-----+-------+---------+-----+
|satish|kumar|EC     |Bangalore|KAR  |
|Teja  |     |EC     |Bangalore|KAR  |
|Ramya |Sree |EC     |Bangalore|KAR  |
+------+-----+-------+---------+-----+



Here drops the duplicate records that having duplicate values for the columns 'first' and 'last'.