# Duplicates

When merging the different data sources, some duplicates may be created. They are adding a lot of bias to the data. Just imagine running a Telco Marketing campaign and targeting the same person multiple times. You always need to handle them before doing any types of data preparation to avoid unexpected result. Let's use the Iris dataset to understand how to handle duplicated values.

In [130]:
from vertica_ml_python import *
vdf = vDataFrame("iris")
print(vdf)

0,1,2,3,4,5
,SepalLengthCm,Species,PetalWidthCm,PetalLengthCm,SepalWidthCm
0.0,4.3,Iris-setosa,0.1,1.1,3.0
1.0,4.4,Iris-setosa,0.2,1.4,2.9
2.0,4.4,Iris-setosa,0.2,1.3,3.0
3.0,4.4,Iris-setosa,0.2,1.3,3.2
4.0,4.5,Iris-setosa,0.3,1.3,2.3
,...,...,...,...,...


<object>  Name: iris, Number of rows: 150, Number of columns: 5


To find all the duplicates, you can use the 'duplicated' method.

In [131]:
vdf.duplicated()

0,1,2,3,4,5,6
,SepalLengthCm,Species,PetalWidthCm,PetalLengthCm,SepalWidthCm,occurrence
0.0,4.9,Iris-setosa,0.1,1.5,3.1,3
1.0,5.8,Iris-virginica,1.9,5.1,2.7,2


<object>  Name: Duplicated Rows (total = 3), Number of rows: 2, Number of columns: 6

Using this type of data, we will find flowers which have the exact same characteristics. It doesn't mean that they are real duplicates. There is no need to drop them. 

However, if we want to drop the duplicates, it is still possible to do it using the 'drop_duplicates' method.

In [132]:
vdf.drop_duplicates()

3 element(s) was/were filtered


0,1,2,3,4,5
,SepalLengthCm,Species,PetalWidthCm,PetalLengthCm,SepalWidthCm
0.0,4.3,Iris-setosa,0.1,1.1,3.0
1.0,4.4,Iris-setosa,0.2,1.4,2.9
2.0,4.4,Iris-setosa,0.2,1.3,3.0
3.0,4.4,Iris-setosa,0.2,1.3,3.2
4.0,4.5,Iris-setosa,0.3,1.3,2.3
,...,...,...,...,...


<object>  Name: iris, Number of rows: 147, Number of columns: 5

Using this method will add an advanced analytical function in the SQL code generation which is quite expensive. You should use this method after aggregating the data to avoid heavy computations during the entire process.

Let's see in the next lesson how to handle outliers.