![title](Header__0009_1.png "Header")
___
# Chapter 2 - Data Preparation Basics
## Segment 3 - Removing duplicates

In [1]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

### Removing duplicates

In [2]:
DF_obj = DataFrame({'column 1': [1, 1, 2, 2, 3, 3, 3],
                  'column 2': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
                  'column 3': ['A', 'A', 'B', 'B', 'C', 'C', 'C']})
DF_obj

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
1,1,a,A
2,2,b,B
3,2,b,B
4,3,c,C
5,3,c,C
6,3,c,C


In [3]:
# object_name.duplicated()
# ( WHAT THIS DOES )
# The .duplicated() method searches each row in the DataFrame, and returns a True or False value to 
#indicate whether it is a duplicate of another row found earlier in the DataFrame.
DF_obj.duplicated()

0    False
1     True
2    False
3     True
4    False
5     True
6     True
dtype: bool

So, looking at our results here, we see that we have a false value that was returned for a one, that makes sense since there are no rows that came before it. 

But let's look at a row that returned a value of true, row six. If we look at row four, we can see that row six is a duplicate of it, row four returned a value of false. In other words, not a duplicate. That's because row four was the first row to contain that exact combination of values. Any subsequent rows that have the same combination of values will be counted as duplicates and return a false value.

Now that we've found the duplicate records, let's look at how we can drop them. 

In [4]:
# object_name.drop_duplicates()
# ( WHAT THIS DOES )
# To drop all duplicate rows, just call the drop_duplicates() method off of the DataFrame.
DF_obj.drop_duplicates()

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
2,2,b,B
4,3,c,C


So, row two, or the row with a series index value of one, was dropped, and that makes sense, 'cause it's a duplicate of the first row in the data frame. 

And then, the next row that was dropped was row four, which makes sense because it's a duplicate of row three, and so on, so, it looks like that yes, absolutely, our, all of our duplicate rows have been dropped from our data frame. 

I also want to show you how to drop records based on column values. In order to that, I want to make a small change to our data frame.

In [5]:
# So let's go back up and copy the code that we used to create the data frame, and I'm just 
# going to change this letter here from a C to a D for the purpose of our demonstration. 
DF_obj = DataFrame({'column 1': [1, 1, 2, 2, 3, 3, 3],
                  'column 2': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
                  'column 3': ['A', 'A', 'B', 'B', 'C', 'D', 'C']})
DF_obj

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
1,1,a,A
2,2,b,B
3,2,b,B
4,3,c,C
5,3,c,D
6,3,c,C


In [6]:
# object_name.drop_duplicates(['column_name'])
# ( WHAT THIS DOES )
# To drop the rows that have duplicates in only one column Series, just call the drop_duplicates() 
# method off of the DataFrame, and pass in the label-index of the column you want the de-duplication 
# to be based on. This method will drops all rows that have duplicates in the column you specify.
DF_obj.drop_duplicates(['column 3'])

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
2,2,b,B
4,3,c,C
5,3,c,D


And just as we predicted, it dropped the rows that had the series index values one, three, and six, now we have no duplicates in column three. Now that I've shown you how to drop duplicates from your data, I just want to highlight the point that it's really important that you check your data for duplicates, and remove them if you find them. Now, it's time to move on, to data concatenation and transformation.