It can be very common when dealing with time series data to end up with duplicate data. This can happen for a variety of reasons, and I've encountered it more than one time and tried different approaches to eliminate the duplicate values. There's a [gem of a solution on Stack Overflow](https://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries#34297689) and I thought it would be helpful to walk through the possible solutions to this issue.

To keep things simple, I'll just work with a ```Series``` of floating point data. This could be anything, but we could pretend it's something that's manually maintained, like an earnings estimate for a stock, or a temperature reading, or a sales for a store on a given date.

In [17]:
import pandas as pd
import numpy as np

items = pd.Series(np.random.random_sample(10) * 100, pd.date_range('2020-01-01', periods=10))

items

2020-01-01    78.535174
2020-01-02    66.960187
2020-01-03    14.711618
2020-01-04    90.651161
2020-01-05     3.127869
2020-01-06    40.417499
2020-01-07    46.791960
2020-01-08    40.818973
2020-01-09    85.778448
2020-01-10    73.401114
Freq: D, dtype: float64

At this point, we have 10 periods of data, and the index (a ```DatetimeIndex``` with 10 days) all have unique values. But let's say in our data, corrected data appears in the same source file. I'll do something a bit contrived here and concatenate two ```Series``` that have some of the same dates in them, but in real life you can imagine a number of ways that data will show up in your sources with duplicated data for the same time stamp.

In [4]:
corrected = pd.Series(np.random.random_sample(3) * 75, pd.date_range('2020-01-04', periods=3))

combined = pd.concat([items, corrected])

Now, how do we get rid of this duplicated data? Let's say that we want to only keep the most recent data in our file, assuming that it was a correction or updated value that we prefer to use. Instead of going right to the accepted solution on Stack Overflow, I'm going to work through the pandas documentation to see what the possible solutions are, and hopefully end up in the same place!

First, let's see if we can answer the question of whether our data has duplicate items in the index. In the pandas docs, we see a few promising methods, including a [```duplicated```](https://pandas.pydata.org/docs/reference/api/pandas.Index.duplicated.html) method, and also a ```has_duplicates``` property. Let's see if those report what we expect.

In [6]:
combined.index.has_duplicates

True

Now the methods available to look at are ```duplicated``` and ```drop_duplicates```. For ```duplicated```, the method will return an array of boolean values, where ```True``` indicates the duplicate. You can use the ```keep``` argument to keep either the first (default) or last occurrence of the value in your index. In ```drop_duplicates```, you get an ```Index``` returned with the duplicates already removed, and you can pass in the same ```keep``` argument with the same meaning.

In [7]:
combined.index.duplicated(keep='last')

array([False, False, False,  True,  True,  True, False, False, False,
       False, False, False, False])

In [8]:
combined.index.drop_duplicates(keep='last')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-07',
               '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq=None)

Ok, so what do we do now with these two options? The first boolean array can be used to just pick the values that we want to keep, but the ```True``` values are the ones we want to drop. That is pretty easy, just invert it with a ```~```.

In [9]:
~combined.index.duplicated(keep='last')

array([ True,  True,  True, False, False, False,  True,  True,  True,
        True,  True,  True,  True])

That can be used to select the values you want out of the array, and gets us to a good solution. We need to sort the index since it is not in chronological order.

In [10]:
combined[~combined.index.duplicated(keep='last')].sort_index()

2020-01-01    32.709476
2020-01-02    60.135948
2020-01-03    63.326407
2020-01-04    40.548428
2020-01-05     1.234698
2020-01-06    23.512759
2020-01-07    56.705172
2020-01-08    29.921226
2020-01-09    73.158245
2020-01-10    68.840243
dtype: float64

Now if we want to use the second method, ```drop_duplicates```, we need to find a way to use that to grab the values out of our ```Series``` that we want to keep. This is a bit more complicated. First, we can use the ```reset_index``` method which is a handy way to take the index (in our case a ```DatetimeIndex```) and turn it into a column on a ```DataFrame``` momentarily with a new regular, non-repeating index.

In [11]:
combined.reset_index()

Unnamed: 0,index,0
0,2020-01-01,32.709476
1,2020-01-02,60.135948
2,2020-01-03,63.326407
3,2020-01-04,49.435518
4,2020-01-05,75.190352
5,2020-01-06,11.834493
6,2020-01-07,56.705172
7,2020-01-08,29.921226
8,2020-01-09,73.158245
9,2020-01-10,68.840243


Now, we can use ```drop_duplicates```, but we'll use the ```DataFrame``` version which adds a ```subset``` argument that can be used to only consider a certain column (our new 'index' column) for duplicates to drop. Now since this is now a ```DataFrame``` and not a ```Series```, we will reset the index to our index column using ```set_index``` and return the column ```0```. This gives us the same result as the earlier method, but in a much more roundabout way.

In [12]:
combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')[0].sort_index()

index
2020-01-01    32.709476
2020-01-02    60.135948
2020-01-03    63.326407
2020-01-04    40.548428
2020-01-05     1.234698
2020-01-06    23.512759
2020-01-07    56.705172
2020-01-08    29.921226
2020-01-09    73.158245
2020-01-10    68.840243
Name: 0, dtype: float64

One other way to do this is to use ```groupby``` and a grouping function (in this case the ```last```) to select the values we want. This method provides us with sorted output and also looks simple.

In [13]:
combined.groupby(combined.index).last()

2020-01-01    32.709476
2020-01-02    60.135948
2020-01-03    63.326407
2020-01-04    40.548428
2020-01-05     1.234698
2020-01-06    23.512759
2020-01-07    56.705172
2020-01-08    29.921226
2020-01-09    73.158245
2020-01-10    68.840243
dtype: float64

What's the best way to do this? Like the question on Stack Overflow, I prefer the first method for readability, but the last is also pretty simple. One good argument for choosing the first method is speed.

In [14]:
%timeit combined[~combined.index.duplicated(keep='last')].sort_index()

271 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [15]:
%timeit combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')[0].sort_index()

1.6 ms ± 62.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [16]:
%timeit combined.groupby(combined.index).last()

595 µs ± 7.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Well, after digging through all that, I hope you understand a bit more about how to remove duplicate items from a ```Series``` or ```DataFrame``` and why some methods might be better to choose than others. 