
This file is an exploration of duplicate values that are seen in the Iceberg Tracking Beacon Database. 

Derek Mueller 

Several beacons have data with repeat positions but with consecutive date/times, which seem suspicious. In my experience, there is almost always some jitter in the GPS data and a lot of jitter in ARGOS data. 
* 2018_300234066545280 FT-2000 6058
* 2009_300034012571050 ICEB-I-XA 14710
* 2017_300234063516450 iCALIB 2429
  
Here is a bit of the 300034012571050 data. Note that the distance and direction rounding was turned off to generate this. 

RECREATE!!

| datetime_data           | latitude | longitude | temperature_air | distance        | speed         | direction |
|-------------------------|----------|-----------|-----------------|-----------------|---------------|-----------|
| 2017-07-25 18:00:00 | 76.3194  | -75.0602  | 5.4             | 1607.1913 | 0.446 | 92.36     |
| 2017-07-25 19:00:00 | 76.3194  | -75.0602  | 7.0             | 0               | 0.0           | 180       |
| 2017-07-25 20:00:00 | 76.3194  | -75.0602  | 9.8             | 0               | 0.0           | 180       |
| 2017-07-25 21:00:00 | 76.295   | -74.993   | 9.8             | 3251.8739 | 0.903  | 146.86    |
| 2017-07-25 22:00:00 | 76.295   | -74.993   | 9.3             | 0               | 0.0           | 180       |
| 2017-07-25 23:00:00 | 76.295   | -74.993   | 8.8             | 0               | 0.0           | 180       |
| 2017-07-26 00:00:00 | 76.2778  | -74.9708  | 9.7             | 2007.9831 | 0.557 | 162.97    |
| 2017-07-26 01:00:00 | 76.2778  | -74.9708  | 8.9             | 0               | 0.0           | 180       |
| 2017-07-26 02:00:00 | 76.2778  | -74.9708   | 5.4             | 0                | 0             | 180       |

The following notebook will review duplicates in the ITDB. 

In [11]:
# imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pyproj


In [12]:

def count_decimal_places(value):
    if "." in str(value):  # Check if there is a decimal point
        return len(
            str(value).split(".")[-1]
        )  # Count the characters after the decimal point
    return 0


In [13]:

def lon_precision_v_distance(lat):
    """
    Generate distances represented by a 1 sd change in longitude.

    Parameters
    ----------
    lat : float
        A valid latitude

    Returns
    -------
    dist :

    """
    geodesic = pyproj.Geod(ellps="WGS84")

    lons = [
        -90.1,
        -90.2,
        -80.01,
        -80.02,
        -70.001,
        -70.002,
        -60.0001,
        -60.0002,
        -50.00001,
        -50.00002,
        -40.000001,
        -40.000002,
        -30.0000001,
        -30.0000002,
    ]

    decimal_places = pd.Series(range(1, int(len(lons) / 2) + 1))
    az, baz, dist = geodesic.inv(
        [np.nan] + lons[:-1],
        [lat] * len(lons),
        lons,
        [lat] * len(lons),
    )
    dist = pd.Series(dist)
    distance_m = dist[dist < dist.quantile(0.51)].reset_index(drop=True)

    return pd.DataFrame({"decimal_places": decimal_places, "distance_m": distance_m})


In [14]:
### MAIN

# To do this analysis, the rounding of distance, direction and speed were commented out in the speed() function so the distance and direction could be examined. 

# export the database to csv and read it in.  
df = pd.read_csv("/home/dmueller/Desktop/cis_iceberg_beacon_database_0.3/TestDuplicates/alldata.csv")

print(f"The database has {len(df)} iceberg positions")

# make a duplicate indicator
df["dup"] = 0
df.loc[(df["speed"] == 0) & (df["direction"] == 180), "dup"] = 1

print(
    f"{df.dup.sum()} or {df.dup.sum()/len(df):.2%} of these positions are duplicates, where there is no apparent movement of the iceberg"
)

# Apply function to the column to get decimal places
df["lat_d"] = df["latitude"].apply(count_decimal_places)
df["lon_d"] = df["longitude"].apply(count_decimal_places)


  df = pd.read_csv("/home/dmueller/Desktop/cis_iceberg_beacon_database_0.3/TestDuplicates/alldata.csv")


The database has 971749 iceberg positions
108506 or 11.17% of these positions are duplicates, where there is no apparent movement of the iceberg


The main question is: 

Are the duplicates there because there is actually no movement or are there artifacts in the data?

The precision of the data makes a big difference, when detecting duplicates. For more info, see https://xkcd.com/2170/, but realize that the latitude has a big effect. For example:

In [15]:
print(
    f"At 80 deg N, the longitude decimal place affects distance as follows: \n\n {lon_precision_v_distance(80)}"
)
print(
    f"At 70 deg N, the longitude decimal place affects distance as follows: \n\n {lon_precision_v_distance(70)}"
)
print(
    f"At 60 deg N, the longitude decimal place affects distance as follows: \n\n {lon_precision_v_distance(60)}"
)
print(
    f"At 50 deg N, the longitude decimal place affects distance as follows: \n\n {lon_precision_v_distance(50)}"
)


At 80 deg N, the longitude decimal place affects distance as follows: 

    decimal_places   distance_m
0               1  1939.348314
1               2   193.934855
2               3    19.393486
3               4     1.939349
4               5     0.193935
5               6     0.019393
6               7     0.001939
At 70 deg N, the longitude decimal place affects distance as follows: 

    decimal_places   distance_m
0               1  3818.653700
1               2   381.865412
2               3    38.186541
3               4     3.818654
4               5     0.381865
5               6     0.038187
6               7     0.003819
At 60 deg N, the longitude decimal place affects distance as follows: 

    decimal_places   distance_m
0               1  5579.999626
1               2   558.000015
2               3    55.800002
3               4     5.580000
4               5     0.558000
5               6     0.055800
6               7     0.005580
At 50 deg N, the longitude decimal pl

Given the location precision of a single frequency (L1) GPS receiver is typically +/- 3 to 10 m, it is quite possible that movement below ~15 m would not be detectable and therefore we cannot expect to separate duplicates that are legitimate from those that are caused by artifacts at SD <= 4.


The number of recorded latitudes by precision (number of decimal places)

In [16]:
df.groupby("lat_d").size()

lat_d
1       4962
2      34735
3     294872
4     254199
5     111197
6     175977
7        861
8      36462
9          1
10      1250
13     47420
14      9785
15        28
dtype: int64

The number of recorded longitudes by precision (number of decimal places)

In [17]:
df.groupby("lon_d").size()

lon_d
1       1929
2      35004
3     277710
4     271701
5     123268
6     176370
7        929
8      36090
9          1
10      1253
12       732
13     46618
14       116
15        28
dtype: int64

The percent of records that are duplicated by precision

In [18]:
df.loc[df["dup"] == 1].groupby("lat_d").size() / df.groupby("lat_d").size() * 100

lat_d
1      4.373237
2      5.360587
3      8.993733
4     25.347071
5     12.303389
6      0.346636
7           NaN
8      3.123800
9           NaN
10          NaN
13          NaN
14     0.459888
15          NaN
dtype: float64

In [19]:
df.loc[df["dup"] == 1].groupby("lon_d").size() / df.groupby("lon_d").size() * 100

lon_d
1      3.110420
2     16.475260
3      3.948003
4     27.809982
5     11.771100
6      0.327153
7           NaN
8      2.928789
9           NaN
10     0.079808
12          NaN
13          NaN
14     8.620690
15          NaN
dtype: float64

Next figure out which beacon tracks have the most duplicates: 


In [20]:
# find the total number of positions by track
total_counts = df.groupby("beacon_id").size().reset_index()
total_counts.columns = ["beacon_id", "total_n"]

# find the number of duplicates by track
dup_counts = df.loc[df.dup == 1].groupby("beacon_id").size().reset_index()
dup_counts.columns = ["beacon_id", "dups_n"]

# get the beacon models
mf = pd.read_csv("/home/dmueller/Desktop/cis_iceberg_beacon_database_0.3/database/metadata.csv")
mf = mf[["beacon_id", "beacon_model"]]

# get the median precision for the duplicates in the track
dp_median = df.loc[df.dup == 1].groupby("beacon_id").agg({'lon_d':'median','lat_d':'median'}).reset_index()
#df_median = df.groupby("beacon_id").agg({'lon_d':'median','lat_d':'median'}).reset_index()
#dp_dup_stats = dup.groupby("beacon_id").agg({"lon_d": ["min","mean","max"], "lat_d":["min","mean","max"]}).reset_index()
#dp_df_stats = df.groupby("beacon_id").agg({"lon_d": ["min","mean","max"], "lat_d":["min","mean","max"]}).reset_index()


# merge dfs to make one with duplicate counts, total counts, beacon model and percent 
totdf = pd.merge(mf, total_counts, how="left")
dupdf = pd.merge(totdf, dup_counts, how="left")
# get the % of track that has duplicates
dupdf["dups_percent"] = dupdf["dups_n"] / dupdf["total_n"] * 100
# merge with the median precision
dupdp = pd.merge(dupdf, dp_median, how="left")
# sort
dupdp.sort_values("dups_percent", inplace=True, ascending=False)



In [21]:
n=  pd.get_option('display.max_rows')
pd.set_option('display.max_rows', len(dupdp.loc[dupdp["dups_percent"] > 5]))
dupdp.loc[dupdp["dups_percent"] > 5]
#pd.set_option('display.max_rows', n)

Unnamed: 0,beacon_id,beacon_model,total_n,dups_n,dups_percent,lon_d,lat_d
115,2018_300234066545280,FT-2000,6058,4938.0,81.51205,5.0,5.0
15,2009_300034012571050,ICEB-I-XA,14710,11250.0,76.478586,5.0,5.0
109,2017_300234063516450,iCALIB,2429,1840.0,75.751338,4.0,4.0
52,2012_300234010132070,SVP-I-BXGS-LP,5174,3488.0,67.413993,4.0,4.0
70,2015_300234061762030,iCALIB,3427,2298.0,67.055734,4.0,4.0
106,2017_300234062325760,SVP-I-BXGSA-L-AD,461,282.0,61.171367,4.0,3.0
108,2017_300234062328750,SVP-I-BXGSA-L-AD,3172,1923.0,60.624212,4.0,4.0
62,2014_300234060544160,SVP-I-BXGS-LP,3377,1923.0,56.944033,4.0,4.0
170,2023_300534064036660,iCALIB,8220,4504.0,54.793187,4.0,4.0
68,2015_300234060104820,SVP-I-BXGS-LP,9719,4690.0,48.255993,4.0,4.0
