# Google Smartphone Decimeter Challenge by LightGBM

Solution is based on referring to various notebooks posted at Kaggle, along with my changes. Seeing this as a learning experience.

## Notes

 If you’re outside, with open sky, the GPS accuracy from your phone is about five meters, and that’s been constant for a while. With raw GNSS measurements from the phones, this can now improve dramatically.
 
### The GNSS problem description
https://github.com/commaai/laika

GNSS satellites orbit the earth broadcasting signals that allow the receiver to determine the distance to each satellite. These satellites have known orbits and so their positions are known. This makes determining the receiver's position a basic 3-dimensional trilateration problem. In practice observed distances to each satellite will be measured with some offset that is caused by the receiver's clock error. This offset also needs to be determined, making it a 4-dimensional trilateration problem. 

<img src= "https://camo.githubusercontent.com/0d85f5131c63442f8e7b46de7dab8040a7d693effd5e611ebed25be0b7600a32/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f632f63332f33737068657265732e7376672f36323270782d33737068657265732e7376672e706e67"  alt ="GNSS" style="width:400px;height:400px;">

Since this problem is generally overdetermined (more than 4 satellites to solve the 4d problem) there is a variety of methods to compute a position estimate from the measurements. One can use a basic weighted least squares solver for experimental purposes. This is far from optimal due to the dynamic nature of the system, this makes a Bayesian estimator like a Kalman filter the preferred estimator.

However, the above description is over-simplified. Getting accurate distance estimates to satellites and the satellite's position from the receiver observations is not trivial. This is what we call processing of the GNSS observables and it is this procedure laika is designed to make easy.
 
 ### How positioning works?
 - Send a burst of these transactions and, as a consequence, the system can calculate ranging statistics, such as the mean and the variance.
 
 <img src= "https://www.gpsworld.com/wp-content/uploads/2018/07/Android-Figure-3.jpg" alt ="Wifi distance">
 
 Wi-Fi RTT principles, basic concept. Image by Frank van Diggelen, Roy Want and Wei Wang
 
 - Take these ranges of separate access points; if those ranges were accurate, they would define four circles that would intersect at a single point. In practice, because of error in each range, a maximum likelihood position is calculated using a least squares multilateration algorithm.
 
 - Further refine this position by repeating the process, particularly as the phone moves, and then calculate trajectory using filtering techniques, such as Kalman filtering, to optimize the estimate.
 
  <img src= "https://www.gpsworld.com/wp-content/uploads/2018/07/Android-Figure-4.jpg" alt ="Workflow">
  
  Wi-Fi Workflow. Image by: Frank van Diggelen, Roy Want and Wei Wang.

 
 
### Steps for the solution from Sohier Dane
- Smoothing out the baseline estimates
- Integrating readings from other phone instruments, like the accelerometer.
- Satellite triangulation using the *derived.csv files.
- Building triangulations directly from the raw gnss logs. 
- Incorporating external data for controls like satellite readings from base stations in the area.

### References: 
- https://www.kaggle.com/tensorchoko/google-smartphone-lightgbm
- https://www.kaggle.com/jeongyoonlee/google-smartphone-decimeter-eda-keras-tpu
- Discussions topics from Sohier Dane
- Google I/O https://www.gpsworld.com/how-to-achieve-1-meter-accuracy-in-android/
- GPS Survey Workshop video https://www.youtube.com/watch?v=vOJ3u7Zd_i0
- Hardware: Centimeter Positioning with a Smartphone-Quality GNSS Antenna https://www.youtube.com/watch?v=rCOvklUB5vQ

In [None]:
!pip install simdkalman

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np # linear algebra
from pathlib import Path
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import sparse
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import simdkalman
import tensorflow as tf
from tensorflow import keras
from tqdm.notebook import tqdm
from warnings import simplefilter

simplefilter('ignore')
plt.style.use('fivethirtyeight')
pd.set_option('max_columns', 100)
pd.set_option('max_rows', 100)

In [None]:
model_name = 'nn_v2'

data_dir = Path('../input/google-smartphone-decimeter-challenge')
train_file = data_dir / 'baseline_locations_train.csv'
test_file = data_dir / 'baseline_locations_test.csv'
sample_file = data_dir / 'sample_submission.csv'

build_dir = Path('./build')
build_dir.mkdir(parents=True, exist_ok=True)
predict_val_file = build_dir / f'{model_name}.val.txt'
predict_tst_file = build_dir / f'{model_name}.tst.txt'
submission_file = 'submission.csv'

cname_col = 'collectionName'
pname_col = 'phoneName'
phone_col = 'phone'
ts_col = 'millisSinceGpsEpoch'
dt_col = 'datetime'
lat_col = 'latDeg'
lon_col = 'lngDeg'

lrate = .01 #.001
batch_size = 32 #1024
epochs = 2000#100
n_stop = 10
n_fold = 5
seed = 42#77

In [None]:
train = pd.read_csv(train_file)
test = pd.read_csv(test_file)

In [None]:
train.groupby('collectionName').apply(lambda x: x['phoneName'].unique())

In [None]:
test.groupby('collectionName').apply(lambda x: x['phoneName'].unique())

In [None]:
train.phoneName.unique()

In [None]:
f = open('../input/google-smartphone-decimeter-challenge/train/2020-05-14-US-MTV-1/Pixel4/supplemental/Pixel4_GnssLog.20o', 'r')
data = f.readlines()
f.close()
data[:20]

In [None]:
f = open('../input/google-smartphone-decimeter-challenge/train/2020-05-14-US-MTV-1/Pixel4/supplemental/SPAN_Pixel4_10Hz.nmea', 'r')
data = f.readlines()
f.close()
data[:10]

In [None]:
ground = pd.read_csv('../input/google-smartphone-decimeter-challenge/train/2020-05-14-US-MTV-1/Pixel4/ground_truth.csv')
ground.head(5)

In [None]:
f = open('../input/google-smartphone-decimeter-challenge/train/2020-05-14-US-MTV-1/Pixel4/Pixel4_GnssLog.txt', 'r')
data = f.readlines()
f.close()
data[:10]

In [None]:
derived = pd.read_csv('../input/google-smartphone-decimeter-challenge/train/2020-05-14-US-MTV-1/Pixel4/Pixel4_derived.csv')
derived

In [None]:
derived_unique = list(derived.millisSinceGpsEpoch.unique())
len(derived_unique)

In [None]:
ground_unique = list(ground.millisSinceGpsEpoch.unique())
len(ground_unique)

In [None]:
import json
json_open = open('../input/google-smartphone-decimeter-challenge/metadata/accumulated_delta_range_state_bit_map.json', 'r')
json.load(json_open)

In [None]:
import json
json_open = open('../input/google-smartphone-decimeter-challenge/metadata/raw_state_bit_map.json', 'r')
json.load(json_open)

In [None]:
pd.read_csv('../input/google-smartphone-decimeter-challenge/metadata/constellation_type_mapping.csv').head(5)

In [None]:
# from https://www.kaggle.com/sohier/loading-gnss-logs
def gnss_log_to_dataframes(path):
    print('Loading ' + path, flush=True)
    gnss_section_names = {'Raw','UncalAccel', 'UncalGyro', 'UncalMag', 'Fix', 'Status', 'OrientationDeg'} #これはどこからでてきたのか？
    with open(path) as f_open:
        datalines = f_open.readlines()

    datas = {k: [] for k in gnss_section_names}
    gnss_map = {k: [] for k in gnss_section_names}
    for dataline in datalines:
        is_header = dataline.startswith('#')
        dataline = dataline.strip('#').strip().split(',')
        # skip over notes, version numbers, etc
        if is_header and dataline[0] in gnss_section_names:
            gnss_map[dataline[0]] = dataline[1:]
        elif not is_header:
            datas[dataline[0]].append(dataline[1:])

    results = dict()
    for k, v in datas.items():
        results[k] = pd.DataFrame(v, columns=gnss_map[k])
    # pandas doesn't properly infer types from these lists by default
    for k, df in results.items():
        for col in df.columns:
            if col == 'CodeType':
                continue
            results[k][col] = pd.to_numeric(results[k][col])

    return results

In [None]:
gnss_section_names = {'Raw','UncalAccel', 'UncalGyro', 'UncalMag', 'Fix', 'Status', 'OrientationDeg'}
datas = {k: [] for k in gnss_section_names}
gnss_map = {k: [] for k in gnss_section_names}
datas

In [None]:
results = dict()
for k, v in datas.items():
     results[k] = pd.DataFrame(v, columns=gnss_map[k])
results

In [None]:
# from https://www.kaggle.com/dannellyz/start-here-simple-folium-heatmap-for-geo-data
import folium
from folium import plugins


def simple_folium(df:pd.DataFrame, lat_col:str, lon_col:str):
    #Preprocess
    #Drop rows that do not have lat/lon
    df = df[df[lat_col].notnull() & df[lon_col].notnull()]

    # Convert lat/lon to (n, 2) nd-array format for heatmap
    # Then send to list
    df_locs = list(df[[lat_col, lon_col]].values)

    ##folium.Mapオブジェクト作成
    fol_map = folium.Map([df[lat_col].median(), df[lon_col].median()])

    # plot heatmap
    heat_map = plugins.HeatMap(df_locs)
    print(heat_map)
    fol_map.add_child(heat_map)

    # plot markers
    markers = plugins.MarkerCluster(locations = df_locs)
    fol_map.add_child(markers)

    #Add Layer Control
    folium.LayerControl().add_to(fol_map)

    return fol_map

In [None]:
# from https://www.kaggle.com/jpmiller/baseline-from-host-data
# simplified haversine distance
def calc_haversine(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(a**0.5)
    dist = 6_367_000 * c
    return dist

In [None]:
# from https://www.kaggle.com/emaerthin/demonstration-of-the-kalman-filter
T = 1.0 
state_transition = np.array([[1, 0, T, 0, 0.5 * T ** 2, 0], [0, 1, 0, T, 0, 0.5 * T ** 2], [0, 0, 1, 0, T, 0],
                             [0, 0, 0, 1, 0, T], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]])
process_noise = np.diag([1e-5, 1e-5, 5e-6, 5e-6, 1e-6, 1e-6]) + np.ones((6, 6)) * 1e-9
observation_model = np.array([[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0]])
observation_noise = np.diag([5e-5, 5e-5]) + np.ones((2, 2)) * 1e-9

kf = simdkalman.KalmanFilter(
        state_transition = state_transition,
        process_noise = process_noise,
        observation_model = observation_model,
        observation_noise = observation_noise)

def apply_kf_smoothing(df, kf_=kf):
    unique_paths = df[phone_col].unique()
    for phone in tqdm(unique_paths):
        data = df.loc[df[phone_col] == phone][[lat_col, lon_col]].values
        data = data.reshape(1, len(data), 2)
        smoothed = kf_.smooth(data)
        df.loc[df[phone_col] == phone, lat_col] = smoothed.states.mean[0, :, 0]
        df.loc[df[phone_col] == phone, lon_col] = smoothed.states.mean[0, :, 1]
    return df

In [None]:
trn = pd.read_csv(train_file)
print(trn.shape)
trn.head()

In [None]:
tst = pd.read_csv(test_file)
print(tst.shape)
tst.head()

In [None]:
sub = pd.read_csv(sample_file)
print(sub.shape)
sub.head()

In [None]:
cname = trn[cname_col][0]
cname

In [None]:
pname = trn[pname_col][0]
pname

In [None]:
path =str(data_dir / 'train' / cname / pname / f'{pname}_GnssLog.txt')
with open(path) as f_open:
        datalines = f_open.readlines()

In [None]:
    for dataline in datalines:
        is_header = dataline.startswith('#')
        dataline = dataline.strip('#').strip().split(',')
        break

In [None]:
datalines[:10]

In [None]:
for col in [cname_col, pname_col]:
    print(f'# of unique {col:>14s} in training: {trn[col].nunique():4d}')
    print(f'# of unique {col:>14s}     in test: {tst[col].nunique():4d}')

In [None]:
trn[pname_col].value_counts()

In [None]:
tst[pname_col].value_counts()

In [None]:
print(f'# of unique phone in training: {trn[phone_col].nunique():4d}')
print(f'    # of unique phone in test: {tst[phone_col].nunique():4d}')

In [None]:
trn[phone_col].value_counts()[:10]

In [None]:
tst[phone_col].value_counts()[:10]

In [None]:
overlapping_phones = [x for x in tst[phone_col] if x in trn[phone_col]]
print(len(overlapping_phones))

In [None]:
tst[ts_col].min(), tst[ts_col].max()

In [None]:
dt_offset = pd.to_datetime('1980-01-06 00:00:00')
print(dt_offset)
dt_offset_in_ms = int(dt_offset.value / 1e6)

In [None]:
trn[dt_col] = pd.to_datetime(trn[ts_col] + dt_offset_in_ms, unit='ms')
tst[dt_col] = pd.to_datetime(tst[ts_col] + dt_offset_in_ms, unit='ms')
print(f'Training data range: {trn[dt_col].min()} - {trn[dt_col].max()}')
print(f'    Test data range: {tst[dt_col].min()} - {tst[dt_col].max()}')

In [None]:
latlon_trn = trn[[lat_col, lon_col]].round(3)
latlon_trn['counts'] = 1
latlon_trn = latlon_trn.groupby([lat_col, lon_col]).sum().reset_index()
latlon_trn.head()

In [None]:
    #def simple_folium(df:pd.DataFrame, lat_col:str, lon_col:str):
    simple_folium(latlon_trn, lat_col, lon_col)
    df = pd.DataFrame(latlon_trn)

    #Preprocess
    #Drop rows that do not have lat/lon
    df = df[df[lat_col].notnull() & df[lon_col].notnull()]
    df


In [None]:

    # Convert lat/lon to (n, 2) nd-array format for heatmap
    # Then send to list
    df_locs = list(df[[lat_col, lon_col]].values)
    

In [None]:
df[lat_col].median() #中央値

In [None]:
    fol_map = folium.Map([df[lat_col].median(), df[lon_col].median()])
    fol_map

In [None]:
    # plot heatmap
    heat_map = plugins.HeatMap(df_locs)
    print(heat_map)

In [None]:
    fol_map.add_child(heat_map)
    fol_map

In [None]:

    # plot markers
    markers = plugins.MarkerCluster(locations = df_locs)
    

In [None]:
fol_map.add_child(markers)

In [None]:
    #Add Layer Control
    folium.LayerControl().add_to(fol_map)

In [None]:
simple_folium(latlon_trn, lat_col, lon_col)

In [None]:
latlon_tst = tst[[lat_col, lon_col]].round(3)
latlon_tst

In [None]:
latlon_tst['counts'] = 1
latlon_tst = latlon_tst.groupby([lat_col, lon_col]).sum().reset_index()
latlon_tst

In [None]:
simple_folium(latlon_tst, lat_col, lon_col)

In [None]:
cname = trn[cname_col][0]
cname

In [None]:
pname = trn[pname_col][0]
pname

In [None]:
dfs = gnss = gnss_log_to_dataframes(str(data_dir / 'train' / cname / pname / f'{pname}_GnssLog.txt'))
print(dfs.keys())

In [None]:
df_raw = dfs['Raw']
print(df_raw.shape)
df_raw.head()

In [None]:
df_raw.info()

In [None]:
df_raw['ArrivalTime'] = df_raw['TimeNanos'] - df_raw['FullBiasNanos'] - df_raw['BiasNanos']
print(df_raw['ArrivalTime'].describe())
df_raw['ArrivalTime'].hist(bins=20)

In [None]:
print(df_raw['BiasUncertaintyNanos'].describe())
df_raw['BiasUncertaintyNanos'].hist(bins=20)

In [None]:
print(df_raw['ReceivedSvTimeUncertaintyNanos'].describe())
df_raw['ReceivedSvTimeUncertaintyNanos'].hist(bins=20)

In [None]:
print(df_raw.AccumulatedDeltaRangeUncertaintyMeters.describe())
df_raw.AccumulatedDeltaRangeUncertaintyMeters.hist(bins=20)

In [None]:
print(df_raw.Cn0DbHz.describe())
df_raw.Cn0DbHz.hist(bins=20)

In [None]:
df_raw = df_raw.loc[
    ~pd.isnull(df_raw.FullBiasNanos) &
    (df_raw.BiasUncertaintyNanos < 100) &
    (df_raw.ArrivalTime > 0) &
    (df_raw.ConstellationType != 0) &
    ~pd.isnull(df_raw.TimeNanos) &
    (df_raw.State != 3) & (df_raw.State != 14) & (df_raw.State != 7) & (df_raw.State != 15) &
    (df_raw.ReceivedSvTimeUncertaintyNanos < 100) &
    (df_raw.AccumulatedDeltaRangeUncertaintyMeters < 0.3) &
    (df_raw.Cn0DbHz > 20)
]
print(df_raw.shape)

In [None]:
df_raw

In [None]:
derived = pd.read_csv(data_dir / 'train' / cname / pname / f'{pname}_derived.csv')
print(derived.shape)
derived.head()

In [None]:
derived.info()

In [None]:
derived = derived.loc[derived.constellationType != 0]
print(derived.shape)

In [None]:
derived

In [None]:
derived['correctedPrM'] = (derived['rawPrM'] + derived['satClkBiasM'] - derived['isrbM'] - 
                           derived['ionoDelayM'] - derived['tropoDelayM'])
sns.pairplot(data=derived, vars=['correctedPrM', 'rawPrM'], size=3)

In [None]:
derived[dt_col] = pd.to_datetime(derived[ts_col] + dt_offset_in_ms, unit='ms')
print(f'Data range for {cname}/{pname}: {derived[dt_col].min()} - {derived[dt_col].max()}')

In [None]:
derived[['constellationType', 'svid', 'signalType']].value_counts()

In [None]:
derived[[ts_col, 'constellationType', 'correctedPrM']].groupby([ts_col, 'constellationType']).agg(['mean', 'std', 'count']).describe()

In [None]:
derived.loc[derived.constellationType == 1][[ts_col, 'svid', 'correctedPrM']].groupby([ts_col, 'svid']).agg(['mean', 'std', 'count']).describe()

In [None]:
pd.read_csv('../input/google-smartphone-decimeter-challenge/metadata/constellation_type_mapping.csv')

In [None]:
derived.loc[derived.signalType == 'GPS_L1'][[ts_col, 'svid', 'correctedPrM']].groupby([ts_col, 'svid']).agg(['mean', 'std', 'count'])

In [None]:
derived.loc[derived.signalType == 'GPS_L1'][[ts_col, 'svid', 'correctedPrM']].groupby([ts_col, 'svid']).agg(['mean', 'std', 'count']).describe()

In [None]:
derived.loc[derived.signalType == 'GPS_L1'][[ts_col, 'svid']].drop_duplicates().groupby([ts_col]).agg(['mean', 'std', 'count']).describe()

In [None]:
gps_l1 = derived.loc[derived.signalType == 'GPS_L1'][[ts_col, 'svid', 'correctedPrM']].drop_duplicates([ts_col, 'svid'])
print(gps_l1.shape)
gps_l1.head()

In [None]:
label = pd.read_csv(data_dir / 'train' / cname / pname / 'ground_truth.csv')
print(label.shape)
label.head()

In [None]:
label[dt_col] = pd.to_datetime(label[ts_col] + dt_offset_in_ms, unit='ms')
print(f'Labels range for {cname}/{pname}: {label[dt_col].min()} - {label[dt_col].max()}')

In [None]:
cname = trn[cname_col][10]
pname = trn[pname_col][10]
derived2 = pd.read_csv(data_dir / 'train' / cname / pname / f'{pname}_derived.csv')
label2 = pd.read_csv(data_dir / 'train' / cname / pname / 'ground_truth.csv')
print(f"Derived data starts at: {pd.to_datetime(derived2[ts_col].min() + dt_offset_in_ms, unit='ms')}")
print(f"  Label data starts at: {pd.to_datetime(label2[ts_col].min() + dt_offset_in_ms, unit='ms')}")

In [None]:
trn.sort_values([phone_col, ts_col], inplace=True)

In [None]:
trn[['prev_lat']] = trn[lat_col].shift().where(trn[phone_col].eq(trn[phone_col].shift()))
trn[['prev_lat']] 

In [None]:
trn[['prev_lon']] = trn[lon_col].shift().where(trn[phone_col].eq(trn[phone_col].shift()))
trn[['prev_lon']]

In [None]:
tst.sort_values([phone_col, ts_col], inplace=True)

In [None]:
tst[['prev_lat']] = tst[lat_col].shift().where(tst[phone_col].eq(tst[phone_col].shift()))
tst[['prev_lat']] 

In [None]:
tst[['prev_lon']] = tst[lon_col].shift().where(tst[phone_col].eq(tst[phone_col].shift()))
trn.head()

In [None]:
# from https://www.kaggle.com/jpmiller/baseline-from-host-data
label_files = (data_dir / 'train').rglob('ground_truth.csv')
label_files

In [None]:
cols = [phone_col, ts_col, lat_col, lon_col]

df_list = []
for t in tqdm(label_files, total=73):
    label = pd.read_csv(t, usecols=[cname_col, pname_col, ts_col, lat_col, lon_col])
    df_list.append(label)
   

In [None]:
df_label = pd.concat(df_list, ignore_index=True)
df_label

In [None]:
pd.DataFrame(df_list)[:5]

In [None]:
df_label[phone_col] = df_label[cname_col] + '_' + df_label[pname_col]
df_label

In [None]:
df = df_label.merge(trn[cols + ['prev_lat', 'prev_lon']], how='inner', on=[phone_col, ts_col], 
                    suffixes=('_gt', '')).drop([cname_col, pname_col], axis=1) #列名が重複している場合のサフィックスを指定: 引数suffixes
df

In [None]:
df['sSinceGpsEpoch'] = df[ts_col] // 1000 ## 切り捨て除算
print(df.shape)
df.head()

In [None]:
df_tst = sub[[phone_col, ts_col]].merge(tst[[phone_col, ts_col, lat_col, lon_col, 'prev_lat', 'prev_lon']], 
                                        how='left', on=[phone_col, ts_col], suffixes=('', '_basepred'))
df_tst

In [None]:
df_tst['sSinceGpsEpoch'] = df_tst[ts_col] // 1000
print(df_tst.shape)
df_tst.head()

In [None]:
derived_files = (data_dir / 'train').rglob('*_derived.csv')
cols = [ts_col, 'svid', 'correctedPrM']
derived_files

In [None]:
df_list = []
for t in tqdm(derived_files, total=73):
    derived = pd.read_csv(t).drop_duplicates([ts_col, 'svid'])
    derived['correctedPrM'] = (derived['rawPrM'] + derived['satClkBiasM'] - derived['isrbM'] - 
                               derived['ionoDelayM'] - derived['tropoDelayM'])
    df_list.append(derived[[cname_col, pname_col, ts_col, 'svid', 'correctedPrM']])

In [None]:
df_derived = pd.concat(df_list, ignore_index=True)
df_derived

In [None]:
df_derived[phone_col] = df_derived[cname_col] + '_' + df_derived[pname_col]
df_derived[phone_col] 

In [None]:
df_derived.drop([cname_col, pname_col], axis=1, inplace=True)

print(df_derived.shape)
df_derived.head()

In [None]:
df_derived_pivot = pd.pivot_table(df_derived, 
                                  values='correctedPrM', 
                                  index=[phone_col, ts_col],
                                  columns=['svid'],
                                  aggfunc=np.mean)
df_derived_pivot 

In [None]:
df_derived_pivot.columns = [f'svid_{x}' for x in df_derived_pivot.columns]
df_derived_pivot.columns

In [None]:
df_derived_pivot.reset_index(inplace=True)
df_derived_pivot

In [None]:
df_derived_pivot['sSinceGpsEpoch'] = df_derived_pivot[ts_col] // 1000

print(df_derived_pivot.shape)
df_derived_pivot.head()

In [None]:
df = df.merge(df_derived_pivot, how='left', on=[phone_col, 'sSinceGpsEpoch'], suffixes=['', '_2'])
df.drop(['sSinceGpsEpoch', ts_col + '_2'], axis=1, inplace=True)
print(df.shape)
df.head()

In [None]:
df['d_lat'] = df['latDeg_gt'] - df[lat_col]
df['d_lon'] = df['lngDeg_gt'] - df[lon_col]
df[['d_lat', 'd_lon']].describe()

In [None]:
derived_files = (data_dir / 'test').rglob('*_derived.csv')
cols = [ts_col, 'svid', 'correctedPrM']
derived_files 

In [None]:
df_list = []
for t in tqdm(derived_files, total=48):
    derived = pd.read_csv(t)
    derived['sSinceGpsEpoch'] = derived[ts_col] // 1000
    derived.drop_duplicates(['sSinceGpsEpoch', 'svid'], inplace=True)
    derived['correctedPrM'] = (derived['rawPrM'] + derived['satClkBiasM'] - derived['isrbM'] - 
                               derived['ionoDelayM'] - derived['tropoDelayM'])
    df_list.append(derived[[cname_col, pname_col, 'sSinceGpsEpoch', 'svid', 'correctedPrM']])
    

In [None]:
df_derived = pd.concat(df_list, ignore_index=True)
df_derived

In [None]:
df_derived[phone_col] = df_derived[cname_col] + '_' + df_derived[pname_col]
df_derived.drop([cname_col, pname_col], axis=1, inplace=True)
df_derived

In [None]:
df_derived_pivot = pd.pivot_table(df_derived, 
                                  values='correctedPrM', 
                                  index=[phone_col, 'sSinceGpsEpoch'],
                                  columns=['svid'],
                                  aggfunc=np.mean)
df_derived_pivot

In [None]:
df_derived_pivot.columns = [f'svid_{x}' for x in df_derived_pivot.columns]
df_derived_pivot.reset_index(inplace=True)
df_derived_pivot

In [None]:
df_tst = df_tst.merge(df_derived_pivot, how='left', 
                      on=[phone_col, 'sSinceGpsEpoch']).drop(['sSinceGpsEpoch'], axis=1)
print(df_tst.shape)
df_tst.head()

In [None]:
df_tst.describe()

In [None]:
feature_cols = [x for x in df_tst.columns if x not in [phone_col, ts_col]]
target_cols = ['d_lat', 'd_lon']
input_dim = len(feature_cols)
output_dim = len(target_cols)

In [None]:
feature_cols 

In [None]:
scaler = StandardScaler()
label_scaler = StandardScaler()
scaler.fit(pd.concat([df[feature_cols], df_tst[feature_cols]], axis=0).fillna(0).values)
X = scaler.transform(df[feature_cols].fillna(0).values)
X_tst = scaler.transform(df_tst[feature_cols].fillna(0).values)
Y = label_scaler.fit_transform(df[target_cols].values)
print(X.shape, Y.shape, X_tst.shape)

In [None]:
def scheduler(epoch, lr, warmup=5):
    if epoch < warmup:
        return lr * 1.5
    else:
        return lr * tf.math.exp(-.1) #epoch毎に減衰させている。

In [None]:
import optuna 
import optuna.integration.lightgbm as lgbo
import lightgbm as lgb
'''
params = { 'objective': 'mse', 'metric': 'mse' }
Y = pd.DataFrame(Y,columns={'data','data2'})

lgb_train1 = lgb.Dataset(X, Y.data)
lgb_valid1 = lgb.Dataset(X, Y.data)
model1 = lgbo.train(params, lgb_train1, valid_sets=[lgb_valid1], verbose_eval=False, num_boost_round=100, early_stopping_rounds=5) 
model1.params["learning_rate"] = 0.01
model1.params["early_stopping_round"] = 100
model1.params["num_iterations"] = 8000
model1.params
'''

In [None]:
params1= {'objective': 'mse',
 'metric': 'l2',
 'feature_pre_filter': False,
 'lambda_l1': 0.0,
 'lambda_l2': 0.0,
 'num_leaves': 253,
 'feature_fraction': 0.8999999999999999,
 'bagging_fraction': 0.8540227553324429,
 'bagging_freq': 2,
 'min_child_samples': 5,
 'num_iterations': 8000,
 'early_stopping_round': 100,
 'learning_rate': 0.01}

In [None]:
'''
lgb_train2 = lgb.Dataset(X, Y.data2)
lgb_valid2 = lgb.Dataset(X, Y.data2)
model2 = lgbo.train(params, lgb_train2, valid_sets=[lgb_valid2], verbose_eval=False, num_boost_round=100, early_stopping_rounds=5) 
model2.params["learning_rate"] = 0.01
model2.params["early_stopping_round"] = 100
model2.params["num_iterations"] = 8000
model2.params
'''

In [None]:
params2 = {'objective': 'mse',
 'metric': 'l2',
 'feature_pre_filter': False,
 'lambda_l1': 0.0,
 'lambda_l2': 0.0,
 'num_leaves': 242,
 'feature_fraction': 0.8839999999999999,
 'bagging_fraction': 0.9480235884535055,
 'bagging_freq': 3,
 'min_child_samples': 5,
 'num_iterations': 8000,
 'early_stopping_round': 100,
 'learning_rate': 0.01}

In [None]:
params = {'objective': 'mse',
 'metric': 'mse',
 'num_iterations': 8000,
 'early_stopping_round': 100,
 'learning_rate': 0.001}

In [None]:

from sklearn.multioutput import MultiOutputRegressor
import lightgbm as lgb

params={'learning_rate': 0.02, #0.01
        'objective':'mae', 
        'metric':'mae',
        'num_leaves': 9, #9 @@
        'verbose': 0,
        'bagging_fraction': 0.8, #0.7
        'feature_fraction': 0.8 #0.7
       }
reg = MultiOutputRegressor(lgb.LGBMRegressor(**params, n_estimators=2000))

cv = KFold(n_splits=n_fold, shuffle=True, random_state=seed)

P = np.zeros_like(Y, dtype=float)
P_tst = np.zeros((X_tst.shape[0], output_dim), dtype=float)
#Y = pd.DataFrame(Y,columns={'data','data2'})
for i, (i_trn, i_val) in enumerate(cv.split(X), 1):
    print(f'Training for CV #{i}')
    #lgb_train1 = lgb.Dataset(X[i_trn], Y.data[i_trn])
    #lgb_valid1 = lgb.Dataset(X[i_val], Y.data[i_val])
    
    #lgb_train2 = lgb.Dataset(X[i_trn], Y.data2[i_trn])
    #lgb_valid2 = lgb.Dataset(X[i_val], Y.data2[i_val])
    
    reg.fit(X[i_trn], Y[i_trn])
    #model1 = lgb.train(params1, lgb_train1, valid_sets=[lgb_valid1], verbose_eval=100)
    #model2 = lgb.train(params2, lgb_train2, valid_sets=[lgb_valid2], verbose_eval=100)
    
    #a =model1.predict(X[i_val])
    #b =model2.predict(X[i_val])
    #tt = pd.DataFrame(columns=['a','b'],index=range(len(a)))
    #tt.a = a
    #tt.b = b
   
    tt = reg.predict(X[i_val])
    P[i_val] = label_scaler.inverse_transform(tt)
    
    
    #a =model1.predict(X_tst)
    #b =model2.predict(X_tst)
    #tt = pd.DataFrame(columns=['a','b'],index=range(len(a)))
    #tt.a = a
    #tt.b = b
    tt = reg.predict(X_tst)

    P_tst += label_scaler.inverse_transform(tt) / n_fold
    
    distance_i = calc_haversine(df.latDeg_gt.values[i_val], 
                                df.lngDeg_gt.values[i_val], 
                                P[i_val, 0] + df.latDeg.values[i_val], 
                                P[i_val, 1] + df.lngDeg.values[i_val]).mean()
    print(f'CV #{i}: {np.percentile(distance_i, [50, 95])}')


In [None]:
'''
cv = KFold(n_splits=n_fold, shuffle=True, random_state=seed)

P = np.zeros_like(Y, dtype=float)
P_tst = np.zeros((X_tst.shape[0], output_dim), dtype=float)
Y = pd.DataFrame(Y,columns={'data','data2'})
for i, (i_trn, i_val) in enumerate(cv.split(X), 1):
    print(f'Training for CV #{i}')
    lgb_train1 = lgb.Dataset(X[i_trn], Y.data[i_trn])
    lgb_valid1 = lgb.Dataset(X[i_val], Y.data[i_val])
    
    lgb_train2 = lgb.Dataset(X[i_trn], Y.data2[i_trn])
    lgb_valid2 = lgb.Dataset(X[i_val], Y.data2[i_val])
    

    model1 = lgb.train(params1, lgb_train1, valid_sets=[lgb_valid1], verbose_eval=100)
    model2 = lgb.train(params2, lgb_train2, valid_sets=[lgb_valid2], verbose_eval=100)
    
    a =model1.predict(X[i_val])
    b =model2.predict(X[i_val])
    tt = pd.DataFrame(columns=['a','b'],index=range(len(a)))
    tt.a = a
    tt.b = b
   
    P[i_val] = label_scaler.inverse_transform(tt)
    
    
    a =model1.predict(X_tst)
    b =model2.predict(X_tst)
    tt = pd.DataFrame(columns=['a','b'],index=range(len(a)))
    tt.a = a
    tt.b = b

    P_tst += label_scaler.inverse_transform(tt) / n_fold
    
    distance_i = calc_haversine(df.latDeg_gt.values[i_val], 
                                df.lngDeg_gt.values[i_val], 
                                P[i_val, 0] + df.latDeg.values[i_val], 
                                P[i_val, 1] + df.lngDeg.values[i_val]).mean()
    print(f'CV #{i}: {np.percentile(distance_i, [50, 95])}')
    '''

In [None]:
print(P.mean(axis=0), P_tst.mean(axis=0))
np.savetxt(predict_val_file, P, delimiter=',', fmt='%.6f')
np.savetxt(predict_tst_file, P_tst, delimiter=',', fmt='%.6f')

In [None]:
distance = calc_haversine(df.latDeg_gt, df.lngDeg_gt, P[:, 0] + df.latDeg, P[:, 1] + df.lngDeg)
print(f'CV All: {np.percentile(distance, [50, 95])}')

In [None]:
df.sort_values([phone_col, ts_col], inplace=True)
df_smoothed = df.copy()
df_smoothed[lat_col] = df[lat_col] + P[:, 0]
df_smoothed[lon_col] = df[lon_col] + P[:, 1]
df_smoothed = apply_kf_smoothing(df_smoothed)
distance = calc_haversine(df_smoothed.latDeg_gt, df_smoothed.lngDeg_gt, df_smoothed.latDeg, df_smoothed.lngDeg)
print(f'CV All (smoothed): {np.percentile(distance, [50, 95])}')

In [None]:
distance_tst = calc_haversine(df_tst.latDeg, df_tst.lngDeg, P_tst[:, 0] + df_tst.latDeg, P_tst[:, 1] + df_tst.lngDeg)
print(f'CV All: {np.percentile(distance_tst, [50, 95])}')

In [None]:
distance_tst

In [None]:
df_tst.sort_values([phone_col, ts_col], inplace=True)
df_tst_smoothed = df_tst.copy()
df_tst_smoothed[lat_col] = df_tst_smoothed[lat_col] + P_tst[:, 0]
df_tst_smoothed[lon_col] = df_tst_smoothed[lon_col] + P_tst[:, 1]
df_tst_smoothed

In [None]:
df_tst_smoothed = apply_kf_smoothing(df_tst_smoothed)
df_tst_smoothed

In [None]:
distance_tst = calc_haversine(df_tst.latDeg, df_tst.lngDeg, df_tst_smoothed.latDeg, df_tst_smoothed.lngDeg)
print(f'CV All (smoothed): {np.percentile(distance_tst, [50, 95])}')

In [None]:
df_tst_smoothed[[phone_col, ts_col, lat_col, lon_col]].to_csv(submission_file, index=False)

In [None]:
submission = df_tst_smoothed[[phone_col, ts_col, lat_col, lon_col]]

In [None]:
def get_removedevice(input_df: pd.DataFrame, divece: str) -> pd.DataFrame:
    input_df['index'] = input_df.index
    input_df = input_df.sort_values('millisSinceGpsEpoch')
    input_df.index = input_df['millisSinceGpsEpoch'].values

    output_df = pd.DataFrame() 
    for _, subdf in input_df.groupby('collectionName'):

        phones = subdf['phoneName'].unique()

        if (len(phones) == 1) or (not divece in phones):
            output_df = pd.concat([output_df, subdf])
            continue

        origin_df = subdf.copy()
        
        _index = subdf['phoneName']==divece
        subdf.loc[_index, 'latDeg'] = np.nan
        subdf.loc[_index, 'lngDeg'] = np.nan
        subdf = subdf.interpolate(method='index', limit_area='inside')

        _index = subdf['latDeg'].isnull()
        subdf.loc[_index, 'latDeg'] = origin_df.loc[_index, 'latDeg'].values
        subdf.loc[_index, 'lngDeg'] = origin_df.loc[_index, 'lngDeg'].values

        output_df = pd.concat([output_df, subdf])

    output_df.index = output_df['index'].values
    output_df = output_df.sort_index()

    del output_df['index']
    
    return output_df

In [None]:
submission['collectionName'] = submission['phone'].map(lambda x: x.split('_')[0])
submission['phoneName'] = submission['phone'].map(lambda x: x.split('_')[1])
submission = get_removedevice(submission, 'SamsungS20Ultra')

submission = submission.drop(columns=['collectionName', 'phoneName'], axis=1)
submission.to_csv('submission.csv', index=False)