# Google Smartphone Decimeter Challenge by Mean(Interpolation)

Solution is based on referring to various notebooks posted at Kaggle, along with my changes. Seeing this as a learning experience.

## Notes

 If you’re outside, with open sky, the GPS accuracy from your phone is about five meters, and that’s been constant for a while. With raw GNSS measurements from the phones, this can now improve dramatically.
 
### The GNSS problem description
https://github.com/commaai/laika

GNSS satellites orbit the earth broadcasting signals that allow the receiver to determine the distance to each satellite. These satellites have known orbits and so their positions are known. This makes determining the receiver's position a basic 3-dimensional trilateration problem. In practice observed distances to each satellite will be measured with some offset that is caused by the receiver's clock error. This offset also needs to be determined, making it a 4-dimensional trilateration problem. 

<img src= "https://camo.githubusercontent.com/0d85f5131c63442f8e7b46de7dab8040a7d693effd5e611ebed25be0b7600a32/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f632f63332f33737068657265732e7376672f36323270782d33737068657265732e7376672e706e67"  alt ="GNSS" style="width:400px;height:400px;">

Since this problem is generally overdetermined (more than 4 satellites to solve the 4d problem) there is a variety of methods to compute a position estimate from the measurements. One can use a basic weighted least squares solver for experimental purposes. This is far from optimal due to the dynamic nature of the system, this makes a Bayesian estimator like a Kalman filter the preferred estimator.

However, the above description is over-simplified. Getting accurate distance estimates to satellites and the satellite's position from the receiver observations is not trivial. This is what we call processing of the GNSS observables and it is this procedure laika is designed to make easy.
 
 ### How positioning works?
 - Send a burst of these transactions and, as a consequence, the system can calculate ranging statistics, such as the mean and the variance.
 
 <img src= "https://www.gpsworld.com/wp-content/uploads/2018/07/Android-Figure-3.jpg" alt ="Wifi distance">
 
 Wi-Fi RTT principles, basic concept. Image by Frank van Diggelen, Roy Want and Wei Wang
 
 - Take these ranges of separate access points; if those ranges were accurate, they would define four circles that would intersect at a single point. In practice, because of error in each range, a maximum likelihood position is calculated using a least squares multilateration algorithm.
 
 - Further refine this position by repeating the process, particularly as the phone moves, and then calculate trajectory using filtering techniques, such as Kalman filtering, to optimize the estimate.
 
  <img src= "https://www.gpsworld.com/wp-content/uploads/2018/07/Android-Figure-4.jpg" alt ="Workflow">
  
  Wi-Fi Workflow. Image by: Frank van Diggelen, Roy Want and Wei Wang.

 
 
### Steps for the solution from Sohier Dane
- Smoothing out the baseline estimates
- Integrating readings from other phone instruments, like the accelerometer.
- Satellite triangulation using the *derived.csv files.
- Building triangulations directly from the raw gnss logs. 
- Incorporating external data for controls like satellite readings from base stations in the area.

### References: 
- https://www.kaggle.com/t88take/gsdc-phones-mean-prediction
- Discussions topics from Sohier Dane
- Google I/O https://www.gpsworld.com/how-to-achieve-1-meter-accuracy-in-android/
- GPS Survey Workshop video https://www.youtube.com/watch?v=vOJ3u7Zd_i0
- Hardware: Centimeter Positioning with a Smartphone-Quality GNSS Antenna https://www.youtube.com/watch?v=rCOvklUB5vQ

In [None]:
import os
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn2_circles
import seaborn as sns
from tqdm.notebook import tqdm
import pathlib
import plotly
import plotly.express as px

In [None]:
def calc_haversine(lat1, lon1, lat2, lon2):
    """Calculates the great circle distance between two points
    on the earth. Inputs are array-like and specified in decimal degrees.
    """
    RADIUS = 6_367_000
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    dist = 2 * RADIUS * np.arcsin(a**0.5)
    return dist

In [None]:
def visualize_trafic(df, center, zoom=9):
    fig = px.scatter_mapbox(df,
                            
                            # Here, plotly gets, (x,y) coordinates
                            lat="latDeg",
                            lon="lngDeg",
                            
                            #Here, plotly detects color of series
                            color="phoneName",
                            labels="phoneName",
                            
                            zoom=zoom,
                            center=center,
                            height=600,
                            width=800)
    fig.update_layout(mapbox_style='stamen-terrain')
    fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
    fig.update_layout(title_text="GPS trafic")
    fig.show()
    
def visualize_collection(df, collection):
    target_df = df[df['collectionName']==collection].copy()
    lat_center = target_df['latDeg'].mean()
    lng_center = target_df['lngDeg'].mean()
    center = {"lat":lat_center, "lon":lng_center}
    
    visualize_trafic(target_df, center)

In [None]:
# directory setting
INPUT = '../input/google-smartphone-decimeter-challenge'

In [None]:
base_train = pd.read_csv(INPUT + '/' + 'baseline_locations_train.csv')
base_test = pd.read_csv(INPUT + '/' + 'baseline_locations_test.csv')
sample_sub = pd.read_csv(INPUT + '/' + 'sample_submission.csv')

In [None]:
# ground_truth
p = pathlib.Path(INPUT)
gt_files = list(p.glob('train/*/*/ground_truth.csv'))
print('ground_truth.csv count : ', len(gt_files))

gts = []
for gt_file in tqdm(gt_files):
    gts.append(pd.read_csv(gt_file))
ground_truth = pd.concat(gts)

display(ground_truth.head())

In [None]:
def add_distance_diff(df):
    df['latDeg_prev'] = df['latDeg'].shift(1)
    df['latDeg_next'] = df['latDeg'].shift(-1)
    df['lngDeg_prev'] = df['lngDeg'].shift(1)
    df['lngDeg_next'] = df['lngDeg'].shift(-1)
    df['phone_prev'] = df['phone'].shift(1)
    df['phone_next'] = df['phone'].shift(-1)
    
    df['dist_prev'] = calc_haversine(df['latDeg'], df['lngDeg'], df['latDeg_prev'], df['lngDeg_prev'])
    df['dist_next'] = calc_haversine(df['latDeg'], df['lngDeg'], df['latDeg_next'], df['lngDeg_next'])
    
    df.loc[df['phone']!=df['phone_prev'], ['latDeg_prev', 'lngDeg_prev', 'dist_prev']] = np.nan
    df.loc[df['phone']!=df['phone_next'], ['latDeg_next', 'lngDeg_next', 'dist_next']] = np.nan
    
    return df

In [None]:
# reject outlier
train_ro = add_distance_diff(base_train)
th = 50
train_ro.loc[((train_ro['dist_prev'] > th) & (train_ro['dist_next'] > th)), ['latDeg', 'lngDeg']] = np.nan

In [None]:
!pip install simdkalman

In [None]:
import simdkalman

In [None]:
T = 1.0
state_transition = np.array([[1, 0, T, 0, 0.5 * T ** 2, 0], [0, 1, 0, T, 0, 0.5 * T ** 2], [0, 0, 1, 0, T, 0],
                             [0, 0, 0, 1, 0, T], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]])
process_noise = np.diag([1e-5, 1e-5, 5e-6, 5e-6, 1e-6, 1e-6]) + np.ones((6, 6)) * 1e-9
observation_model = np.array([[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0]])
observation_noise = np.diag([5e-5, 5e-5]) + np.ones((2, 2)) * 1e-9

kf = simdkalman.KalmanFilter(
        state_transition = state_transition,
        process_noise = process_noise,
        observation_model = observation_model,
        observation_noise = observation_noise)

def apply_kf_smoothing(df, kf_=kf):
    unique_paths = df[['collectionName', 'phoneName']].drop_duplicates().to_numpy()
    for collection, phone in unique_paths:
        cond = np.logical_and(df['collectionName'] == collection, df['phoneName'] == phone)
        data = df[cond][['latDeg', 'lngDeg']].to_numpy()
        data = data.reshape(1, len(data), 2)
        smoothed = kf_.smooth(data)
        df.loc[cond, 'latDeg'] = smoothed.states.mean[0, :, 0]
        df.loc[cond, 'lngDeg'] = smoothed.states.mean[0, :, 1]
    return df

In [None]:
cols = ['collectionName', 'phoneName', 'millisSinceGpsEpoch', 'latDeg', 'lngDeg']
train_ro_kf = apply_kf_smoothing(train_ro[cols])

In [None]:
def make_lerp_data(df):
    '''
    Generate interpolated lat,lng values for different phone times in the same collection.
    '''
    org_columns = df.columns
    
    # Generate a combination of time x collection x phone and combine it with the original data (generate records to be interpolated)
    time_list = df[['collectionName', 'millisSinceGpsEpoch']].drop_duplicates()
    phone_list =df[['collectionName', 'phoneName']].drop_duplicates()
    tmp = time_list.merge(phone_list, on='collectionName', how='outer')
    
    lerp_df = tmp.merge(df, on=['collectionName', 'millisSinceGpsEpoch', 'phoneName'], how='left')
    lerp_df['phone'] = lerp_df['collectionName'] + '_' + lerp_df['phoneName']
    lerp_df = lerp_df.sort_values(['phone', 'millisSinceGpsEpoch'])
    
    # linear interpolation
    lerp_df['latDeg_prev'] = lerp_df['latDeg'].shift(1)
    lerp_df['latDeg_next'] = lerp_df['latDeg'].shift(-1)
    lerp_df['lngDeg_prev'] = lerp_df['lngDeg'].shift(1)
    lerp_df['lngDeg_next'] = lerp_df['lngDeg'].shift(-1)
    lerp_df['phone_prev'] = lerp_df['phone'].shift(1)
    lerp_df['phone_next'] = lerp_df['phone'].shift(-1)
    lerp_df['time_prev'] = lerp_df['millisSinceGpsEpoch'].shift(1)
    lerp_df['time_next'] = lerp_df['millisSinceGpsEpoch'].shift(-1)
    # Leave only records to be interpolated
    lerp_df = lerp_df[(lerp_df['latDeg'].isnull())&(lerp_df['phone']==lerp_df['phone_prev'])&(lerp_df['phone']==lerp_df['phone_next'])].copy()
    
    # calc lerp
#     lerp_df_copy = lerp_df.copy()
#     lerp_df_copy = lerp_df_copy.drop_duplicates(subset = ["millisSinceGpsEpoch"])

#     x = lerp_df_copy['millisSinceGpsEpoch']
    
#     y_lat = lerp_df_copy['latDeg']
#     y_lng = lerp_df_copy['lngDeg']
#     interpolation_function_lat = interp1d(x, y_lat, kind='cubic')
#     interpolation_function_lng = interp1d(x, y_lng, kind='cubic')
#     lerp_df['latDeg'] = interpolation_function_lat(lerp_df["millisSinceGpsEpoch"])
#     lerp_df['lngDeg'] = interpolation_function_lng(lerp_df["millisSinceGpsEpoch"])
    latDeg_next = lerp_df['latDeg_next']
    latDeg_prev = lerp_df['latDeg_prev']
    
    lngDeg_prev = lerp_df['lngDeg_prev']
    lngDeg_next = lerp_df['lngDeg_next']
    
    millisSinceGpsEpoch = lerp_df['millisSinceGpsEpoch']
    time_prev = lerp_df['time_prev']
    time_next = lerp_df['time_next']
    
    time_ratio = (millisSinceGpsEpoch - time_prev) / (time_next - time_prev)
#     time_ratio_cosine = (1 - np.cos(time_ratio * np.pi))/2.0
    
#     lerp_df['latDeg'] = latDeg_prev + ((latDeg_next - latDeg_prev) * time_ratio_cosine) 
#     lerp_df['lngDeg'] = lngDeg_prev + ((lngDeg_next - lngDeg_prev) * time_ratio_cosine) 

    lerp_df['latDeg'] = latDeg_prev + ((latDeg_next - latDeg_prev) * time_ratio) 
    lerp_df['lngDeg'] = lngDeg_prev + ((lngDeg_next - lngDeg_prev) * time_ratio) 
    
#     lerp_df['latDeg'] = lerp_df['latDeg_prev'] + ((lerp_df['latDeg_next'] - lerp_df['latDeg_prev']) * ((lerp_df['millisSinceGpsEpoch'] - lerp_df['time_prev']) / (lerp_df['time_next'] - lerp_df['time_prev']))) 
#     lerp_df['lngDeg'] = lerp_df['lngDeg_prev'] + ((lerp_df['lngDeg_next'] - lerp_df['lngDeg_prev']) * ((lerp_df['millisSinceGpsEpoch'] - lerp_df['time_prev']) / (lerp_df['time_next'] - lerp_df['time_prev']))) 
    
    # Leave only the data that has a complete set of previous and next data.
    lerp_df = lerp_df[~lerp_df['latDeg'].isnull()]
    
    return lerp_df[org_columns]

In [None]:
def calc_mean_pred(df, lerp_df):
    '''
    Make a prediction based on the average of the predictions of phones in the same collection.
    '''
    add_lerp = pd.concat([df, lerp_df])
    mean_pred_result = add_lerp.groupby(['collectionName', 'millisSinceGpsEpoch'])[['latDeg', 'lngDeg']].mean().reset_index()
    mean_pred_df = df[['collectionName', 'phoneName', 'millisSinceGpsEpoch']].copy()
    mean_pred_df = mean_pred_df.merge(mean_pred_result[['collectionName', 'millisSinceGpsEpoch', 'latDeg', 'lngDeg']], on=['collectionName', 'millisSinceGpsEpoch'], how='left')
    return mean_pred_df

In [None]:
train_lerp = make_lerp_data(train_ro_kf)
train_mean_pred = calc_mean_pred(train_ro_kf, train_lerp)

In [None]:
tmp1 = train_ro_kf.copy()
tmp2 = train_mean_pred.copy()
tmp2['phoneName'] = tmp2['phoneName'] + '_MEAN'
tmp3 = ground_truth.copy()
tmp3['phoneName'] = tmp3['phoneName'] + '_GT'
tmp = pd.concat([tmp1, tmp2, tmp3])
visualize_collection(tmp, '2020-05-14-US-MTV-1')

In [None]:
def percentile50(x):
    return np.percentile(x, 50)
def percentile95(x):
    return np.percentile(x, 95)

In [None]:
def get_train_score(df, gt):
    gt = gt.rename(columns={'latDeg':'latDeg_gt', 'lngDeg':'lngDeg_gt'})
    df = df.merge(gt, on=['collectionName', 'phoneName', 'millisSinceGpsEpoch'], how='inner')
    # calc_distance_error
    df['err'] = calc_haversine(df['latDeg_gt'], df['lngDeg_gt'], df['latDeg'], df['lngDeg'])
    # calc_evaluate_score
    df['phone'] = df['collectionName'] + '_' + df['phoneName']
    res = df.groupby('phone')['err'].agg([percentile50, percentile95])
    res['p50_p90_mean'] = (res['percentile50'] + res['percentile95']) / 2 
    score = res['p50_p90_mean'].mean()
    return score

In [None]:
print('kf + reject_outlier : ', get_train_score(train_ro_kf, ground_truth))
print('+ phones_mean_pred : ', get_train_score(train_mean_pred, ground_truth))

In [None]:
base_test = add_distance_diff(base_test)
th = 100000
base_test.loc[((base_test['dist_prev'] > th) & (base_test['dist_next'] > th)), ['latDeg', 'lngDeg']] = np.nan

test_kf = apply_kf_smoothing(base_test)

test_lerp = make_lerp_data(test_kf)
test_mean_pred = calc_mean_pred(test_kf, test_lerp)

sample_sub['latDeg'] = test_mean_pred['latDeg']
sample_sub['lngDeg'] = test_mean_pred['lngDeg']
sample_sub.to_csv('submission.csv', index=False)