In this notebook I do EDA on devices. After that, I show you how to improve the accuracy by removing location information from devices with poor accuracy and interpolate it from other devices.

このノートブックではdeviceに関するEDAを行い、精度が高くないdeviceの位置情報を取り除き他のdeviceからの位置情報を補間することで精度を向上させる方法を紹介します。

Score(Public Leaderboard)  
[Baseline post-processing by outlier correction](https://www.kaggle.com/dehokanta/baseline-post-processing-by-outlier-correction) by [dehokanta](https://www.kaggle.com/dehokanta) : 6.164  
this notebook: 6.089

**0.075 up!**

# EDA

Let's look at the accuracy for each device. You can see that the accuracy varies from device to device.

deviceごとの精度を見てみましょう。device間で精度が異なるのがわかります。

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from glob import glob


def get_groundtruth(path: Path) -> pd.DataFrame:
    output_df = pd.DataFrame()
    
    for path in glob(str(path / 'train/*/*/ground_truth.csv')):
        _df = pd.read_csv(path)
        output_df = pd.concat([output_df, _df])
    output_df = output_df.reset_index(drop=True)
    
    _columns = ['latDeg', 'lngDeg', 'heightAboveWgs84EllipsoidM']
    output_df[['t_'+col for col in _columns]] = output_df[_columns]
    output_df = output_df.drop(columns=_columns, axis=1)
    return output_df


def calc_haversine(lat1, lon1, lat2, lon2):
    """Calculates the great circle distance between two points
    on the earth. Inputs are array-like and specified in decimal degrees.
    """
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(a**0.5)
    dist = 6_367_000 * c
    return dist


def check_score(input_df: pd.DataFrame) -> pd.DataFrame:
    output_df = input_df.copy()
    
    output_df['meter'] = input_df.apply(
        lambda r: calc_haversine(
            r.latDeg, r.lngDeg, r.t_latDeg, r.t_lngDeg
        ),
        axis=1
    )

    meter_score = output_df['meter'].mean()
    print(f'error meter: {meter_score}')

    scores = []
    for phone in output_df['phone'].unique():
        _index = output_df['phone']==phone
        p_50 = np.percentile(output_df.loc[_index, 'meter'], 50)
        p_95 = np.percentile(output_df.loc[_index, 'meter'], 95)
        scores.append(p_50)
        scores.append(p_95)

    score = sum(scores) / len(scores)
    print(f'score: {score}')
    
    return output_df

In [None]:
# read data
BASE_DIR = Path('../input/google-smartphone-decimeter-challenge')
train_base = pd.read_csv(BASE_DIR / 'baseline_locations_train.csv')
test_base = pd.read_csv(BASE_DIR / 'baseline_locations_test.csv')

# merge graoundtruth
train_base = train_base.merge(
    get_groundtruth(BASE_DIR),
    on=['collectionName', 'phoneName', 'millisSinceGpsEpoch']
)

In [None]:
# check score
train_base = check_score(train_base)

In [None]:
# show boxplot
for name, df in train_base.groupby('collectionName'):    
    sns.boxplot(data=df, x='phoneName', y='meter', width=0.5)
    plt.title(name)
    plt.ylim(0, 20)
    plt.show()

In [None]:
# show describe
train_base.groupby('phoneName')['meter'].describe().T

If you look at the data, you can see that the error is different for devices with the same collectionName.  
Now, let's interpolate the location information from other devices except for SamsungS20Ultra which has a lot of errors.

データを見てみると、同じcollectionNameでもdeviceによって誤差が異なることがわかります。  
今回は誤差を多く含むSamsungS20Ultraの除いてほかのdeviceの位置情報から補間してみましょう。

# Method

In [None]:
def get_removedevice(input_df: pd.DataFrame, divece: str) -> pd.DataFrame:
    input_df['index'] = input_df.index
    input_df = input_df.sort_values('millisSinceGpsEpoch')
    input_df.index = input_df['millisSinceGpsEpoch'].values

    output_df = pd.DataFrame() 
    for _, subdf in input_df.groupby('collectionName'):

        phones = subdf['phoneName'].unique()

        if (len(phones) == 1) or (not divece in phones):
            output_df = pd.concat([output_df, subdf])
            continue

        origin_df = subdf.copy()
        
        _index = subdf['phoneName']==divece
        subdf.loc[_index, 'latDeg'] = np.nan
        subdf.loc[_index, 'lngDeg'] = np.nan
        subdf = subdf.interpolate(method='index', limit_area='inside')

        _index = subdf['latDeg'].isnull()
        subdf.loc[_index, 'latDeg'] = origin_df.loc[_index, 'latDeg'].values
        subdf.loc[_index, 'lngDeg'] = origin_df.loc[_index, 'lngDeg'].values

        output_df = pd.concat([output_df, subdf])

    output_df.index = output_df['index'].values
    output_df = output_df.sort_index()

    del output_df['index']
    
    return output_df

In [None]:
train_remove = get_removedevice(train_base, 'SamsungS20Ultra')

In [None]:
"""
check_score(train_base):
    error meter: 3.846848374995186
    score: 5.287970649047862
"""

train_remove = check_score(train_remove)

As described above, I was able to increase the accuracy easily.  
This is a post processing step, so you can improve the accuracy even without train data. To try it out, let's use [Baseline post-processing by outlier correction](https://www.kaggle.com/dehokanta/baseline-post-processing-by-outlier-correction) created by [dehokanta](https://www.kaggle.com/dehokanta).

以上のように簡単に精度を向上することができました。  
これは後処理なのでtrainデータがなくても精度が向上させることができます。試しに[dehokanta](https://www.kaggle.com/dehokanta)さんの[Baseline post-processing by outlier correction](https://www.kaggle.com/dehokanta/baseline-post-processing-by-outlier-correction)に対して使ってみましょう。

In [None]:
submission = pd.read_csv('../input/baseline-post-processing-by-outlier-correction/submission.csv')
submission['collectionName'] = submission['phone'].map(lambda x: x.split('_')[0])
submission['phoneName'] = submission['phone'].map(lambda x: x.split('_')[1])
submission = get_removedevice(submission, 'SamsungS20Ultra')

submission = submission.drop(columns=['collectionName', 'phoneName'], axis=1)
submission.to_csv('submission.csv', index=False)

You can only process the submission file, but be sure to process the train data as well to check the score.  
Thank you for reading to the end. I look forward to your questions and comments.

サブミッションファイルに対してのみ処理することも可能ですが、確実にtrainデータに対しても処理を行いスコアを確認してください。  
最後まで読んでいただきありがとうございました！質問とコメントお待ちしています。

# Reference

I used these notebooks as a reference in creating this. Thank you very much!

* [Baseline post-processing by outlier correction](https://www.kaggle.com/dehokanta/baseline-post-processing-by-outlier-correction) created by [dehokanta](https://www.kaggle.com/dehokanta).