### We divided the data into three regions and evaluated them.

Using [this discussion](https://www.kaggle.com/c/google-smartphone-decimeter-challenge/discussion/245160) as a guide, we divided the data into three areas: highways, streets with trees, and city streets, and evaluated the training data in each area.

The results are as follows.

highway : 3.4528073895024414

tree : 6.173261717576203

downtown : 19.432900281799608

Thank you for sharing this discusiion with us.

https://www.kaggle.com/c/google-smartphone-decimeter-challenge/discussion/245160

I used the following code as a reference for the evaluation script.

Thank you for sharing code with us.

https://www.kaggle.com/t88take/gsdc-phones-mean-prediction#evaluate-train-score

In [None]:
import pandas as pd
import pathlib
from tqdm.notebook import tqdm
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df_train = pd.read_csv("../input/google-smartphone-decimeter-challenge/baseline_locations_train.csv")
df_test = pd.read_csv("../input/google-smartphone-decimeter-challenge/baseline_locations_test.csv")

In [None]:
train_collectionName = df_train["collectionName"].unique()

In [None]:
df_train_highway = df_train[df_train['collectionName'].isin([train_collectionName[0],
                                                           train_collectionName[1],
                                                           train_collectionName[2],
                                                           train_collectionName[3],
                                                           train_collectionName[4],
                                                           train_collectionName[5],
                                                           train_collectionName[6],
                                                           train_collectionName[7],
                                                           train_collectionName[8],
                                                           train_collectionName[9],
                                                           train_collectionName[10],
                                                           train_collectionName[11],
                                                           train_collectionName[12],
                                                           train_collectionName[13],
                                                           train_collectionName[14],
                                                           train_collectionName[15],
                                                           train_collectionName[16],
                                                           train_collectionName[17],
                                                           train_collectionName[18],
                                                           train_collectionName[19],
                                                           train_collectionName[20]])]

In [None]:
df_train_tree = df_train[df_train['collectionName'].isin([train_collectionName[21],
                                                          train_collectionName[22],
                                                          train_collectionName[24],
                                                          train_collectionName[25],
                                                          train_collectionName[27]])]

In [None]:
df_train_downtown = df_train[df_train['collectionName'].isin([train_collectionName[23],
                                                              train_collectionName[26],
                                                              train_collectionName[28]])]

In [None]:
# ground_truth
p = pathlib.Path("../input/google-smartphone-decimeter-challenge")
gt_files = list(p.glob('train/*/*/ground_truth.csv'))
print('ground_truth.csv count : ', len(gt_files))

gts = []
for gt_file in tqdm(gt_files):
    gts.append(pd.read_csv(gt_file))
ground_truth = pd.concat(gts)

In [None]:
def calc_haversine(lat1, lon1, lat2, lon2):
    """Calculates the great circle distance between two points
    on the earth. Inputs are array-like and specified in decimal degrees.
    """
    RADIUS = 6_367_000
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    dist = 2 * RADIUS * np.arcsin(a**0.5)
    return dist

In [None]:
def percentile50(x):
    return np.percentile(x, 50)
def percentile95(x):
    return np.percentile(x, 95)

In [None]:
def get_train_score(df, gt):
    gt = gt.rename(columns={'latDeg':'latDeg_gt', 'lngDeg':'lngDeg_gt'})
    df = df.merge(gt, on=['collectionName', 'phoneName', 'millisSinceGpsEpoch'], how='inner')
    # calc_distance_error
    df['err'] = calc_haversine(df['latDeg_gt'], df['lngDeg_gt'], df['latDeg'], df['lngDeg'])
    # calc_evaluate_score
    df['phone'] = df['collectionName'] + '_' + df['phoneName']
    res = df.groupby('phone')['err'].agg([percentile50, percentile95])
    res['p50_p90_mean'] = (res['percentile50'] + res['percentile95']) / 2 
    score = res['p50_p90_mean'].mean()
    return score,df

In [None]:
score_highway,df_highway = get_train_score(df_train_highway, ground_truth)
score_tree,df_tree = get_train_score(df_train_tree, ground_truth)
score_downtown,df_downtown = get_train_score(df_train_downtown, ground_truth)

In [None]:
print("highway :" , score_highway )
print('tree : ' ,score_tree)
print('downtown : ' , score_downtown)

In [None]:
c1,c2,c3 = "blue","green","red"
fig, ax = plt.subplots(nrows=1, ncols=3,figsize = (25,5))
ax[0].hist(df_highway.err, bins=range(100), color=c1)
ax[0].set_title('highway')
ax[0].set_xlabel('error',fontsize = 20)
ax[0].set_ylabel('freq')
ax[0].set_ylim(0,30000)
ax[1].hist(df_tree.err, bins=range(100), color=c2)
ax[1].set_title('tree')
ax[1].set_xlabel('error',fontsize = 20)
ax[1].set_ylabel('freq')
ax[1].set_ylim(0,30000)
ax[2].hist(df_downtown.err, bins=range(100), color=c3)
ax[2].set_title('downtown')
ax[2].set_xlabel('error',fontsize = 20)
ax[2].set_ylabel('freq')
ax[2].set_ylim(0,30000)

In [None]:
fig = plt.figure(figsize = (10,5))
plt.hist([df_highway.err,df_tree.err,df_downtown.err], stacked=True, bins=range(100),color=[c1,c2,c3], label=["highway","tree","downtown"])
plt.xlabel('error',fontsize = 15)
plt.show()

**We can see that the data in the city is small in number but has a large error. In the evaluation of this competition, it may be effective to approach the data where the error is large even if the number is small.**