In [None]:
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

data_path = Path("../input/google-smartphone-decimeter-challenge")

# Problem Definition
## Goal

This competition, hosted by the Android GPS team, was presented at the ION GNSS+ 2021 Conference. They seek to advance research in smartphone GNSS positioning accuracy and help people better navigate the world around them.

Global Navigation Satellite System (GNSS) provides raw signals, which the GPS chipset uses to compute a position. Current mobile phones only offer 3-5 meters of positioning accuracy. While useful in many cases, it can create a “jumpy” experience. For many use cases the results are not fine nor stable enough to be reliable.

The objective of the competition is that different teams acroos the world could use data collected from the host team’s own Android phones to compute location down to decimeter or even centimeter resolution, if possible. **We'll have access to precise ground truth, raw GPS measurements, and assistance data from nearby GPS stations, in order to train and test your submissions.**

Our submission file is like this.

In [None]:
import pandas as pd

sub = pd.read_csv(data_path / "sample_submission.csv")
sub.head()

**latDeg** and **lngDeg** are our target.

We can use GNSS tracking data and variety of sensor data to improve our solution.

## About dataset 

Google releases 60+ datasets collected from phones in the Android GPS team, together with corrections from SwiftNavigation Inc. and Verizon Inc. These datasets were collected on highways in the US San Francisco Bay Area in the summer of 2020. We can see the video for dataset.

We are given data from actual runs with android devices installed in cars, see following.

![](https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/Android_smartphones_high_accuracy_GNSS_datasets/fig3_fig4.JPG)

<font size="1">The figures come from <I>Fu, Guoyu (Michael), Khider, Mohammed, van Diggelen, Frank, "Android Raw GNSS Measurement Datasets for Precise Positioning," Proceedings of the 33rd International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2020), September 2020, pp. 1925-1937.
[https://doi.org/10.33012/2020.17628](https://www.ion.org/publications/abstract.cfm?articleID=17628)</I></font>

We can see more detail of data collection process at [Android smartphones high accuracy GNSS datasets](https://www.kaggle.com/google/android-smartphones-high-accuracy-datasets).

Data collection trials are separated as collectionName and under each collectionName, the data of the device is stored. In addition, the data collected from each device, groundtruth, and supplemental data are stored under it. The supplemental data contains the raw data that was measured.

# Basic Data Exploration

The first step in any machine learning project is familiarize yourself with the data.  You'll use the Pandas library for this.  Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as `pd`.  We do this with the command

In [None]:
import pandas as pd

## Reading data files

### CSV data files

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

```
Value A,Value B,Value C,
30,21,9,
35,34,1,
41,11,11
```

So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the `pd.read_csv()` function to read the data into a DataFrame. The input data is at the file path **`../input/google-smartphone-decimeter-challenge`**. We will use [train/test]/[drive_id]/[phone_name]/[phone_name]_derived.csv as organized data. And also we can use ground_truth.csv as reference.

In [None]:
# read the data and store data in DataFrame
sample_trail_derived = pd.read_csv(data_path / "train/2020-05-14-US-MTV-1/Pixel4/Pixel4_derived.csv")
sample_trail_truth = pd.read_csv(data_path / "train/2020-05-14-US-MTV-1/Pixel4/ground_truth.csv")

We can use the `shape` attribute to check how large the resulting DataFrame is:

In [None]:
sample_trail_derived.shape

So our new DataFrame has 55.218 records split across 20 different columns. That's around 1.1 million entries!

We can examine the contents of the resultant DataFrame using the `head()` command, which grabs the first five rows:

In [None]:
sample_trail_derived.head()

We can also have a look at the first rows of the ground truth data frame

In [None]:
sample_trail_truth.head()

### Read complex data files (GNSS Log)

The phone's logs has been generated by the [GnssLogger App](https://play.google.com/store/apps/details?id=com.google.android.apps.location.gps.gnsslogger&hl=en_US&gl=US). Each gnss file contains several sub-datasets:
* **Raw** - The raw GNSS measurements of one GNSS signal (each satellite may have 1-2 signals for L5-enabled smartphones), collected from the Android API [GnssMeasurement](https://developer.android.com/reference/android/location/GnssMeasurement).
* **Status** - The status of a GNSS signal, as collected from the Android API [GnssStatus](https://developer.android.com/reference/android/location/GnssStatus.Callback).
* **UncalAccel** - Readings from the uncalibrated accelerometer, as collected from the Android API [Sensor#TYPE_ACCELEROMETER_UNCALIBRATED](https://developer.android.com/reference/android/hardware/Sensor#TYPE_ACCELEROMETER_UNCALIBRATED).
* **UncalGyro** - Readings from the uncalibrated gyroscope, as collected from the Android API [Sensor#TYPE_GYROSCOPE_UNCALIBRATED](https://developer.android.com/reference/android/hardware/Sensor#TYPE_GYROSCOPE_UNCALIBRATED).
* **UncalMag** - Readings from the uncalibrated magnetometer as collected from the Android API [Sensor#STRING_TYPE_MAGNETIC_FIELD_UNCALIBRATED](https://developer.android.com/reference/android/hardware/Sensor#STRING_TYPE_MAGNETIC_FIELD_UNCALIBRATED).
* **OrientationDeg** - Each row represents an estimated device orientation, collected from Android API [SensorManager#getOrientation](https://developer.android.com/reference/android/hardware/SensorManager#getOrientation%28float%5B%5D,%20float%5B%5D%29). This message is only available in logs collected since March 2021.

In [None]:
# Adapted from https://www.kaggle.com/sohier/loading-gnss-logs
def gnss_log_to_dataframes(path):
    print(f'Loading {path}', flush=True)
    gnss_section_names = {'Raw','UncalAccel', 'UncalGyro', 'UncalMag', 'Fix', 'Status', 'OrientationDeg'}
    with open(path) as f_open:
        datalines = f_open.readlines()

    datas = {k: [] for k in gnss_section_names}
    gnss_map = {k: [] for k in gnss_section_names}
    for dataline in datalines:
        is_header = dataline.startswith('#')
        dataline = dataline.strip('#').strip().split(',')
        # skip over notes, version numbers, etc
        if is_header and dataline[0] in gnss_section_names:
            gnss_map[dataline[0]] = dataline[1:]
        elif not is_header:
            datas[dataline[0]].append(dataline[1:])

    results = dict()
    for k, v in datas.items():
        results[k] = pd.DataFrame(v, columns=gnss_map[k])
    # pandas doesn't properly infer types from these lists by default
    for k, df in results.items():
        for col in df.columns:
            if col == 'CodeType':
                continue
            results[k][col] = pd.to_numeric(results[k][col])

    return results

In [None]:
sample_trail_gnss_log = gnss_log_to_dataframes(data_path / "train/2020-05-14-US-MTV-1/Pixel4/Pixel4_GnssLog.txt")
sample_trail_gnss_log.keys()

We´ve got a dictionary with one key per each sub-dataset. We can have a look at the first rows of the *Raw* dataset.

In [None]:
raw_gnss = sample_trail_gnss_log['Raw']
raw_gnss.head()

## Indexing, Selecting & Assigning

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation we'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to us quickly and effectively.

Hence to access the `constellationType` property of `derived` data we can use:

In [None]:
sample_trail_derived.constellationType # Or sample_trail_derived['constellationType']

To select the first row of data in a DataFrame, we may use the following:

In [None]:
sample_trail_derived.iloc[0]

## Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do *interesting* things with the data, however, we often need to ask questions based on conditions. 

For example, suppose that we're interested specifically in Galileo E1 signal types.

We can start by checking if each entry is a Galileo E1 signal types:

In [None]:
sample_trail_derived.signalType == 'GAL_E1'

This operation produced a Series of True/False booleans based on the signal type of each record. This result can then be used inside of loc to select the relevant data:

In [None]:
sample_trail_derived.loc[sample_trail_derived.signalType == 'GAL_E1']

This DataFrame has ~11,000 rows. The original had ~55,000. That means that around 20% of the entries have signal type Galileo E1.

If we also wanted to select entries from satellite id 13 or 15, we can use the ampersand (`&`) to bring the two questions together:

In [None]:
sample_trail_derived.loc[(sample_trail_derived.signalType == 'GAL_E1') & (sample_trail_derived.svid.isin([13, 15]))]

Suppose we want to filter for any Galileo E1 or GPS L5 sygnal type. For this we use a pipe (`|`):

In [None]:
sample_trail_derived.loc[(sample_trail_derived.signalType == 'GAL_E1') | (sample_trail_derived.signalType == 'GPS_L5')]

## Summary Functions

Pandas provides many simple "summary functions" which restructure the data in some useful way. For example, let´s consider the describe() method to review the carrier-to-noise density in dB-Hz.:

In [None]:
raw_gnss.Cn0DbHz.describe()

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [None]:
sample_trail_derived.signalType.describe()

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [None]:
raw_gnss.Cn0DbHz.mean()

To see a list of unique values we can use the `unique()` function:

In [None]:
sample_trail_derived.signalType.unique()

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [None]:
sample_trail_derived.signalType.value_counts()

## Data Types and Missing Values

The data type for a column in a DataFrame or a Series is known as the **dtype**.

We can use the `dtype` property to grab the type of a specific column.  For instance, we can get the dtype of the `Cn0DbHz` column in the `derived` DataFrame:

In [None]:
raw_gnss.Cn0DbHz.dtype

Alternatively, the `info` method returns the `dtype` of _every_ column in the DataFrame and the number of non-empty values:

In [None]:
sample_trail_derived.info()

## Data Visualization

### Visualize a track with Plotly


In [None]:
import plotly.express as px

# from https://www.kaggle.com/nayuts/let-s-visualize-dataset-to-understand
def visualize_trafic(df, center, zoom=9):
    fig = px.scatter_mapbox(df,
                            
                            # Here, plotly gets, (x,y) coordinates
                            lat="latDeg",
                            lon="lngDeg",
                            
                            #Here, plotly detects color of series
                            color="phoneName",
                            labels="phoneName",
                            
                            zoom=zoom,
                            center=center,
                            height=600,
                            width=800)
    fig.update_layout(mapbox_style='stamen-terrain')
    fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
    fig.update_layout(title_text="GPS trafic")
    fig.show()

In [None]:
center = {"lat":37.423576, "lon":-122.094132}
visualize_trafic(sample_trail_truth, center)

Let´s now plot data which have same collectionName to see if they have the same ground truth.

In [None]:
sample_trail_2_truth = pd.read_csv(data_path / "train/2020-05-14-US-MTV-1/Pixel4XLModded/ground_truth.csv")

# Since plotly looks at the phoneName of the dataframe,
# you can visualize multiple series of data by simply concatting dataframes.
sample_trail_truth_combined = pd.concat([sample_trail_truth, sample_trail_2_truth])

center = {"lat":37.423576, "lon":-122.094132}
visualize_trafic(sample_trail_truth_combined, center)

### Visualize multiple tracks with GeoPandas

In the previous step we saw how to use plotly to map data on OpenStreetMap. This time, since there is a certain amount of tracking data in the train data alone, we will see how to use geopandas to get a quick overview as a regular diagram.

First, we'll download shape file lof bayarea.

In [None]:
import geopandas as gpd
from geopandas import GeoDataFrame
import requests
from shapely.geometry import Point, shape
import shapely.wkt

#Download geojson file of US San Francisco Bay Area.
r = requests.get("https://data.sfgov.org/api/views/wamw-vt4s/rows.json?accessType=DOWNLOAD")
r.raise_for_status()

#get geojson from response
data = r.json()

#get polygons that represents San Francisco Bay Area.
shapes = []
for d in data["data"]:
    shapes.append(shapely.wkt.loads(d[8]))
    
#Convert list of porygons to geopandas dataframe.
gdf_bayarea = pd.DataFrame()

#I'll use only 6 and 7th object.
for shp in shapes[5:7]:
    tmp = pd.DataFrame(shp, columns=["geometry"])
    gdf_bayarea = pd.concat([gdf_bayarea, tmp])
    
gdf_bayarea = GeoDataFrame(gdf_bayarea)

For each collectionNames, read the ground truth files in format that is convenient for visualization. At this time, we have already converted it to geopandas dataframe.

In [None]:
import glob

collectionNames = [item.split("/")[-1] for item in glob.glob("../input/google-smartphone-decimeter-challenge/train/*")]

gdfs = []
for collectionName in collectionNames:
    gdfs_each_collectionName = []
    csv_paths = glob.glob(f"../input/google-smartphone-decimeter-challenge/train/{collectionName}/*/ground_truth.csv")
    for csv_path in csv_paths:
        df_gt = pd.read_csv(csv_path)
        df_gt["geometry"] = [Point(lngDeg, latDeg) for lngDeg, latDeg in zip(df_gt["lngDeg"], df_gt["latDeg"])]
        gdfs_each_collectionName.append(GeoDataFrame(df_gt))
    gdfs.append(gdfs_each_collectionName)
    
colors = ['blue', 'green', 'purple', 'orange']

Now, let's visualize tracks. Some of them were too small to be seen when projected on the map, so we put them side by side with the ones that are just routes.

In [None]:
import matplotlib.pyplot as plt

for collectionName, gdfs_each_collectionName in zip(collectionNames, gdfs):
    fig, axs = plt.subplots(1, 2, figsize=(15, 5))
    gdf_bayarea.plot(figsize=(10,10), color='none', edgecolor='gray', zorder=3, ax=axs[0])
    for i, gdf in enumerate(gdfs_each_collectionName):
        g1 = gdf.plot(color=colors[i], ax=axs[0])
        g1.set_title(f"Phone track of {collectionName} with map")
        g2 = gdf.plot(color=colors[i], ax=axs[1])
        g2.set_title(f"Phone track of {collectionName}")

There are several tracks that have the same form of data with different collectionName. It is easy to understand the positional relationship by overlapping them. There are two roads extending from the northwest to the southeast, and they seem to run along those roads all the time, or occasionally go off those roads. The tracks wandering around the grid-like paths seem to be collected farther southeast than those paths.

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))

for collectionName, gdfs_each_collectionName in zip(collectionNames, gdfs):   
    for i, gdf in enumerate(gdfs_each_collectionName):
        gdf.plot(color=colors[i], ax=ax, markersize=5, alpha=0.5)

In geopandas, it's easy to see where they are in relation to each other, but it's hard to see the details and geographic information, so let's look at them with plotly as well

In [None]:
all_tracks = pd.DataFrame()

for collectionName, gdfs_each_collectionName in zip(collectionNames, gdfs):   
    for i, gdf in enumerate(gdfs_each_collectionName):
        all_tracks = pd.concat([all_tracks, gdf])
        # Tracks they have same collectionName is also same
        break
        
fig = px.scatter_mapbox(all_tracks,
                            
                        # Here, plotly gets, (x,y) coordinates
                        lat="latDeg",
                        lon="lngDeg",
                            
                        #Here, plotly detects color of series
                        color="collectionName",
                        labels="collectionName",
                            
                        zoom=9,
                        center={"lat":37.423576, "lon":-122.094132},
                        height=600,
                        width=800)
fig.update_layout(mapbox_style='stamen-terrain')
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.update_layout(title_text="GPS trafic")
fig.show()

### Visualize Heatmap for Geo-Data with Folium

In [None]:
# from https://www.kaggle.com/dannellyz/start-here-simple-folium-heatmap-for-geo-data
import folium
from folium import plugins


def simple_folium(df:pd.DataFrame, lat_col:str, lon_col:str):
    """
    Descrption
    ----------
        Returns a simple Folium HeatMap with Markers
    ----------
    Parameters
    ----------
        df : padnas DataFrame, required
            The DataFrane with the data to map
        lat_col : str, required
            The name of the column with latitude
        lon_col : str, required
            The name of the column with longitude
    """
    #Preprocess
    #Drop rows that do not have lat/lon
    df = df[df[lat_col].notnull() & df[lon_col].notnull()]

    # Convert lat/lon to (n, 2) nd-array format for heatmap
    # Then send to list
    df_locs = list(df[[lat_col, lon_col]].values)

    #Set up folium map
    fol_map = folium.Map([df[lat_col].median(), df[lon_col].median()])

    # plot heatmap
    heat_map = plugins.HeatMap(df_locs)
    fol_map.add_child(heat_map)

    # plot markers
    markers = plugins.MarkerCluster(locations = df_locs)
    fol_map.add_child(markers)

    #Add Layer Control
    folium.LayerControl().add_to(fol_map)

    return fol_map

First, let's see how estimated locations between the training and test data look like. The ground truth for training data is available per phone in {collectionName}/{phoneName}/ground_truth.csv.

In [None]:
train_baseline_locations = pd.read_csv(data_path / 'baseline_locations_train.csv')
latlon_trn = train_baseline_locations[['latDeg', 'lngDeg']].round(3)
latlon_trn['counts'] = 1
latlon_trn = latlon_trn.groupby(['latDeg', 'lngDeg']).sum().reset_index()
latlon_trn.head()

Let's see the heatmap for the training data.

In [None]:
simple_folium(latlon_trn, 'latDeg', 'lngDeg')

Let's see the heatmap for the test data too

In [None]:
test_baseline_locations = pd.read_csv(data_path / 'baseline_locations_test.csv')
latlon_tst = test_baseline_locations[['latDeg', 'lngDeg']].round(3)
latlon_tst['counts'] = 1
latlon_tst = latlon_tst.groupby(['latDeg', 'lngDeg']).sum().reset_index()

simple_folium(latlon_tst, 'latDeg', 'lngDeg')

## Feature Engineering

The goal of feature engineering is simply to make our data better suited to the problem at hand. We might perform feature engineering to:
- improve a model's predictive performance
- reduce computational or data needs
- improve interpretability of the results

### A Guiding Principle of Feature Engineering

For a feature to be useful, it must have a relationship to the target that your model is able to learn. Linear models, for instance, are only able to learn linear relationships. So, when using a linear model, your goal is to transform the features to make their relationship to the target linear.

Whatever relationships our model can't learn, we can provide ourself through transformations. As we develop our feature set, think about what information our model could use to achieve its best performance. 

#### Corrected Pseudo Range

Let's calculate a corrected pseudorange (i.e. a closer approximation to the geometric range from the phone to the satellite) as described in the data description:
```
correctedPrM = rawPrM + satClkBiasM - isrbM - ionoDelayM - tropoDelayM
```
"The baseline locations are computed using correctedPrM and the satellite positions, using a standard Weighted Least Squares (WLS) solver, with the phone's position (x, y, z), clock bias (t), and isrbM for each unique signal type as states for each epoch."

In [None]:
sample_trail_derived['correctedPrM'] = (sample_trail_derived.rawPrM + sample_trail_derived.satClkBiasM - sample_trail_derived.isrbM - 
                        sample_trail_derived.ionoDelayM - sample_trail_derived.tropoDelayM)
sample_trail_derived.head()

#### Previous location

We can add previous latitude and longitude estimates as features.

In [None]:
train_baseline_locations.columns

In [None]:
train_baseline_locations.sort_values(['phone', 'millisSinceGpsEpoch'], inplace=True)
train_baseline_locations[['prev_lat']] = train_baseline_locations['latDeg'].shift().where(train_baseline_locations['phone'].eq(train_baseline_locations['phone'].shift()))
train_baseline_locations[['prev_lon']] = train_baseline_locations['lngDeg'].shift().where(train_baseline_locations['phone'].eq(train_baseline_locations['phone'].shift()))

test_baseline_locations.sort_values(['phone', 'millisSinceGpsEpoch'], inplace=True)
test_baseline_locations[['prev_lat']] = test_baseline_locations['latDeg'].shift().where(test_baseline_locations['phone'].eq(test_baseline_locations['phone'].shift()))
test_baseline_locations[['prev_lon']] = test_baseline_locations['lngDeg'].shift().where(test_baseline_locations['phone'].eq(test_baseline_locations['phone'].shift()))

test_baseline_locations.head()

# Create a Machine Learning Model

In this example we will apply a Kalman Filter to improve the baseline slightly. 

![Drag Racing](https://simdkalman.readthedocs.io/en/latest/_images/example.png)

Please read the documentation if you would like to learn more about this implementation of kf: https://simdkalman.readthedocs.io/en/latest/

In [None]:
!pip install simdkalman

from pathlib import Path
import numpy as np
import pandas as pd
import simdkalman
from tqdm.notebook import tqdm

## Define Model


In [None]:
# from https://www.kaggle.com/jpmiller/baseline-from-host-data
T = 1.0
state_transition = np.array([[1, 0, T, 0, 0.5 * T ** 2, 0], [0, 1, 0, T, 0, 0.5 * T ** 2], [0, 0, 1, 0, T, 0],
                             [0, 0, 0, 1, 0, T], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]])
process_noise = np.diag([1e-5, 1e-5, 5e-6, 5e-6, 1e-6, 1e-6]) + np.ones((6, 6)) * 1e-9
observation_model = np.array([[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0]])
observation_noise = np.diag([5e-5, 5e-5]) + np.ones((2, 2)) * 1e-9

kf = simdkalman.KalmanFilter(
        state_transition = state_transition,
        process_noise = process_noise,
        observation_model = observation_model,
        observation_noise = observation_noise)

def apply_kf_smoothing(df, kf_=kf):
    unique_paths = df[['collectionName', 'phoneName']].drop_duplicates().to_numpy()
    for collection, phone in tqdm(unique_paths):
        cond = np.logical_and(df['collectionName'] == collection, df['phoneName'] == phone)
        data = df[cond][['latDeg', 'lngDeg']].to_numpy()
        data = data.reshape(1, len(data), 2)
        smoothed = kf_.smooth(data)
        df.loc[cond, 'latDeg'] = smoothed.states.mean[0, :, 0]
        df.loc[cond, 'lngDeg'] = smoothed.states.mean[0, :, 1]
    return df

# Model Validation

We've built a model. But how good is it?

Now we will learn to use model validation to measure the quality of our model. Measuring model quality is the key to iteratively improving your models.

## What is Model Validation
We'll want to evaluate almost every model we ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. We'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual phone positions, we'll likely find mix of good and bad predictions. Looking through a list of undred of thousands predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called **MAE**). Let's break down this metric starting with the last word, error.

The prediction error for each position is: <br>
```
error = haversine_distance(actual_position, predicted_position)
```

So, if a phone is at 50° 03′ 59″ N, 005° 42′ 53″ W and our position prediction is at 50° 03′ 59″ N, 005° 42′ 53.1″ W the error is 1.98m.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

> On average, our predictions are off by about X.

MAE is not the only measure that can be used to summarize model quality. For example, in the case of the Google Smartphone Decimeter Challenge, submissions are socred on the mean of the **50th and 95th percentile distance errors**.

In [None]:
# Simplified haversine distance
def haversine_distance(lat1: float, lon1: float, lat2: float, lon2: float):
    """Calculates the great circle distance between two points
    on the earth. Inputs are array-like and specified in decimal degrees.
    """
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(a**0.5)
    dist = 6_367_000 * c
    return dist

## Calculate Google Baseline model accuracy as reference point

In order to evaluate how good our model is, as a reference we can analyze the accuracy of the baseline solution provided by Google. The baseline locations
are computed using a corrected pseudorange (i.e. a closer approximation to the geometric range from the phone to the satellite) and the satellite positions, using a standard Weighted Least Squares (WLS) solver, with the phone's position (x, y, z), clock bias (t), and the Inter-Signal Range Bias in meters (isrbM) for each unique signal type as states for each epoch.

In [None]:
# from https://www.kaggle.com/jpmiller/baseline-from-host-data
truths = (data_path / 'train').rglob('ground_truth.csv')

df_list = []
cols = ['collectionName', 'phoneName', 'millisSinceGpsEpoch', 'latDeg', 'lngDeg']

for t in tqdm(truths, total=73):
    df_phone = pd.read_csv(t, usecols=cols)  
    df_list.append(df_phone)
df_truth = pd.concat(df_list, ignore_index=True)

baseline_locations_train = pd.read_csv(data_path / 'baseline_locations_train.csv', usecols=cols)
baseline_predictions = df_truth.merge(baseline_locations_train, how='inner', on=cols[:3], suffixes=('_truth', '_basepred'))

baseline_predictions['dist'] = haversine_distance(baseline_predictions.latDeg_truth, baseline_predictions.lngDeg_truth, 
    baseline_predictions.latDeg_basepred, baseline_predictions.lngDeg_basepred)

display(baseline_predictions[:5])

In [None]:
print(f'Mean error of the baseline locations on the train dataset: {baseline_predictions.dist.mean():.3f}m')
print(f'50th and 95th Percentile Distance Error: {np.percentile(baseline_predictions.dist, [50, 95])}m')

## Calculate the Mean Absolute Error of our Model

In [None]:
df_basepreds_kf = apply_kf_smoothing(pd.read_csv('../input/google-smartphone-decimeter-challenge/baseline_locations_train.csv', usecols=cols))
df_all = df_truth.merge(df_basepreds_kf, how='inner', on=cols[:3], suffixes=('_truth', '_basepred'))

df_all['dist'] = haversine_distance(df_all.latDeg_truth, df_all.lngDeg_truth, df_all.latDeg_basepred, df_all.lngDeg_basepred)

print(f'Mean error of our model on train dataset: {df_all.dist.mean():.3f}m')
print(f'50th and 95th Percentile Distance Error: {np.percentile(df_all.dist, [50, 95])}m')

Our model accuracy is slightly better than the baseline but there are many ways to improve this model, such as experimenting to find better features or different model types. 

## The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of measures for both building the model and evaluating it. But since the model was derived from the training data, the model will appear accurate in the training data.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.

# Submit results

In [None]:
test_base = pd.read_csv('../input/google-smartphone-decimeter-challenge/baseline_locations_test.csv')

output = pd.read_csv('../input/google-smartphone-decimeter-challenge/sample_submission.csv')

kf_smoothed_baseline = apply_kf_smoothing(test_base)
output = output.assign(
    latDeg = kf_smoothed_baseline.latDeg,
    lngDeg = kf_smoothed_baseline.lngDeg
)
output.to_csv('submission.csv', index=False)

# References
This notebook is based on 
- https://www.kaggle.com/dansbecker/model-validation
- https://www.kaggle.com/emaerthin/demonstration-of-the-kalman-filter
- https://www.kaggle.com/jpmiller/baseline-from-host-data
- https://www.kaggle.com/nayuts/let-s-visualize-dataset-to-understand
- https://www.kaggle.com/dansbecker/basic-data-exploration
- https://www.kaggle.com/dannellyz/start-here-simple-folium-heatmap-for-geo-data
- https://www.kaggle.com/jeongyoonlee/google-smartphone-decimeter-eda-keras-tpu
- https://www.kaggle.com/residentmario/creating-reading-and-writing
- https://www.kaggle.com/residentmario/indexing-selecting-assigning
- https://www.kaggle.com/residentmario/summary-functions-and-maps
- https://www.kaggle.com/residentmario/data-types-and-missing-values
- https://www.kaggle.com/ryanholbrook/what-is-feature-engineering