### Disclaimer
* Credits go to https://www.kaggle.com/museas for revisiting the cost minimisation solution. 
* I came across this method from https://www.kaggle.com/mehrankazeminia/1-3-indoor-navigation-cost-minimization-floor, which I thank here for their contribution.
* The submission file used was from the best public ensemble, published in this notebook: https://www.kaggle.com/saurabhbagchi/ensembling-best-performing-notebooksCredits go to https://www.kaggle.com/museas for revisiting the cost minimisation solution. I came across this method from https://www.kaggle.com/mehrankazeminia/1-3-indoor-navigation-cost-minimization-floor, which I thank here for their contribution

### Solution follows



**The step estimation module provided by the host is great, but sometimes it points to strange positions. That of the host uses TYPE_ROTATION_VECTOR. Therefore, by using it in combination with the estimation using TYPE_MAGNETIC_FIELD, the walking direction can be brought closer to the accurate one.**



Score improved 0.018 in this sub

4.545 →　4.527

This notebook demonstrates a post-processing strategy for the
[Indoor Location & Navigation](https://www.kaggle.com/c/indoor-location-navigation)
competition.

To combine machine learning (wifi features) predictions with sensor data (acceleration, attitude heading),
I defined cost function as follows,
$$
L(X_{1:N}) = \sum_{i=1}^{N} \alpha_i \| X_i - \hat{X}_i \|^2 + \sum_{i=1}^{N-1} \beta_i \| (X_{i+1} - X_{i}) - \Delta \hat{X}_i \|^2
$$
where $\hat{X}_i$ is absolute position predicted by machine learning and $\Delta \hat{X}_i$ is relative position predicted by sensor data.

Since the cost function is quadratic, the optimal $X$ is solved by linear equation $Q X = c$
, where $Q$ and $c$ are derived from above cost function.
Because the matrix $Q$ is tridiagonal,
each machine learning prediction is corrected by *all* machine learning predictions and sensor data.

The optimal hyperparameters ($\alpha$ and $\beta$) can be estimated by expected error of machine learning and sensor data,
or just tuned by public score.

## References
+ [Simple 99% Accurate Floor Model](https://www.kaggle.com/nigelhenry/simple-99-accurate-floor-model)
+ [Indoor Location Competition 2.0 (Sample Data and Code)](https://github.com/location-competition/indoor-location-competition-20)

In [None]:
!git clone --depth 1 https://github.com/location-competition/indoor-location-competition-20 indoor_location_competition_20
!rm -rf indoor_location_competition_20/data

In [None]:
import multiprocessing
import numpy as np
import pandas as pd
import scipy.interpolate
import scipy.sparse
from tqdm import tqdm

from indoor_location_competition_20.io_f import read_data_file
import indoor_location_competition_20.compute_f as compute_f

In [None]:
INPUT_PATH = '../input/indoor-location-navigation'

In [None]:
def compute_rel_positions(acce_datas, ahrs_datas):
    step_timestamps, step_indexs, step_acce_max_mins = compute_f.compute_steps(acce_datas)
    headings = compute_f.compute_headings(ahrs_datas)
    stride_lengths = compute_f.compute_stride_length(step_acce_max_mins)
    step_headings = compute_f.compute_step_heading(step_timestamps, headings)
    rel_positions = compute_f.compute_rel_positions(stride_lengths, step_headings)
    return rel_positions

In [None]:
import math

order = 6
# fs = 50.0  # sample rate, Hz
# cutoff = 3

fs = 100
cutoff = 3.667  # desired cutoff frequency of the filter, Hz

step_distance = 0.75
w_height = 1.7
m_trans = -2

In [None]:
from scipy.signal import butter, lfilter

def butter_lowpass(cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return b, a

def butter_lowpass_filter(data, cutoff, fs, order=5):
    b, a = butter_lowpass(cutoff, fs, order=order)
    y = lfilter(b, a, data)
    return y

In [None]:
def peak_accel_threshold(data, timestamps, threshold):
    d_acc = []
    last_state = 'below'
    crest_troughs = 0
    crossings = []

    for i, datum in enumerate(data):
        
        current_state = last_state
        if datum < threshold:
            current_state = 'below'
        elif datum > threshold:
            current_state = 'above'

        if current_state is not last_state:
            if current_state is 'above':
                crossing = [timestamps[i], threshold]
                crossings.append(crossing)
            else:
                crossing = [timestamps[i], threshold]
                crossings.append(crossing)

            crest_troughs += 1
        last_state = current_state
    return np.array(crossings)

**The blending method is to halve the stride length and adopt all steps, the number of steps will be doubled, but here only the movement distance is required**

In [None]:
def steps_compute_rel_positions(sample_file):
    
    mix_acce = np.sqrt(sample_file.acce[:,1:2]**2 + sample_file.acce[:,2:3]**2 + sample_file.acce[:,3:4]**2)
    mix_acce = np.concatenate([sample_file.acce[:,0:1], mix_acce], 1)
    mix_df = pd.DataFrame(mix_acce)
    mix_df.columns = ["timestamp","acce"]
    
    filtered = butter_lowpass_filter(mix_df["acce"], cutoff, fs, order)

    threshold = filtered.mean() * 1.1
    crossings = peak_accel_threshold(filtered, mix_df["timestamp"], threshold)

    step_sum = len(crossings)/2
    distance = w_height * 0.4 * step_sum

    mag_df = pd.DataFrame(sample_file.magn)
    mag_df.columns = ["timestamp","x","y","z"]
    
    acce_df = pd.DataFrame(sample_file.acce)
    acce_df.columns = ["timestamp","ax","ay","az"]
    
    mag_df = pd.merge(mag_df,acce_df,on="timestamp")
    mag_df.dropna()
    
    time_di_list = []

    for i in mag_df.iterrows():

        gx,gy,gz = i[1][1],i[1][2],i[1][3]
        ax,ay,az = i[1][4],i[1][5],i[1][6]

        roll = math.atan2(ay,az)
        pitch = math.atan2(-1*ax , (ay * math.sin(roll) + az * math.cos(roll)))

        q = m_trans - math.degrees(math.atan2(
            (gz*math.sin(roll)-gy*math.cos(roll)),(gx*math.cos(pitch) + gy*math.sin(roll)*math.sin(pitch) + gz*math.sin(pitch)*math.cos(roll))
        )) -90
        if q <= 0:
            q += 360
        time_di_list.append((i[1][0],q))

    d_list = [x[1] for x in time_di_list]
    
    steps = []
    step_time = []
    di_dict = dict(time_di_list)

    for n,i in enumerate(crossings[:,:1]):
        if n % 2 == 1:
            continue
        direct_now = di_dict[i[0]]
        dx = math.sin(math.radians(direct_now))
        dy = math.cos(math.radians(direct_now))
#         print(int(n/2+1),"歩目/x:",dx,"/y:",dy,"/角度：",direct_now)
        steps.append((i[0],dx,dy))
        step_time.append(i[0])
    
        step_dtime = np.diff(step_time)/1000
        step_dtime = step_dtime.tolist()
        step_dtime.insert(0,5)
        
        rel_position = []

        wp_idx = 0
#         print("WP:",round(sample_file.waypoint[0,1],3),round(sample_file.waypoint[0,2],3),sample_file.waypoint[0,0])
#         print("------------------")
        for p,i in enumerate(steps):
            step_distance = 0
            if step_dtime[p] >= 1:
                step_distance = w_height*0.25
            elif step_dtime[p] >= 0.75:
                step_distance = w_height*0.3
            elif step_dtime[p] >= 0.5:
                step_distance = w_height*0.4
            elif step_dtime[p] >= 0.35:
                step_distance = w_height*0.45
            elif step_dtime[p] >= 0.2:
                step_distance = w_height*0.5
            else:
                step_distance = w_height*0.4

#             step_x += i[1]*step_distance
#             step_y += i[2]*step_distance
            
            rel_position.append([i[0], i[1]*step_distance, i[2]*step_distance])
#     print(rel_position)
    
    return np.array(rel_position)

In [None]:
def correct_path(args):
    path, path_df = args
    
    T_ref  = path_df['timestamp'].values
    xy_hat = path_df[['x', 'y']].values
    
    example = read_data_file(f'{INPUT_PATH}/test/{path}.txt')

    rel_positions1 = compute_rel_positions(example.acce, example.ahrs)
    rel_positions2 = steps_compute_rel_positions(example)
    rel1 = rel_positions1.copy()
    rel2 = rel_positions2.copy()
    rel1[:,1:] = rel_positions1[:,1:] / 2
    rel2[:,1:] = rel_positions2[:,1:] / 2
    rel_positions = np.vstack([rel1,rel2])
    rel_positions = rel_positions[np.argsort(rel_positions[:, 0])]
    
    if T_ref[-1] > rel_positions[-1, 0]:
        rel_positions = [np.array([[0, 0, 0]]), rel_positions, np.array([[T_ref[-1], 0, 0]])]
    else:
        rel_positions = [np.array([[0, 0, 0]]), rel_positions]
    rel_positions = np.concatenate(rel_positions)
    
    T_rel = rel_positions[:, 0]
    delta_xy_hat = np.diff(scipy.interpolate.interp1d(T_rel, np.cumsum(rel_positions[:, 1:3], axis=0), axis=0)(T_ref), axis=0)

    N = xy_hat.shape[0]
    delta_t = np.diff(T_ref)
    alpha = (8.1)**(-2) * np.ones(N)
    beta  = (0.3 + 0.3 * 1e-3 * delta_t)**(-2)
    A = scipy.sparse.spdiags(alpha, [0], N, N)
    B = scipy.sparse.spdiags( beta, [0], N-1, N-1)
    D = scipy.sparse.spdiags(np.stack([-np.ones(N), np.ones(N)]), [0, 1], N-1, N)

    Q = A + (D.T @ B @ D)
    c = (A @ xy_hat) + (D.T @ (B @ delta_xy_hat))
    xy_star = scipy.sparse.linalg.spsolve(Q, c)

    return pd.DataFrame({
        'site_path_timestamp' : path_df['site_path_timestamp'],
        'floor' : path_df['floor'],
        'x' : xy_star[:, 0],
        'y' : xy_star[:, 1],
    })

In [None]:
sub = pd.read_csv('../input/indoor-loc-and-nav-subs/submission_4475.csv')  # ('../input/simple-99-accurate-floor-model/submission.csv')
tmp = sub['site_path_timestamp'].apply(lambda s : pd.Series(s.split('_')))
sub['site'] = tmp[0]
sub['path'] = tmp[1]
sub['timestamp'] = tmp[2].astype(float)

processes = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=processes) as pool:
    dfs = pool.imap_unordered(correct_path, sub.groupby('path'))
    dfs = tqdm(dfs)
    dfs = list(dfs)
sub = pd.concat(dfs).sort_values('site_path_timestamp')
sub.to_csv('first-submission.csv', index=False)

# Postprocessing with leaked feature

In the discussion section, I proposed [the possiility of the leakage caused by raw timestamps](https://www.kaggle.com/c/indoor-location-navigation/discussion/228898).

In short.
* The last waypoints of some paths are exactly the same as the first waypoints of other paths. I think this is because they are divided from a single measurement.
* The raw timestamps have information for the floor. I think this is because the measurements took place in floor by floor manner.

In this notebook, I demonstrate the existance of the leakage by presenting the results of postprocessing based on the raw timestamps  and not on any sensor information are actually good or even better.

The algorithm of postprocessing is simple.
* For x and y of the first waypoint, I use those of the temporally nearest endpoint in training dataset.
* For x and y of the last waypoint, I use those of the temporally nearest startpoint in training dataset.
* For floor of all waypoints, I use those of the temporally nearest endpoint and startpoint in training dataset if they match each other.

I have not decided whether I use this leakage. Should I try to create the model valid for real life or just to win this competition? I hope the competition host recreated the test dataset without any leakages.

I use some codes, data, and ideas from following notebooks. Thank you very much.
* https://www.kaggle.com/kenmatsu4/feature-store-for-indoor-location-navigation
* https://www.kaggle.com/mehrankazeminia/3-3-g6-indoor-navigation-snap-to-grid
* https://www.kaggle.com/jiweiliu/fix-the-timestamps-of-test-data-using-dask

[update] I found a bug in calculating the raw time stamp and fix it. The score changed worse a little bit.
[update2] I found another bug in getting test data function and fix it. Then the score changed better a little bit.

In [None]:
import json
import re
import gc
import pickle
import itertools
import pandas as pd
import numpy as np
from glob import glob
from datetime import datetime as dt
from pathlib import Path
from tqdm import tqdm
import datetime
ts_conv = np.vectorize(datetime.datetime.fromtimestamp) # ut(10 digit) -> date

# pandas settings -----------------------------------------
pd.set_option("display.max_colwidth", 100)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.options.display.float_format = '{:,.5f}'.format

# Graph drawing -------------------------------------------
import matplotlib
from matplotlib import font_manager
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib import rc
from matplotlib_venn import venn2, venn2_circles
from matplotlib import animation as ani
from IPython.display import Image
from pylab import imread

plt.rcParams["patch.force_edgecolor"] = True
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
sns.set(style="whitegrid", palette="muted", color_codes=True)
sns.set_style("whitegrid", {'grid.linestyle': '--'})
red = sns.xkcd_rgb["light red"]
green = sns.xkcd_rgb["medium green"]
blue = sns.xkcd_rgb["denim blue"]

%matplotlib inline
%config InlineBackend.figure_format='retina'

# ML -------------------------------------------
from sklearn.preprocessing import LabelEncoder


import dill
from collections import defaultdict, OrderedDict
from scipy.spatial import distance

In [None]:
def unpickle(filename):
    with open(filename, 'rb') as fo:
        p = pickle.load(fo)
    return p

def to_pickle(filename, obj):
    with open(filename, 'wb') as f:
        pickle.dump(obj, f, -1)



class FeatureStore():
    
    # necessayr to re-check
    floor_convert = {'1F' :  0, '2F' : 1, '3F' : 2, '4F' : 3, '5F' : 4, 
                     '6F' : 5, '7F' : 6, '8F' : 7, '9F' : 8,
                     'B'  : -1, 'B1' : -1, 'B2' : -2, 'B3' : -3, 
                     'BF' : -1, 'BM' : -1, 
                     'F1' : 0, 'F2' : 1, 'F3' : 2, 'F4' : 3, 'F5' : 4, 
                     'F6' : 5, 'F7' : 6, 'F8' : 7, 'F9' : 8, 'F10': 9,
                     'L1' : 0, 'L2' : 1, 'L3' : 2, 'L4' : 3, 'L5' : 4, 
                     'L6' : 5, 'L7' : 6, 'L8' : 7, 'L9' : 8, 'L10': 9, 
                     'L11': 10,
                     'G'  : 0, 'LG1': 0, 'LG2': 1, 'LM' : 0, 'M'  : 0, 
                     'P1' : 0, 'P2' : 1,}
    
    df_types = ['accelerometer',
                'accelerometer_uncalibrated',
                'beacon',
                'gyroscope',
                'gyroscope_uncalibrated',
                'magnetic_field',
                'magnetic_field_uncalibrated',
                'rotation_vector',
                'waypoint',
                'wifi']
    
    # https://github.com/location-competition/indoor-location-competition-20
    df_type_cols = {'accelerometer': ["timestamp", "x", "y", "z", "accuracy"],
                'accelerometer_uncalibrated': ["timestamp", "x", "y", "z", 
                                               "x2", "y2", "z2", "accuracy" ],
                'beacon': ["timestamp", "uuid", "major_id", "minor_id", "tx_power", 
                           "rssi", "distance", "mac_addr", "timestamp2"],
                'gyroscope': ["timestamp", "x", "y", "z", "accuracy"],
                'gyroscope_uncalibrated': ["timestamp", "x", "y", "z", 
                                           "x2", "y2", "z2", "accuracy" ],
                'magnetic_field': ["timestamp", "x", "y", "z", "accuracy"],
                'magnetic_field_uncalibrated': ["timestamp", "x", "y", "z", 
                                                "x2", "y2", "z2", "accuracy" ],
                'rotation_vector': ["timestamp", "x", "y", "z", "accuracy"],
                'waypoint': ["timestamp", "x", "y"],
                'wifi': ["timestamp", "ssid", "bssid","rssi","frequency",
                         "last_seen_timestamp",]}

    dtype_dict = {}
    dtype_dict["accelerometer"] = {"timestamp":int, "x":float, "y":float, "z":float, 
                                   "accuracy":int}
    dtype_dict["accelerometer_uncalibrated"] = {"timestamp":int, "x":float, "y":float, 
                                                "z":float, "x2":float, "y2":float, 
                                                "z2":float, "accuracy":int}
    dtype_dict["beacon"] = {"timestamp":int, "uuid":str, "major_id":str, 
                            "minor_id":str, "tx_power":int,  "rssi":int, 
                            "distance":float, "mac_addr":str, "timestamp2":int}
    dtype_dict["gyroscope"] = {"timestamp":int, "x":float, "y":float, "z":float, 
                               "accuracy":int}
    dtype_dict["gyroscope_uncalibrated"] = {"timestamp":int, "x":float, "y":float, 
                                            "z":float, "x2":float, "y2":float, 
                                            "z2":float, "accuracy":int}
    dtype_dict["magnetic_field"] = {"timestamp":int, "x":float, "y":float, 
                                    "z":float, "accuracy":int}
    dtype_dict["magnetic_field_uncalibrated"] = {"timestamp":int, "x":float, 
                                                 "y":float, "z":float, "x2":float, 
                                                 "y2":float, "z2":float, "accuracy":int}
    dtype_dict["rotation_vector"] = {"timestamp":int, "x":float, "y":float, 
                                     "z":float, "accuracy":int}
    dtype_dict["waypoint"] = {"timestamp":int, "x":float, "y":float, "z":float}
    dtype_dict["wifi"] = {"timestamp":int, "ssid":str, "bssid":str,
                          "rssi":int,"frequency":int, "last_seen_timestamp":int}

    def __init__(self, site_id, floor, path_id, 
                 input_path="../input/indoor-location-navigation/",
                 save_path="../mid"):
        self.site_id = site_id.strip()
        self.floor = floor.strip()
        self.n_floor = self.floor_convert[self.floor]
        self.path_id = path_id.strip()
        
        self.input_path = input_path
        assert Path(input_path).exists(), f"input_path do not exist: {input_path}"
        
        self.save_path = save_path
        Path(save_path).mkdir(parents=True, exist_ok=True)
        
        self.site_info = SiteInfo(site_id=self.site_id, floor=self.floor, input_path=self.input_path)
        
    def _flatten(self, l):
        return list(itertools.chain.from_iterable(l))
    
    def multi_line_spliter(self, s):
        matches = re.finditer("TYPE_", s)
        matches_positions = [match.start() for match in matches]
        split_idx = [0] + [matches_positions[i]-14 for i in range(1, len(matches_positions))] + [len(s)]
        return [s[split_idx[i]:split_idx[i+1]] for i in range(len(split_idx)-1)]
    
    def load_df(self, ):
        path = str(Path(self.input_path)/f"train/{self.site_id}/{self.floor}/{self.path_id}.txt")
        with open(path) as f:
            data = f.readlines()
        
        modified_data = []
        for s in data:
            if s.count("TYPE_")>1:
                lines = self.multi_line_spliter(s)
                modified_data.extend(lines)
            else:
                modified_data.append(s)
        del data
        self.meta_info_len = len([d for d in modified_data if d[0]=="#"])
        self.meta_info_df = pd.DataFrame([m.replace("\n", "").split(":") 
                                          for m in self._flatten([d.split("\t") 
                                                                  for d in modified_data if d[0]=="#"]) if m!="#"])

        data_df = pd.DataFrame([d.replace("\n", "").split("\t") for d in modified_data if d[0]!="#"])
        for dt in self.df_types:
            # select data type
            df_s = data_df[data_df[1]==f"TYPE_{dt.upper()}"]
            if len(df_s)==0:
                setattr(self, dt, pd.DataFrame(columns=self.df_type_cols[dt]))
            else:
                # remove empty cols
                na_info = df_s.isna().sum(axis=0) == len(df_s)
                df_s = df_s[[i for i in na_info[na_info==False].index if i!=1]].reset_index(drop=True)
                
                if len(df_s.columns)!=len(self.df_type_cols[dt]):
                    df_s.columns = self.df_type_cols[dt][:len(df_s.columns)]
                else:
                    df_s.columns = self.df_type_cols[dt]
            
                # set dtype          
                for c in df_s.columns:
                    df_s[c] = df_s[c].astype(self.dtype_dict[dt][c])
                                     
                # set DataFrame to attr
                setattr(self, dt, df_s)
    
    def get_site_info(self, keep_raw=False):
        self.site_info.get_site_info(keep_raw=keep_raw)
            
    def load_all_data(self, keep_raw=False):     
        self.load_df()
        self.get_site_info(keep_raw=keep_raw)
        
    def __getitem__(self, item):
        if item in self.df_types:
            return getattr(self, item)
        else:
            return None
    
    def save(self, ):
        # to be implemented
        pass
    
    
class SiteInfo():
    def __init__(self, site_id, floor, input_path="../input/indoor-location-navigation/"):
        self.site_id = site_id
        self.floor = floor
        self.input_path = input_path
        assert Path(input_path).exists(), f"input_path do not exist: {input_path}"
        
    def get_site_info(self, keep_raw=False):
        floor_info_path = f"{self.input_path}/metadata/{self.site_id}/{self.floor}/floor_info.json"
        with open(floor_info_path, "r") as f:
            self.floor_info = json.loads(f.read())
            self.site_height = self.floor_info["map_info"]["height"]
            self.site_width = self.floor_info["map_info"]["width"]
            if not keep_raw:
                del self.floor_info
            
        geojson_map_path = f"{self.input_path}/metadata/{self.site_id}/{self.floor}/geojson_map.json"
        with open(geojson_map_path, "r") as f:
            self.geojson_map = json.loads(f.read())
            self.map_type = self.geojson_map["type"]
            self.features = self.geojson_map["features"]
            
            self.floor_coordinates = self.features[0]["geometry"]["coordinates"]
            self.store_coordinates = [self.features[i]["geometry"]["coordinates"] 
                                          for i in range(1, len(self.features))]
                
            if not keep_raw:
                del self.geojson_map
    
    def show_site_image(self):
        path = f"{self.input_path}/metadata/{self.site_id}/{self.floor}/floor_image.png"
        plt.imshow(imread(path), extent=[0, self.site_width, 0, self.site_height])

    def draw_polygon(self, size=8, only_floor=False):

        fig = plt.figure()
        ax = plt.subplot(111)
            
        xmax, xmin, ymax, ymin = self._draw(self.floor_coordinates, ax, calc_minmax=True)
        if not only_floor:
            self._draw(self.store_coordinates, ax, fill=True)
        plt.legend([])
        
        xrange = xmax - xmin
        yrange = ymax - ymin
        ratio = yrange / xrange
        
        self.x_size = size
        self.y_size = size*ratio

        fig.set_figwidth(size)
        fig.set_figheight(size*ratio)
        # plt.show()
        return ax
        
    def _draw(self, coordinates, ax, fill=False, calc_minmax=False):
        xmax, ymax = -np.inf, -np.inf
        xmin, ymin = np.inf, np.inf
        for i in range(len(coordinates)):
            ndim = np.ndim(coordinates[i])
            if ndim==2:
                corrd_df = pd.DataFrame(coordinates[i])
                if fill:
                    ax.fill(corrd_df[0], corrd_df[1], alpha=0.7)
                else:
                    corrd_df.plot.line(x=0, y=1, style="-", ax=ax)
                        
                if calc_minmax:
                    xmax = max(xmax, corrd_df[0].max())
                    xmin = min(xmin, corrd_df[0].min())

                    ymax = max(ymax, corrd_df[1].max())
                    ymin = min(ymin, corrd_df[1].min())
            elif ndim==3:
                for j in range(len(coordinates[i])):
                    corrd_df = pd.DataFrame(coordinates[i][j])
                    if fill:
                        ax.fill(corrd_df[0], corrd_df[1], alpha=0.6)
                    else:
                        corrd_df.plot.line(x=0, y=1, style="-", ax=ax)
                        
                    if calc_minmax:
                        xmax = max(xmax, corrd_df[0].max())
                        xmin = min(xmin, corrd_df[0].min())

                        ymax = max(ymax, corrd_df[1].max())
                        ymin = min(ymin, corrd_df[1].min())
            else:
                assert False, f"ndim of coordinates should be 2 or 3: {ndim}"
        if calc_minmax:
            return xmax, xmin, ymax, ymin
        else:
            return None

In [None]:
# train_meta_data
train_meta = glob("../input/indoor-location-navigation/train/*/*/*")
train_meta_org = pd.DataFrame(train_meta)
train_meta = train_meta_org[0].str.split("/", expand=True)[[4, 5, 6]]
train_meta.columns = ["site_id", "floor", "path_id"]
train_meta["path_id"] = train_meta["path_id"].str.replace(".txt", "")
train_meta["path"] = train_meta_org[0]
#train_meta.head()

In [None]:
def pickle_dump_dill(obj, path):
    with open(path, mode='wb') as f:
        dill.dump(obj, f)


def pickle_load_dill(path):
    with open(path, mode='rb') as f:
        data = dill.load(f)
        return data

In [None]:
sample_sub = pd.read_csv('../input/indoor-location-navigation/sample_submission.csv')
test_sites = sample_sub.site_path_timestamp.apply(lambda x: pd.Series(x.split("_")))[0].unique().tolist()

test_meta = sample_sub["site_path_timestamp"].apply(
    lambda x: pd.Series(x.split("_")))
test_meta.columns = ["site_id", "path_id", "timestamp"]
test_meta=test_meta.drop('timestamp', axis=1)
test_meta = test_meta.drop_duplicates(subset=["site_id", "path_id"]).reset_index(drop=True)

# Get first and last waypoints in train dataset

In [None]:
create_train_meta_sub=False
if create_train_meta_sub:
    train_meta_sub=train_meta[train_meta['site_id'].isin(test_sites)].reset_index(drop=True)
    train_meta_sub['start_time']=0
    train_meta_sub['end_time']=0
    train_meta_sub['start_wp_time']=0
    train_meta_sub['start_wp_x']=0
    train_meta_sub['start_wp_y']=0
    train_meta_sub['end_wp_time']=0
    train_meta_sub['end_wp_x']=0
    train_meta_sub['end_wp_y']=0
    train_meta_sub['n_floor']=0
    for i in tqdm(range(len(train_meta_sub))):
        t = train_meta_sub.iloc[i]
        n_floor = FeatureStore.floor_convert[t.floor]
        feature = FeatureStore(
            site_id=t.site_id, floor=t.floor, path_id=t.path_id)
        feature.load_all_data() 
        start_time=int(feature.meta_info_df[feature.meta_info_df[0]=='startTime'][1])
        end_time=int(feature.meta_info_df[feature.meta_info_df[0]=='endTime'][1])
        train_meta_sub.loc[i,'start_time']=start_time
        train_meta_sub.loc[i,'start_wp_time']=feature.waypoint.iloc[0]['timestamp']
        train_meta_sub.loc[i,'start_wp_x']=feature.waypoint.iloc[0]['x']
        train_meta_sub.loc[i,'start_wp_y']=feature.waypoint.iloc[0]['y']
        train_meta_sub.loc[i,'end_time']=end_time
        train_meta_sub.loc[i,'end_wp_time']=feature.waypoint.iloc[-1]['timestamp']
        train_meta_sub.loc[i,'end_wp_x']=feature.waypoint.iloc[-1]['x']
        train_meta_sub.loc[i,'end_wp_y']=feature.waypoint.iloc[-1]['y']
        train_meta_sub.loc[i,'n_floor']=feature.n_floor
    train_meta_sub.to_csv('train_meta_sub.csv', index=False)
else:
    train_meta_sub = pd.read_csv('../input/indoor-public/train_meta_sub.csv')

In [None]:
train_meta_sub[train_meta_sub.site_id=='5d2709b303f801723c327472'][['path_id','site_id','n_floor','start_time','start_wp_x','start_wp_y','end_time','end_wp_x','end_wp_y']].sort_values(['site_id','n_floor','start_time'])[:50]

You can see the last waypoints of some paths are exactly the same as the first waypoints of other paths in training dataset.

In [None]:
import seaborn as sns
for test_site in test_sites:
    plt.figure()
    plt.title(test_site)
    sns.boxplot(x='floor', y='start_time', data=train_meta_sub[train_meta_sub.site_id==test_site])

You can see the raw timestamps have information for the floor in training dataset.

In [None]:
def read_txt(file):
    with open(file) as f:
        txt = f.readlines()

    modified_data = []
    for s in txt:
        if s.count("TYPE_") > 1:
            lines = multi_line_spliter(s)
            modified_data.extend(lines)
        else:
            modified_data.append(s)
    return modified_data


def _flatten(l):
    return list(itertools.chain.from_iterable(l))


def get_feature_test(site_id, path_id, input_path, sample_sub):
    file = f"{input_path}/test/{path_id}.txt"
    content = read_txt(file)
    data_df = pd.DataFrame([d.replace("\n", "").split("\t")
                            for d in content if d[0] != "#"])
    data_dict = OrderedDict()
    for dt in FeatureStore.df_types:
        # select data type
        df_s = data_df[data_df[1] == f"TYPE_{dt.upper()}"]
        if len(df_s) == 0:
            setattr(data_dict, dt, pd.DataFrame(
                columns=FeatureStore.df_type_cols[dt]))
        else:
            # remove empty cols
            na_info = df_s.isna().sum(axis=0) == len(df_s)
            df_s = df_s[[i for i in na_info[na_info ==
                                            False].index if i != 1]].reset_index(drop=True)

            if len(df_s.columns) != len(FeatureStore.df_type_cols[dt]):
                df_s.columns = FeatureStore.df_type_cols[dt][:len(
                    df_s.columns)]
            else:
                df_s.columns = FeatureStore.df_type_cols[dt]

            # set dtype
            for c in df_s.columns:
                df_s[c] = df_s[c].astype(FeatureStore.dtype_dict[dt][c])
            setattr(data_dict, dt, df_s)
    data_dict.meta_info_df = pd.DataFrame([m.replace("\n", "").split(":")
                                           for m in _flatten([d.split("\t")
                                                              for d in content if d[0] == "#"]) if m != "#"])
    startTime_ind = int(np.where(data_dict.meta_info_df[0] == 'startTime')[0])
    endTime_ind = int(np.where(data_dict.meta_info_df[0] == 'endTime')[0])
    data_dict.meta_info_df.loc[startTime_ind,
                               1] = data_dict.meta_info_df.loc[startTime_ind+1, 0]
    data_dict.meta_info_df.loc[endTime_ind,
                               1] = data_dict.meta_info_df.loc[endTime_ind+1, 0]

    data_dict.waypoint['timestamp'] = sample_sub[sample_sub.path_id ==
                                                 path_id].timestamp.values.astype(int)
    data_dict.waypoint['x'] = 0
    data_dict.waypoint['y'] = 0
    data_dict.n_floor = 0
    data_dict.site_id = site_id
    return data_dict

# Postprocessing based on leaked feature.

In [None]:
def leak_postprocessing(submission_df,train_meta, postprocess_start=True, postprocess_end=True, postprocess_floor=True,start_threshold=5500,end_threshold=5500):
    out_df=submission_df.copy()
    out_df[["site_id", "path_id", "timestamp"]] = out_df["site_path_timestamp"].apply(
        lambda x: pd.Series(x.split("_")))
    start_counter = 0
    end_counter = 0
    floor_counter = 0
    input_path='/kaggle/input/indoor-location-navigation/'
    sample_sub = pd.read_csv(f"{input_path}/sample_submission.csv")
    sample_sub = sample_sub["site_path_timestamp"].apply(
        lambda x: pd.Series(x.split("_")))
    sample_sub.columns = ["site_id", "path_id", "timestamp"]
    out_df_unique=out_df.drop_duplicates(
    subset=["site_id", "path_id"]).reset_index(drop=True)
    for i in tqdm(range(len(out_df_unique.path_id))):
        t = out_df_unique.iloc[i]
        site_id=t.site_id
        path_id=t.path_id
        feature = get_feature_test(site_id, path_id, input_path, sample_sub)
        if feature.meta_info_df[feature.meta_info_df[0] == 'startTime'][1].values == None:
            start_time = int(np.nanmin([feature.accelerometer.timestamp.min(
            ), feature.wifi.timestamp.min(), feature.beacon.timestamp.min()]))
        else:
            start_time = int(
                feature.meta_info_df[feature.meta_info_df[0] == 'startTime'][1])
        if (len(feature.meta_info_df[feature.meta_info_df[0] == 'endTime']) == 0) or (feature.meta_info_df[feature.meta_info_df[0] == 'endTime'][1].values == None):
            end_time = int(np.nanmax([feature.accelerometer.timestamp.max(
            ), feature.wifi.timestamp.max(), feature.beacon.timestamp.max()]))
        else:
            end_time = int(
                feature.meta_info_df[feature.meta_info_df[0] == 'endTime'][1])
        if len(feature.beacon) > 0:
            gap = feature.beacon.loc[0, 'timestamp2'] - \
                feature.beacon.loc[0, 'timestamp']
        else:
            gap = (feature.wifi.last_seen_timestamp.values -
                   feature.wifi.timestamp.values).max()+210.14426803816337  # from mean gap
        site_id = feature.site_id
        train_meta_site = train_meta[train_meta.site_id == site_id]
        
        #postprocess start point based on leakage
        train_meta_site_end = train_meta_site[(
            start_time+gap) > train_meta_site.end_time]
        if len(train_meta_site_end) > 0:
            nearest_endpoint = train_meta_site_end.loc[train_meta_site_end.end_time.idxmax(
            )]
            if postprocess_start and (start_time + gap - nearest_endpoint.end_time < start_threshold):
                out_df.loc[(out_df.path_id == path_id) & (out_df.timestamp == 
                    out_df[out_df.path_id == path_id].timestamp.min()), 'x'] = nearest_endpoint.end_wp_x
                out_df.loc[(out_df.path_id == path_id) & (out_df.timestamp == 
                    out_df[out_df.path_id == path_id].timestamp.min()), 'y'] = nearest_endpoint.end_wp_y
                start_counter += 1
        
        #postprocess end point based on leakage
        train_meta_site_start = train_meta_site[train_meta_site.start_time > (
            end_time+gap)]
        if len(train_meta_site_start) > 0:
            nearest_startpoint = train_meta_site_start.loc[train_meta_site_start.start_time.idxmin(
            )]
            if postprocess_end and (nearest_startpoint.start_time - end_time - gap < end_threshold):
                out_df.loc[(out_df.path_id == path_id) & (out_df.timestamp == 
                    out_df[out_df.path_id == path_id].timestamp.max()), 'x'] = nearest_startpoint.start_wp_x
                out_df.loc[(out_df.path_id == path_id) & (out_df.timestamp == 
                    out_df[out_df.path_id == path_id].timestamp.max()), 'y'] = nearest_startpoint.start_wp_y
                end_counter += 1
                
        #postprocess floor based on leakage
        if postprocess_floor:
            if (len(train_meta_site_end) > 0) and (len(train_meta_site_start) > 0) and (nearest_endpoint.n_floor == nearest_startpoint.n_floor):
                out_df.loc[(out_df.path_id == path_id),
                                  'floor'] = nearest_endpoint.n_floor
                floor_counter += (out_df.path_id == path_id).sum()

            # uncomment this section if you want to postprocess all floor predictions
            # elif (len(train_meta_site_end) > 0) and (len(train_meta_site_start) > 0):
            #     diff_start_time = start_time - nearest_endpoint.end_time
            #     diff_end_time = nearest_startpoint.start_time - end_time
            #     if diff_start_time < diff_end_time:
            #         out_df.loc[(out_df.path_id == path_id),
            #                           'floor'] = nearest_endpoint.n_floor
            #         floor_counter += (out_df.path_id == path_id).sum()
            #     if diff_end_time < diff_start_time:
            #         out_df.loc[(out_df.path_id == path_id),
            #                           'floor'] = nearest_startpoint.n_floor
            #         floor_counter += (out_df.path_id == path_id).sum()
            # elif len(train_meta_site_end) > 0:
            #     out_df.loc[(out_df.path_id == path_id),
            #                       'floor'] = nearest_endpoint.n_floor
            #     floor_counter += (out_df.path_id == path_id).sum()
            # elif len(train_meta_site_start) > 0:
            #     out_df.loc[(out_df.path_id == path_id),
            #                       'floor'] = nearest_startpoint.n_floor
            #     floor_counter += (out_df.path_id == path_id).sum()
            
    print(str(start_counter) + ' start points are postprocessed.')
    print(str(end_counter) + ' end points are postprocessed.')
    print(str(floor_counter) + ' floors are postprocessed.')
    out_df = out_df.drop(
        ["site_id", "path_id", "timestamp"], axis=1)
    return out_df

In [None]:
# submission_df = pd.read_csv('../input/3-3-g6-indoor-navigation-snap-to-grid/submission_snap_to_grid.csv')
submission_df = pd.read_csv('first-submission.csv')

I used the submission file of the current best public notebook for demonstration.

In [None]:
submission_df_leak_start = leak_postprocessing(submission_df,train_meta_sub, postprocess_start=True, postprocess_end=False, postprocess_floor=False)
submission_df_leak_start.to_csv(
    'submission_df_leak_start.csv', index=False)

I postprocessed start waypoints only here.

In [None]:
submission_df_leak_end = leak_postprocessing(submission_df,train_meta_sub, postprocess_start=False, postprocess_end=True, postprocess_floor=False)
submission_df_leak_end.to_csv(
    'submission_df_leak_end.csv', index=False)

I postprocessed end waypoints only here.

In [None]:
submission_df_leak_floor = leak_postprocessing(submission_df,train_meta_sub, postprocess_start=False, postprocess_end=False, postprocess_floor=True)
submission_df_leak_floor.to_csv(
    'submission_df_leak_floor.csv', index=False)

I postprocessed floor only here

In [None]:
submission_df_leak_all = leak_postprocessing(submission_df,train_meta_sub, postprocess_start=True, postprocess_end=True, postprocess_floor=True)
submission_df_leak_all.to_csv(
    'submission_df_leak_all.csv', index=False)

I postprocessed all data here.

In [None]:
submission_df_leak_start_end = leak_postprocessing(submission_df,train_meta_sub, postprocess_start=True, postprocess_end=True, postprocess_floor=False)
submission_df_leak_start_end.to_csv(
    'submission_df_leak_start_end.csv', index=False)

I postprocessed start and end point here.

The LB Scores are as follows
* Original: 4.789
* Start point: 4.751
* End point: 4.750
* Floor: 4.995
* All: 4.924
* Start and end point: 4.718

Although the postprocessing floor makes the LB score a bit lower, it is surprising that the score is comparable even if about 80% of floor predictions were replaced with those without any sensor information. If I uncomment the section in postprocessing function and postprocess all floor predictions, the LB score becomes 8.7xx. It is still not so bad, if I consider the fact that LB score < 15 means my floor prediction error < 1.

Any comments are welcome.