<a id = "top"></a>

## Indoor Location and Navigation

Your smartphone goes everywhere with you—whether driving to the grocery store or shopping for holiday gifts. With your permission, apps can use your location to provide contextual information. You might get driving directions, find a store, or receive alerts for nearby promotions. These handy features are enabled by GPS, which requires outdoor exposure for the best accuracy. Yet, there are many times when you’re inside large structures, such as a shopping mall or event center. Accurate indoor positioning, based on public sensors and user permission, allows for a great location-based experience even when you aren’t outside.

Current positioning solutions have poor accuracy, particularly in multi-level buildings, or generalize poorly to small datasets. Additionally, GPS was built for a time before smartphones. Today’s use cases often require more granularity than is typically available indoors.

In this competition, your task is to predict the indoor position of smartphones based on real-time sensor data, provided by indoor positioning technology company XYZ10 in partnership with Microsoft Research. You'll locate devices using “active” localization data, which is made available with the cooperation of the user. Unlike passive localization methods (e.g. radar, camera), the data provided for this competition requires explicit user permission. You'll work with a dataset of nearly 30,000 traces from over 200 buildings.

If successful, you’ll contribute to research with broad-reaching possibilities, including industries like manufacturing, retail, and autonomous devices. With more accurate positioning, existing location-based apps could even be improved. Perhaps you’ll even see the benefits yourself the next time you hit the mall.

<a id = "contents"></a>

### Notebook Contents

In this notebook I'll do some Exploratory Data Analysis, trying to update it with new sections in the next weeks.

The sections are: 

0. [**File Structure explained**](#files)<br>
    0.1. [*Visual Explanation of our data*](#visual)

1. [**metadata**](#meta_head)<br>
    
    1.1. [*Intro*](#meta_intro)<br>
    
    1.2. [*Geospatial Intro*](#meta_geo)<br> 
  
2. [**train**](#train_head)<br>

    2.1. [*Basic Statistics for Sites, Floors, Paths*](#train_stats)<br>
    
    2.2 [*Data explanation and relationships*](#train_expl)<br>
    
3. [**test and submission**](#test_and_sub_head)<br>


##### Props to: 

[Leonie](https://www.kaggle.com/iamleonie/intro-to-indoor-location-navigation), always a lot to learn from her, [indoor location competition Git](https://github.com/location-competition/indoor-location-competition-20/blob/master/io_f.py), [flaticon](https://www.flaticon.com/) and [imgur](https://imgur.com/) for visualizations.



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_columns = 50
pd.options.display.max_colwidth  = 200
import os

from dataclasses import dataclass
import colorama
from colorama import Fore, Back, Style
import folium
import json
import geopandas as gpd

import re
import pyproj
from pyproj import Proj, transform

from shapely.ops import cascaded_union
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
plt.style.use('fivethirtyeight')
import seaborn as sns # visualization
import warnings # Supress warnings 
warnings.filterwarnings('ignore')

import plotly.graph_objs as go
from PIL import Image

from tqdm import tqdm

metadata_path = '/kaggle/input/indoor-location-navigation/metadata/'
train_path = '/kaggle/input/indoor-location-navigation/train/'
test_path = '/kaggle/input/indoor-location-navigation/test/'

y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
sr_ = Style.RESET_ALL

color_dict = {'site': c_, 'floor': y_, 'path': b_}

test_structure = {test_path: ['path_1.txt','path_2.txt','path_3.txt','...', 'path_n.txt']}

metadata_structure = {metadata_path: 
                               {'site_1': {'floor_1': ['geojson_map.json', 'floor_info.json', 'floor_image.png'],
                                           'floor_2': ['geojson_map.json', 'floor_info.json', 'floor_image.png']},
                                'site_2': {'basement': ['geojson_map.json', 'floor_info.json', 'floor_image.png'],
                                           'floor_1': ['geojson_map.json', 'floor_info.json', 'floor_image.png']},
                               }
                     }

train_structure = {train_path: 
                               {'site_1': {'floor_1': ['path_1.txt', 'path_2.txt'],
                                           'floor_2': ['path_1.txt', 'path_2.txt', 'path_3.txt']},
                                'site_2': {'basement': ['path_1.txt'],
                                           'floor_1': ['path_1.txt', 'path_2.txt']},
                               }
                     }

def pretty(d, indent=0, max_enum = 10):
    for enum, (key, value) in enumerate(d.items()):
        if enum < max_enum:
            if ((len(str(key)) < 5) or (any(x in str(key) for x in ['floor', 'basement']))) and ('site' not in str(key)):
                print('\t'*indent, color_dict['floor'] + str(key)) 
            
            elif ((len(str(key)) > 5)):
                print('\t'*indent, color_dict['site'] + str(key)) 
            
            else:
                print('\t' * indent + str(key))
            if isinstance(value, dict):
                pretty(value, indent+1)
            else:
                if (len(value)>0) & (any(x in str(value) for x in ['.json', '.txt', '.png'])):
                    print("""{0}{1}{2}""".format('\t'*(indent+1), color_dict['path'], str(value)))
                else: 
                    print('\t' * (indent+1) + str(value))
        print(Style.RESET_ALL)
                    
def create_dict(metadata_path, max_enum = 1000, files_enum = None):
    
    metadata_dict = {}
    sites = os.listdir(metadata_path)
    metadata_dict[metadata_path] = sites
    sites_path = list(map(lambda x: os.path.join(metadata_path, x), sites))
    sites_dict = {}
    for sites_enum, site_path in enumerate(sites_path):
        
        if sites_enum<max_enum:
            
            site_floors = os.listdir(site_path)
            floors_path = list(map(lambda x: os.path.join(site_path, x), site_floors)) 
            
            floor_dict = {}
            for floor_enum, floor in enumerate(floors_path): 
                if floor_enum<max_enum:
                    if files_enum:
                        floor_dict[site_floors[floor_enum]] = len(os.listdir(floor)[:files_enum])
                    else:
                        floor_dict[site_floors[floor_enum]] = len(os.listdir(floor))
                        
            sites_dict[sites[sites_enum]] = floor_dict
                    
                    
    return {metadata_path: sites_dict}
                    
# copy from https://github.com/location-competition/indoor-location-competition-20/blob/master/io_f.py

@dataclass
class ReadData:
    acce: np.ndarray
    acce_uncali: np.ndarray
    gyro: np.ndarray
    gyro_uncali: np.ndarray
    magn: np.ndarray
    magn_uncali: np.ndarray
    ahrs: np.ndarray
    wifi: np.ndarray
    ibeacon: np.ndarray
    waypoint: np.ndarray


def read_data_file(data_filename):
    acce = []
    acce_uncali = []
    gyro = []
    gyro_uncali = []
    magn = []
    magn_uncali = []
    ahrs = []
    wifi = []
    ibeacon = []
    waypoint = []

    with open(data_filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    for line_data in lines:
        line_data = line_data.strip()
        if not line_data or line_data[0] == '#':
            continue

        line_data = line_data.split('\t')

        if line_data[1] == 'TYPE_WAYPOINT':
            waypoint.append([int(line_data[0]), float(line_data[2]), float(line_data[3])])
            continue
       
        if line_data[1] == 'TYPE_ACCELEROMETER':
            acce.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue
        
        if line_data[1] == 'TYPE_ACCELEROMETER_UNCALIBRATED':
            acce_uncali.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue
        
        if line_data[1] == 'TYPE_GYROSCOPE':
            gyro.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_GYROSCOPE_UNCALIBRATED':
            gyro_uncali.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue
        
        if line_data[1] == 'TYPE_MAGNETIC_FIELD':
            magn.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_MAGNETIC_FIELD_UNCALIBRATED':
            magn_uncali.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_ROTATION_VECTOR':
            ahrs.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_WIFI':
            sys_ts = line_data[0]
            ssid = line_data[2]
            bssid = line_data[3]
            rssi = line_data[4]
            lastseen_ts = line_data[6]
            wifi_data = [sys_ts, ssid, bssid, rssi, lastseen_ts]
            wifi.append(wifi_data)
            continue

        if line_data[1] == 'TYPE_BEACON':
            ts = line_data[0]
            uuid = line_data[2]
            major = line_data[3]
            minor = line_data[4]
            rssi = line_data[6]
            ibeacon_data = [ts, '_'.join([uuid, major, minor]), rssi]
            ibeacon.append(ibeacon_data)
            continue
        
    
    acce = np.array(acce)
    acce_uncali = np.array(acce_uncali)
    gyro = np.array(gyro)
    gyro_uncali = np.array(gyro_uncali)
    magn = np.array(magn)
    magn_uncali = np.array(magn_uncali)
    ahrs = np.array(ahrs)
    wifi = np.array(wifi)
    ibeacon = np.array(ibeacon)
    waypoint = np.array(waypoint)
    
    print('Acce shape:', acce.shape)
    print('acce_uncali shape:', acce_uncali.shape)
    print('gyro shape:', gyro.shape)
    print('gyro_uncali shape:', gyro_uncali.shape)
    print('magn shape:', magn.shape)
    print('magn_uncali shape:', magn_uncali.shape)
    print('ahrs shape:', ahrs.shape)
    print('wifi shape:', wifi.shape)
    print('ibeacon shape:', ibeacon.shape)
    print('Waypoint shape:', waypoint.shape)
    
    return ReadData(acce, acce_uncali, gyro, gyro_uncali, magn, magn_uncali, ahrs, wifi, ibeacon, waypoint)

def visualize_trajectory(trajectory, floor_plan_filename, width_meter, 
                         height_meter, title=None, mode='lines + markers + text', show=False):
    """
    Copied from from https://github.com/location-competition/indoor-location-competition-20/blob/master/visualize_f.py

    """
    fig = go.Figure()

    # add trajectory
    size_list = [6] * trajectory.shape[0]
    size_list[0] = 10
    size_list[-1] = 10

    color_list = ['rgba(4, 174, 4, 0.5)'] * trajectory.shape[0]
    color_list[0] = 'rgba(12, 5, 235, 1)'
    color_list[-1] = 'rgba(235, 5, 5, 1)'

    position_count = {}
    text_list = []
    for i in range(trajectory.shape[0]):
        if str(trajectory[i]) in position_count:
            position_count[str(trajectory[i])] += 1
        else:
            position_count[str(trajectory[i])] = 0
        text_list.append('        ' * position_count[str(trajectory[i])] + f'{i}')
    text_list[0] = 'Start 0'
    text_list[-1] = f'End {trajectory.shape[0] - 1}'

    fig.add_trace(
        go.Scattergl(
            x=trajectory[:, 0],
            y=trajectory[:, 1],
            mode=mode,
            marker=dict(size=size_list, color=color_list),
            line=dict(shape='linear', color='lightgrey', width=3, dash='dash'),
            text=text_list,
            textposition="top center",
            name='trajectory',
        ))

    # add floor plan
    floor_plan = Image.open(floor_plan_filename)
    fig.update_layout(images=[
        go.layout.Image(
            source=floor_plan,
            xref="x",
            yref="y",
            x=0,
            y=height_meter,
            sizex=width_meter,
            sizey=height_meter,
            sizing="contain",
            opacity=1,
            layer="below",
        )
    ])

    # configure
    fig.update_xaxes(autorange=False, range=[0, width_meter])
    fig.update_yaxes(autorange=False, range=[0, height_meter], scaleanchor="x", scaleratio=1)
    fig.update_layout(
        title=go.layout.Title(
            text=title or "No title.",
            xref="paper",
            x=0,
        ),
        autosize=True,
        width=800,
        height=  800 * height_meter / width_meter,
        template="plotly_white",
    )

    if show:
        fig.show()

    return fig

def visualize_train_trajectory(path):
    """
    Edited from 
    https://www.kaggle.com/ihelon/indoor-location-exploratory-data-analysis
    """
    _id, floor = path.split("/")[:2]
    
    train_floor_data = read_data_file(f"../input/indoor-location-navigation/train/{path}")
    with open(f"../input/indoor-location-navigation/metadata/{_id}/{floor}/floor_info.json") as f:
        train_floor_info = json.load(f)

    return visualize_trajectory(
        train_floor_data.waypoint[:, 1:3], 
        f"../input/indoor-location-navigation/metadata/{_id}/{floor}/floor_image.png",
        train_floor_info["map_info"]["width"], 
        train_floor_info["map_info"]["height"],
        f"Visualization of {path}"
    )

<a id = 'files'></a>
<h3> Files Structure</h3>

Files
train - training path files, organized by site and floor; each path file contains the data of a single path on a single floor
test - test path files, organized by site and floor; each path file contains the data of a single path on a single floor, but without the waypoint (x, y) data; the task of this competition is, for a given site-path file, predict the floor and waypoint locations at the timestamps given in the sample_submission.csv file

- **train**: training path files, organized by site and floor; each path files contains the data of a **single** **path** on a **single** **floor**.<br>

- **test**: test path files, organized by site and floor; each path files contains the data of a single path on a single floor, but **without the waypoint (x, y) data**; the task of this competition is, for a given site-path file, predict the floor and waypoint locations at the timestamps given in the sample_submission.csv file.<br>

- **metadata**: floor metadata folder, organized by site and floor, which includes the following for each floor: <br>
<style>
    ul {
  padding-left: 15px;
}
</style>
<ol>
  <li>floor_image.png</li>
  <li>floor_info.json</li>
  <li>geojson_map.json</li>
</ol>

- **sample_submission.csv**: a sample submission file in the correct format; each has a unique id which contains a site id, a path id, and the timestamp within the trace for which to make a prediction; see the Evaluation page for the required integer mapping of floor names <br>


I'll start by visually explaining the relationships between our data. 

<a id = "visual"></a>
#### Visual Explanation of our Data

<img src="https://i.imgur.com/2bAH1Rl.png">

<img src="https://i.imgur.com/C8SEyiF.png"></img>

<a id ="meta_head"></a>
### metadata

<a id ="meta_intro"></a>
#### Intro

For each site-floor we have 3 files: 

<img src="https://i.imgur.com/vWzZyQ0.png"></img>

In [None]:
pretty(metadata_structure)

Let's see an example for site `5cd56c0ce2acfd2d33b6ab27`

In [None]:
site_name_ = '5cd56c0ce2acfd2d33b6ab27'
site_path = os.path.join(metadata_path, site_name_)
site_structure = {site_path: {'B1': ['geojson_map.json', 'floor_info.json', 'floor_image.png'],
                              'F3': ['geojson_map.json', 'floor_info.json', 'floor_image.png'],
                              'F2': ['geojson_map.json', 'floor_info.json', 'floor_image.png']}}
pretty(site_structure)

And floor `B1`

In [None]:
floor_info = pd.read_json(os.path.join(site_path, 'B1/floor_info.json'))
floor_image = plt.imread(os.path.join(site_path, 'B1/floor_image.png'))
floor_geo = (gpd.GeoDataFrame.from_features(
                        pd.read_json(os.path.join(site_path, 'B1/geojson_map.json'))['features'])
                     .assign(site_name=site_name_))
print('Floor Info')
display(floor_info)

fig, axes = plt.subplots(1, 2, figsize = (16, 10))
ax = axes.ravel()
floor_geo['geometry'].plot(ax=ax[0], color = 'red')
ax[0].set_title('Floor {} polygon'.format('B1'))
ax[1].imshow(floor_image)
ax[1].set_title('Floor {} image'.format('B1'))
fig.suptitle('Floor Polygon and corresponding Floor Image')

For each site-floor there's a MultiPolygon, which is the same for all floors in that sites: in fact, it is just the convex hull of the union of each floor polygons. 

In [None]:
fig, axes = plt.subplots(2, 3, figsize = (20, 12))
ax = axes.ravel()

single_poly_df = (floor_geo.loc[floor_geo.geometry.apply(lambda x: x.geom_type == 'Polygon')]
                 .reset_index(drop = True))

for j in range(len(single_poly_df)):
    single_poly_df.iloc[[j]].plot(ax = ax[j])
    ax[j].set_title("Polygon {}".format(j+1))
    
polygons = []
boundary = gpd.GeoSeries(cascaded_union(single_poly_df.geometry.tolist()))
boundary.plot(color = 'red', ax = ax[4])
ax[4].set_title('Polygon Unions')
floor_geo.iloc[[0]]['geometry'].plot(ax = ax[5], color = 'orange')
ax[5].set_title('MultiPolygon')
plt.suptitle('Floor {} at Site {} Polygons'.format('B1', site_name_ ))

<a id = "meta_geo"></a>

### Geospatial Intro
#### Where are our Polygons in the World?

Here I'll try to show where are the polygons (i.e. parking lots, malls, airports) we are dealing with in this challenge. 

Here I load all the sites geojson information:

In [None]:
geo_dfs = []
geo_cols = ["geometry","Vr","category","name","code","floor_num", 'sid',
            "type","id","version","display","point","points","doors", "site_name"]

problematic_sites = []
for site in os.listdir(metadata_path):
    site_path = os.path.join(metadata_path, site)
    for floor in os.listdir(site_path):
        floor_path = os.path.join(site_path, floor)
        try:
            geo_df = (gpd.GeoDataFrame.from_features(
                        pd.read_json(os.path.join(floor_path, 'geojson_map.json'))['features'])
                     .assign(site_name=site))
        except:
            problematic_sites+=[site]
        geo_dfs.append(geo_df)
problematic_sites=list(set(problematic_sites))
full_geo_df = pd.concat(geo_dfs, axis = 0, ignore_index = True)

In [None]:
full_geo_df[['geometry', 'point', 'site_name']].sample()

Points are in **epsg:3857** coordinates, as you can check [here](https://xserver2-dashboard.cloud.ptvgroup.com/dashboard/Content/TechnicalConcepts/Basics/DSC_About_CoordinateSystems.htm). 

Let's convert to standard *Latitude* and *Longitude* and plot them on a map. 

In [None]:
def get_lat_lon(point, proj = pyproj.Transformer.from_crs(3857, 4326, always_xy=True)):
    try:
        x1, y1 = point[0], point[1]
        lon, lat = proj.transform(x1, y1)
        return lat, lon
    except:
        return np.nan

def get_point(x, i=0):
    try:
        return x[i]
    except:
        return np.nan
    
full_geo_df_sample = full_geo_df.sample(500).reset_index(drop = True)
full_geo_df_sample['lat_lon'] = full_geo_df_sample.point.apply(get_lat_lon)
full_geo_df_sample['lat'] = full_geo_df_sample['lat_lon'].apply(lambda x: get_point(x,0))
full_geo_df_sample['lon'] = full_geo_df_sample['lat_lon'].apply(lambda x: get_point(x,1))

#### Map for some of our sites

In [None]:
m = folium.Map(location=[30.7444062,121.1146543], tiles='openstreetmap', zoom_start = 8)

for j in range(len(full_geo_df_sample)):
    try:
        folium.Marker(location=[full_geo_df_sample['lat'][j],
                                full_geo_df_sample['lon'][j]],
                        popup=full_geo_df_sample['site_name'][j],
                        icon = folium.Icon(prefix = 'fa', icon = "map-pin", color = 'blue'),
                        fill_color='#132b5e', num_sides=3, radius=5).add_to(m)
    except:
        continue
m

In the next few days I'll go on in customizing this analysis.

<a id = "train_head"></a>

### Train

In [None]:
pretty(train_structure)

<a id = "train_stats"></a>

#### Basic Statistics for Sites, Floors, Paths

Let's see the number of paths per floor/site. 

In [None]:
train_dict = create_dict(train_path)[train_path]
train_path_df = pd.DataFrame.from_dict(train_dict, orient = 'index')

assert train_path_df[train_path_df == 0].sum().sum() == 0, "Floor present in Site, but no path available"

train_path_df['number_of_floors'] = train_path_df.apply(lambda x: ~x.isna()).sum(axis = 1)

train_path_df = (train_path_df.reset_index(drop= False).rename(columns = {'index': 'site'})
 .melt(ignore_index = 'False', id_vars = ['site', 'number_of_floors'], var_name = 'floor',
      value_name = 'number_of_paths'))

train_path_df = train_path_df.loc[~train_path_df.number_of_paths.isna()].reset_index(drop = True)

display(train_path_df.sample(3))

Here I Add metadata to retrieve the floor numbers

In [None]:
floor_meta_info = (full_geo_df.loc[~full_geo_df.floor_num.isna()]
                   [['site_name', 'name', 'floor_num']].reset_index(drop = True))
train_path_df_plus_meta = (train_path_df.merge(floor_meta_info, 
                           left_on = ['site', 'floor'], right_on = ['site_name', 'name']))

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (20, 12))
ax = axes.ravel()
plot_df = train_path_df[['site', 'number_of_floors']].drop_duplicates(ignore_index = True)
ax[0]= plt.subplot2grid((2, 2), (0, 0), colspan=1)
plot_df.number_of_floors.hist(ax = ax[0], bins = 50, color = '#2695f0')
ax[0].set_title('Number of Floors per Site distribution')

ax[1]= plt.subplot2grid((2, 2), (0, 1))
train_path_df.number_of_paths.hist(ax = ax[1], bins = 30, color = '#2695f0')
ax[1].set_title('Number of Paths per Floor and Site distribution')

ax[2] = plt.subplot2grid((2, 2), (1, 0), colspan=1)

plot_df_3 = (train_path_df_plus_meta.groupby('floor_num').agg({'number_of_paths': ['sum', 'mean']}).reset_index())
plot_df_3.columns = ['floor_num', 'total_paths', 'avg_paths']
plot_df_3['avg_paths'] = round(plot_df_3['avg_paths'], 3)
#plot_df_3 = plot_df_3.melt(id_vars = 'floor_num')

plot_df_3.plot(kind = 'bar', x = 'floor_num', y = 'total_paths', ax = ax[2], color = '#f0b326')
#plot_df_3.plot(kind = 'bar', x = 'floor_num', y = 'avg_paths', ax = ax[2])
ax[2].set_title('Total Number of Paths per Floor Number')

ax[3] = plt.subplot2grid((2, 2), (1, 1), colspan=1)
plot_df_3.plot(kind = 'bar', x = 'floor_num', y = 'avg_paths', ax = ax[3], color = '#f0b326')
ax[3].set_title('Average Number of Paths per Floor Number')

ax[0].set_xlabel('N_floors')
ax[1].set_xlabel('N_paths')
ax[2].set_xlabel('Floor_num')
ax[3].set_xlabel('Floor_num')

ax[2].get_legend().remove()
ax[3].get_legend().remove()

<a id = "train_expl"></a>

#### Data explanation and relationships

I would suggest to read the official competition [Git README](https://github.com/location-competition/indoor-location-competition-20). I will report the pillar information here. 

> Each trace (*.txt) corresponds to an indoor path between position p1 and p2 walked by a site-surveyor. During the walk, site-surveyor is holding an Android smartphone flat in front of his body, and a sensor data recording app is running on the device to collect IMU (accelerometer, gyroscope) and geomagnetic field (magnetometer) readings, as well as WiFi and Bluetooth iBeacon scanning results.

And, regarding timestamps:

> In specific, we use SensorEvent.timestamp for sensor data and system time for WiFi and Bluetooth scans.

So we won't probably find the same timestamps for phone data (accelerometer, gyroscope, magnetic field, ahrs) we have for WiFi/Beacon (Beacon in this case is the same as Bluetooth, for those wondering). 

#### Visual Explanation of Train Data

<img src = "https://i.imgur.com/kFufSTR.png"></img>

Let's load path data for site `5cd56c0ce2acfd2d33b6ab27`, floor `F2` and path `5d09b22fcfb49b00085466a0`.

In [None]:
site_floor_path = "5cd56c0ce2acfd2d33b6ab27/F2/5d09b22fcfb49b00085466a0.txt"

sample_file = read_data_file(os.path.join(train_path, site_floor_path))

In [None]:
sample_wifi = pd.DataFrame(sample_file.wifi, columns = ['ts_last_seen', 'wifi_id_1', 'wifi_id_2', 'rssi', 'ts_first_seen'])
sample_wifi[['ts_first_seen', 'ts_last_seen']] = sample_wifi[['ts_first_seen', 'ts_last_seen']].astype(int)
sample_beacon = pd.DataFrame(sample_file.ibeacon, columns = ['timestamp', 'beacon_id', 'rssi'])
sample_beacon['timestamp'] = sample_beacon['timestamp'].astype(int)

print("Wifi Data")
display(sample_wifi.sample(3))
print("IBeacon Data")
display(sample_beacon)

In [None]:
sample_acce =  pd.DataFrame(sample_file.acce, columns = ['timestamp', 'acce_x', 'acce_y', 'acce_z'])
sample_acce_uncali =  pd.DataFrame(sample_file.acce_uncali, columns = ['timestamp', 'acce_x', 'acce_y', 'acce_z'])
sample_gyro =  pd.DataFrame(sample_file.gyro, columns = ['timestamp', 'gyro_x', 'gyro_y', 'gyro_z'])
sample_gyro_uncali =  pd.DataFrame(sample_file.gyro_uncali, columns = ['timestamp', 'gyro_x', 'gyro_y', 'gyro_z'])
sample_magn =  pd.DataFrame(sample_file.magn, columns = ['timestamp', 'magn_x', 'magn_y', 'magn_z'])
sample_magn_uncali =  pd.DataFrame(sample_file.magn_uncali, columns = ['timestamp', 'magn_x', 'magn_y', 'magn_z'])
sample_ahrs =  pd.DataFrame(sample_file.ahrs, columns = ['timestamp', 'ahrs_x', 'ahrs_y', 'ahrs_z'])
sensor_df = sample_acce.copy()

for df in [sample_acce_uncali, sample_gyro, sample_gyro_uncali, sample_magn, sample_magn_uncali, sample_ahrs]:
    
    assert len(sensor_df) == len(sensor_df.merge(df, on = 'timestamp'))
    sensor_df = (sensor_df.merge(df, on = 'timestamp', suffixes = ("", "_uncali")))

sensor_df['timestamp'] = sensor_df['timestamp'].astype(int)

print("Phone Sensors Data")
display(sensor_df.sample(3))

In [None]:
sample_waypoint = pd.DataFrame(sample_file.waypoint, columns = ['timestamp', 'x', 'y'])
sample_waypoint['timestamp'] = sample_waypoint['timestamp'].astype(int)

print("Waypoint Data")
display(sample_waypoint.sample(min(len(sample_waypoint), 3)))

**The 3 data sources (phone sensors, phone signals and waypoint) are not time aligned**. Check the following plot:

In [None]:
all_timestamps = list(set(sample_waypoint.timestamp.tolist() + sensor_df.timestamp.tolist() +
                          sample_wifi.ts_first_seen.tolist()+sample_wifi.ts_last_seen.tolist()+sample_beacon.timestamp.tolist()))

all_timestamps_df = pd.DataFrame({'timestamp': all_timestamps}).sort_values('timestamp', ignore_index = True)

all_timestamps_plus_data_df = (all_timestamps_df.merge(sample_waypoint[['timestamp']], how = 'left', indicator = True).rename({'_merge': 'waypoint'}, axis = 1)
                               .replace({'left_only': 0, 'both': 1})
                  .merge(sensor_df[['timestamp']], how = 'left', indicator = True).rename({'_merge': 'phone_sensors'}, axis = 1)
                               .replace({'left_only': 2, 'both': 3})
                  .merge(sample_wifi[['ts_first_seen']].rename({'ts_first_seen': 'timestamp'},axis=1), how = 'left', indicator = True)
                               .rename({'_merge': 'wifi_first'}, axis = 1)
                               .replace({'left_only': 4, 'both': 5})
                  .merge(sample_wifi[['ts_last_seen']].rename({'ts_last_seen': 'timestamp'},axis=1), how = 'left', indicator = True)
                               .rename({'_merge': 'wifi_last'}, axis = 1)
                               .replace({'left_only': 6, 'both': 7})
                  .merge(sample_beacon[['timestamp']], how = 'left', indicator = True).rename({'_merge': 'beacon'}, axis = 1)
                               .replace({'left_only': 8, 'both': 9})
                  .drop_duplicates(ignore_index = True))

fig, ax = plt.subplots(1, 1, figsize = (16, 10))
all_timestamps_plus_data_df.loc[all_timestamps_plus_data_df.timestamp > 1560914051576].head(2000).set_index('timestamp').plot(ax = ax)
ax.set_yticklabels(['on', 'off', 'on', 'off', 'on', 'off', 'on', 'off', 'on', 'off', 'on'])
ax.legend(loc='upper left', bbox_to_anchor=(1, 0.8))
plt.suptitle('Each signal timestamp data: how the series are unaligned')

<a id = "test_and_sub_head"></a>

### test and submission

I have provided a visual explanation above of how test, train and submission data relate. For floors mapping check the [evaluation](https://www.kaggle.com/c/indoor-location-navigation/overview/evaluation) page. 

In [None]:
pretty(test_structure)

Differently from the training set we have just a list of paths, with no _site_ nor _floor_ information. 

Let's read one of the paths, `5694e13f4bb0bac39806b5ae`


In [None]:
sample_test_path = read_data_file(os.path.join(test_path, '5694e13f4bb0bac39806b5ae.txt'))

Everything is the same as for train data, except we don't have waypoint data (which is the target we are trying to predict). Let's get the sample submission data.

In [None]:
sub = pd.read_csv('/kaggle/input/indoor-location-navigation/sample_submission.csv')
sub[['site', 'path', 'timestamp']] = sub['site_path_timestamp'].str.split('_', expand=True)
display(sub.head(3))

We can already see we may be asked to predict floor, x and y coordinates for the same path at the same site for different timestamps. 

Let's now retrieve the corresponding path (`046cfa46be49fc10834815c6`) from test.

In [None]:
@dataclass
class ReadDataDf:
    sensor_df: pd.DataFrame
    wifi: pd.DataFrame
    ibeacon: pd.DataFrame
    waypoint: pd.DataFrame


def read_data_file_df(data_filename):
    acce = []
    acce_uncali = []
    gyro = []
    gyro_uncali = []
    magn = []
    magn_uncali = []
    ahrs = []
    wifi = []
    ibeacon = []
    waypoint = []

    with open(data_filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    for line_data in lines:
        line_data = line_data.strip()
        if not line_data or line_data[0] == '#':
            continue

        line_data = line_data.split('\t')

        if line_data[1] == 'TYPE_WAYPOINT':
            waypoint.append([int(line_data[0]), float(line_data[2]), float(line_data[3])])
            continue
       
        if line_data[1] == 'TYPE_ACCELEROMETER':
            acce.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue
        
        if line_data[1] == 'TYPE_ACCELEROMETER_UNCALIBRATED':
            acce_uncali.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue
        
        if line_data[1] == 'TYPE_GYROSCOPE':
            gyro.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_GYROSCOPE_UNCALIBRATED':
            gyro_uncali.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue
        
        if line_data[1] == 'TYPE_MAGNETIC_FIELD':
            magn.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_MAGNETIC_FIELD_UNCALIBRATED':
            magn_uncali.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_ROTATION_VECTOR':
            ahrs.append([int(line_data[0]), float(line_data[2]), float(line_data[3]), float(line_data[4])])
            continue

        if line_data[1] == 'TYPE_WIFI':
            sys_ts = line_data[0]
            ssid = line_data[2]
            bssid = line_data[3]
            rssi = line_data[4]
            lastseen_ts = line_data[6]
            wifi_data = [sys_ts, ssid, bssid, rssi, lastseen_ts]
            wifi.append(wifi_data)
            continue

        if line_data[1] == 'TYPE_BEACON':
            ts = line_data[0]
            uuid = line_data[2]
            major = line_data[3]
            minor = line_data[4]
            rssi = line_data[6]
            ibeacon_data = [ts, '_'.join([uuid, major, minor]), rssi]
            ibeacon.append(ibeacon_data)
            continue
            
    def create_df(array, cols):
        try:
            return pd.DataFrame(array, columns = cols)
        except:
            return pd.DataFrame(columns = cols)
    
    acce = create_df(np.array(acce), cols = ['timestamp', 'acce_x', 'acce_y', 'acce_z'])
    acce_uncali = create_df(np.array(acce_uncali), cols = ['timestamp', 'acce_uncali_x', 'acce_uncali_y', 'acce_uncali_z'])
    gyro = create_df(np.array(gyro), cols = ['timestamp', 'gyro_x', 'gyro_y', 'gyro_z'])
    gyro_uncali = create_df(np.array(gyro_uncali), cols = ['timestamp', 'gyro_uncali_x', 'gyro_uncali_y', 'gyro_uncali_z'])
    magn = create_df(np.array(magn), cols = ['timestamp', 'magn_x', 'magn_y', 'magn_z'])
    magn_uncali = create_df(np.array(magn_uncali), cols = ['timestamp', 'magn_uncali_x', 'magn_uncali_y', 'magn_uncali_z'])
    ahrs = create_df(np.array(ahrs), cols = ['timestamp', 'ahrs_x', 'ahrs_y', 'ahrs_z'])
    
    sensor_df = acce.copy()
    for df in [acce_uncali, gyro, gyro_uncali, magn, magn_uncali, ahrs]:
        current_len = len(sensor_df)
        if len(df) == 0:
            continue
        sensor_df = (sensor_df.merge(df, on = 'timestamp', suffixes = ("", "_uncali")))
        assert current_len == len(sensor_df)
        
    sensor_df['timestamp'] = sensor_df['timestamp'].astype(int)
    
    wifi = create_df(wifi, cols = ['ts_last_seen', 'wifi_id_1', 'wifi_id_2', 'rssi', 'ts_first_seen'])
    wifi[['ts_first_seen', 'ts_last_seen']] = wifi[['ts_first_seen', 'ts_last_seen']].astype(int)
    ibeacon = create_df(ibeacon, cols = ['timestamp', 'beacon_id', 'rssi'])
    ibeacon['timestamp'] = ibeacon['timestamp'].astype(int)
    
    waypoint = create_df(np.array(waypoint), cols = ['timestamp', 'x', 'y'])
    waypoint['timestamp'] = waypoint['timestamp'].astype(int)
    
    return ReadDataDf(sensor_df, wifi, ibeacon, waypoint)

In [None]:
test_path_046cfa46be49fc10834815c6 = read_data_file_df(os.path.join(test_path, '046cfa46be49fc10834815c6.txt'))
sub_path_046cfa46be49fc10834815c6 =  sub.loc[sub.path == '046cfa46be49fc10834815c6']
sub_path_046cfa46be49fc10834815c6['timestamp'] = sub_path_046cfa46be49fc10834815c6['timestamp'].astype(int)
display(sub_path_046cfa46be49fc10834815c6)

In [None]:
all_timestamps = list(set(test_path_046cfa46be49fc10834815c6.sensor_df.timestamp.tolist() + 
                          test_path_046cfa46be49fc10834815c6.ibeacon.timestamp.tolist() +
                          #test_path_046cfa46be49fc10834815c6.wifi.ts_first_seen.tolist()+
                          test_path_046cfa46be49fc10834815c6.wifi.ts_last_seen.tolist()))

all_timestamps_df = pd.DataFrame({'timestamp': all_timestamps}).sort_values('timestamp', ignore_index = True)

all_timestamps_plus_data_df = (all_timestamps_df
                  .merge(test_path_046cfa46be49fc10834815c6.sensor_df[['timestamp']], how = 'left', indicator = True)
                               .rename({'_merge': 'phone_sensors'}, axis = 1)
                               .replace({'left_only': 2, 'both': 3})
                  .merge(test_path_046cfa46be49fc10834815c6.wifi[['ts_last_seen']].rename({'ts_last_seen': 'timestamp'},axis=1), 
                                 how = 'left', indicator = True)
                               .rename({'_merge': 'wifi'}, axis = 1)
                               .replace({'left_only': 4, 'both': 5})
                  .merge(test_path_046cfa46be49fc10834815c6.ibeacon[['timestamp']], how = 'left', indicator = True).rename({'_merge': 'beacon'}, axis = 1)
                               .replace({'left_only': 6, 'both': 7})
                  .drop_duplicates(ignore_index = True))

fig, ax = plt.subplots(1, 1, figsize = (16, 10))
all_timestamps_plus_data_df.set_index('timestamp').plot(ax = ax)
for m, timestamp in enumerate(sub_path_046cfa46be49fc10834815c6.timestamp.tolist()):
    ax.axvline(timestamp, alpha = 0.5, ymin = 0, ymax = 10, linestyle = ":", color = 'blue')
    ax.text(timestamp-1700, 7.3, "prediction {}".format(m+1), size = 10, alpha = 0.5, rotation = 30)
ax.set_yticklabels(['on', 'off', 'on', 'off', 'on', 'off', 'on', 'off', 'on', 'off', 'on'])
ax.legend(loc='upper left', bbox_to_anchor=(1, 0.8))
plt.suptitle('Path vs Waypoint Timestamp for path 046cfa46be49fc10834815c6', fontsize = 20)
#plt.text(x=100.8, y=7.4, s="When we are ask to predict waypoint, wrt to path data", fontsize=14)

I would assume that for each path we can also use data from later timestamps to predict the waypoint. 

My ideas end here, I'll update the notebook in the next few weeks if anything comes to my mind. Please tell me what you think!
