# Indoor location and navigation competition EDA

Starting EDA to understand the data available and building up from other great works that I have seen on the public space (please upvote [this one ](https://www.kaggle.com/andradaolteanu/indoor-navigation-complete-data-understanding/data#notebook-container), [this one](https://www.kaggle.com/iamleonie/intro-to-indoor-location-navigation) and [this one](https://www.kaggle.com/c/indoor-location-navigation/discussion/215445))

## The task  
Given a path file, the goal is to predict the floor and the waypoint location (x,y) at the timestamp provided in the `sample_submission.csv` file.

In [None]:
import numpy as np 
import pandas as pd 
import os
import json

import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.dates as mdates

from datetime import datetime
from functools import reduce


## Understanding the input data  
- __train__ : path to training data, organized by site and floor. Each file within each floor folder contains the data relative to a single path on that floor.
- __test__ : path to test examples. Each file here represents a single path on a single floor, and it does not have the waypoint (x,y) data since that is part of what we want to predict
- __metadata__ : organized by site and floor, and containing for each of them `floor_image.png`, `floor_info.json` and `geojson_map.json`


For each example, the following signals are recorded at each position:
- accelerometer
- magnetic field
- gyroscope
- rotation vector
- WiFi
- Bluetooth iBeacon
- waypoint locations (x,y)
- each example belongs to a particolar floor of a particular building

In [None]:
train_path = "../input/indoor-location-navigation/train"
test_path = "../input/indoor-location-navigation/test"
sites_path = "../input/indoor-location-navigation/metadata"

In [None]:
#checking the number of files in training
_, train_dir_names, _ = next(os.walk(train_path))

train_filenames = []
n_floors = 0
for tdn in train_dir_names: #for each training example folder (so for each site)
    _, sub_dir, _ = next(os.walk(train_path + '/' + tdn))
    n_floors += len(sub_dir)
    for sd in sub_dir: #for each sub-folder (so for each floor of the site)
        _, _, filenames = next(os.walk(train_path + '/' + tdn + '/' + sd)) #list all the files
    
        train_filenames += filenames
        
print("There are {} training examples over a total of {} sites and {} floors".format(len(train_filenames),
                                                                         len(train_dir_names),
                                                                                    n_floors))

#checking the number of test examples
_, test_dir_names, tst_filenames = next(os.walk(test_path))

print("There are {} test examples".format(len(tst_filenames)))

#counting the sites for which metadata is provided
_, meta_dir_names, _ = next(os.walk(sites_path))
print("Metadata provided for {} sites".format(len(meta_dir_names)))

## Using the code from the competition repo 

Re-structuring the data from scratch can be a serious pain since the single example above has 14210 lines and these lines vary from example to example.  
We can instead download the [competition repo](https://github.com/location-competition/indoor-location-competition-20) and then upload it as dataset Using `cp -r path/* ./` . In this way we can accees the code of the repo and use `read_data_file` in the `io_f` to read information correctly

In [None]:
!cp -r ../input/indoorlocationcompetition20master/indoor-location-competition-20-master/* ./

In [None]:
from io_f import read_data_file

#input example
example_path = "../input/indoor-location-navigation/train/5cd56b83e2acfd2d33b5cab0/2F/5cf61eed731d5200089a627f.txt"


example = read_data_file(example_path)

#list of the fields as per read_data_file class in io_f
print("'acce' refers to the accelerometer data - shape {}".format(example.acce.shape))
print("'acce_uncali' refers to the uncalibrated accelerometer data - shape {}".format(example.acce_uncali.shape))
print("'gyro' refers to the gyroscope data - shape {}".format(example.gyro.shape))
print("'gyro_uncali' refers to the uncalibrated gyroscope data - shape {}".format(example.gyro.shape))
print("'magn' refers to the magnetic field data - shape {}".format(example.magn.shape))
print("'magn_uncali' refers to the uncalibrated magnetic field data - shape {}".format(example.magn_uncali.shape))
print("'ahrs' refers to the rotation vector data - shape {}".format(example.ahrs.shape))
print("'wifi' refers to the wifi data - shape {}".format(example.wifi.shape))
print("'ibeacon' refers to the bluetooth data - shape {}".format(example.ibeacon.shape))
print("'waypoint' refers to the x,y position coordinates - shape {}".format(example.waypoint.shape))
#    waypoint = np.array(waypoint)

[Here](https://www.kaggle.com/iamleonie/intro-to-indoor-location-navigation#Inertial-Measurement-Unit-(IMU)) is a great explanation of Inertial Measurement Unit (IMU) data, consisting of signals coming from **accelerometer**, **Gyroscopes** and **Magnetometers**. For them, as you can see above, the data shape is the same.  
The first item is usually a timestamp in milliseconds (so it will be converted to datetime using pandas.fromtimestamp inputting the value divided by 1000.

In [None]:
#example of waypoint data
exmp_waypoint_df = pd.DataFrame(example.waypoint, columns = ['timestamp', 'x', 'y'])
exmp_waypoint_df['timestamp'] = exmp_waypoint_df['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
exmp_waypoint_df

In [None]:
#example of accelerometer data (gyroscopes, magnetometers (and their uncalibrated version)and rotation vector can be managed in the same way)
exmp_acce_df = pd.DataFrame(example.acce, columns = ['timestamp', 'x', 'y', 'z'])
exmp_acce_df['timestamp'] = exmp_acce_df['timestamp'].apply(lambda x: datetime.fromtimestamp(x/1000))
exmp_acce_df

In [None]:
#merging IMU data together
def IMU_df_gen(obs_path, datetimeconvert = True):
    
    """
    This function creates a dataframe with IMU data for a single observation.
    """
    
    obs = read_data_file(obs_path)
    
    acce = pd.DataFrame(obs.acce, columns = ['timestamp', 'acce_x', 'acce_y', 'acce_z'])
    acce_uncali = pd.DataFrame(obs.acce_uncali, columns = ['timestamp', 'acce_uncali_x', 'acce_uncali_y', 'acce_uncali_z'])
    
    gyro = pd.DataFrame(obs.gyro, columns = ['timestamp', 'gyro_x', 'gyro_y', 'gyro_z'])
    gyro_uncali = pd.DataFrame(obs.gyro_uncali, columns = ['timestamp', 'gyro_uncali_x', 'gyro_uncali_y', 'gyro_uncali_z'])
    
    magn = pd.DataFrame(obs.magn, columns = ['timestamp', 'magn_x', 'magn_y', 'magn_z'])
    magn_uncali = pd.DataFrame(obs.magn_uncali, columns = ['timestamp', 'magn_uncali_x', 'magn_uncali_y', 'magn_uncali_z'])
    
    ahrs = pd.DataFrame(obs.ahrs, columns = ['timestamp', 'ahrs_x', 'ahrs_y', 'ahrs_z'])
    
    #merging the dfs
    dfs = [acce, acce_uncali, gyro, gyro_uncali, magn, magn_uncali, ahrs]
    
    IMU_df = reduce(lambda  left,right: pd.merge(left,right,on=['timestamp'], how='outer'), dfs)
    
    
    if datetimeconvert:
        IMU_df['timestamp'] = IMU_df.timestamp.apply(lambda x: datetime.fromtimestamp(int(x)/1000))
    
    return IMU_df

IMU_df = IMU_df_gen(example_path)

In [None]:
fig, acce_axs = plt.subplots(nrows=3, ncols=1)

plt.subplots_adjust(left=None, bottom=1, right=None, top=2, wspace=None, hspace=None)

acce_x_plot = IMU_df.plot(x = 'timestamp', y=['acce_x', 'acce_uncali_x'], figsize = (20,3), title = 'acce_x vs acce_uncali_x', ax = acce_axs[0])
acce_y_plot = IMU_df.plot(x = 'timestamp', y=['acce_y', 'acce_uncali_y'], figsize = (20,3), title = 'acce_y vs acce_uncali_y', ax = acce_axs[1])
acce_y_plot = IMU_df.plot(x = 'timestamp', y=['acce_z', 'acce_uncali_z'], figsize = (20,3), title = 'acce_z vs acce_uncali_z', ax = acce_axs[2])
        


for a in range(len(acce_axs)):
    if a < 2:
        acce_axs[a].axes.get_xaxis().set_visible(False)
    else:
        acce_axs[a].xaxis.set_major_locator(mdates.SecondLocator(interval = 5))

In [None]:
fig, magn_axs = plt.subplots(nrows=3, ncols=1)

plt.subplots_adjust(left=None, bottom=1, right=None, top=2, wspace=None, hspace=None)

magn_x_plot = IMU_df.plot(x = 'timestamp', y=['magn_x', 'gyro_uncali_x'], figsize = (20,3), title = 'magn_x vs magn_uncali_x', ax = magn_axs[0])
magn_y_plot = IMU_df.plot(x = 'timestamp', y=['magn_y', 'gyro_uncali_y'], figsize = (20,3), title = 'magn_y vs magn_uncali_y', ax = magn_axs[1])
magn_y_plot = IMU_df.plot(x = 'timestamp', y=['magn_z', 'gyro_uncali_z'], figsize = (20,3), title = 'magn_z vs magn_uncali_z', ax = magn_axs[2])
for a in range(len(magn_axs)):
    if a < 2:
        magn_axs[a].axes.get_xaxis().set_visible(False)
    else:
        magn_axs[a].xaxis.set_major_locator(mdates.SecondLocator(interval = 5))

In [None]:
fig, gyro_axs = plt.subplots(nrows=3, ncols=1)

plt.subplots_adjust(left=None, bottom=1, right=None, top=2, wspace=None, hspace=None)

gyro_x_plot = IMU_df.plot(x = 'timestamp', y=['gyro_x', 'gyro_uncali_x'], figsize = (20,3), title = 'gyro_x vs gyro_uncali_x', ax = gyro_axs[0])
gyro_y_plot = IMU_df.plot(x = 'timestamp', y=['gyro_y', 'gyro_uncali_y'], figsize = (20,3), title = 'gyro_y vs gyro_uncali_y', ax = gyro_axs[1])
gyro_y_plot = IMU_df.plot(x = 'timestamp', y=['gyro_z', 'gyro_uncali_z'], figsize = (20,3), title = 'gyro_z vs gyro_uncali_z', ax = gyro_axs[2])
for a in range(len(gyro_axs)):
    if a < 2:
        gyro_axs[a].axes.get_xaxis().set_visible(False)
    else:
        gyro_axs[a].xaxis.set_major_locator(mdates.SecondLocator(interval = 5))

## FLOORS MAPPING  
I am considering the '**LG' floors as 'Lower ground**' so they will be the same as the '**B**' floors.  
Same for '**BF**' (basement floor).  
P levels are also usually referred to parkings below the ground floor.  
I have interpreted **'G' as ground floor** so this will be same as **'F1'**. 

*Classifying BM (I guess basement mezzanine), M (mezzanine) and LM is probably more complicated
I'll just probably drop them (giving value 99 in the dictionary for now)*

In [None]:
#grouping the floors by type and counting them

floortypes = []

for tdn in train_dir_names: #for each training example folder (so for each site)
    _, sub_dir, _ = next(os.walk(train_path + '/' + tdn))
    for sd in sub_dir: #for each sub-folder (so for each floor of the site)
        floortypes.append(sd)

floors = pd.Series(floortypes)
        
plt.figure(figsize = (20,10))
sns.countplot(floors, color = 'lightblue')

In [None]:
floors.unique()

In [None]:
#floors mapping

floors_dict = {'B3': -3,
               'B2': -2, 'LG2':-2, 'P2':-2,
               'B1': -1, 'LG1':-1, 'P1':-1, 'B':-1, 'BF':-1,
              'F1':0, '1F':0, 'L1':0, 'G':0,
              'F2':1, '2F':1, 'L2':1,
              'F3':2, '3F':2, 'L3':2,
              'F4':3, '4F':3, 'L4':3,
              'F5':4, '5F':4, 'L5':4,
              'F6':5, '6F':5, 'L6':5,
              'F7':6, '7F':6, 'L7':6,
              'F8':7, '8F':7, 'L8':7,
              'F9':8, '9F':8, 'L9':8,
              'F10':9, '10F':9, 'L10':9,
              'L11':10,
              'LM':99, 'M':99, 'BM':99}

floors_mapped = floors.apply(lambda x:floors_dict[x])

plt.figure(figsize = (20,10))
sns.countplot(floors_mapped, color='lightblue')

## Visualizing a single path from a single example

In [None]:
from visualize_f import visualize_trajectory, visualize_heatmap

exmp_trajectory = example.waypoint #this returns timestamp, x,y
exmp_trajectory = exmp_trajectory[:, 1:] #keeping only coords x,y
exmp_base_path = "/".join(example_path.split("/")[:4]).replace("/train", "/metadata")
exmp_site = example_path.split("/")[4] #getting the site folder
exmp_floor = example_path.split("/")[5] #getting the floor

exmp_floor_img = "{}/{}/{}/floor_image.png".format(exmp_base_path, exmp_site, exmp_floor)

exmp_floor_json = "{}/{}/{}/floor_info.json".format(exmp_base_path, exmp_site, exmp_floor)
with open(exmp_floor_json) as jsonf:
    json_data = json.load(jsonf)
    
width_m = json_data['map_info']['width']
height_m = json_data['map_info']['height']

visualize_trajectory(trajectory = exmp_trajectory,
                     floor_plan_filename = exmp_floor_img,
                     width_meter = width_m,
                     height_meter = height_m,
                     title = 'single path from single example')

## Visualizing the magnetic strength on a single path

In [None]:
from main import calibrate_magnetic_wifi_ibeacon_to_position
from main import extract_magnetic_strength

# Extracting the magnetic strength
mwifi_data = calibrate_magnetic_wifi_ibeacon_to_position([example_path])
magnetic_strength = extract_magnetic_strength(mwifi_data)

heat_positions = np.array(list(magnetic_strength.keys()))
heat_values = np.array(list(magnetic_strength.values()))

visualize_heatmap(heat_positions,
                  heat_values,
                     floor_plan_filename = exmp_floor_img,
                     width_meter = width_m,
                     height_meter = height_m,
                     title = 'magnetic strength from single example')

## Visualizing WiFi access points from a single example

In [None]:
from main import extract_wifi_rssi, extract_wifi_count

# Get WiFi data
wifi_rssi = extract_wifi_rssi(mwifi_data)
print(f'This floor has {len(wifi_rssi.keys())} wifi aps (access points).')

wifi_counts = extract_wifi_count(mwifi_data)
heat_positions = np.array(list(wifi_counts.keys()))
heat_values = np.array(list(wifi_counts.values()))
# filter out positions that no wifi detected
mask = heat_values != 0
heat_positions = heat_positions[mask]
heat_values = heat_values[mask]

# The heatmap
visualize_heatmap(heat_positions, 
                  heat_values, 
                  floor_plan_filename = exmp_floor_img,
                    width_meter = width_m,
                    height_meter = height_m,
                  colorbar_title='count', 
                  title='wifi access on a single example')