# Data Wrangling
In this section, I will wrangle the pathway data in such a way that the output will be split dataframes, one for each sensor type.  
I will define reusable functions along the way, which will be in the hidden cells so you may need to unhide them.  

I will take care of cleaning the data quality issue on missing new line character as reported by the dataset provider.  
An example of the issue is shown below, where the whole string below occurs within one line:  

> 1560916208644	TYPE_BEACON	bd1b5cf6d9f4f7bcb796b62cc831b6c81b1aa6ae	356a192b7913b04c54574d18c28d46e6395428ab	356a192b7913b04c54574d18c28d46e6395428ab	-60	-97	36.627261007490404	778c2c52390b3513c1510c2fe7579c1011d250bb1560916208852	TYPE_WIFI	b1e32753c8cfd3624253d16d9bc944d917c451e4	8760dd3789b36258dea5d2b3687be70eb2163310	-77	2452	1560916206584

As you can see, within a single line, there are more than one sensor type being listed, where in the example they are TYPE_BEACON and TYPE_WIFI within a single line. I assumed that if this line occurs, this means that both sensor reading occurs at the same timestamp.
So, for cleaning those problematic lines, the following is done in the code:
- I'll split the line into multiple lines, following the number of sensor types within that single line
- I'll then assume that the splited lines all have the same timestamp as the timestamp on the problematic line

I will also create the following columns to be able to identify every sensor reading across the many files:
- `map_id`: the ID of the folder
- `floor`: the floor level from the folder name
- `trip_id`: the ID of the text file
- `waypoint_seq`: the sequence number for each waypoint path (unique only within one trip)

I believe by having those 3 IDs in each row, added with the timestamp, we'll then be able to uniquely identify every sensor reading across the many files.

The waypoint sequence number is useful to find all the sensor readings that happened within one waypoint path before reaching the next waypoint. In this way, we can query the sensor readings related to a specific path between one waypoint and the next waypoint. The sequence number is generated based on the sequence of the waypoint that is found from the data. The sequence number starts from 0.

In [None]:
import numpy as np
import pandas as pd
from glob import glob
import os
from tqdm import tqdm
from collections.abc import Iterable
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Below is the function to read in the pathway data and produce dataframe. The function will:
- Clean the reported data quality issue on the missing new line characters to separate multiple sensor readings.
- Create new columns `waypoint_seq` to help separate the sensor readings into multiple paths based on pairs of waypoints.
- Create the IDs mentioned above.

In [None]:
def read_pathway_data(file_path):
    '''
    Read the pathway text file into separate dataframes, where each dataframe corresponds to each sensor type.
    This function will do the data cleaning for the "missing new line" problem.
    This function will also add the corresponding columns to add identifications of sensor reading.
    '''
    
    # Create a new csv file with all the comment lines removed and the data quality issue fixed, then read the csv file using pandas
    # because the number of columns varies for each row, there will be NaN values at around the last columns for some rows
    # this is expected and to be cleaned at the end of this function.
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = np.array([line for line in file.readlines() if not line.startswith('#')])
    
    # clean the missing new line data quality issue by adding new line and copying the same timestamp
    new_lines_to_add = []
    line_idx_to_remove = []
    for line_idx, line in enumerate(lines):
        tokens = line.split('\t')
        type_idx = [i for i, token in enumerate(tokens) if token.startswith('TYPE_')]
        
        if len(type_idx) > 1:
            line_idx_to_remove.append(line_idx)
            timestamp = tokens[0]
            for j, start_idx in enumerate(type_idx):
                end_idx = -1 if j == (len(type_idx)-1) else type_idx[j+1]-1
                tokens_sub = tokens[start_idx:end_idx+1] if end_idx > -1 else tokens[start_idx:]
                new_line = '\t'.join([timestamp] + tokens_sub)
                if not new_line.endswith('\n'):
                    new_line += '\n'
                new_lines_to_add.append(new_line)
    
    lines = np.delete(lines, line_idx_to_remove)
    lines = np.append(lines, new_lines_to_add)
    
    # rewrite cleaned lines into new csv file
    with open('temp.csv', 'w') as tmp_csv:
        for line in lines:
            tmp_csv.write(line)
    
    # import the cleaned CSV file
    df = pd.read_csv('temp.csv', sep='\t', names=np.arange(9), header=None, low_memory=False) # need high memory due to mixed data type in each column
    os.remove('temp.csv')
    
    # rename columns
    df = df.rename(columns={0: 'timestamp', 1: 'sensor_type'})
    
    # create the ID column based on timestamp and textfile name
    path, filename = os.path.split(file_path)
    trip_id  = os.path.splitext(filename)[0]
    path, floor = os.path.split(path)
    path, map_id = os.path.split(path)
    
    df['trip_id'] = trip_id
    df['floor'] = floor
    df['map_id'] = map_id
    
    # create waypoint sequence number
    waypoint_timestamps = np.append(df.query('sensor_type == "TYPE_WAYPOINT"').timestamp.sort_values().values[1:], -1)
    start_timestamp = 0
    df['waypoint_seq'] = -1
    for i, end_timestamp in enumerate(waypoint_timestamps):
        if end_timestamp > -1:
            df.loc[(df.timestamp >= start_timestamp) & (df.timestamp < end_timestamp), 'waypoint_seq'] = i
        else:
            df.loc[(df.timestamp >= start_timestamp), 'waypoint_seq'] = i
        start_timestamp = end_timestamp
    
    # split the dataframe into multiple dataframes, one for each sensor type
    df_dict = {}
    for sensor_type in df.sensor_type.unique():
        df_sub = df.query(f'sensor_type == "{sensor_type}"')
        if sensor_type == 'TYPE_WAYPOINT':
            df_sub = (df_sub.rename(columns={2: 'x', 3: 'y'})
                      .loc[:, ['map_id', 'floor', 'trip_id', 'timestamp', 'waypoint_seq', 'sensor_type', 'x', 'y']]
                      .astype({'x': float, 'y': float})
                     )
        elif sensor_type == 'TYPE_WIFI':
            df_sub = (df_sub.rename(columns={2: 'ssid', 3: 'bssid', 4: 'rssi', 5: 'frequency', 6: 'last_seen_timestamp'})
                      .astype({'last_seen_timestamp': int})
                      .loc[:, ['map_id', 'floor', 'trip_id', 'timestamp', 'waypoint_seq', 'sensor_type', 'ssid', 'bssid', 'rssi', 'frequency', 'last_seen_timestamp']]
                     )
        elif sensor_type == 'TYPE_BEACON':
            df_sub = (df_sub.rename(columns={2: 'uuid', 3: 'major_id', 4: 'minor_id', 5: 'tx_power', 6: 'rssi', 7: 'distance', 8: 'mac_address'})
                      .astype({'uuid': str,
                               'major_id': str,
                               'minor_id': str,
                               'tx_power': float,
                               'rssi': float,
                               'distance': float,
                               'mac_address': str
                              })
                      .loc[:, ['map_id', 'floor', 'trip_id', 'timestamp', 'waypoint_seq', 'sensor_type', 'uuid', 'major_id', 'minor_id', 'tx_power', 'rssi',
                              'distance', 'mac_address']]
                     )
        else:
            df_sub = (df_sub.rename(columns={2: 'x', 3: 'y', 4: 'z'})
                      .loc[:, ['map_id', 'floor', 'trip_id', 'timestamp', 'waypoint_seq', 'sensor_type', 'x', 'y', 'z']]
                      .astype({'x': float, 'y': float})
                     )
            
        df_dict[sensor_type] = df_sub.reset_index(drop=True)
    
    return df, df_dict

In [None]:
train_file_paths = glob('../input/indoor-location-navigation/train/*/*/*')
df_raw, df_dict = read_pathway_data(train_file_paths[4]) # this is the path with TYPE_BEACON data

Preview of the dataframes:

In [None]:
for sensor_type in df_raw.sensor_type.unique():
    print('-------------------------------')
    print(sensor_type)
    print('-------------------------------')
    display(df_dict[sensor_type].head())

Preview of the dataframe column infos:

In [None]:
for sensor_type in df_raw.sensor_type.unique():
    print('-------------------------------')
    print(sensor_type)
    print('-------------------------------')
    df_dict[sensor_type].info()
    print('\n')

# Convert Text Files to CSV

Below I will:
- Define the function to collect the training data (only for a specific map) into separate dataframes, one dataframe for each sensor type.
- Do the collection using the defined function.

In [None]:
def is_iterable(x):
    try:
        iter(sensor_type)
        return True
    except TypeError:
        return False

def to_csv_whole_map(map_dir, sensor_type=None, out_dir='', low_memory=True):
    """Export the whole map specified in `map_dir` into CSV files, one CSV file for each sensor type."""
    # This function assumes the folder structure being '<map_dir>/<floor>/<trip text file>'
    selected_train_files = glob(os.path.join(map_dir, '*', '*'))
    
    # create out dir if does not exists
    if len(out_dir) > 0 and not os.path.exists(out_dir):
        os.mkdir(out_dir)
    
    # ensure sensor_type is iterable
    if not is_iterable(sensor_type): sensor_type = [sensor_type]
        
    # initialize df_out_dict if high memory approach is chosen
    if not low_memory: df_out_dict = {}
    
    for file_index, file_path in enumerate(tqdm(selected_train_files, desc='Converting text files')):
        df_raw, df_dict = read_pathway_data(file_path)
        
        s_types = df_raw.sensor_type.unique() if sensor_type is None else sensor_type
        
        for s_type in s_types:
            if low_memory:
                # export and append one by one if low_memory approach is chosen
                out_filepath = os.path.join(out_dir, s_type + '.csv')
                df_dict[s_type].to_csv(out_filepath,
                                       mode=('a' if file_index > 0 else 'w'),
                                       header=(file_index == 0),
                                       index=False)
            else:
                # collect all dataframes and export in a single shot if high memory is possible
                df_out_list = df_out_dict.get(s_type)
                if df_out_list is None:
                    df_out_dict[s_type] = []
                    df_out_list = df_out_dict[s_type]
                df_out_list.append(df_dict[s_type])
    
    # the final step of high memory approach
    if not low_memory:
        for s_type in tqdm(df_out_dict.keys(), desc='Exporting to CSV files'):
            out_filepath = os.path.join(out_dir, s_type + '.csv')
            pd.concat(df_out_dict[s_type]).to_csv(out_filepath, index=False)

In [None]:
# # UNCOMMENT TO RUN THIS
# to_csv_whole_map('../input/indoor-location-navigation/train/5a0546857ecc773753327266', low_memory=False)

# Data Exploration

## Import Data

Below I will import the CSV files into dataframes, collected into dictionary form.

In [None]:
sensor_types = ['TYPE_WAYPOINT', 'TYPE_MAGNETIC_FIELD', 'TYPE_GYROSCOPE',
                'TYPE_ACCELEROMETER', 'TYPE_WIFI', 'TYPE_ROTATION_VECTOR', 'TYPE_BEACON']
indexes = ['map_id', 'floor', 'trip_id', 'waypoint_seq', 'timestamp']

df = {}
for s_type in sensor_types:
    key = s_type.replace('TYPE_', '').lower() # for convenience when getting the df
    df[key] = (pd.read_csv(f'{s_type}.csv')
               .set_index(indexes)
               .drop(columns='sensor_type')
               .sort_index()
              )

## Waypoints:

In [None]:
b1_waypoints = df['waypoint'].loc['5a0546857ecc773753327266', 'B1']
sns.scatterplot(data=b1_waypoints, x='x', y='y', hue='trip_id', legend=False);
num_trips = b1_waypoints.index.get_level_values('trip_id').unique().shape[0]
plt.title(f'Number of trips: {num_trips}');

## Wifi

In [None]:
df['wifi']