# What to do

In [competition page](https://www.kaggle.com/c/indoor-location-navigation/data), following data quality problem is described.

>A note on data quality: In the training files, you may find occasionally that a line is missing the ending newline character, causing it to run on to the next line. It is up to you how you want to handle this issue. This issue is not found in the test data.

In this note book, I will see some path files for understanding file format and try to arrange data into handy format with addressing the data quality problem above.

In [None]:
import glob
import os
import os.path
import random
from typing import Tuple, List, Union

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1st sample

First sample is "/kaggle/input/indoor-location-navigation/train/5cd56c0ce2acfd2d33b6ab27/B1/5d09a625bd54340008acddb9.txt". It tells me the basic format of path file.

In [None]:
sample_filepath = "/kaggle/input/indoor-location-navigation/train/5cd56c0ce2acfd2d33b6ab27/B1/5d09a625bd54340008acddb9.txt"
with open(sample_filepath, 'r') as f:
    sample_content = f.readlines()
    f.close()

In [None]:
sample_content[:20]

In [None]:
sample_content[-20:]

It seems that;
- Top 7 rows are header and final row is footer.
  - Each row contains single or several fields separated by tab.
  - The field format is _name + ":" + value_
- Other rows conatin sensor data, format is uniformed.

Following by [Official guide](https://github.com/location-competition/indoor-location-competition-20), each row have 4-10 columns; first is Unix time(millisecond), second is data type, the others body. We can obtain ground truth (x, y) labels from 3 and 4 columns of data type == "TYPE_WAYPOINT".

# 2nd sample

Next, try "/kaggle/input/indoor-location-navigation/train/5da138b74db8ce0c98bd4774/F3/5db299ab5741f4000680a7d3.txt" in order to understand the format more.

In [None]:
sample_filepath2 = "/kaggle/input/indoor-location-navigation/train/5da138b74db8ce0c98bd4774/F3/5db299ab5741f4000680a7d3.txt"
with open(sample_filepath2, 'r') as f:
    sample_content2 = f.readlines()
    f.close()

In [None]:
sample_content2[:20]

Only 2 header rows. It indicates that number of header rows is not fixed. Probably "#" indicates that the line is header/footer.

In [None]:
sample_content2[-20:]

# 3rd sample

Third one is "/kaggle/input/indoor-location-navigation/train/5cd56b83e2acfd2d33b5cab0/B2/5cf72539e9d9c9000852f45b.txt".

In [None]:
sample_filepath3 =  "/kaggle/input/indoor-location-navigation/train/5cd56b83e2acfd2d33b5cab0/B2/5cf72539e9d9c9000852f45b.txt"
with open(sample_filepath3, 'r') as f:
    sample_content3 = f.readlines()
    f.close()

In [None]:
sample_content3[:20]

In [None]:
sample_content3[-20:]

Then I can notice that some files dose not have footer.

# 4th sample

Check "/kaggle/input/indoor-location-navigation/train/5cd56b90e2acfd2d33b5e33f/F1/5d0868bdbb84450008f569ca.txt". This file shows the example of data quality problem

In [None]:
sample_filepath4 = "/kaggle/input/indoor-location-navigation/train/5cd56b90e2acfd2d33b5e33f/F1/5d0868bdbb84450008f569ca.txt"
with open(sample_filepath4, 'r') as f:
    sample_content4 = f.readlines()
    f.close()

In [None]:
sample_content4[:20]

In [None]:
sample_content4[-20:]

Now show the data quality problem.

In [None]:
# show line 482, 483, 484, please watch 483 carefully!
sample_content4[482:485]

Line 483 is very long, this is the data quality problem, missing "\n" at the end of line. Same problem happens 56 lines in this file.

In [None]:
# Which rows have >10 rows?
sensor_data4 = sample_content4[7:-1]
indice = []
for i, line in enumerate(sensor_data4):
    fields = [field for field in line.strip().split('\t')]
    if len(fields) > 10:
        indice.append(str(i))
print(f'{len(indice)} lines contain > 2 records')
print(f'Lines problem occurd at: {", ".join(indice)}')

In [None]:
sensor_data4[994]

In [None]:
sensor_data4[1497]

# 5th sample

Now take a look at "/kaggle/input/indoor-location-navigation/test/52ad8c760ff9978d0949deed.txt". It shows that sometimes header format is inconsistent.

In [None]:
sample_filepath5 = "/kaggle/input/indoor-location-navigation/test/52ad8c760ff9978d0949deed.txt"
with open(sample_filepath5, 'r') as f:
    sample_content5 = f.readlines()
    f.close()

In [None]:
sample_content5[0]

Field name and value is separated not by ":", but by "\t".

# Brief summary

Then I should keep in my mind abount path files' charactaristics and quality;
- Data quality problem described at competition page.
- Number of footer row is not fixed.  
- A file may have no footer.
- About header format, field name and value is not always separated by ":", sometimes by "\t".


# Convert path file addressing the data quality problem

Now, I would like to try making path files more handy.

- Separate .txt path files into 3 files: header, footer, body files. 
  - Convert header and footer data into simple .json file, and
  - convert body data into .parquet file.
- Address data quality problem.

I write the solution code on next cell.

In [None]:
DATA_TYPES = ('TYPE_ACCELEROMETER',
              'TYPE_MAGNETIC_FIELD',
              'TYPE_GYROSCOPE',
              'TYPE_ROTATION_VECTOR',
              'TYPE_MAGNETIC_FIELD_UNCALIBRATED',
              'TYPE_GYROSCOPE_UNCALIBRATED',
              'TYPE_ACCELEROMETER_UNCALIBRATED',
              'TYPE_WIFI',
              'TYPE_BEACON',
              'TYPE_WAYPOINT')




def count_header_row(content: List[str]) -> int:
    return min([i for i in range(len(content)) if not content[i].startswith('#')])


def separate_line_if_needed(line: str) -> list:
    '''Separate multiple data rows in a single file line.

    In path file, each single line expects to represent each single data row.
    Sometimes, however, a line dose not have the line feed at the end of itself,
    thus multiple data rows are belong to the line. This function will address
    such data quality problem.
    The function detects the problem by rough method. If line has greater than 
    10 columns, detect and try to address the problem. If not, treat it that 
    there are no data quality problem.

    Parameters
    ----------
    line: str
        Single line of sensor data file. It sometimes represents single data row,
        sometimes multiple.

    Return
    ------
    list of line(s): list
        If there are multiple data rows in line, separate it so that each line
        represents single data row. If not, return [list].
        
    Note
    ----
    This solution depends on that 1st column (Unix time) and 2nd column (data type)
    dose not have any quality problem such as missing value, invalid value,
    too long/short length, and so on.
    '''

    max_columns = 10
    field_sep = '\t'
    fields = [field for field in line.strip().split(field_sep)]
    idx_data_type = [i for i, field in enumerate(fields) if field in DATA_TYPES]
    if len(fields) <= max_columns:
        # No problem! line represents single record.
        return [line]

    """
    1. Identify where 1st row actually ends. The hint is where 2nd row's data type starts.
    2. Separate line into 2 parts; "1st row" and "Others".
    3. Add "1st row" of 2. to list of lines.
    4. Separate "Others" into single line(s) by same process recursively.
    """
    lines = []
    len_unix_time = 13  # Unix time milliseconds, such as '1560830841553'
    
    # 1.
    # where 2nd row's data type starts ?
    idx_second_data_type = idx_data_type[1]
    almost_first_row = field_sep.join(fields[:idx_second_data_type])  # 1st row + 2nd row's first column
    pos_first_row_end = len(almost_first_row) - len_unix_time  # Where 1st row actually ends
    # 2.
    first_row = line[:pos_first_row_end]
    others = line[pos_first_row_end:]
    # 3.
    lines.append(first_row)
    # 4.
    for l in others.splitlines():
        lines += separate_line_if_needed(l)

    return lines


def to_dict(header_or_footer: Union[str, list]) -> Union[None, dict]:
    '''Convert header/footer string into dictionary object.
    
    In path file, format of header and footer is described as follows;
    - header and footer line is starts with fixed string '#\t'
    - header and footer is composed of combinations of field name and value.
      i.e. '#\tSiteID:5cd56b83e2acfd2d33b5cab0\tSiteName:日月光中心'
    Field name and value is separated by generally ':'. Sometimes, however,
    they are separated by '\t'. This function addresses such inconsistent
    format problem partially.
    
    Parameters
    ----------
    header_or_footer: str or list
        header or footer represented by string, or list of them.
        
    Return
    ------
    dctionary or None:
        Return None if header_or_footer is actually a header/footer,
        otherwise dictionary (key: field name, value: field value).
    '''
    
    result_dict = {}
    if isinstance(header_or_footer, str):
        header_or_footer = [header_or_footer]
    
    header_or_footer = [l for l in header_or_footer if l.startswith('#\t')]
    if not header_or_footer:
        return None

    for line in header_or_footer:
        # We do not need First 3 chars '#\t" and line feed.
        fields = line[2:].strip().split('\t')
        skip = False
        for i, field in enumerate(fields):
            if skip:
                skip = False
                continue
            try:
                name, value = field.split(':')
            except ValueError:
                # Field name and value might be separated by "\t"
                name, value = fields[i], fields[i + 1]
                skip = True
            result_dict[name] = value
    return result_dict

In [None]:
assert(count_header_row(sample_content) == 7)
sample_content[:10]

In [None]:
assert(count_header_row(sample_content2) == 2)
sample_content2[:10]

In [None]:
assert(count_header_row(sample_content3) == 7)
sample_content3[:10]

In [None]:
single = '1560830842594\tTYPE_BEACON\t3rd_column\t4th_column\t5th_column\t6th_column\t7th_column\t8th_column\t9th_column\t10th_column'
assert(separate_line_if_needed(single) == [single])

single2 = '1560830842594\tTYPE_WAYPOINT\t3rd_column\t4th_column'
assert(separate_line_if_needed(single2) == [single2])

not_single = single + single2
assert(separate_line_if_needed(not_single) == [single, single2])

not_single2 = single + single2 + single + single2
assert(separate_line_if_needed(not_single2) == [single, single2, single, single2])

very_long = ''.join([single] * 100)
assert(separate_line_if_needed(very_long) == [single] * 100)

In [None]:
header1 = (
    '#\tstartTime:1559699278686\n',
    '#\tSiteID:5cd56b83e2acfd2d33b5cab0\tSiteName:日月光中心\tFloorId:5cd56b86e2acfd2d33b5d1f6\tFloorName:B2\n')
assert(to_dict(header1) == {'startTime': '1559699278686',
                            'SiteID': '5cd56b83e2acfd2d33b5cab0',
                            'SiteName': '日月光中心',
                            'FloorId':'5cd56b86e2acfd2d33b5d1f6',
                            'FloorName': 'B2'})
header2 = (
    '#\tstartTime\t1559699278686\n',
    '#\tSiteID:5cd56b83e2acfd2d33b5cab0\tSiteName:日月光中心\tFloorId:5cd56b86e2acfd2d33b5d1f6\tFloorName:B2\n')
assert(to_dict(header2) == {'startTime': '1559699278686',
                            'SiteID': '5cd56b83e2acfd2d33b5cab0',
                            'SiteName': '日月光中心',
                            'FloorId':'5cd56b86e2acfd2d33b5d1f6',
                            'FloorName': 'B2'})

header3 = (
    '#\tstartTime:1559699278686\n',
    '#\tSiteID:5cd56b83e2acfd2d33b5cab0\tSiteName\t日月光中心\tFloorId:5cd56b86e2acfd2d33b5d1f6\tFloorName:B2\n')
assert(to_dict(header3) == {'startTime': '1559699278686',
                            'SiteID': '5cd56b83e2acfd2d33b5cab0',
                            'SiteName': '日月光中心',
                            'FloorId':'5cd56b86e2acfd2d33b5d1f6',
                            'FloorName': 'B2'})

footer1 = '#\tendTime\t0000000201304\n'
assert(to_dict(footer1) == {'endTime': '0000000201304'})
footer2 = '#\tendTime:0000000201304\n'
assert(to_dict(footer2) == {'endTime': '0000000201304'})
footer3 = '1559699299331\tTYPE_ROTATION_VECTOR\t0.17401376\t-0.0031055238'  # Not a footer
assert(to_dict(footer3) is None)

Maybe no problem.

Now convert path file data into more handy format; header and footer into json file, path data into parquet file.

In [None]:
!ls

In [None]:
%%time
import json
from time import time
from logging import getLogger, INFO, WARNING, FileHandler, StreamHandler, Formatter

# preparation for logging
logger = getLogger(__name__)
logger.setLevel(INFO)
formatter = Formatter('%(levelname)s : %(asctime)s : %(message)s')
filehandler = FileHandler('notebook.log')
filehandler.setLevel(INFO)
filehandler.setFormatter(formatter)
logger.addHandler(filehandler)
streamhandler = StreamHandler()
streamhandler.setLevel(WARNING)
streamhandler.setFormatter(formatter)
logger.addHandler(streamhandler)

logger.info(f'Start!')

train_filepaths = glob.glob("/kaggle/input/indoor-location-navigation/train/*/*/*.txt", recursive=True)
test_filepaths = glob.glob("/kaggle/input/indoor-location-navigation/test/*.txt", recursive=True)
filepaths = train_filepaths + test_filepaths

n_files = len(filepaths)
logger.info('{} path files found'.format(n_files))
since = time()

for i, filepath in enumerate(filepaths):
    logger.info('Start processing "{}"'.format(filepath))

    with open(filepath, 'r') as f:
        content = f.readlines()
        f.close()

    # arrange footer into dictionary object
    footer = content[-1]
    arranged_footer = to_dict(footer)  

    # arrange header into dictionary object
    n_header_rows = count_header_row(content)
    header = content[:n_header_rows]
    arranged_header = to_dict(header)
    
    # arrange body into pd.DataFrame object
    body = content[n_header_rows:-1]
    len_columns = 10
    columns = [f'column{i + 1}' for i in range(len_columns)]
    data = []
    for line in body:
        cleansed_line = separate_line_if_needed(line)  # data cleansing
        for line_ in cleansed_line:
            fields = [field for field in line_.strip().split('\t')]
            if len(fields) < len_columns:
                # padding
                fields += [np.nan] * (len_columns - len(fields))
            data.append(fields)
    arranged_body = pd.DataFrame(data=data, columns=columns)
    
    # prepare directory to save dataset
    train_or_test = filepath.split(os.path.sep)[-4]
    if train_or_test == 'train':
        site = filepath.split(os.path.sep)[-3]
        floor = filepath.split(os.path.sep)[-2]
        body_directory = os.path.join(train_or_test, site, floor)
    else:
        body_directory = 'test'
    header_directory = os.path.join(body_directory, 'header')
    footer_directory = os.path.join(body_directory, 'footer')
    for directory in (body_directory, header_directory, footer_directory):
        os.makedirs(directory, exist_ok=True)
    
    # save dataset:
    ## footer
    basename = filepath.split(os.path.sep)[-1][:-4]
    footer_filepath = os.path.join(footer_directory, basename + '.json')
    if arranged_footer is not None:
        with open(footer_filepath, 'w') as f:
            json.dump(arranged_footer, f, indent=4)
            f.close()
    else:
        pass
    ## header
    header_filepath = os.path.join(header_directory, basename + '.json')
    with open(header_filepath, 'w') as f:
        json.dump(arranged_header, f, indent=4)
        f.close()
    ## body
    body_filepath = os.path.join(body_directory, basename + '.parquet')
    arranged_body.to_parquet(body_filepath, index=False)

    # logging
    if arranged_footer is None:
        logger.warning('Separate "{}" into 2 files: header="{}", body="{}", no footer file' \
                    .format(filepath, header_filepath, body_filepath))
    else:
        logger.info('Separate "{}" into 3 files: header="{}", footer="{}", body="{}"' \
                    .format(filepath, header_filepath, footer_filepath, body_filepath))
    logger.info('{}/{} files has processed ({} seconds passed)' \
                .format(i + 1, n_files, time() - since))

logger.info('Complete!')

In [None]:
!ls

In [None]:
headers = []
for header_filepath in glob.glob('train/*/*/header/*.json'):
    with open(header_filepath, 'r') as f:
        headers.append(json.load(f))
        f.close()
pd.DataFrame(headers)

In [None]:
headers = []
for header_filepath in glob.glob('test/header/*.json'):
    with open(header_filepath, 'r') as f:
        headers.append(json.load(f))
        f.close()
pd.DataFrame(headers)

In [None]:
footers = []
for footer_filepath in glob.glob('train/*/*/footer/*.json'):
    with open(footer_filepath, 'r') as f:
        footers.append(json.load(f))
        f.close()
pd.DataFrame(footers)

In [None]:
footers = []
for footer_filepath in glob.glob('test/footer/*.json'):
    with open(footer_filepath, 'r') as f:
        footers.append(json.load(f))
        f.close()
pd.DataFrame(footers)

In [None]:
%%time
!zip -rm "train.zip" "train"

In [None]:
%%time
!zip -rm "test.zip" "test"

In [None]:
!ls

In [None]:
print('Complete!')