# Consideration of floor
## Summary
A BBSID is an ID attached to a physical equipment of WiFi access point. Since WiFI access points are fixed to the building, they do not
move. Therefore a BSSID, in whichever path it appears in the dataset, is uniquely tied to one floor number.

We know from the training data, who (= a person on the path) received a WiFi signal from a particular WiFi access point (= BSSID. hereafter we use 'BBSID' and 'WiFi access point' interchangeably). We also know from the training data, on which floor this person was at the time of the signal reception. In this way, we can assign one floor to a particular BSSID.

One problem is that a person sometimes pick up a WiFi signal from the floors on which the person is not located, because WiFi signals can go through walls and floors, and a part of the building where the floor is removed to connect the adjacent stories.

However, if we select those signal-reception events with strong signals (= large RSSI), the person should be close to the WiFi access point, and therefore the chances are high that the person is on the same floor with this WiFi access point. By selectively looking at the floor information in such large-signal reception-events, we can tie a BSSID to a particular floor with better accuracy.

Once we created a table of BSSIDs with the information on which floor they are in the building, we can use the table to assign a floor to a path in the test data by looking at the set of BSSIDs that are show up along the path.

Again the set of BSSIDs in a particular path could be contaminated by WiFi access points on other floors. We again selectively look at the signal-reception events with strong signals (= large RSSI) to filter out unexpected signal receptions from other floors.

In this notebook, the data analysis goes as follows.

1. We pick up one site. 
2. List up all BSSID-RSSI pairs (=reception events), regardless on which paths they happen.
3. Group the list by BSSID, and select 8 pairs (the number is arbitrary) that have strongest RSSI.
4. Look at on which floor the pair happens.
5. Now one BSSID has 8 entires of the floor. Let them vote, and find the floor that occurred most often. Assign this floor as the location of this BSSID.
6. Use a validation set, and predict the floor of a particular path, using their BSSIDs that show up along the path.

## Conclusion
There is one parameter we had to adjust, which is the number of signal-reception events to be included in the procedure 6. Once the number is tweaked, the prediction of the floor was perfect for the first building we tried (= zero error).
 
Optimal number of votes that should be considered in the procedure 6 likely differs building by building. It makes sense as the range of WiFi signal differs depending on the structure of a building. It is worthwhile tuning the parameter, because the errors in the floor predictiosn cost more than those in the two dimensional positions.

## Acknowledgement
I used the data published by @kokitanisaka
https://www.kaggle.com/kokitanisaka/create-unified-wifi-features-example

This notebook is inspired by the work of @nigelhenly 
https://www.kaggle.com/nigelhenry/simple-99-accurate-floor-model

---
## Data Processing

Set up packages, notebook-wide parameters and data directory.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import os
from datetime import datetime

from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder

from datatable import (dt, f, by)
from IPython.core.display import display, HTML

display(HTML(
    '<style>'
        '#notebook { padding-top:0px !important; } '
        '.container { width:100% !important; } '
        '.end_space { min-height:0px !important; } '
        '</style>'
        ))

# ========================================================
pd.options.display.max_rows = 999
dt.options.display.max_nrows = 999
# ========================================================

path = Path('../input/indoorunifiedwifids')

Read the data creatd by @kokitanisaka. Let us use datatable, instead of pandas, for the speed. 

In [None]:
# ========================================================
# reading data

data = dt.fread(path/'train_all.csv')
test_data = dt.fread(path/'test_all.csv')

The content of `data` looks like this.

In [None]:
data

It is too cumbersome to work with long text. The strings are converted to numbers. First, we need the name of the columns in the datatable.

In [None]:
bx = [i for i in data.names if i.startswith('bssid_')]
rx = [i for i in data.names if i.startswith('rssi_')]
bx, rx


Collect all BSSID names that occur in the training and the test dataset.

In [None]:
wifi_bssids = dt.unique(dt.rbind(data[:, bx], test_data[:, bx]))
wifi_bssids

In [None]:
wifi_bssids.nrows

There are 61307 unique BSSIDs in the dataset. Prepare label encoders that convert strings to numbers. Apply the encoders on `data`.

In [None]:
le = LabelEncoder()
le.fit(wifi_bssids)

le_site = LabelEncoder()
le_site.fit(data[:, 'site_id'])

le_path = LabelEncoder()
le_path.fit(data[:, 'path'])

# ========================================================

data[:, 'site_id'] = le_site.transform(data[:, 'site_id'])
data[:, 'path'] = le_path.transform(data[:, 'path'])

for i in bx:
    data[:, i] = le.transform(data[:, i])

Now the data looks easer to read.

In [None]:
# from sklearn.model_selection import StratifiedKFold, GroupKFold
data


In [None]:
data.shape

In [None]:
data.names

__Let us pick up one site.__ When we put the notebook in production, here is the part we have to change, and loop. We pick up the first site, where `data['site'] == 0`

In [None]:
d0 = data[f.site_id==0, :]
d0

BSSID part of data.

In [None]:
d0[:,bx]

RSSI part of data.

In [None]:
d0[:,rx]

Now we make a pair of these two tables. It is like overlapping one table on the other, and make a pair. The resultant list will be long. The number of columns (`rx` and `bx`) are 100, and the number of rows in data is 9296. The expected length of the list is therefore 929600.

In [None]:
len(bx), len(rx), d0.nrows

In [None]:
tp1 = datetime.now()
tpx0 = datetime.now()

for i in range(d0.nrows):
    dbx = d0[i,bx]
    drx = d0[i,rx]

    dbx = dt.Frame(dbx.to_numpy().T)
    drx = dt.Frame(drx.to_numpy().T)
#    print(f'dbx {dbx.nrows}') 

    dfx = dt.repeat(dt.Frame({'floor':[d0[i,'floor']]}), dbx.nrows)
    dpx = dt.repeat(dt.Frame({'path':[d0[i,'path']]}), dbx.nrows)

    cx = dt.cbind([dbx, drx, dfx, dpx])
    if i ==0 :  
        dx = cx
    else:
#        dx = dt.rbind([cx, dx])
        dx.rbind(cx)
        
    if i % 1000 == 0:
        tpx = datetime.now()
        print(f'{i:-5} \033[32m{tpx}\033[0m {dx.nrows:-8} \033[91m{tpx-tpx0}\033[0m')
        tpx0 = tpx

dx.names = ['BSSID', 'RSSI', 'floor', 'path']    
tp2 = datetime.now()

print(f'took \033[1,32m{tp2-tp1}\033[0m')



In [None]:
dx

We convert dx to a pandas data frame now here.

In [None]:
df = dx.sort([f.BSSID, f.RSSI]).to_pandas()

The data frame will be split to training (`ti`) and validation sets (`vi`). 

In [None]:
N_SPLIT=5
gkf = GroupKFold(n_splits=N_SPLIT)
ti, vi = next(gkf.split(df, df['floor'], groups=df['path']))
df_ti = df.iloc[ti]
df_vi = df.iloc[vi]

Make a list of unique BSSIDs.

In [None]:
u_bssid = df_ti['BSSID'].unique()
u_bssid, len(u_bssid)

Let us look at the first 3 BSSIDS, 8, 107, and 129. 

In [None]:
x_df = pd.DataFrame(columns=['BSSID_S', 'BSSID', 'floor'])
for i in u_bssid[:3]:

    print(f'\033[31mBSSID \033[0m {i:5}')

    x = df_ti.loc[df_ti['BSSID']==i].sort_values(['RSSI'], ascending=False).iloc[0:8]
    xb = le.inverse_transform([i])[0]

    x_floor = x['floor'].mode().values[0]
    x_a = pd.DataFrame({'BSSID_S':[xb], 'BSSID':[i], 'floor':[x_floor]})

    print(x)
    print(f'\033[32mfloor voted: \033[0m{x_floor}')
    print()

    if i == u_bssid[0]: 
        x_df = x_a
    else: 
        x_df = pd.concat([x_df, x_a])



In case of the BSSID number 8, there are 4 different paths in data (10844, 10728, 10734, 10733) among 8 events that received strong signals from this WiFi access point. All voted for floor 0. We conclude BSSID #8 is locate on the first floor, F1. The voting for the WiFi access point BSSID #129 is not unanimous, but according to the 8 strongest RSSIs receptions, it is 1. Note that this number '8 strongest signal receptions' is an arbitrary choice, and there is a room of an experiment.  

We will run the code for whole `u_bssid`. The text BSSID names ('BSSID_S') are added. 

In [None]:
x_df = pd.DataFrame(columns=['BSSID_S', 'BSSID', 'floor'])
tp1 = datetime.now()
tpx0 = datetime.now()

for i_b, i in enumerate(u_bssid):

    xb = le.inverse_transform([i])[0]
    x0 = df_ti.loc[(df_ti['BSSID'] == i) & (df_ti['RSSI'] != -999)]

    if len(x0) == 0:
#        x_floor = np.nan
         x_floor = 999

    else: 
        p = np.min([8, len(x0)])

        x  = x0.sort_values(['RSSI'], ascending=False).iloc[0:p]
        x_floor = x['floor'].mode().values[0]

    x_a = pd.DataFrame({'BSSID_S':[xb], 'BSSID':[i], 'floor':[x_floor]})

    if i_b == 0: 
        x_df = x_a
    else: 
        x_df = pd.concat([x_df, x_a])

    if i_b % 1000 == 0:
        tpx = datetime.now()
        print(f'{i:-8} \033[32m{tpx}\033[0m {len(x_df): 8} \033[91m{tpx-tpx0}\033[0m')
        tpx0 = tpx

tp2 = datetime.now()
print(f'took \033[1,32m{tp2-tp1}\033[0m')




The contents of `x_df`.

In [None]:
x_df

In [None]:
len(x_df)

This is the table that locates each BSSID on a certain floor. 

Now we will start working with the validation set. First, collect all unique paths in the validation dataset. 

In [None]:
u_path = df_vi['path'].unique()
u_path, len(u_path)

Just to refresh our memory what was in `df_vi`.

In [None]:
df_vi

Let us take a look at first 3 paths, 10678, 10659 and 352. 

In [None]:
f_p = np.zeros(0)
f_t = np.zeros(0)

for i in u_path[:3]: 
    x = df_vi.loc[df['path']==i]

    x_mer = x.merge(x_df.drop(['BSSID_S'], axis=1), left_on='BSSID', right_on='BSSID', suffixes=('','_pred'))
    x_mer_top = x_mer.sort_values(['RSSI'], ascending=False)[0:p]

    p1 = x_mer_top['floor_pred'].mode().values[0]
    f_p = np.hstack((f_p, p1))
    f_t = np.hstack((f_t, x_mer['floor'].mode().values[0]))

    print(f'\033[31mpath \033[0m{i:6}\033[0m')
    print(f'{x_mer_top}')
    print(f'\033[32mfloor voted: \033[0m{p1}')
    print()


The BSSIDs observed along the path 10678 all vote for the WiFi device being located on the floor -1, if we picked the 8 strongest signal-receptions. One can safely attribute the path to the floor B1. 

We will look into how many signal-reception events is the best to come up to the correct assignment of the floor to a path. `p` is changed from 1 to 32 below. With `p=26`, the code makes no mistake in the assignment of the floor.

In [None]:
for p in np.arange(1,32,1):

    f_p = np.zeros(0)
    f_t = np.zeros(0)

    for i in  u_path: 
        x = df_vi.loc[df['path']==i]

        x_mer = x.merge(x_df.drop(['BSSID_S'], axis=1), left_on='BSSID', right_on='BSSID', suffixes=('','_pred'))
        x_mer_top = x_mer.sort_values(['RSSI'], ascending=False)[0:p]

        f_p = np.hstack((f_p, x_mer_top['floor_pred'].mode().values[0]))
        f_t = np.hstack((f_t, x_mer['floor'].mode().values[0]))

    print(f'{p:-2} {(f_p==f_t).mean():.4f}')



Note this number `p` varries accoring to which site we are looking at. Sometimes we need `p > 100' to get right answers. 
