# Simple 99% accurate Bayesian floor inference

This is my first public notebook on kaggle ;). Please, vote if you find it interesting!

*Note: Code from this notebook produced 100% accurate floor prediction with more advanced private dataset I used in the actual competition (wifi + beacons + magnetic data). Here I'm using a public dataset for the sake of simplicity.*


For each waypoint the floor could be predicted using any multiclass classification model, where features are wifi RSSI values and the class is a floor number. The model presented in this notebook improves on top of it by utilizing the fact that all waypoints of the same path are always located on the same floor.
The most obvious postprocessing to make sure it is the case would use some kind of majority voting. E.g. for each path choose the most frequently predicted floor. This might work, but there is a better solution rooted in 

<img src="https://i.imgflip.com/59ykp9.jpg" alt="drawing" width="400"/>


First recall that multiclass classification models with softmax output layers and cross entropy loss functions produce probabilities of each class along with the most probable class. Let's look at an example of such predictions for an imaginary path:


| Path      | Timestamp     | B1 | F1 | F2  |  Max probability floor |
| ----------- | ----------- | -- | -- | --  |  -------- |
| path1       | 100         |0.1 | 0.3 | 0.6 |  **F2** |
| path1       | 5000        |0.1 | 0.4 | 0.1 |  F1 |
| path1      | 10000        |0.3 | 0.3 | 0.4 | **F2** |

Majority voting produces floor F2 wich we are going to challenge with the following observations. 

Let's assume each waypoint prediction is an independent random event with probability $p_{ij}$ where $i$ is the waypoint number within a single path ($1$ to $n$) and $j$ is the floor. Then according to Bayes theorem probability of a floor $f_j$ given a set of predictions $p_{ij}$ is:

$$Pr(f_j|p_{ij}) = Pr(p_{ij}|f_j) * Pr(f_j) / Pr(p_{ij})$$

Given independence of waypoint predictions and the fact that $Pr(p_{ij})$ doesn't actually depend on the floor number, the formula can be transformed to:

$$Pr(f_j|p_{ij}) \infty \prod\limits_{i=1}^{n} p_{ij} * Pr(f_j)$$

But what is $P(f_j)$?. In Bayes theory it is called "prior" and in our case it can be simply calculated as normalized frequency of floor occurences in the training data. Let $P(f_j)$ be $0.33$ for each of the floors in the example above. 


For the example above theh final floor probability is

| Floor | Probability |
| ----- | -------------------------- |
| B1    | 0.1 * 0.1 * 0.3 * 0.33 = 0.00099  |
| **F1**    | **0.3 * 0.4 * 0.3 * 0.33 = 0.01188**  |
| F2    | 0.6 * 0.1 * 0.4 * 0.33 = 0.00792  |

As you can see the most probable floor F1 is different (and likely more accurate) from the result of majority voting: F2. 

For more details search for Naive Bayes classifiers.

## Cross validation

In [None]:
import pandas as pd
import re 
import glob
import os
import numpy as np

def buildings():    
    return np.unique([re.match('([^_]*)_.*', os.path.basename(f)).group(1) 
                      for f in glob.glob('/kaggle/input/indoor-navigation-and-location-wifi-features/*')])


def train_building_features(building):
    return pd.read_csv(f'/kaggle/input/indoor-navigation-and-location-wifi-features/{building}_train.csv')

def parse_site_path_timestamp(spt):
    return np.array(spt.str.split('_').tolist()).transpose()
    
def test_building_features(building):
    df = pd.read_csv(f'/kaggle/input/indoor-navigation-and-location-wifi-features/{building}_test.csv')
    df['site'], df['path'], df['timestamp'] = parse_site_path_timestamp(df['site_path_timestamp'])
    return df


print('List of used buildings', buildings())
    

In [None]:
import xgboost 
import numpy as np

import zlib

def get_priors(df):
    df = df[['f', 'path']].drop_duplicates().groupby('f').count()
    return (df / df['path'].sum()).to_dict()['path']

def train_floor_model(df, eta=0.06, colsample_bytree=0.25, max_depth=25, n_estimators=100):
    model = xgboost.XGBClassifier(eval_metric='mlogloss', 
                                  eta=eta, 
                                  colsample_bytree=colsample_bytree, 
                                  max_depth=max_depth,
                                  n_estimators=n_estimators)    
    model.fit(df.select_dtypes(include=int).drop(columns=['f']), df['f'])
    return model
        


def predict_floor(df, model, priors):
    floors = model._le.inverse_transform(list(range(model.n_classes_)))    
    def predict_path_floor(df_path):
        """
        This method implements the logic outlined above. 
        1. The probability matrix is generated by predict_proba
        2. Products for each floor are generated with np.prod
        3. We choose the best foor using np.argmax
        """
        proba = model.predict_proba(df_path.select_dtypes(include=int).drop(columns='f', errors='ignore'))
        prod1 = np.prod(proba, axis=0)
        prod2 = [p * priors[int(floors[idx])] for idx, p in enumerate(prod1)]
        return floors[np.argmax(prod2)]
        
    df = df.set_index('path').copy()
    df['predicted_floor'] = df.groupby('path').apply(lambda df_path: predict_path_floor(df_path))
    return df

 
def train_test_split(df, fold=0, nfolds=10):
    is_holdout = df['path'].apply(lambda path: zlib.crc32(path.encode()) % nfolds == fold)
    return df[~is_holdout], df[is_holdout]


In [None]:
%%time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

NFOLDS = 10

# add more folds to achive higher confidence
for fold in [7]:   
    df_result = []
    for building in tqdm(buildings(), total=len(buildings())):
        df_train, df_test = train_test_split(train_building_features(building).drop(columns=['x', 'y']), fold, NFOLDS)        
        priors = get_priors(df_train)    
        model = train_floor_model(df_train)    
        pred = predict_floor(df_test, model, priors)[['f', 'predicted_floor']]
        df_result.append(pred)

    df_result = pd.concat(df_result)
    print(f'Floor prediction cross validation fold {fold} accuracy: ', (df_result['f'] == df_result['predicted_floor']).mean())

## Produce test predictions

In [None]:
df_test = pd.read_csv('/kaggle/input/indoor-location-navigation/sample_submission.csv')
df_test['site'], df_test['path'], df_test['timestamp'] = parse_site_path_timestamp(df_test['site_path_timestamp'])
df_test = df_test.set_index('path')
df_test

In [None]:
%%time

for site in tqdm(buildings(), total=len(buildings())):
    df_features = train_building_features(site)
    df_features_test = test_building_features(site)
    model = train_floor_model(df_features)
    prior = get_priors(df_features)
    df_pred = predict_floor(df_features_test, model, prior)[['predicted_floor']].drop_duplicates()
    df_test.loc[df_pred.index, 'floor'] = df_pred['predicted_floor']

In [None]:
df_sub = df_test[['site_path_timestamp', 'floor', 'x', 'y']]
df_sub.to_csv('submission.csv', index=False)
df_sub