# Meeting around

For any given pair of uid s, determine when and where they could have met each other as
they moved through the building. Please state your assumptions about what would constitute
a “meeting.” Note that the coordinates can be assumed to be 1 unit = 1 meter.

## Design Rationale

When I'm about to work with Data I usually split the work in two parts: Exploration and Optimisation

### Exploration

I try to get some intuition about the dataset. What is it about? What are features? Is it clean or do I need to pre-process it somehow? 

In terms of tooling, I know two different ecosystems R and Python. I like and know Python better so I will go for it and that's why you are reading this notebook. I could probably do the same job with Rmarkdown but I would be less efficient.

### Optimisation

Once I will get a proper sense of the solution I will implement it in a more optimised way and with a faster language. The technological choice is often influenced by the company policy. If everybody codes in Java it would be Java. Today, this my choice so I will pick Scala because I really like the functional programming paradigm.

> **What is following is the Exploration part**

## Prerequesite

* Python 3 (https://www.python.org/download/releases/3.0/)
* Pandas (http://pandas.pydata.org/)

## Load the libraries

In [1]:
import pandas as pd

## Load the dataset

In [2]:
df = pd.read_csv('reduced.csv')

In [3]:
# Gives an overview of the dataset
df.describe()

Unnamed: 0,x,y,floor
count,2228820.0,2228820.0,2228820.0
mean,94.075081,75.104579,2.146474
std,19.266067,11.735871,0.80917
min,43.702972,42.494525,1.0
25%,81.227639,66.223634,1.0
50%,104.203185,71.151381,2.0
75%,108.025986,86.434109,3.0
max,115.081924,102.755955,3.0


In [9]:
df.head()

Unnamed: 0,timestamp,x,y,floor,uid
0,2014-07-19T16:00:06.071Z,103.79211,71.504194,1,600dfbe2
1,2014-07-19T16:00:06.074Z,110.33613,100.682839,1,5e7b40e1
2,2014-07-19T16:00:06.076Z,110.066315,86.488736,1,285d22e4
3,2014-07-19T16:00:06.076Z,103.78499,71.456331,1,74d917a1
4,2014-07-19T16:00:06.076Z,109.09495,92.824487,1,3c3649fb


In [10]:
# number of people in the building
len(df.uid.unique())

12991

In [14]:
print('the dataset starts on {0} and ends on {1}'.format(df.timestamp[0], df.timestamp[df.timestamp.count()-1]))

the dataset starts on 2014-07-19T16:00:06.071Z and ends on 2014-07-20T15:59:58.853Z


In [16]:
unique_floors = df.floor.unique()
print('there are {0} floors {1}'.format(len(unique_floors), unique_floors))

there are 3 floors [1 2 3]


## My assumptions
TBD

## Helpers

In [18]:
def get_subset(df, a, b):
    """
    Returns a subset of DataFrame for user a and user b (data are duplicated to preserve the source)
    :param df: dataset
    :type board: pandas.core.frame.DataFrame
    :param a: a first user id
    :type a: str
    :param b: state of the current game
    :type b: a second user id
    """
    assert a != b, "uids should be different"
    return df[(df.uid == a) | (df.uid == b)].copy()

from math import pow, sqrt

def calc_coord_dist(x1, y1, x2, y2):
    """
    Returns the Euclidean distance betwen two pairs of coordinate 
    (we could also have taken the Manhattan distance)
    :param x1: x coordinate of first pair
    :type x1: float
    :param y1: y coordinate of first pair
    :type y1: float
    :param x2: x coordinate of second pair
    :type x2: float
    :param y2: y coordinate of second pair
    :type y2: float

    """
    return sqrt(pow(x2 - x1, 2) + pow(y2 - y1, 2))

## Extract a subset of the dataset for 2 user
On one hand it is helping me to get a better intuition of the dataset. 
On the other hand, it will improve the next operations in terms of space and processing time. N is smaller.
(I know I still have "df" in memory but let's assume it is not ;) )

In [6]:
a = '600dfbe2'
b = '5e7b40e1'

In [7]:
subset = get_subset(df, a, b)

In [8]:
# the following assertion must be True
subset.count() == df[df.uid == a].count() + df[df.uid == b].count()

timestamp    True
x            True
y            True
floor        True
uid          True
dtype: bool