#Facebook Challenge 5 - Predicting Check-in
##Notebook #1: Data Exploration
###Introduction:
This Facebook Challenge is concerned with predicting the business that is associated with each check-in events. 
In particular, as described by the competition host, the businesses are distributed within an area defined by
a square of 10km by 10km. The participants are given a set of training data, which include: x, y (the x,y coordinates), 
time (time at which the check-in event occurs), accuracy (definition unknown), and place_id (the business which we need to predict). 
In this first notebook, the nature of the training data will be investigated using statistical and visualisation tools.


###Overview:
To investigate the training data, the first step is obviously to load the data. Pandas will be used as the 
primary tool for this exercise.

In [None]:
import numpy as np 
import pandas as pd 


training_data = pd.read_csv("../input/train.csv")
test_data = pd.read_csv("../input/test.csv")

training_data.head()


It can be seen from the first few rows of data that as expected, the x and y values are 
floating point numbers in between 0 and 10 (This will be checked later) corresponding to the check-in coordinates
within the 10km by 10km area in which the businesses are located. It can also be seen that the accuracy is also a 
floating point number, and from first glance it seems that the accuracy might be quoted as percentages (i.e. floating
points of magnitude between 0 and 100). Both time and place_id seems to be very large integers. In the next sections, each of these 
features will be investigated in details. 

###Time:

Time is the first feature to be investigated, as it is the most important feature after the x,y coordinates. That is because each
business would have different opening hours, and each business will experience different busy and quiet periods, depending on 
the nature of the business. For example, a night club would probably have significantly higher customer influx (hence more check-in
events) on Friday Night and Saturday Night, whereas a bakery or cafe would probably have higher customer influx on weekday mornings. 
This can potentially be further extended to months of the year: a ski-shop is more likely to have more customers just before winter, 
while a surf shop would have more customers just before or during summer. Therefore, the time of check-in is going to be the key to distingushing 
between businesses that are located close together. 

However, unlike the x,y coordinates, the unit time provided in the data is ambiguious. Therefore, compared to the x,y coordinate, much more 
investigation will be needed to make use of the time feature. This would be the goal of this section. 

The first investigation will be a comparison between the time record for the training data and test data:


In [None]:
training_data.time.describe()

In [None]:
test_data.time.describe()

The first thing that can be noted from these two summaries is the fact that the test data time follows the training data time. This means that all events in
the test data set happens after the training data set. This is not surprising, as it has been vaguely mentioned in the competition details that the test data and 
the training data are divided in time. This essentially means that the time (or a scaled version of it) cannot be used directly for fiting a predictive model, as 
the time range in the test and training data sets are different. The time need to be converted into a form that is common to both data sets: a cyclical expression 
of the time, i.e. time of the day, time of the week, and/or time of the year. This is consistent with the need of a cyclical measure of time to represent 
the dependence of business on time as described above.

From the extremely large values of the time record, it can be concluded that it is impossible for the time to be recorded in months, days or hours. The only possibilities
are minutes, seconds and milliseconds (nanosecond is too short for human activities). In order to find out, the histogram of a selected data range 
can be plotted by first assuming that time is recorded in unit of minutes

In [None]:
#Choose an arbitary range for dividing the data into days
data_range = [400000, 500000]
no_of_days_train = (500000 - 400000)/(60*24)
training_data.time.hist(bins = int(no_of_days_train), range = data_range)

There is not a clear pattern for when the training data is divided this way. But how about the test data?

In [None]:
#Choose an arbitary range for dividing the data into days
data_range = [800000, 900000]
no_of_days_test = (900000 - 800000)/(60*24)
test_data.time.hist(bins = int(no_of_days_test), range = data_range)

There is a very clear pattern here, that there is a periodicity with a spike occuring every 7 bins
Since each bin represents a day, this corresponds well with the occurance of weekends. Hence, it can be concluded that 
the time data is represented in minutes. Note that this is consistent with what can be found on your facebook 
page: if you go to security settings -> where you logged in, you will find that your log-ins are recorded
in date and time to the nearest minute. So time is recorded in minutes

###place_id
place_id records the business that is present in the area. Let's first look at the basic statistics.

In [None]:
training_data.place_id.describe()

It can be seen that place_id are big numbers. Furthermore, since there are about 30 million entries, but the 
numbers go from about 100 million to 1 billion, it can be conclude that the place_id are discrete. In fact, there are:

In [None]:
training_data.place_id.nunique()

So there are just over 100k unique ids. This immediately points to the need of partitioning the data, as predicting
100k classes with one single model would be difficult. Finally, the code below gives the most popular business in the 
area

In [None]:
training_data.place_id.value_counts()[:10]

It seems that there is not a a dominant business that overwhelm other businesses. Again, this is what is expected.

### x, y
Again, starting with basic statistics, we have:

In [None]:
training_data.x.describe()

In [None]:
training_data.y.describe()

This confirms that x and y are floates between 0 and 10. To investigate in more details the relationship between 
x, y and place_id, Let's select a particular place_id and analysis that data

In [None]:
X1 = training_data.query('place_id == 8772469670')
X1.x.describe()

In [None]:
X1.y.describe()

It can be seen from above that the spread in y is significantly smaller than x. This means that y will be a 
better predictor for predicting the business compared to x. This information can be visualised as follows:

In [None]:
from matplotlib import pyplot as plt
from matplotlib import patches as patches
mean_x = np.mean(X1.x)
mean_y = np.mean(X1.y)
std_x = np.std(X1.x)
std_y = np.std(X1.y)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.axis([0,10,8.15,8.45])
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('distribution of instances corresponding to place_id = 8772469670')
ax1.add_patch(patches.Rectangle((mean_x - 2*std_x, mean_y - 2*std_y), 4*std_x, 4*std_y, fill = False))
ax1.scatter(X1.x, X1.y, s = 40, marker = 'x')

The graph above shows the distribution of instances corresponding to place_id = 877249670. It can be seen clearly 
that the y-values are much tighter (plotted within a 0.3 range) compared to the x-value which is spread across
the whole 0 - 10 axis. The little square in the plot shows an area which is +/- 2 standard deviations away from
the mean. But now if we look at data points within the plotted area that is not of this place_id

In [None]:
X2 = training_data.query('x < 4 and x > 2 and y > 8.15 and y < 8.45 and place_id != 8772469670')
X2.place_id.nunique()

In [None]:
len(X2)

It can be seen that there are 70k data points of 3000+ different place_id within this small area. To make it 
more dramatic, here is the plot of all this data points, with the rectangle representing the area that is 2 
standard deviation from the mean of place_id = 877249670 still drawn:

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.axis([2,4,8.25,8.35])
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('distribution of instances of other place_ids')
ax1.add_patch(patches.Rectangle((mean_x - std_x, mean_y - std_y), 2*std_x, 2*std_y, fill = False))
ax1.scatter(X2.x, X2.y, s = 40, marker = 'x')

Of course, it is expected that when the time data should be able to more finely divide these data points. However, 
it shows how closely spaced these data points are in x and y, and how difficult it will be to model these data points. 

### Accuracy
Accuracy is the last feature that needs to be explored. Starting with basic statistics:


In [None]:
training_data.accuracy.describe()

It can be seen that the accuracy seems to be mostly under 100, but the max in not - is it possible that it is 
indeed a percentage measure, and anything above 100 is error in the data? If it is indeed a percentage score, then 
we expect that data points with high accuracy should cluster together within the same place_id. This can be investigated
by plotting a scatter plot, again using place_id = 877249670 as a sample

In [None]:
import matplotlib.colors as colors

# define the colormap
cmap = plt.cm.jet
## extract all colors from the .jet map
cmaplist = [cmap(i) for i in range(cmap.N)]
## create the new map
cmap = cmap.from_list('Custom cmap', cmaplist, cmap.N)
bounds = np.array([0, 20, 40, 60, 80, 100])
norm = colors.BoundaryNorm(boundaries=bounds, ncolors=cmap.N)
X1_sample = X1.sample(500)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.axis([2,4,8.25,8.35])
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('distribution instances of place_id = 8772469670, colored by accuracy')
ax1.add_patch(patches.Rectangle((mean_x - std_x, mean_y - std_y), 2*std_x, 2*std_y, fill = False))
ax1.scatter(X1_sample.x, X1_sample.y, s = 40, marker = 'x', c = X1_sample.accuracy, cmap = cmap, norm = norm)

It can be seen that there is no clear clustering of colours. The data is everywhere. In particular, high accuracy
scores (red and orange) can be found both within and out of the 2-standard deviation area. Simiarly for low accuracy
So at this point, it is unclear what the accuracy is. However, a quick search online reveal that in usual cases for 
detecting IP locations, accuracy is usually quoted as the confidence that the data is within a certain km. So a accuracy
of 10 would mean that there is a 99% confidence that the actual location is within 10km (or m) of the quoted coordinates
If that is true, then in actual fact, lower accuracy means that the data x,y coordinates are more accurate. 

Unfortunately, due to time constrains, there is no time to investigate the accuracy value further. 
More information would probably reveal itself when we perform the actual modelling. 