# Summary

We're given the prompt from Facebook: 
> The goal of this competition is to predict which place a person would like to check in to. 
For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square. 
For a given set of coordinates, your task is to return a ranked list of the most likely places. 
Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values. Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.

Rather than ask, “How do I model this artificial world?” let’s ask a different question, “How were these data generated?”
Treating this as an analysis problem, rather than a modeling problem,
could lead to deeper insights than tuning model hyperparameters.
Plus, it's a fun challenge. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats    
import scipy.special as sps
%matplotlib inline

In [None]:
# Read in the data using the first column as the index
df = pd.read_csv('../input/train.csv', index_col=0)

In [None]:
df.head()

In [None]:
df.describe()

# Tour of the variables
Let's pause and see what we're dealing with. 

## row_id
The primary key for our data. Nothing to see here...

## x, y
We're told this is a 10 km x 10 km square and, indeed, x and y run between 0 and 10 so it’s probably fair to assume the units of x and y are kilometers. 
The precision of x and y is 0.0001 km which is 10 cm (that’s 4 in in Menlo Park).

## accuracy
This is interesting. `accuracy` varies between 1-1033 with an average of 82.8. 
It's given as an integer which is odd. 

## time
Again an integer. Based on previous analyses, we suspect its units of are minutes. 

## place_id
`place_id` ranges between $10^9$ and $10^{10}$. 
There are $9\times10^9$ possible unique values but only $2.9\times10^7$ were used, 0.3% of the availability. 
It's a odd choice given $9.99\times10^{32}/2^{32} = 2.3$ because it just overflows the SQL `INT` data type. 

In [None]:
nb_total = df.place_id.count()
nb_unique = df.place_id.drop_duplicates().count()

print('Number place_ids: {}'.format(nb_total))
print('Unique place_ids: {}'.format(nb_unique))
print("Average number of duplicates: %.1f" % (nb_total/nb_unique))

There are an average of 269 check ins per place. 

In [None]:
from pandas.tools.plotting import scatter_matrix
df_sample = df[df.place_id == 4823777529]
scatter_matrix(df_sample.drop('place_id', axis=1), diagonal='kde', figsize=(11,11))

`x` and `y` are look gaussian but `accuracy` and `time` are something more complicated. Again we see how all these variables are uncorrelated and the scatter in `x` is much greater than in `y`.

In [None]:
shape, scale = 1.34, 199.38 # mean and dispersion

s = df.place_id.value_counts().values

#Display the histogram of the samples, along with the probability density function:
ax = df.place_id.value_counts().plot.kde()
ax.set_xlim(0, 2000)

count, bins, ignored = plt.hist(s, 100, normed=True)
#y = bins**(shape-1)*(np.exp(-bins/scale) /
#                      (sps.gamma(shape)*scale**shape))
#y = stats.gamma.pdf(bins, a=.6, loc=.999, scale=2.0549)
#rv = stats.maxwell(loc=-249.6547, scale=336.860199)
rv = stats.frechet_r(1.1, loc=0.89, scale=280)
y = rv.pdf(bins)
ax.plot(bins, y, linewidth=2, color='r')

You can swap out distributions fairly easily but I haven't found one that fits yet. 
Let me know if you figure it out!