# Summary

We're given the prompt from Facebook: 
> The goal of this competition is to predict which place a person would like to check in to. 
For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square. 
For a given set of coordinates, your task is to return a ranked list of the most likely places. 
Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values. Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.

Rather than ask, “How do I model this artificial world?” let’s ask a different question, “How were these data generated?”
Treating this as an analysis problem, rather than a modeling problem,
could lead to deeper insights than tuning model hyperparameters.
Plus, it's a fun challenge. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats    
import scipy.special as sps
%matplotlib inline

In [None]:
df = pd.read_csv('../input/train.csv', index_col=0)
print(df.describe())

In [None]:
df.head()

# Tour of the variables
Let's pause and see what we're dealing with. 

## row_id
The primary key for our data. Nothing to see here...

## x, y
We're told this is a 10 km x 10 km square and, indeed, x and y run between 0 and 10 so it’s probably fair to assume the units of x and y are kilometers. 
The precision of x and y is 0.0001 km which is 10 cm (that’s 4 in in Menlo Park).

## accuracy
This is interesting. The accuracy varies between 1-1033 with an average of 82. 
It's given as an integer which is odd. 

## time
Again an integer. Based on previous analyses, we think the units of this are minutes. 

## place_id



Let's take a look at the number of duplicates.

In [None]:
print(df.place_id.drop_duplicates().count())

In [None]:
y = df.place_id.value_counts().hist(bins=500)
y.set(ylabel="Duplicate frequency", xlabel="Number of duplicates")

In [None]:
shape, scale = 1.34, 199.38 # mean and dispersion
s = df.place_id.value_counts().values
#s = np.random.gamma(shape, scale, 1000)

#Display the histogram of the samples, along with the probability density function:


count, bins, ignored = plt.hist(s, 50, normed=True)
y = bins**(shape-1)*(np.exp(-bins/scale) /
                      (sps.gamma(shape)*scale**shape))
 
plt.plot(bins, y, linewidth=2, color='r')
plt.ylim(0,.006)
plt.show()