# Sightseeing in New York City
** Extracting patterns from geolocated venues and events **

Machine learning, and in particular clustering algorithms, can be used to determine which geographical areas are commonly visited and “checked into” by a given user and which areas are not. Such geographical analyses enable a wide range of services, from location-based recommenders to advanced security systems, and in general provide a more personalized user experience. 

I will use these techniques to provide two flavours of predicting analytics: 

First, I will build a simple recommender system which will provide the most trending venues in a given area. In particular, k-means tclustering can be applied to the dataset of geolocated events to partition the map into regions. For each region, we can rank the venues which are most visited. With this information, we can recommend venues and landmarks such as Times Square or the Empire State Building depending of the location of the user.

Second, I’ll determine geographical areas that are specific and personal to each user. In particular, I will use a density-based clustering technique such as DBSCAN to extract the areas where a user usually go. This analysis can be used to determine if a given data point is an _outlier_ with respect to the areas where a user normally checks in. And therefore it can be used to score a "novelty" or "anomaly" factor given the location of a given event

We will analyze this events from a public dataset shared by Gowalla on venues checkins registered between 2008 and 2010. This notebook will cover some typical data science steps:

  - data acquisition
  - data preparation
  - data exploration
  
Thereafter, we will dive into some unsupervised learning techniques: *k-means* and *dbscan* clustering, respectively for recommending popular venues and for determining outliers.

## Imports

In [1]:
%matplotlib inline

# utils
import os
import re
import urllib

# images on the notebook
from PIL import Image

# time
import pytz as tz
from datetime import datetime

# cassandra driver
from cassandra.cluster import Cluster
from cassandra.cluster import SimpleStatement, ConsistencyLevel

# data exploration
import numpy as np
np.random.seed(1337)

import pandas as pd

In [2]:
# init
datadir = './data'

# connect to cassandra
CASSANDRA_NODES = ['127.0.0.1']

cluster = Cluster(CASSANDRA_NODES)
session = cluster.connect()

In [3]:
#matplotlib
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (20.0, 20.0)
plt.rcParams.update({'font.size': 12})
plt.rcParams['xtick.major.pad']='5'
plt.rcParams['ytick.major.pad']='5'

plt.style.use('ggplot')

### Prepare cassandra statements
We are going to read events relative to a specific user.

In [14]:
# prepared statement for getting the name of the top venue in a given cluster
cql_prepared = session.prepare("SELECT lon, lat from lbsn.events where uid= ?")

## Determining user-specific regions

In [83]:
from sklearn.cluster import DBSCAN
from scipy.spatial import ConvexHull
from scipy.spatial import Delaunay

def clusters(uid, radius=260):

    #1deg at 40deg latitude is 111034.61 meters
    eps = radius/111034.61

    #user events
    rows = session.execute(cql_prepared.bind([uid]))
    
    data = pd.DataFrame(list(rows))
    db = DBSCAN(eps=eps, min_samples=3).fit(data)
    
    data['cl'] = db.labels_
    return data

    
def regions(data):
    hulls = []
    for cl, group in data.groupby('cl'):
        if cl>=0:
            points = group[['lon','lat']].as_matrix()
        try:
            hull = ConvexHull(points)
            hull_vertices = np.array([ [points[i][0], points[i][1]] for i in hull.vertices ])
            hulls.append(hull_vertices)
        except:
            pass
    return hulls

def in_region(p, convexhull):
    """
    Test if points in `p` are in `convexhull`
    """
    # triangulation of convex hull vertices
    try:
        tri = Delaunay(convexhull)
        return tri.find_simplex(p)>=0
    except:
        return True
        

In [84]:
def location_alert(uid, lon, lat):
    """
    Determine if the given point is within any of the given convex hulls
    If not, it gives  
    """
    
    result = False
    hulls = regions(clusters(uid))
    
    for k in list(range(len(hulls))):
        result = result or in_region([lon, lat], hulls[k])
        
    return (not result)

In [85]:
location_alert(22, -73.99, 40.75)


False

In [86]:
location_alert(22, -73.99, 43.75)

True

### Build the REST service

In [None]:
import json

from flask import Flask
app = Flask("location_alert")

@app.route("/location/alert/<int:uid>/<lon>,<lat>")
def alert_api(uid,lon, lat):
    result = location_alert(uid,float(lon), float(lat))
    return json.dumps(result)

app.run()

Try http://localhost:5000/location/alert/22/-73.99,40.75