# Visualizing Grab transportation data from 2013
The goal is to visualize grab bookings with respect to the geographical area. Here, we try to do this by clustering scattered GPS points to see trends.

In [1]:
import math
import pandas as pd
import matplotlib.pyplot as plt
import mpld3
import gmplot
import numpy as np
import hdbscan
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from geopy.distance import distance
from geopy.distance import Distance
from geopy import Point
from shapely.geometry import MultiPoint
%matplotlib inline
mpld3.enable_notebook()

Some dataset details:
- `source` is the channel whereby the booking was made
- `pick_up_distance` is the distance of the driver from the passenger. This is measured via road distance and not by straight line.
- `state`: either `CANCELLED` (booking cancelled by the user), `COMPLETED` (whole trip was completed), or `UNALLOCATED` (no car could be allocated to the user)

In [2]:
df = pd.read_csv('original.csv')
df.head()

Unnamed: 0,source,created_at_local,pick_up_latitude,pick_up_longitude,drop_off_latitude,drop_off_longitude,city,fare,pick_up_distance,state
0,ADR,2013-09-22 23:46:18.000000,14.604348,120.998654,14.53737,120.994423,Metro Manila,281.875,0.389894,CANCELLED
1,T47,2013-11-04 03:51:59.000000,14.590099,121.082645,14.508611,121.019444,Metro Manila,413.125,2.20977,COMPLETED
2,T47,2013-11-21 05:21:24.000000,14.582707,121.061458,14.537752,121.001379,Metro Manila,277.5,2.70291,COMPLETED
3,ADR,2013-09-16 20:53:34.000000,14.585812,121.060171,14.575915,121.085487,Metro Manila,220.625,0.321403,CANCELLED
4,IOS,2013-09-10 23:49:16.000000,14.55201,121.05126,14.63021,120.99592,Metro Manila,378.125,0.667067,COMPLETED


Check if we have any irregular NAs in the dataset

In [3]:
df.isnull().sum()

source                     0
created_at_local           0
pick_up_latitude           0
pick_up_longitude          0
drop_off_latitude          0
drop_off_longitude         0
city                       0
fare                       0
pick_up_distance      120532
state                      0
dtype: int64

NAs at `pick_up_distance` should come from unallocated trips

In [4]:
df[df["state"]=="UNALLOCATED"].isnull().sum()

source                     0
created_at_local           0
pick_up_latitude           0
pick_up_longitude          0
drop_off_latitude          0
drop_off_longitude         0
city                       0
fare                       0
pick_up_distance      120532
state                      0
dtype: int64

### Clustering the points
There is not much analysis we can do when all we have are a bunch of GPS coordinates. To build relationships or to correlate points, we need some kind of method to group them together. For example, we can group points in the same city/town/postal code. However, grouping by geographic boundaries wouldn't be practical resource-wise. Moreover, such boundaries sometimes do not reflect proximity: for instance, if a popular mall lies along a city boundary, bookings coming from that mall would be assigned to 2 different cities, even though all those bookings are within the mall.

To get a more natural grouping of points, we use a clustering algorithm called **DBSCAN**. We can cluster points based on how close they are to each other, such that points near each other are likely to be grouped in the same cluster.

Prepare the coordinates for DBSCAN

In [5]:
pick_up_coords = df.as_matrix(columns=['pick_up_latitude', 'pick_up_longitude'])
drop_off_coords = df.as_matrix(columns=['drop_off_latitude', 'drop_off_longitude'])
# Concatenate pick-up and drop-off coordinates; first half pick-up, latter half drop-off
coords = np.concatenate((pick_up_coords, drop_off_coords))

#### Clustering coordinates using DBSCAN
Some parameters of interest:
- `eps_km` - This is the maximum distance of points within a cluster. As this we increase this parameter, we get fewer clusters and with generally larger area. We determined 50 meters to be a good enough setting.
- `min_samples` - The minimum no. of samples to be considered for a point to be considered a core point.

In [6]:
eps_km = 0.05  # Max neighborhood distance of DBSCAN in km
eps_rad = eps_km/6371.  # Convert km to radians (source: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/)
min_samples = 3  # Min no. of samples to be considered a core point

# Perform DBSCAN
db = DBSCAN(eps=eps_rad, min_samples=min_samples, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Transform the outputs from DBSCAN

In [7]:
# Map row # in coords -> cluster #
cluster_labels = db.labels_

# Separate cluster labels into pick up and drop off
cluster_labels_pick_up = cluster_labels[:len(cluster_labels)//2]
cluster_labels_drop_off = cluster_labels[len(cluster_labels)//2:]

# Count the total # of clusters
num_clusters = len(set(cluster_labels))

# Map cluster # -> points inside the cluster
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: '+str(num_clusters))

Number of clusters: 9355


After clustering, we now have the grouping of the closest points. However, these clusters are merely groupings; they do not have a 2D representation yet. We need to get the center of each cluster:

In [8]:
def get_centermost_point(cluster):
    """Compute the centermost point of a cluster of points."""
    if (len(cluster) == 0):
        return
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

In [9]:
# Map row # in clusters -> center (lat, long)
centermost_points = clusters.map(get_centermost_point)

Now we can add the clusters and their centers to our original dataframe

In [10]:
# Assign center points in dataframe
df['pick_up'] = [centermost_points[lbl] if lbl != -1 else np.nan for lbl in cluster_labels_pick_up]
df['drop_off'] = [centermost_points[lbl] if lbl != -1 else np.nan for lbl in cluster_labels_drop_off]
df['pick_up_lbl'] = cluster_labels_pick_up
df['drop_off_lbl'] = cluster_labels_drop_off

### Analyzing the clusters
With the clustering done, we now have a way to associate together points that are "close" to each other.
We are now free to do analysis on these clusters with relevant metrics.

For now, we are going to focus on the **Allocation Rate (AR)**: the rate of successful matching we can do. This is defined by `AR = Allocated Bookings / Total Bookings`. This metric is important, since it tells us how well we can match demand. By calculating the `AR` for each cluster, we will be able to know which locations lack available Grab vehicles.

We can now compute the AR for each cluster.

In [11]:
# Compute the total and unallocated bookings for each cluster
clusters_total_bookings = [len(df[df['pick_up_lbl'] == i]) for i in range(num_clusters)]
clusters_unallocated_bookings = [len(df[(df['pick_up_lbl']==i) & (df['state']=='UNALLOCATED')]) for i in range(num_clusters)]

In [12]:
def get_allocation_rate(cluster_lbl):
    """Compute the allocation rate of a cluster given the cluster label."""
    total_bookings = clusters_total_bookings[cluster_lbl]
    if (total_bookings == 0):
        return
    unallocated_bookings = clusters_unallocated_bookings[cluster_lbl]
    # Compute the allocation rate
    return (total_bookings - unallocated_bookings) / total_bookings

In [13]:
# Compute the allocation rate for each cluster
clusters_ar = [get_allocation_rate(i) for i in range(num_clusters)]

#### Visualizing the clusters
To visualize the points, we plot each cluster in a map. We also define some functions and constants for our plot's aesthetics.

In [14]:
PT_SIZE_SCALE = 4  # Scaling for point sizes

def get_hue(pct):
    """Creates a color between green and red, tinted according to pct. 100% is green; 0% is red"""
    n = pct*100.
    R = round((255 * n) / 100.)
    G = round((255 * (100 - n)) / 100.)
    B = 0
    return '#%02x%02x%02x' % (R, G, B)

#### Drawing a cluster
For each cluster that we can draw in the map, we change the ff. visual properties:
- **Location**: the center of the cluster should be the approximate center of all the bookings in the vicinity
- **Color**: red means very low allocation rate (hard to get a booking), green means high allocation rate (easy to get a booking)
- **Size**: a large cluster should represent that a lot of bookings are made in that cluster

In [15]:
def draw_cluster(cluster_lbl):
    """Draws a cluster on the map given the cluster label."""
    center = centermost_points[cluster_lbl]
    allocation_rate = clusters_ar[cluster_lbl]
    total_bookings = clusters_total_bookings[cluster_lbl]
    if (allocation_rate is None):
        return
    # Determine color based on allocation rate
    color = get_hue(allocation_rate)
    # Determine size based on sqrt(total bookings): since area grows in a quadratic manner,
    # we get the square root of the metric to represent the size accurately
    point_size = math.sqrt(total_bookings)*PT_SIZE_SCALE
    gmap.scatter([center[0]], [center[1]], size=point_size, marker=False, color=color)

In [22]:
MAP_INITIAL_ZOOM = 14  # Initial zoom level of the map
MAP_INITIAL_POS = [14.556423, 121.025291]  # Initial center position of the map

# Draw the clusters on a google map
gmap = gmplot.GoogleMapPlotter(MAP_INITIAL_POS[0], MAP_INITIAL_POS[1], MAP_INITIAL_ZOOM)
for i in range(num_clusters):
    draw_cluster(i)
gmap.draw('out.html')

### A screenshot of the final output
![Output Screenshot](out_screenshot.png)

**Some quick observations about the data:**
- In the figure above, we can see that the Makati, BGC, and Ortigas areas have relatively big clusters. **Green** 🍏 clusters mean high allocation rate (easy to book), while **red** 🍎 ones mean low allocation rates (hard to find a booking).
- Looking at the bigger picture, we find notable red spots in:
    - 🏬 Trinoma - Quezon City
    - 🏙 Tivoli Garden Residences - Mandaluyong
    - 🏞 Riverfront Residences - Pasig
    - 🌏 SM Mall of Asia - Pasay
    - 🛫 NAIA Terminal 4 - Pasay
- From the point of view of Grab, big red clusters show potential problems in their matching algorithms.
- From the point of view of a Grab user however, this map can be useful in finding where it is best to book Grab (or where not to).

It is important to note, however, that these clusters **do not represent geographic area** from which the bookings happened, but instead represent dense clusters of bookings.
