# Uber Data Exploration

This data gives the times and places of Uber pickups around New York City for several months in 2014.
I'm going to start by first looking at the data from April 2014. 

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy as sp

%matplotlib inline

df = pd.read_csv('../input/uber-raw-data-apr14.csv')


df.head()

There's not much here. Just a timestamp, latitude, longitude, and a 'Base.'

To start out, I'll add a bunch of timestamp transformations that should be useful. Things like day and hour can be useful as categorical variables, but I'll also add floating point times down to minute granularity.

Something else that could be fun to do would be to look at census data. You can get the census block using this code, which uses an online tool from the FCC (it's extremely slow, though, so there are probably better ways to do this):



In [None]:
#! /usr/bin/env python3

from xml.etree import ElementTree
import xml.etree
import requests
import pandas as pd
import numpy as np

def getdata(lat,lon):
  page = requests.get("http://data.fcc.gov/api/block/2010/find?latitude=%f&longitude=%f"%(lat,lon))  
  root = ElementTree.fromstring(page.content)

  status = root.attrib['status']
  block = root[0].attrib['FIPS']  
  county = root[1].attrib['name']
  state = root[2].attrib['code']

  data= {'lat':lat,'lon':lon,'status':status,'block':block,'county':county,'state':state}

  return data

def read_uber_data(inf='uber-raw-data-apr14.csv',outf='uber_apr14_blocks.csv'):

  df = pd.read_csv(inf)
  length = df.shape[0]
  lats = df.Lat.as_matrix()
  lons = df.Lon.as_matrix()
  f = open(outf,'w')
  f.write('status,block,county,state\n')
  count = 0
  for lat,lon in zip(lats,lons):
    if (count%10 ==0 ):
      print(count)
    count += 1
    #print(lat,lon)
    data = getdata(lat,lon)
    f.write( str(data['status'])+','+str(data['block'])+','+str(data['county'])+','+str(data['state'])+'\n')
  f.close()

In [None]:
def TransformTimestamp(df):
    df['Date/Time'] = pd.to_datetime(df['Date/Time'])
    df['DayOfWeek'] = [x.dayofweek for x in df['Date/Time']]
    df['Day'] = [x.day for x in df['Date/Time']]
    df['Hour'] = [x.hour for x in df['Date/Time']]
    df['DayTime'] = [x.hour+x.minute/60. for x in df['Date/Time']]
    df['MonthTime'] = [x.day + x.hour/24. + x.minute/(24.*60) for x in df['Date/Time']]
    df['IsWeekend'] = (df.DayOfWeek == 0) | (df.DayOfWeek == 6)
    return df
df = TransformTimestamp(df)

Now that I've done that, I can start plotting some basic things about the time distributions.

In [None]:
## Create a figure since we'll have several subfigures
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(221)

## First, total rides by day of the week.
ax.hist(df.DayOfWeek,bins=7,range=[0,7],alpha=0.8)
ax.set_xlabel('Day of Week')
ax.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5,6.5])
ax.set_xticklabels(['Su','Mo','Tu','We','Th','Fr','Sa'])
ax.set_ylabel('# of Rides in April 2014')
ax.set_xlim([0,7])

## Next, let's look at the distributions every 15 min throughout the day
ax = fig.add_subplot(222)
ax.hist(df.DayTime,bins=96,range=[0,24],alpha=0.8)
ax.set_xlabel('Time of Day [hrs]')
ax.set_ylabel('# of Rides in April 2014')
ax.set_xlim([0,24])

## Finally, let's look at the # of rides every 12 hours for the whole month
ax = fig.add_subplot(212)
ax.hist(df.MonthTime,bins=30*12,range=[1,31],alpha=0.8)
ax.set_xlabel('Time in Month [days]')
ax.set_ylabel('# of Rides in April 2014')
ax.set_xlim([1,31])

fig = plt.figure(figsize=[12,6])
plt.hist(df.MonthTime,bins=30,range=[1,31],alpha=0.8)
plt.xlabel('Day [days]')
plt.ylabel('# of Rides in April 2014')
plt.xlim([1,31])
plt.show()

We can see a number of features here. There are clearly more rides during weekdays than on weekends. We would expect this as long as a substantial fraction of rides are commuters.

We also see several peaks throughout the day. The morning rush hour starts a little after 5 and ends around 9. The afternoon rush hour is much larger and there is actually a broad peak extending from around noon to 3 am. The true rush hour is probably around 4-8 pm, with a second small peak near 9 pm. The small peak is probably not big enough to be able to tell what it is.

Finally, in the plot of the whole month, we see that there are large day-to-day fluctuations. The last day of the month has a huge number of trips for some reason, so it would be good to investigate that. We also see that on the first Sunday of the month (the 7th), there is an unusually large amount of Uber activity compared to other Sundays.

Now, I'll compare the weekend to non-weekend time distributions.

In [None]:
fg = sns.FacetGrid(df,col='IsWeekend',size=5,sharex=True,xlim=[0,24])
fg.map(plt.hist,'DayTime',alpha=0.8,normed=True,bins=96,range=[0,24])
for ax in fg.axes.flat:
    ax.set_xlabel('Time of Day [hr]')

plt.show()

We see that the rush hour structure still persists even on weekends but is not quite as clear. The second peak in the afternoon rush hour has disappeared on the weekend, suggesting that it is business-related in some way.

There is an odd dip in the weekend plot around 8pm.

We also see a huge difference in the number of trips taken in the early morning, where there are many fewer trips on weekdays. Clearly, this feature is going to be related to the number of people out socializing at night. Interestingly, there are more trips taken at night but before midnight on weekdays. We would generally expect the values at 24:00 and 0:00 to match but this is clearly not the case on weekends. I would guess that this is due to Friday being counted as a weekday here.


In [None]:
### If we're going to make any histogram type plots, it would be good to have square bins
### Away from the equator, degrees in latitude and longitude are different
### (i.e. the Jacobian on a spherical surface is sin(theta) )

latitude_nyc = 40.75 * np.pi/180
aspect = 1/np.sin(0.5*np.pi - latitude_nyc)
aspect

# Some visualizations

First we'll look at a few plots of the whole city.

In [None]:
def WholeCityPlot(df,ax=None,xbins=250):
    if ax is None:
        fig = plt.figure(figsize=(10,10))
        ax = fig.add_subplot(111)
    ax.grid(b=False)

    xlims = [-74.2,-73.7]
    ylims = [40.6,40.9]
    ybins = xbins * aspect*(ylims[1]-ylims[0] )/(xlims[1]-xlims[0])
    hist = ax.hist2d(df.Lon,df.Lat,bins=[xbins,ybins],range=[xlims,ylims],cmap='inferno',norm=mpl.colors.LogNorm())
    
WholeCityPlot(df)
plt.show()

We can see a number of hot spots here. Midtown Manhattan is clearly a huge bright spot. It also looks like there's a spot near the Meatpacking District. The three airports are also clearly visible. Most of the outer boroughs are pretty lightly covered with spots. The areas close to the waterfront in Queens and Brooklyn are bright. In the Bronx, there seem to be spots around Riverdale and Yankee Stadium. Hoboken and the waterfront area of Jersey City are also brighter.

Now, how do the weekend and weekday plots differ?

In [None]:
fig = plt.figure(figsize=[10,5])
ax = fig.add_subplot(121)
WholeCityPlot(df[df['IsWeekend'] == True ],ax=ax)
ax = fig.add_subplot(122)
WholeCityPlot(df[df['IsWeekend'] == False],ax=ax)

In [None]:
def MapNoAirports(df,ax=None,xbins=200):
    if ax is None:
        fig = plt.figure(figsize=(10,10))
        ax = fig.add_subplot(111)
    ax.grid(b=False)

    xlims = [-74.05,-73.9]
    ylims = [40.6,40.9]
    ybins = xbins * aspect*(ylims[1]-ylims[0] )/(xlims[1]-xlims[0])
    hist = ax.hist2d(df.Lon,df.Lat,bins=[xbins,ybins],range=[xlims,ylims],cmap='inferno',norm=mpl.colors.LogNorm())
    
MapNoAirports(df)

In [None]:
def ManhattanMap(df,ax=None,xbins=150):
    if ax is None:
        fig = plt.figure(figsize=(10,10))
        ax = fig.add_subplot(111)
    ax.grid(b=False)

    xlims = [-74.03,-73.92]
    ylims = [40.69,40.88]
    ybins = xbins * aspect*(ylims[1]-ylims[0] )/(xlims[1]-xlims[0])
    hist = ax.hist2d(df.Lon,df.Lat,bins=[xbins,ybins],range=[xlims,ylims],cmap='inferno',norm=mpl.colors.LogNorm())
    
fig = plt.figure(figsize=[10,10])
ax = fig.add_subplot(221)
ax.set_title('Weekend')
ManhattanMap(df[df['IsWeekend'] == True ],ax=ax,xbins=100)
ax = fig.add_subplot(222)
ax.set_title('Weekday')
ManhattanMap(df[df['IsWeekend'] == False],ax=ax,xbins=100)
ax = fig.add_subplot(223)
ax.set_title('Late Night')
ManhattanMap(df[(df.DayTime<5)],ax=ax,xbins=100)
ax = fig.add_subplot(224)
ax.set_title('Not Late Night')
ManhattanMap(df[(df.DayTime>=5)],ax=ax,xbins=100)



In [None]:
fig = plt.figure(figsize=[10,5])
ax = fig.add_subplot(121)
ax.set_title('Morning Commute')
ManhattanMap(df[ (df.DayTime > 6)&(df.DayTime<9) ],ax=ax,xbins=80)
ax = fig.add_subplot(122)
ax.set_title('Afternoon Commute')
ManhattanMap(df[(df.DayTime>16)&(df.DayTime<19)],ax=ax,xbins=80)

# Fan behavior around Yankee Stadium

April is the first month of baseball, so let's see if the data shows anything about the season. I'll look at Yankee Stadium, which is in the Bronx just across from Manhattan. Very little of the Bronx seems to get much traffic, so it is likely that a lot of the traffic around Yankee Stadium is 

It might also be interesting to see how this compares to Citi Field, where the Mets play. I also wonder if there are noticeable effects around Madison Square Garden from events. MSG is on top of Penn Station, so there will be a huge background of regular travelers if we want to look there.

In [None]:
YankeeStadium = [40.8294,-73.9267]
## -73.9319, -73.9213
## 40.8247 40.834
df_ys = df[(df.Lat > 40.824 ) & (df.Lat < 40.837) & (df.Lon > -73.9319) & (df.Lon < -73.92)]
fig = plt.figure(figsize=[10,10])
ax = fig.add_subplot(221)
ax.scatter(df_ys.Lon,df_ys.Lat)
ax.set_xlim([-73.9319,-73.92])
ax.set_ylim([40.824,40.837])
ax.set_xlabel('Lon [deg]')
ax.set_ylabel('Lat [deg]')

ax = fig.add_subplot(222)
ax.hist(df_ys.MonthTime,range=[1,31],bins=30)
ax.set_xlim([1,31])
ax.set_xlabel('Day')
ax.set_ylabel('Count')

ax = fig.add_subplot(212)
ax.hist(df_ys.MonthTime,range=[7,17],bins=10*24)
ax.set_xlim([7,17])
ax.set_xlabel('Day')
ax.set_ylabel('Count')

plt.show()

The top left plot shows the locations of pickups in the vicinity of Yankee Stadium (around (-73.926,40.83)).
We see two main clusters of pickups, along Jerome Ave and along E 161st between the stadium and the Grand Concourse. I was surprised to see that there was no cluster near the Metro North station toward the bottom left of the plot. I should eventually label these to make the geographic features clearer.

In the top right, we see a clear effect from game days. Mysteriously, there were very few Uber pickups on April 8th even though there was a game. Otherwise, all the game days are obvious. 

The bottom plot looks at the period from April 7th to April 16th. We see sharp peaks corresponding to the games. We also see that there was a double header on the 16th.

Now, let's look some night games.

In [None]:
def YankeesNightGame(df,day):
    fig = plt.figure(figsize=[10,5])
    ax = fig.add_subplot(121)
    ax.hist( (df.MonthTime-day) * 24 ,range=[18,25],bins=40)
    ax = fig.add_subplot(122)
    df_game = df[(df_ys.MonthTime>(day+0.75)) & (df.MonthTime<(day+0.95))]
    ax.scatter(df_game.Lon,df_game.Lat,c=df_game.DayTime,cmap='jet',marker='.')
    ax.set_xlim([-73.9319,-73.92])
    ax.set_ylim([40.824,40.837])
    
YankeesNightGame(df_ys,9)
YankeesNightGame(df_ys,10)
YankeesNightGame(df_ys,11)
YankeesNightGame(df_ys,16)
YankeesNightGame(df_ys,25)
YankeesNightGame(df_ys,27)
YankeesNightGame(df_ys,29)

plt.show()

I've plotted histograms of the pickup times and scatterplots (with color plotted as time) of the positions of pickups for several games. The statistics are poor, but it looks as if there's no real trend for when in the game the most pickups occur. We would generally expect the peak to be immediately following the game, but it's not clear that this is what we see. Several of the plots make it look as if there is a cluster of late pickups near the stadium on E 161st. However, this is not the case for all of these, so I would say that more data or at least more analysis is needed to see if there are differences between the two populations. If there really is often a cluster of late pickups on E 161st, it may be players or staff leaving after the game. It would be interesting to get data like this from a variety of stadiums and arenas to see which venues have fans that tend to stick around after the game is over.

# Upper East Side/Spanish Harlem Boundary

One very noteworthy feature that we can see in the maps is the sharp boundary between the Upper East Side and Spanish Harlem, which traditionally lies at E 96th Street. We can see a small number of pickups in Central Park along the 97th St transverse, showing that this boundary is still more or less the same.

People have very often complained that traditional taxis are often hesitant to venture into areas like Harlem (I don't think this is nearly as common as it used to be though). So, why does Uber also seem to show this behavior?

Their full dataset would show a lot of interesting features, but there are a number of reasons why this might be so. Perhaps people in poorer neighborhoods are simply less likely to have the Uber app, either from not having a smart phone or a credit card. Maybe in 2014 Uber hadn't been able to make inroads outside working professionals. With some census data, it would be interesting compare the number of pickups for different residential areas. Could there be Uber access disparities in poorer neighborhoods? What about minority dominated neighborhoods? What about areas with few professional workers or with different mixes of transit options? These would require additional datasets, which I have downloaded but will not discuss further here.

For now, let's give a quick look at the area around E 96th St.

In [None]:
def UESHarlemMap(df,ax=None,xbins=50):
    if ax is None:
        fig = plt.figure(figsize=(10,10))
        ax = fig.add_subplot(111)
    ax.grid(b=False)

    xlims = [-73.96,-73.93]
    ylims = [40.75,40.85]
    ybins = xbins * aspect*(ylims[1]-ylims[0] )/(xlims[1]-xlims[0])
    hist = ax.hist2d(df.Lon,df.Lat,bins=[xbins,ybins],range=[xlims,ylims],cmap='inferno',norm=mpl.colors.LogNorm(vmin=1))
    
UESHarlemMap(df,None,25)
#plt.show()

There is a very sharp boundary here, especially considering that this is a logarithmic color map. I'm getting errors when trying to display the color bar, so I've left it out so far.

Now, what happens if we look at the time distributions of pickups in the Upper East Side and in Harlem? For simplicity, I've just chosen boxes rather than trying to tilt the regions to fit the Manhattan grid orientation.

In [None]:
df_ues = df[ (df.Lat > 40.77) & (df.Lat < 40.78) & (df.Lon >-73.95) & (df.Lon < -73.946)]
df_sh = df[ (df.Lat > 40.79) & (df.Lat < 40.81) & (df.Lon > -73.95 ) & (df.Lon < -73.946)]

fig = plt.figure()
plt.hist(df_ues.DayTime,alpha=0.5,normed=True,label='Upper East Side',range=[0,24],bins=24)
plt.hist(df_sh.DayTime,alpha=0.5,color='r',normed=True,label='Harlem',range=[0,24],bins=24)
plt.legend(loc='upper right')
plt.xlim([0,24])


#plt.show()

Interestingly, not only are there far more pickups on the Upper East Side compared to nearby parts of Harlem, but the time distributions are very different. The Harlem pickups are much more uniform, with two broad peaks during rush hour. The afternoon rush hour is also later than that in the Upper East Side. This might point to different work schedules for commuters in the two areas.

# Manhattan Uber Traffic Animation

Now, I'll create an animated HTML5 image to show how Uber traffic changes throughout the day in the region of Manhattan where most pickups occur. I'll make a new image every 30 minutes, so I've reduced the number of bins to get rid of some of the noise.

In [None]:
### Note: This does not work in all browsers, seems like it doesn't work in Kaggle due to missing libraries.
## Uncomment the last two lines if running on your own computer.

class ManhattanAnimator:
    def __init__(self,df,nbins):
        self._df = df
        self._nbins = nbins
        self._fig = plt.figure(figsize=(8,8))
        self._ax = self._fig.add_subplot(111)
        self._xlims = [-74.03,-73.94]
        self._ylims = [40.69,40.825]

        aspect =  1.3200187714761737

        ybins = self._nbins * aspect*(self._ylims[1]-self._ylims[0] )/(self._xlims[1]-self._xlims[0])
        self._hists = []
        self._max = 0
        
        for i in range(48):
            hist,xrange,yrange = np.histogram2d(df[(df.DayTime>=i*0.5) & (df.DayTime <(i+1)*0.5)].Lon,df[(df.DayTime>=i*0.5) & (df.DayTime<(i+1)*0.5)].Lat,range=[self._xlims,self._ylims],bins=[self._nbins,ybins])
            maxval = np.max(hist)
            if maxval > self._max:
                self._max = maxval
            self._hists.append( hist )
        self._im = self._ax.imshow(self._hists[0].transpose()[::-1],interpolation='nearest',cmap='inferno',norm=mpl.colors.LogNorm(vmin=1,vmax=maxval),extent=[self._xlims[0],self._xlims[1],self._ylims[0],self._ylims[1]],animated=True)
        self._ax.set_xlim(self._xlims)
        self._ax.set_ylim(self._ylims)
        self._ax.set_xlabel('Longitude [deg]')
        self._ax.set_ylabel('Latitude [deg]')
        self._ax.set_title("Uber Pickups between 0:00 and 0:30")
        self._ax.grid(b=False)

    def init_anim(self):
        #self._line.set_data([], [])
        return (self._im,)
    def animate(self,i):

        self._im.set_array(self._hists[i].transpose()[::-1])
        self._ax.set_title("Uber Pickups between %02i:%02i and %02i:%02i"%(int(i*0.5),int( 60*((i*0.5)%1) ),int((i+1)*0.5),int(60*(((i+1)*0.5)%1)+0.0001))) 

        return (self._im,)
    def run_animation(self):
        from matplotlib import animation, rc
        from IPython.display import HTML, Image
        mpl.rcParams['animation.writer'] = 'avconv'
        mpl.rcParams['animation.html'] = 'html5'
   # First set up the figure, the axis, and the plot element we want to animate
        anim = animation.FuncAnimation(self._fig, self.animate, init_func=self.init_anim,frames=48, interval=1000, blit=True)
        plt.close(anim._fig)
        #anim.save('manhattan_uber.gif', writer='imagemagick', fps=1)
        #Image(url='manhattan_uber.gif')
        return HTML(anim.to_html5_video()) 

    
#a = ManhattanAnimator(df,25)
#a.run_animation()

First, it looks like the last two images of my animation aren't displaying properly. Not sure why. 

There are a lot of features to see here. Late at night, we see two main clustersthat stay bright. I believe these are the Meatpacking District (lots of clubs, shuts down aronud 4:00) and the Lower East Side (around St Marks Pl - tons of bars, shuts down around 2:30-3:00). Now, looking at the commute, the morning commute starts with a lot of activity around the Upper West Side, Upper East side, and a broad area including Greenwich Village, Lower East Side, etc. The Village and nearby areas stay bright throughout the day while the UWS and UES darken after the morning rush hour.

Midtown is the site of the most activity, and it's interesting to see how much more active Midtownidtown is compared to the Financial District. One thing I noticed is that the center of Uber activity in Midtown shifts from somewhere probably around Madison or Park over to Times Square after around 8 pm. This maybe suggests either that offices around Times Square keep different hours or that night activity there is dominated by tourists.

# Anomalous Number of Rides on April 7 and 30
Recall how we saw that April 7th and April 30th seemed to have too many rides. Let's see if we can see any evidence for where that might have come from.

In [None]:
fig = plt.figure(figsize=[10,5])
ax = fig.add_subplot(121)
ax.set_title('April 1-6,8-29')
ManhattanMap(df[ (df.Day != 7)&(df.Day!=30) ],ax=ax,xbins=40)

ax = fig.add_subplot(122)
ax.set_title('April 7')
ManhattanMap(df[(df.Day == 7)],ax=ax,xbins=40)

fig = plt.figure(figsize=[10,5])
ax = fig.add_subplot(121)
ax.set_title('April 1-6,8-29')
ManhattanMap(df[ (df.Day != 7)&(df.Day!=30) ],ax=ax,xbins=40)

ax = fig.add_subplot(122)
ax.set_title('April 30')
ManhattanMap(df[(df.Day == 30)],ax=ax,xbins=40)


Nothing too obvious here. April 7th is a Sunday, so maybe we should really compare to other Sundays. I also haven't looked at the Base variable at all, so maybe something with the data changed and the increase on the 30th is actually artificial. What we should really do here is look at the ratio of a typical day to one of these days.

## More things to do

Some other things that could be nice to look at (I haven't done these yet) are to look at the distributions of pickups at the airports. How do the different airports compare? Maybe they don't have the same schedules. How does the number of Uber pickups compare to the total passenger traffic? It might be hard to find data of the number of people whose final destination was one of the airports. Also, how do the different terminals compare?

As I mentioned earlier, connecting this data to census data could be fascinating since we might get clues about what kinds of people use Uber, what kinds of neighborhoods are underserved, and more.