I started looking at the data yesterday and had a lot of fun with the google maps API. The notebook won't run properly on the Kaggle site because it needs geocoder for one step - you can 


    conda install geocoder

or if you're not using Anaconda

    pip install geocoder

and it sends requests to the google maps API, which I couldn't get working in the Kaggle docker container.

The environ function is a way to supply the API key from an environment variable without revealing it to the world. Where it is used in the function call - you can replace

    api_key=environ["GOOGLE_API_KEY"]

with the key string

    api_key='my_secret_string'

So excuse me for all the error messages. If you download the notebook, install geocoder and sort out the google API key, it will do it's thing. 

N.B. You get 2,500 API calls without a key before getting blocked for the day.




In [None]:
import pandas as pd
import seaborn as sns
import json, requests
from os import environ
%matplotlib inline
import geocoder



http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

### Exploratory steps

In [None]:
df = pd.read_csv('../input/train.csv')
df.head()

In [None]:
df.dtypes

### Datetimes to datetime objects 

In [None]:
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])

### What is the store and forward flag?

In [None]:
df['store_and_fwd_flag'].unique()

### Convert 'store_and_fwd_flag' to ints for easier play

In [None]:
df['store_and_fwd_flag'] = df['store_and_fwd_flag'].map({'Y':1,'N':0})

### Check 'id' is a unique identifier

In [None]:
len(df['id'].unique()) 

In [None]:
len(df['id'].unique()) == df.shape[0]

In [None]:
df['id'].str.startswith('id').sum() # every one starts with 'id'

As the 'id' field is still unique without the 'id' string at the start, I'll remove it to explore

In [None]:
df['id'] = df['id'].apply(lambda x: int(x[2:]))

is it still unique?

In [None]:
df['id'].count() == len(df['id'].unique())

In [None]:
df['id'].min()

In [None]:
df['id'].max()

How many vendors?

In [None]:
df['vendor_id'].unique()

In [None]:
df['vendor_id'][df['vendor_id'] == 1].count() # How many vendor 1?

In [None]:
df['vendor_id'].count() - df['vendor_id'][df['vendor_id'] == 1].count() # How many vendor 2?

In [None]:
sns.barplot(x = [1,2],y = [678342,780302])

What is the spread of passenger numbers?

In [None]:
sns.boxplot(df['passenger_count'])

That's interesting, there are some trips with 0 passengers. 4,5,6,7,8 and 9 passengers is rarer.

![](http://m8.i.pbase.com/g1/26/574826/2/116238228.Wsb0Eo5S.jpg)

Let's look at the proportions

In [None]:
from collections import Counter

In [None]:
passenger_counts = Counter(df['passenger_count'])
passenger_counts

In [None]:
pd.DataFrame({'Count':passenger_counts}).plot(kind='bar',title='Number of passengers count frequency')

A single passenger is by far the most common situation. There are still 200,000 trips with 2 passengers but we don't have an even representation of these groups in the sample. Also wondering what type of taxi can seat more than 4 plus the driver.

### I wonder if the trip duration is the same as the difference between the pickup and dropoff time

In [None]:
df['trip_duration_delta'] = df['dropoff_datetime'] - df['pickup_datetime']

In [None]:
trip_delta = df['trip_duration_delta']

In [None]:
trip_delta.sort_values().head(20)

That's intriguing, lots of trips lasting for 1 second (or I'm thinking that's the smallest value that isn't 0)

In [None]:
trip_delta.sort_values().tail()

Wow! they are some looooong trips! 

In [None]:
trip_delta[trip_delta > '1 days 00:00:00']

Who hires a taxi for three weeks???? Who are these people?

![](https://s-media-cache-ak0.pinimg.com/736x/f4/7c/c2/f47cc2622993b33e30d05c78149fa082--ny-fashion-week-new-york-fashion.jpg)

In [None]:
df.iloc[355003]

Not explained by pickup and dropoff

In [None]:
df.iloc[trip_delta[trip_delta > '1 days 00:00:00'].index.values]

I wonder which vendor is #1? 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.

http://creativemobiletech.com/about-cmt-2/

Their website doesn't give much away. A booking company.

------------

### I'd like to compare the trip duration I've calculated against the trip duration column supplied

In [None]:
delta_seconds = df['trip_duration_delta'].dt.total_seconds()

In [None]:
delta_seconds.head()

In [None]:
df['trip_duration'].head()

In [None]:
df[delta_seconds != df['trip_duration']]

### So there's a difference here!

In [None]:
delta_seconds = df['trip_duration_delta'].dt.total_seconds().apply(int) # Round the floats to ints

In [None]:
df[delta_seconds != df['trip_duration']]

Relax everyone! Just a rounding error...

I'd like to look at distance between pickup and dropoff but direct distance between two points isn't the same as driving through Manhatten. Google maps will give distance and an estimate of driving timebased on historic data.

### Let's look at the first trip as a test

In [None]:
df.loc[0,['pickup_latitude','pickup_longitude']].values # First pickup for testing

Get the address with geocoder

In [None]:
g = geocoder.google([40.767936706542969, -73.982154846191392],method='reverse')
g.address

### I'll setup a routine for taking the coordinates and sending a query to google maps

In [None]:
pulat,pulng,dolat,dolng = df.loc[0,['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']]

In [None]:
print(pulat,pulng,dolat,dolng)

To make it a bit more manageable, I'll break the api call into pieces and put it back together later. The 'origins' and 'destination' are the pieces of information that vary.

In [None]:
google = "http://maps.googleapis.com/maps/api/distancematrix/json?"
geo = "origins={a},{b}&destinations={c},{d}"
reply_type = "&mode=driving&language=en-EN&sensor=false"

In [None]:
# The geo piece requires the coordinates for pickup and destination

q = google+geo.format(a=pulat,b=pulng,c=dolat,d=dolng)+reply_type

In [None]:
result= json.loads(requests.get(q).text)
result

The piece I'm interested in is duration in seconds, it's nested pretty deep.

In [None]:
result['rows'][0]['elements'][0]['duration']['value']

So this trip is df.loc[0]

In [None]:
df.loc[0,'trip_duration']

Not bad! I wonder how close google is with driving time predictions?

In [None]:
def get_google_estimate_now(pulat,pulng,dolat,dolng):
    google = "http://maps.googleapis.com/maps/api/distancematrix/json?"
    geo = "origins={a},{b}&destinations={c},{d}"
    reply_type = "&mode=driving&language=en-EN&sensor=false"
    q = google+geo.format(a=pulat,b=pulng,c=dolat,d=dolng)+reply_type
    result= json.loads(requests.get(q).text)
    return result 

I've called this get_google_estimate_now because google maps is estimating the time for traffic conditions as they are now

In [None]:
x = df.loc[0,['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']]

In [None]:
get_google_estimate_now(*x)

Let's try a small number of trips

In [None]:
small_df = df.loc[0:3,['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']]

In [None]:
for vals in small_df.values:
    print(get_google_estimate_now(*vals))

In [None]:
df.loc[0:3,['trip_duration']]

let's look at the first 1000 rows and plot some results

In [None]:
google_estimate = []
df_coordinates = df.loc[:1000,['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']]

for vals in df_coordinates.values:
    google_estimate.append(get_google_estimate_now(*vals))
    print(google_estimate[-1:])

Printing out each result would have slowed that down, but it's a cheap way to see if it's still processing.

Google are going to get tired of me bashing their API without registering, I better get a key before doing much more.

In [None]:
len(google_estimate)

Oh that's right, .loc is up to "and including" the last index, not the normal up to "but not including" the last index that .iloc gives you. Oh well! One more won't hurt.

In [None]:
google_estimate[1]

In [None]:
estimates = [est['rows'][0]['elements'][0]['duration']['value'] for est in google_estimate]

In [None]:
estimates[:10]

In [None]:
df['trip_duration'][:10]

In [None]:
first_thousand = pd.DataFrame({'trip_time':df['trip_duration'][:len(estimates)],'est_time':pd.Series(estimates)})

In [None]:
first_thousand.head()

In [None]:
first_thousand.plot(x = 'trip_time',
                    y= 'est_time',
                    kind='scatter',
                    title='correlation between trip time and google estimate')

one outlier there making a fool out of me! Let's see who that is...

In [None]:
df['trip_duration'][:len(estimates)][df['trip_duration'][:len(estimates)]> 80000]

In [None]:
first_thousand.iloc[531]

Ok, I'll drop that row and replot

In [None]:
first_thousand.drop(531).plot(x = 'est_time',
                              y='trip_time',
                              kind='scatter',
                              title = 'correlation between trip time and google estimate')

ok, that's not bad. Some are way off but there is a reasonable amount of correlation. I wonder if this can be improved?

google maps also allows departure time and has historic data. I wonder if this can give a little more accuracy. The documentation for the api states that the departure date must be in the future so I can't use 'departure_datetime' as is. I'll get something going and work that bit out later.

In [None]:
import time
import datetime

# google maps API will take a departure time in unix time format
# unix datetime is seconds since 1st Jan 1970

def dt2ut(dt): 
    
    epoch = pd.to_datetime('1970-01-01')
    
    return (dt - epoch).total_seconds()

def format_query(pulat,pulng,dolat,dolng,unixtime,api_key):
    
    google = "http://maps.googleapis.com/maps/api/distancematrix/json?"
    geo = "origins={a},{b}&destinations={c},{d}"
    time = "&departure_time={e}"
    reply_type = "&mode=driving&language=en-EN&sensor=false"
    key = 'key={f}'
    q = google+geo.format(a=pulat,b=pulng,c=dolat,d=dolng)+\
    time.format(e=int(unixtime))+reply_type+key.format(f=api_key)
    
    return q

def get_google_estimate_future(pulat,pulng,dolat,dolng,deptime,api_key):
    
    unixtime = dt2ut(deptime)
    variables = [pulat,pulng,dolat,dolng,unixtime,api_key] 
    q = format_query(*variables)
    result= json.loads(requests.get(q).text)
    driving_results = result
    
    return driving_results

In [None]:
test_dt = df.loc[0,['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude', 'pickup_datetime']]

In [None]:
get_google_estimate_future(*test_dt, api_key=environ["GOOGLE_API_KEY"])

OK! Still working. Maybe if I ask about drive time for the same date but in the future - changing the year for example. I'm going to put a pin in that for now and look at the influence of what day of the week the trip was taken on.

-----------

In [None]:
pd.DatetimeIndex

### Influence of days of the week.

pd.DatetimeIndex.dayofweek

The day of the week with Monday=0, Sunday=6

In [None]:
df['day_of_week'] = df['pickup_datetime'].dt.dayofweek

In [None]:
pd.DataFrame(list(Counter(df['day_of_week']).values()),
             index=['Mon','Tue','Wed','Thur','Fri','Sat','Sun']).plot(title='Number of trips by weekday',
                                                                      figsize=(10,8))

In [None]:
ax = df['trip_duration'].groupby(df['day_of_week']).median().plot(title='Median length of trip by weekday',
                                                                  figsize=(10,8))
ax.set_xticklabels(['Mon','Tue','Wed','Thur','Fri','Sat','Sun'])

The difference between the average trip time and the most common trip time makes me think that there are very long trip times toward Friday and Saturday that are pulling the mean higher.

-------------