This notebook demonstrates how to apply [Distance Matrix API](https://developers.google.com/maps/documentation/distance-matrix/overview), a service provides travel distance and time for a matrix of origins and destinations from google, for getting more realistic distances against the conventional method. 

## 1. Conventional Method vs Distance Matrix API

### Conventional method
The conventional method, a.k.a [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula), is a very accurate way of computing distances between two points on the surface of a sphere using the latitude and longitude of the two points, one of its primary applications is navigation. You can find its realization in this [excellent notebook](https://www.kaggle.com/madhurisivalenka/cleansing-eda-modelling-lgbm-xgboost-starters) by Sivalenka. 
However, since the formula returns the shortest distance between two points which, as you know, is impossible for us to take when it comes to daily traffics.

### Distance Matrix API
If you want to get more accurate distances from lat/lon pairs, one you should consider is `Distance Matrix API`. It calculates distances base on google map algorithm, which is being used in our daily life. No matter you trust google map or not, it could definitely get you better results compares to the `Haversine formula`. Moreover, it also provides you to choose from a list of travel types, such as driving, walking, transit etc., and returns you the travel duration of the given mode.

However, there are two major drawbacks of `Distance Matrix API`, the first is time consuming, it takes about 20~23 sec to process 1000 inputs in my cumputer, so you should prepare for a long run-time if your dataset is large. 

The second, which I think is the biggest disadvantage, is you only get limited free request number to this API. Google provides you a $300 free trial to its products, after you run out this free trial account, you need to pay for additional access. As your reference, I ran out my free trial, and it completed about 650 thousans of samples. So make sure what you are doing before jump into it.

### Let's get start!

In [None]:
# install package 
!pip install googlemaps

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
from tqdm import tqdm
import time
import gc

pd.options.display.float_format = '{:,.2f}'.format

In [None]:
train = pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows = 700_000) 
print(train.shape)
train.head()

In [None]:
# Load testing
test = pd.read_csv('../input/new-york-city-taxi-fare-prediction/test.csv')
print(test.shape)
test.describe()

### NaN Check

In [None]:
train.isna().sum()

## 2. Preprocessing
| Column | Do  |
| ---    | --- |
| fare_amount | drop value > 500|
| longitude | drop Outliers & NAN |
| latitude | drop Outliers & NAN|
| passenger_count | drop value > 6 (align with testing data) |

In [None]:
# Remove outliers and NANs

# Select fare_amount between 0 and 100
train = train[train.fare_amount.between(0,500)]

# Select pickup_longitude and dropoff_longitude between -75 and -71
train = train[(train.pickup_longitude.between(-75,-71)) & (train.dropoff_longitude.between(-75,-71))]

# Select pickup_latitude and dropoff_latitude between 38 and 42
train = train[(train.pickup_latitude.between(38,42)) & (train.dropoff_latitude.between(38,42))]

# Select passenger_count equal or less than 6
train = train[train.passenger_count <= 6]

train = train[(~train.dropoff_longitude.isna()) | (~train.dropoff_latitude.isna())]


# google can't fit a route for this instance
#train.drop(index=[6817], inplace=True)

train = train.reset_index(drop=True)

In [None]:
print('Data After preprocessing:')
train.describe()

## 3. Distance Matrix API
Let's import the package that we just installed

In [None]:
import googlemaps 

The second step to complete our task is to enter your API key and setup `Client`. 

To get the API key, go to [Getting started with Google Maps Platform](https://developers.google.com/maps/gmp-get-started).


In [None]:
# Enter the key you got from Google. I removed mine here
gmaps = googlemaps.Client(key=API_key) 

Next, modifiy the coordinates you want to calculate their distance into tuples.

In [None]:
pickup =  tuple(train.loc[0, ['pickup_latitude','pickup_longitude']])
dropoff = tuple(train.loc[0, ['dropoff_latitude','dropoff_longitude']])

print("Pickup at ({:.5f}, {:.5f});\nDropoff at ({:.5f}, {:.5f})".format(pickup[0], pickup[1], dropoff[0], dropoff[1]))

Once you are done, write the following code to sent request to google.

In [None]:
matrix = gmaps.distance_matrix(pickup, dropoff, mode='driving') # driving mode

# print these results seperately
print("Pickup address: {}".format(matrix['origin_addresses'][0]))
print("Dropoff address: {}".format(matrix['destination_addresses'][0]))
print("Distance: {} m".format(matrix['rows'][0]['elements'][0]['distance']['value']))
print("Duration: {} sec".format(matrix['rows'][0]['elements'][0]['duration']['value']))

Wow! It shows the distance between them is 1375 meters and it takes about 5 mins to travel.

Let's have a simple comparison to `Haversine formula`, I borrowed and modified the following code from [this notebook](https://www.kaggle.com/madhurisivalenka/cleansing-eda-modelling-lgbm-xgboost-starters)

In [None]:
def haversine_distance(lat1, long1, lat2, long2):
    R = 6371  #radius of earth in kilometers
    phi1 = np.radians(lat1)
    phi2 = np.radians(lat2)
    
    delta_phi = np.radians(lat2-lat1)
    delta_lambda = np.radians(long2-long1)
    
    #a = sin²((φB - φA)/2) + cos φA . cos φB . sin²((λB - λA)/2)
    a = np.sin(delta_phi / 2.0) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2
    
    #c = 2 * atan2( √a, √(1−a) )
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    
    #d = R*c
    d = (R * c) #in kilometers
    return d

distance_hd = haversine_distance(pickup[0], pickup[1], dropoff[0], dropoff[1])
print("Distance: {:.2f} m".format(distance_hd*1000))

Here is the difference! 

Distance suggested by `Distance Matrix API` is 1375m while distance suggested by `Haversine formula` is 1030.76m.

It is not surprising to see that `Haversine formula` under-estimates distance in this kind of problems, since it assumes you can always pick the straight way toward the destination.


In the next section, I am going to have a grand comparison between `Distance Matrix API` and `Haversine formula`, I will, considering the free trial limitation, calculate all of the training data as I can. Then campare them base on their correlation-coefficient values with the target value: `fare_amount`.

### Let's move on!

## 5. Grand Comparison
You can use the following code to calculate the distance/duration through the data. I will not run this block here since it is very time consuming. Instead I use data calculated from my local.

### Distance Matrix API

In [None]:
def parse_value(matrix, name):
    # Fill it by NAN if we dont get the data back (status != OK ), 
    # This happens when google can't set a route for the given lat/lon pairs

    res = [
        matrix[i]['elements'][i][name]['value'] if matrix[i]['elements'][i]['status']=='OK' else np.NAN
        for i in range(len(matrix))
    ]
    return res


starttime = time.time()

n = train.shape[0]
DISTANCE = []
DURATION = []
i=0

gmaps = googlemaps.Client(key=API_key) #enter the key you got from Google. I removed mine here
while i < n:
    pickup = train.loc[i:i+9, ['pickup_latitude','pickup_longitude']]
    pickup = [tuple(x) for x in pickup.to_numpy()]
    
    dropoff = train.loc[i:i+9, ['dropoff_latitude','dropoff_longitude']]
    dropoff = [tuple(x) for x in dropoff.to_numpy()]
    
    matrix = gmaps.distance_matrix(pickup, dropoff, mode='driving')['rows']
    
    # Because gmaps returns a matrix of distance and duration values, so I select the diagonal elements
    distance = parse_value(matrix, 'distance')
    duration = parse_value(matrix, 'duration')
    
    DISTANCE.extend(distance)
    DURATION.extend(duration) 
    
    if i%990 == 0:
        print("Complete: {}({:.2f}%)".format(i, (i/n)*100))
        print("time: {:.2f} minutes".format((time.time()-starttime)/60))
    
    i+=10
    
train['G_Distance'] = DISTANCE
train['G_Duration'] = DURATION

print("Total time: {:.2f} minutes".format((time.time()-starttime)/60))

In [None]:
train = pd.read_csv('../input/datacsv/data.csv')
train.rename(columns={'distance':'G_Distance','duration':'G_Duration'}, inplace=True)
print(train.shape)
train.head(3)

### Haversine Formula

In [None]:
# Calculate Haversine Distance for the entire dataset

def haversine_distance(data):
    R = 6371
    phi1 = np.radians(data['pickup_latitude'])
    phi2 = np.radians(data['dropoff_latitude'])
    
    delta_phi = np.radians(data['dropoff_latitude'] - data['pickup_latitude'])
    delta_lambda = np.radians(data['dropoff_longitude'] - data['pickup_longitude'])
    
    a = np.sin(delta_phi / 2.0) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2
    
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = (R * c) #in kilometers
    data['H_Distance'] = d*1000
    return data


train = haversine_distance(train)
train.head()

### Correlation Coeffients

In [None]:
# Take a look to the correlation matrix
train[['fare_amount','G_Distance','G_Duration','H_Distance']].corr()

Google map wins the competetion! Correlation coefficient of `Distance Matrix API` distance to fare is at the top of the board, while `Haversine formula` still get a very decent performance. 

## Summary
I summarize the pros and cons of these two methods. For `Haversine formula`:
* **Pros**
    1. **Budget**: It is FREE !!!
    2. **Efficiency**: Easy to calculate.
    3. **Desent Baseline**: It is always worth to try this method since it generates good approximates to the real distance.
* **Cons**
    1. **Accuracy**: Although it seems performing well in this task, but it is hard to say when it comes to different tasks, because this method only provides approximations to real traffic scenarios.
    
    
For `Distance Matrix API`
* **Pros**
    1. **Accuracy**: `Distance Matrix API` provides values which more close to reality.
    2. **Additional Information**: It also provides additional informations such as traffic time (duration) and address name.
* **Cons**
    1. **Time Consuming**: Obviously it is much slower than `Haversine formula`.
    2. **Limited Number of Requests**: you will be charged for requesting more than cetain number of requests.