In [19]:
import pandas as pd
import numpy as np
from math import *

**Loading the data**

In [20]:
df = pd.read_excel('clinics.xls')

In [21]:
df.head()

Unnamed: 0,bizID,bizCat,bizCatSub,bizName,bizAddr,bizCity,bizState,bizZip,bizPhone,bizFax,...,bizURL,locAreaCode,locFIPS,locTimeZone,locDST,locLat,locLong,locMSA,locPMSA,locCounty
0,1,Clinics,Clinics,Hino Ronald H MD,98-151 Pali Momi Street Suite 142,Aiea,HI,96701,(808)487-2477,,...,,808,15003,PST-2,N,21.398,-157.8981,3320.0,,Honolulu
1,2,Clinics,Clinics,Farmer Joesph F Md,1225 Breckenridge Drive,Little Rock,AR,72205,(501)225-2594,,...,,501,5119,CST,Y,34.7495,-92.3533,4400.0,,Pulaski
2,3,Clinics,Clinics & Medical Centers,Najjar Fadi Md,1155 West Linda Avenue Suite B,Hermiston,OR,97838,(541)289-1122,,...,,541,41059,PST,Y,45.8456,-119.2817,,,Umatilla
3,4,Clinics,Clinics & Medical Centers,Kittson Memorial Upper Level Nursing Home,1010 South Birch Avenue,Hallock,MN,56728,(218)843-2525,,...,,218,27069,CST,Y,48.7954,-97.009,,,Kittson
4,5,Clinics,Clinics & Medical Centers,Thompson Robert B Md,100 North Eagle Creek Drive,Lexington,KY,40509,(859)258-4000,,...,www.lexingtonclinic.com,859,21067,EST,Y,37.9935,-84.3712,4280.0,,Fayette


**Define the haversian function**

In [22]:
def haversine(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    mi = miles_constant * c
    return mi

**Computing Iterrows execution time**

In [37]:
%%timeit
# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985,\
                                      row['locLat'], row['locLong']))
df['distance'] = haversine_series

2.2 ms ± 501 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Interpretation:** The iterrows applied on rows to calculate the distance between two points using haversine formula took 2.2 milli seconds of execution time which is more than the below apply function's execution time.

**Apply Haversine on rows**

Timing "apply"

In [24]:
%timeit df['distance'] =\
df.apply(lambda row: haversine(40.671, -73.985,\
                               row['locLat'], row['locLong']), axis=1)

1.03 ms ± 103 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


**Interpretation:** The apply function applied on rows to calculate the distance between two points using haversine formula took 1.03 milli seconds of execution time which is more than the below pandas series execution time. When compared with R, the same code using apply function took 8.9 micro seconds which is very less compared to that of Python.

In [25]:
%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


Profiling "apply"

In [26]:
# Haversine applied on rows
%lprun -f haversine 
df.apply(lambda row: haversine(40.671, -73.985,\
                               row['locLat'], row['locLong']), axis=1)

0     4959.870106
1     1081.593651
2     2276.098880
3     1255.154292
4      584.683573
5      716.061202
6      804.302067
7     1202.595720
8      261.704147
9      723.619081
10    2500.406105
11    1183.572872
12     914.094552
13     128.899163
14     165.299805
15    1550.139984
16     862.008471
17     668.250535
18      96.684878
19    2453.917624
20    1021.420462
21     909.430381
22     760.908611
23    1625.149470
24    2115.580042
25     353.324858
26     861.054055
27     207.952978
28    1057.569755
29    1140.134525
dtype: float64

Timer unit: 1e-07 s

Total time: 0 s
File: C:\Users\pvspa\AppData\Local\Temp\ipykernel_16332\1616544886.py
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2                                               miles_constant = 3959
     3                                               lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4                                               dlat = lat2 - lat1 
     5                                               dlon = lon2 - lon1 
     6                                               a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7                                               c = 2 * np.arcsin(np.sqrt(a)) 
     8                                               mi = miles_constant * c
     9                                               return mi

**Vectorized implementation of Haversine applied on Pandas series**

Timing vectorized implementation

In [27]:
# Vectorized implementation of Haversine applied on Pandas series
%timeit df['distance'] = haversine(40.671, -73.985,\
                                   df['locLat'], df['locLong'])

1.06 ms ± 19.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


**Interpretation:** The pandas series vectorization applied on columns to calculate the distance between two points using haversine formula took 1.06 milli seconds of execution time which is more than the below numpy array's execution time. When compared with R, the same code to vectorize took 100 microseconds which is less than that of Python.

Profiling vectorized implementation

In [28]:
# Vectorized implementation profile
%lprun -f haversine 
df['distance'] = haversine(40.671, -73.985,\
                              df['locLat'], df['locLong'])

Timer unit: 1e-07 s

Total time: 0 s
File: C:\Users\pvspa\AppData\Local\Temp\ipykernel_16332\1616544886.py
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2                                               miles_constant = 3959
     3                                               lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4                                               dlat = lat2 - lat1 
     5                                               dlon = lon2 - lon1 
     6                                               a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7                                               c = 2 * np.arcsin(np.sqrt(a)) 
     8                                               mi = miles_constant * c
     9                                               return mi

**Vectorized implementation of Haversine applied on NumPy arrays**

Timing vectorized implementation

In [29]:
# Vectorized implementation of Haversine applied on NumPy arrays
%timeit df['distance'] = haversine(40.671, -73.985,\
                         df['locLat'].values, df['locLong'].values)

144 μs ± 19.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


**Interpretation:** The numpy array's vectorization applied on columns to calculate the distance between two points using haversine formula took 144 micro seconds of execution time which is less than all the above-mentioned iterrows, apply, pandas series execution times. 

In [30]:
%%timeit
# Convert pandas arrays to NumPy ndarrays
np_lat = df['locLat'].values
np_lon = df['locLong'].values

7.72 μs ± 515 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Profiling vectorized implementation

In [31]:
%lprun -f haversine 
df['distance'] = haversine(40.671, -73.985,\
                        df['locLat'].values, df['locLong'].values)

Timer unit: 1e-07 s

Total time: 0 s
File: C:\Users\pvspa\AppData\Local\Temp\ipykernel_16332\1616544886.py
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2                                               miles_constant = 3959
     3                                               lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4                                               dlat = lat2 - lat1 
     5                                               dlon = lon2 - lon1 
     6                                               a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7                                               c = 2 * np.arcsin(np.sqrt(a)) 
     8                                               mi = miles_constant * c
     9                                               return mi

**Cythonize that loop**

Load the cython extension

In [32]:
#pip install cython
%load_ext cython
import cython

The cython extension is already loaded. To reload it, use:
  %reload_ext cython


Run unaltered Haversine through Cython

In [33]:
%%cython -a
# Haversine cythonized (no other edits)
import numpy as np
cpdef haversine_cy(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    mi = miles_constant * c
    return mi

Timing

In [34]:
%timeit df['distance'] =\
       df.apply(lambda row: haversine_cy(40.671, -73.985,\
                row['locLat'], row['locLong']), axis=1)

1.32 ms ± 285 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


**Interpretation:** The cythonized version of apply to calculate the distance between two points using haversine formula took 1.32 milli seconds of execution time which is approximately 180 times more than the vectorization through numpy arrays execution time. 

Redefine Haversine with data types and C libraries

In [35]:
%%cython -a
# Haversine cythonized
from libc.math cimport sin, cos, acos, asin, sqrt

cdef deg2rad_cy(float deg):
    cdef float rad
    rad = 0.01745329252*deg
    return rad
    
cpdef haversine_cy_dtyped(float lat1, float lon1, float lat2, float lon2):
    cdef: 
        float dlon
        float dlat
        float a
        float c
        float mi
    
    lat1, lon1, lat2, lon2 = map(deg2rad_cy, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    mi = 3959 * c
    return mi

Timing

In [36]:
%timeit df['distance'] =\
df.apply(lambda row: haversine_cy_dtyped(40.671, -73.985,\
                              row['locLat'], row['locLong']), axis=1)

727 μs ± 38.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


**Interpretation:** The cythonized version of apply to calculate the distance between two points using redefined haversine formula to include all the float type of variables as per C programming, took 727 micro seconds of execution time which is approximately 180 times more than the vectorization through numpy arrays execution time. 