In [105]:
import pandas as pd
import numpy as np
from math import *

In [166]:
df = pd.read_excel('clinics.xls')
df.head(5)

Unnamed: 0,bizID,bizCat,bizCatSub,bizName,bizAddr,bizCity,bizState,bizZip,bizPhone,bizFax,...,bizURL,locAreaCode,locFIPS,locTimeZone,locDST,locLat,locLong,locMSA,locPMSA,locCounty
0,1,Clinics,Clinics,Hino Ronald H MD,98-151 Pali Momi Street Suite 142,Aiea,HI,96701,(808)487-2477,,...,,808,15003,PST-2,N,21.398,-157.8981,3320.0,,Honolulu
1,2,Clinics,Clinics,Farmer Joesph F Md,1225 Breckenridge Drive,Little Rock,AR,72205,(501)225-2594,,...,,501,5119,CST,Y,34.7495,-92.3533,4400.0,,Pulaski
2,3,Clinics,Clinics & Medical Centers,Najjar Fadi Md,1155 West Linda Avenue Suite B,Hermiston,OR,97838,(541)289-1122,,...,,541,41059,PST,Y,45.8456,-119.2817,,,Umatilla
3,4,Clinics,Clinics & Medical Centers,Kittson Memorial Upper Level Nursing Home,1010 South Birch Avenue,Hallock,MN,56728,(218)843-2525,,...,,218,27069,CST,Y,48.7954,-97.009,,,Kittson
4,5,Clinics,Clinics & Medical Centers,Thompson Robert B Md,100 North Eagle Creek Drive,Lexington,KY,40509,(859)258-4000,,...,www.lexingtonclinic.com,859,21067,EST,Y,37.9935,-84.3712,4280.0,,Fayette


Haversine

In [156]:
def haversine(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    mi = miles_constant * c
    return mi

Iterrows Haversine

In [158]:
%%timeit
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['locLat'], row['locLong']))
#cProfile.run("df['distance'] = haversine_series")
df['distance'] = haversine_series

4.18 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The %%timeit output of 4.18 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) indicates that the Haversine distance calculation for each row using the iterrows() method took, on average, 4.18 milliseconds per loop. The standard deviation of 145 microseconds suggests minor variability in execution times across the 7 benchmarking runs, which is typical due to factors like CPU load and memory access.

Timing apply on the Haversine rows

In [160]:
%timeit df['distance'] =\
df.apply(lambda row: haversine(40.671, -73.985,\
                               row['locLat'], row['locLong']), axis=1)

2.07 ms ± 58.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The %%timeit output of 2.07 ms ± 58.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) indicates that using the apply() method to calculate the Haversine distance for each row in the DataFrame took an average of 2.07 milliseconds per loop. The standard deviation of 58.6 microseconds shows that there is less variability in the execution times compared to the iterrows() method.

Profiling "apply"

In [114]:
!pip install line-profiler
%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


In [162]:
# Haversine applied on rows
%lprun -f haversine
df.apply(lambda row: haversine(40.671, -73.985,\
                               row['locLat'], row['locLong']), axis=1)

0     4959.870106
1     1081.593651
2     2276.098880
3     1255.154292
4      584.683573
5      716.061202
6      804.302067
7     1202.595720
8      261.704147
9      723.619081
10    2500.406105
11    1183.572872
12     914.094552
13     128.899163
14     165.299805
15    1550.139984
16     862.008471
17     668.250535
18      96.684878
19    2453.917624
20    1021.420462
21     909.430381
22     760.908611
23    1625.149470
24    2115.580042
25     353.324858
26     861.054055
27     207.952978
28    1057.569755
29    1140.134525
dtype: float64

Timer unit: 1e-07 s

Total time: 0 s
File: C:\Users\supra\AppData\Local\Temp\ipykernel_21064\1616544886.py
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2                                               miles_constant = 3959
     3                                               lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4                                               dlat = lat2 - lat1 
     5                                               dlon = lon2 - lon1 
     6                                               a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7                                               c = 2 * np.arcsin(np.sqrt(a)) 
     8                                               mi = miles_constant * c
     9                                               return mi

Vectorized implementation of Haversine applied on Pandas series

Timing vectorized implementation

In [164]:
%%timeit
df['distance'] = haversine(40.671, -73.985, df['locLat'], df['locLong'])

2.38 ms ± 104 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The %%timeit output of 2.38 ms ± 104 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) indicates that the vectorized implementation of the Haversine function applied directly on the Pandas Series took an average of 2.38 milliseconds per loop. The standard deviation of 104 microseconds shows relatively low variability in execution times across the runs. This method is more efficient than the previous approaches (using iterrows() and apply()), as it leverages vectorized operations which are optimized for performance in Pandas.

Profiling vectorized implementation

In [119]:
# Vectorized implementation profile
%lprun -f haversine 
haversine(40.671, -73.985,\
                              df['locLat'], df['locLong'])

0     4959.870106
1     1081.593651
2     2276.098880
3     1255.154292
4      584.683573
5      716.061202
6      804.302067
7     1202.595720
8      261.704147
9      723.619081
10    2500.406105
11    1183.572872
12     914.094552
13     128.899163
14     165.299805
15    1550.139984
16     862.008471
17     668.250535
18      96.684878
19    2453.917624
20    1021.420462
21     909.430381
22     760.908611
23    1625.149470
24    2115.580042
25     353.324858
26     861.054055
27     207.952978
28    1057.569755
29    1140.134525
dtype: float64

Timer unit: 1e-07 s

Total time: 0 s
File: C:\Users\supra\AppData\Local\Temp\ipykernel_21064\1616544886.py
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2                                               miles_constant = 3959
     3                                               lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4                                               dlat = lat2 - lat1 
     5                                               dlon = lon2 - lon1 
     6                                               a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7                                               c = 2 * np.arcsin(np.sqrt(a)) 
     8                                               mi = miles_constant * c
     9                                               return mi

Vectorized implementation of Haversine applied on NumPy arrays
Timing vectorized implementation

In [121]:
%%timeit
# Vectorized implementation of Haversine applied on NumPy arrays
df['distance'] = haversine(40.671, -73.985, df['locLat'].values, df['locLong'].values)

237 μs ± 9.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


The %%timeit output of 237 μs ± 9.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) indicates that the vectorized implementation of the Haversine function applied on NumPy arrays took an average of only 237 microseconds per loop, with a standard deviation of 9.71 microseconds, showing very low variability. This approach is significantly faster than both the apply() method and iterrows() in pandas, highlighting the efficiency of NumPy arrays for handling element-wise operations.

In [122]:
%%timeit
# Convert pandas arrays to NumPy ndarrays
np_lat = df['locLat'].values
np_lon = df['locLong'].values

14.8 μs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Profiling vectorized implementation

In [124]:
%lprun -f haversine 
df['distance'] = haversine(40.671, -73.985, 
                           df['locLat'].values, df['locLong'].values)

Timer unit: 1e-07 s

Total time: 0 s
File: C:\Users\supra\AppData\Local\Temp\ipykernel_21064\1616544886.py
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2                                               miles_constant = 3959
     3                                               lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4                                               dlat = lat2 - lat1 
     5                                               dlon = lon2 - lon1 
     6                                               a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7                                               c = 2 * np.arcsin(np.sqrt(a)) 
     8                                               mi = miles_constant * c
     9                                               return mi

Cythonize that loop

In [126]:
pip install cython




In [127]:
%load_ext cython

The cython extension is already loaded. To reload it, use:
  %reload_ext cython


In [168]:
%%cython -a

# Haversine cythonized 
import numpy as np
cpdef haversine_cy(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    mi = miles_constant * c
    return mi

Content of stdout:
_cython_magic_11589e0f8be124aa812437b04ca6db1b0d0f1a85.c
   Creating library C:\Users\supra\.ipython\cython\Users\supra\.ipython\cython\_cython_magic_11589e0f8be124aa812437b04ca6db1b0d0f1a85.cp312-win_amd64.lib and object C:\Users\supra\.ipython\cython\Users\supra\.ipython\cython\_cython_magic_11589e0f8be124aa812437b04ca6db1b0d0f1a85.cp312-win_amd64.exp
Generating code
Finished generating code

In [129]:
%timeit df['distance'] =\
       df.apply(lambda row: haversine_cy(40.671, -73.985,\
                row['locLat'], row['locLong']), axis=1)

2.02 ms ± 44.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The %%timeit output of 2.02 ms ± 44.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) indicates that using the cythonized Haversine function applied via apply() to each row in the DataFrame took an average of 2.02 milliseconds per loop. The standard deviation of 44.6 microseconds suggests low variability in execution times across the runs. This result shows that the cythonized version of the Haversine function provides an improvement in performance compared to the regular Python implementation, though it still doesn’t outperform the vectorized NumPy approach (which took 237 microseconds per loop)

Redefine Haversine with data types and C libraries

In [131]:
%%cython -a
# Haversine cythonized
from libc.math cimport sin, cos, acos, asin, sqrt

cdef deg2rad_cy(float deg):
    cdef float rad
    rad = 0.01745329252*deg
    return rad
    
cpdef haversine_cy_dtyped(float lat1, float lon1, float lat2, float lon2):
    cdef: 
        float dlon
        float dlat
        float a
        float c
        float mi
    
    lat1, lon1, lat2, lon2 = map(deg2rad_cy, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    mi = 3959 * c
    return mi

In [132]:
%timeit df['distance'] =\
df.apply(lambda row: haversine_cy_dtyped(40.671, -73.985,\
                              row['locLat'], row['locLong']), axis=1)

1.25 ms ± 15.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


The %%timeit output of 1.25 ms ± 15.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) indicates that using the Haversine function with data types and C libraries (i.e., a further optimized version with typed data and leveraging Cython for faster execution) to calculate the distance for each row took an average of 1.25 milliseconds per loop. The standard deviation of 15.7 microseconds shows minimal variability in execution times across the 7 benchmarking runs. This approach provides a significant improvement in performance compared to the non-optimized versions, including both the regular Python implementation and the cythonized version without data type optimization. While still not as fast as the vectorized NumPy approach (which took 237 microseconds per loop).