# More on Merging: Geospatial Analysis
Geospatial analysis is the gathering, display, and manipulation of imagery, GPS, satellite photography and historical data, described explicitly in terms of geographic coordinates or implicitly, in terms of a street address, postal code, or forest stand identifier as they are applied to geographic models. In this recipe, we will analyze such data from the city of Chicago. We will use Pandas to analyze this data.

In [2]:
import pandas as pd

## Landmark Data
The City of Chicago has a register of landmarks in the city, we can load in this data as a CSV:

In [3]:
landmarks = pd.read_csv('Individual_Landmarks_-_Map.csv')
landmarks['rowid'] = range(len(landmarks)) #unique number
landmarks[:10]

Unnamed: 0,LANDMARK NAME,ID,ADDRESS,DATE BUILT,ARCHITECT,LANDMARK DESIGNATION DATE,LATITUDE,LONGITUDE,LOCATION,rowid
0,Vassar Swiss Underwear Company Building,L-265,2543 - 2545 W Diversey Av,,,07/30/2008,41.931627,-87.6921,"(41.9316266084, -87.6921000957)",0
1,Mathilde Eliel House,L- 89,4122 S Ellis Av,1886,Adler & Sullivan,10/02/1991,41.819256,-87.602788,"(41.819255751, -87.6027879992)",1
2,Manhattan Building,L-139,431 S Dearborn St,1891,William LeBaron Jenney,07/07/1978,41.876066,-87.628964,"(41.8760657234, -87.6289644505)",2
3,Machinery Hall at Illinois Institute of Techno...,L- 12,100 W 33rd St,1901,"Patton, Fisher & Miller",05/26/2004,41.835161,-87.629221,"(41.8351614122, -87.6292212235)",3
4,Melissa Ann Elam House,L- 88,4726 S Dr Martin Luther King Jr Dr,1903,Henry L. Newhouse,03/21/1979,41.80853,-87.617204,"(41.808529769, -87.6172043949)",4
5,(Former) Pioneer Trust and Savings Bank Building,L-318,4000 W. North Ave.,1924,Karl M. Vitzthum,06/06/2012,41.910192,-87.726617,"(41.9101921054, -87.7266173415)",5
6,DuPont-Whitehouse House,L- 85,3558 S Artesian Av,1876,Oscar Cobb & Co.,04/16/1996,41.828582,-87.686594,"(41.8285816489, -87.6865936818)",6
7,Montgomery Ward & Co. Catalog House,L-149,618 W Chicago Av,1907-08,"Richard E. Schmidt, Garden and Martin",05/17/2000,41.897437,-87.643695,"(41.8974368676, -87.6436954155)",7
8,Vorwaerts Turner Hall,L-286,2431 W. Roosevelt Rd,,,09/03/2009,41.866152,-87.687247,"(41.8661518103, -87.6872469444)",8
9,City Hall-County Building,L- 71,121 N LaSalle St / 118 N Clark St,1905-08,Holabird and Roche,01/21/1982,41.883843,-87.631655,"(41.8838425425, -87.6316552814)",9


The new type of data that we see in this dataset is latitude and longitude data. We use the library `geopy` to do manipulations of such numbers. For example, if we want to calculate the distance between landmarks, we can do the following:

In [5]:
import geopy.distance

def distance_between(i,j):
    coords_i = landmarks['LATITUDE'][i],landmarks['LONGITUDE'][i]
    coords_j = landmarks['LATITUDE'][j],landmarks['LONGITUDE'][j]
    
    return geopy.distance.geodesic(coords_i, coords_j).mi

distance_between(14,16)

1.6453285238852118

Now, suppose, we would like to answer queries that identify nearby landmarks to you. So you provide the function your current latitude and longitude and it returns all landmarks within a distance of you. Naively, we could do the following:

In [6]:
import datetime

#take a coordinate for me
#find all landmarks within a distance of me
def find_naive(me, landmarks, distance=0.5):
    start = datetime.datetime.now()
    
    rtn = []
    
    N = len(landmarks)
    
    for i in range(N):
        coords_i = landmarks['LATITUDE'][i],landmarks['LONGITUDE'][i]
        
        if geopy.distance.geodesic(me, coords_i).mi < distance:
            rtn.append(i)
    
    print('Elapsed Time find_naive() ', (datetime.datetime.now()-start).total_seconds())
    
    return landmarks.loc[rtn]

#john crerar library: 41.790524,-87.6050427
find_naive((41.790524,-87.6050427), landmarks, distance=0.25)

Elapsed Time find_naive()  0.120969


Unnamed: 0,LANDMARK NAME,ID,ADDRESS,DATE BUILT,ARCHITECT,LANDMARK DESIGNATION DATE,LATITUDE,LONGITUDE,LOCATION,rowid
215,Site of the 1st Self-Sustain Cont. Nuclear Chain,L-184,East Side of S Ellis Ave between 56th & 57th St,1942,"Commern. sculpture,""Nuclear Energy"" Henry Moore",10/27/1971,41.792162,-87.60087,"(41.7921621211, -87.6008698951)",215
252,American School of Correspondence,L- 46,850 E 58th St,1907,Pond & Pond,04/15/1995,41.789754,-87.604104,"(41.789753922, -87.6041036284)",252


How do we make this search faster? Let's break the city up into sectors and do a sector-by-sector search.

To do so, let's create some dummy columns that bin the latitude and longitude into sectors.

In [9]:
landmarks['lat_bins'], latbins = pd.cut(x=landmarks['LATITUDE'], labels=False, bins=5, retbins=True)
landmarks['long_bins'], longbins = pd.cut(x=landmarks['LONGITUDE'], labels=False, bins=5, retbins=True)
landmarks[:10]

Unnamed: 0,LANDMARK NAME,ID,ADDRESS,DATE BUILT,ARCHITECT,LANDMARK DESIGNATION DATE,LATITUDE,LONGITUDE,LOCATION,rowid,lat_bins,long_bins
0,Vassar Swiss Underwear Company Building,L-265,2543 - 2545 W Diversey Av,,,07/30/2008,41.931627,-87.6921,"(41.9316266084, -87.6921000957)",0,3,2
1,Mathilde Eliel House,L- 89,4122 S Ellis Av,1886,Adler & Sullivan,10/02/1991,41.819256,-87.602788,"(41.819255751, -87.6027879992)",1,2,3
2,Manhattan Building,L-139,431 S Dearborn St,1891,William LeBaron Jenney,07/07/1978,41.876066,-87.628964,"(41.8760657234, -87.6289644505)",2,2,3
3,Machinery Hall at Illinois Institute of Techno...,L- 12,100 W 33rd St,1901,"Patton, Fisher & Miller",05/26/2004,41.835161,-87.629221,"(41.8351614122, -87.6292212235)",3,2,3
4,Melissa Ann Elam House,L- 88,4726 S Dr Martin Luther King Jr Dr,1903,Henry L. Newhouse,03/21/1979,41.80853,-87.617204,"(41.808529769, -87.6172043949)",4,2,3
5,(Former) Pioneer Trust and Savings Bank Building,L-318,4000 W. North Ave.,1924,Karl M. Vitzthum,06/06/2012,41.910192,-87.726617,"(41.9101921054, -87.7266173415)",5,3,1
6,DuPont-Whitehouse House,L- 85,3558 S Artesian Av,1876,Oscar Cobb & Co.,04/16/1996,41.828582,-87.686594,"(41.8285816489, -87.6865936818)",6,2,2
7,Montgomery Ward & Co. Catalog House,L-149,618 W Chicago Av,1907-08,"Richard E. Schmidt, Garden and Martin",05/17/2000,41.897437,-87.643695,"(41.8974368676, -87.6436954155)",7,3,3
8,Vorwaerts Turner Hall,L-286,2431 W. Roosevelt Rd,,,09/03/2009,41.866152,-87.687247,"(41.8661518103, -87.6872469444)",8,2,2
9,City Hall-County Building,L- 71,121 N LaSalle St / 118 N Clark St,1905-08,Holabird and Roche,01/21/1982,41.883843,-87.631655,"(41.8838425425, -87.6316552814)",9,3,3


We can now do something kind of cool, we can reindex this dataset by the defined bins so we can quickly pull up those landmarks inside the bins. This is how you build an inverted index in Pandas!

In [10]:
landmark_index = landmarks.groupby(['lat_bins','long_bins'])['rowid'].apply(list)
landmark_index

lat_bins  long_bins
0         2                                     [19, 116, 144, 221, 280]
          3                                                [12, 75, 118]
          4                            [11, 31, 100, 133, 188, 234, 311]
1         1                                                        [142]
          2                                 [35, 85, 183, 218, 232, 314]
          3            [26, 32, 45, 58, 73, 77, 79, 103, 132, 146, 17...
          4              [64, 67, 74, 165, 212, 228, 236, 264, 277, 289]
2         1                                          [34, 113, 159, 176]
          2            [6, 8, 10, 18, 27, 110, 127, 140, 199, 204, 21...
          3            [1, 2, 3, 4, 37, 39, 43, 60, 61, 65, 68, 72, 8...
3         0                            [71, 78, 102, 123, 181, 244, 316]
          1                                    [5, 42, 54, 93, 115, 167]
          2            [0, 16, 17, 22, 28, 47, 83, 87, 90, 98, 106, 1...
          3            [7, 9, 1

Now, we can write a more sophisticed find function that looks up only those elements in sector that contains your point.

In [11]:
def find_binned(me, landmarks, index, bins, distance=0.5):
    start = datetime.datetime.now()
    
    rtn = []
    
    #find the long and lat bin
    lat_bin_me = pd.cut(x=pd.Series(me[0]), labels=False, bins=bins[0])[0]
    long_bin_me = pd.cut(x=pd.Series(me[1]),labels=False, bins=bins[1])[0]
    
    for i in index[lat_bin_me, long_bin_me]:
        coords_i = landmarks['LATITUDE'][i],landmarks['LONGITUDE'][i]
        
        if geopy.distance.geodesic(me, coords_i).mi < distance:
            rtn.append(i)
    
    print('Elapsed Time find_binned() ', (datetime.datetime.now()-start).total_seconds())

    return landmarks.loc[rtn]

find_binned((41.790524,-87.6050427), landmarks, landmark_index, (latbins,longbins))

Elapsed Time find_binned()  0.016067


Unnamed: 0,LANDMARK NAME,ID,ADDRESS,DATE BUILT,ARCHITECT,LANDMARK DESIGNATION DATE,LATITUDE,LONGITUDE,LOCATION,rowid,lat_bins,long_bins
32,Frederick C. Robie House,L-176,5757 S Woodlawn Av,1909,Frank Lloyd Wright,09/15/1971,41.78992,-87.59597,"(41.7899203248, -87.5959702794)",32,1,3
73,Keck-Gottschalk-Keck Apartments,L-126,5551 S University Av,1937,George and William Keck,08/03/1994,41.793551,-87.597704,"(41.793550861, -87.5977036617)",73,1,3
202,Lorado Taft's Midway Studios,L-192,6016 S Ingleside Dr,1890-1929,"Arch ?, recon:1929;O F Johnson Add:1964;E D Dart",12/01/1993,41.785526,-87.603205,"(41.7855264091, -87.6032052025)",202,1,3
215,Site of the 1st Self-Sustain Cont. Nuclear Chain,L-184,East Side of S Ellis Ave between 56th & 57th St,1942,"Commern. sculpture,""Nuclear Energy"" Henry Moore",10/27/1971,41.792162,-87.60087,"(41.7921621211, -87.6008698951)",215,1,3
243,Rockefeller Memorial Chapel,L-178,1156-80 E 59th St,1925-28,Bertram Grosvenor Goodhue Associates,11/03/2004,41.788526,-87.597044,"(41.7885259092, -87.5970443072)",243,1,3
252,American School of Correspondence,L- 46,850 E 58th St,1907,Pond & Pond,04/15/1995,41.789754,-87.604104,"(41.789753922, -87.6041036284)",252,1,3


You get exactly the same results, a lot faster!! This technique works well if you are interested in searching for landmarks relatively close to you. The second that you cross bin boundaries it no longer works. Let's try to understand just how accurate binning chicago up into 25 sectors is. To do so, we need to compare the selected values with the binned algorithm with that of the naive algorithm.

We can do this calculation with a join. Luckily Pandas as a join feature called merge which allows us to find where two datasets intersect. In our case, we will use rowid to do this merge.

In [12]:
result_actual = find_naive((41.790524,-87.6050427), landmarks)
result_heuristic = find_binned((41.790524,-87.6050427), landmarks, landmark_index, (latbins,longbins))

common = result_actual.merge(result_heuristic,on='rowid')

print('Rows in result actual',len(result_actual))
print('Rows in result heuristic',len(result_heuristic))
print('Rows common to both', len(common) )


for dist in range(5,30,5):
    result_actual = find_naive((41.790524,-87.6050427), landmarks, distance=dist/10.0)
    result_heuristic = find_binned((41.790524,-87.6050427), landmarks, landmark_index, (latbins,longbins), distance=dist/10.0)
    common = result_actual.merge(result_heuristic,on='rowid')
    
    print('--- At Distance',dist/10.0,'---')
    tp = len(result_actual)
    print('Missing Rows',len(result_actual) - len(common))
    print()
    print()


Elapsed Time find_naive()  0.154456
Elapsed Time find_binned()  0.009367
Rows in result actual 6
Rows in result heuristic 6
Rows common to both 6
Elapsed Time find_naive()  0.111035
Elapsed Time find_binned()  0.009344
--- At Distance 0.5 ---
Missing Rows 0


Elapsed Time find_naive()  0.134071
Elapsed Time find_binned()  0.016006
--- At Distance 1.0 ---
Missing Rows 0


Elapsed Time find_naive()  0.119813
Elapsed Time find_binned()  0.011146
--- At Distance 1.5 ---
Missing Rows 7


Elapsed Time find_naive()  0.099728
Elapsed Time find_binned()  0.010806
--- At Distance 2.0 ---
Missing Rows 12


Elapsed Time find_naive()  0.082415
Elapsed Time find_binned()  0.007749
--- At Distance 2.5 ---
Missing Rows 21




## Chicago Parks Data
Now, we're going to try to link this data to Chicago parks data to see how many landmarks are in or very close to Chicago Parks. We have a similar CSV dataset of park facilities and their corresponding latitudes and longitudes:

In [13]:
parks = pd.read_csv('CPD_Facilities.csv')
parks[:10]

Unnamed: 0,OBJECTID,PARK_NO,PARK,the_geom,FACILITY_N,FACILITY_T,X_COORD,Y_COORD,GISOBJID
0,1066,9,HAMILTON (ALEXANDER),POINT (-87.63769762611605 41.76299921071406),CULTURAL CENTER,SPECIAL,-87.637698,41.762999,2494
1,1067,9,HAMILTON (ALEXANDER),POINT (-87.63792902987225 41.76281652333733),GYMNASIUM,INDOOR,-87.637929,41.762817,2495
2,1068,9,HAMILTON (ALEXANDER),POINT (-87.63691359952921 41.76084938932824),BASEBALL JR/SOFTBALL,OUTDOOR,-87.636914,41.760849,2496
3,1069,9,HAMILTON (ALEXANDER),POINT (-87.63832013450852 41.76200535544225),BASEBALL JR/SOFTBALL,OUTDOOR,-87.63832,41.762005,2497
4,1070,9,HAMILTON (ALEXANDER),POINT (-87.63805916837423 41.760473845106304),BASEBALL JR/SOFTBALL,OUTDOOR,-87.638059,41.760474,2498
5,1071,9,HAMILTON (ALEXANDER),POINT (-87.63674047085885 41.76111277250026),BASEBALL JR/SOFTBALL,OUTDOOR,-87.63674,41.761113,2499
6,1072,9,HAMILTON (ALEXANDER),POINT (-87.6359054976673 41.761924615184846),POOL (OUTDOOR),OUTDOOR,-87.635906,41.761925,2500
7,1073,9,HAMILTON (ALEXANDER),POINT (-87.63686292205551 41.762619835913384),PLAYGROUND,OUTDOOR,-87.636863,41.76262,2501
8,1074,9,HAMILTON (ALEXANDER),POINT (-87.63660514607702 41.76312824964004),SPRAY FEATURE,OUTDOOR,-87.636605,41.763128,2502
9,1075,9,HAMILTON (ALEXANDER),POINT (-87.63842974936185 41.76119855500161),BASEBALL SR,OUTDOOR,-87.63843,41.761199,2503


Now, let's bin the data like before. This time let's use finer bins:

In [14]:
parks['lat_bins'], latbins = pd.cut(x=parks['Y_COORD'], labels=False, bins=25, retbins=True)
parks['long_bins'], longbins = pd.cut(x=parks['X_COORD'], labels=False, bins=25, retbins=True)

landmarks = pd.read_csv('Individual_Landmarks_-_Map.csv')
landmarks['rowid'] = range(len(landmarks))
landmarks[:10]
landmarks['lat_bins'], latbins = pd.cut(x=landmarks['LATITUDE'], labels=False, bins=latbins, retbins=True)
landmarks['long_bins'], longbins = pd.cut(x=landmarks['LONGITUDE'], labels=False, bins=longbins, retbins=True)

parks[:5]

Unnamed: 0,OBJECTID,PARK_NO,PARK,the_geom,FACILITY_N,FACILITY_T,X_COORD,Y_COORD,GISOBJID,lat_bins,long_bins
0,1066,9,HAMILTON (ALEXANDER),POINT (-87.63769762611605 41.76299921071406),CULTURAL CENTER,SPECIAL,-87.637698,41.762999,2494,7,16
1,1067,9,HAMILTON (ALEXANDER),POINT (-87.63792902987225 41.76281652333733),GYMNASIUM,INDOOR,-87.637929,41.762817,2495,7,15
2,1068,9,HAMILTON (ALEXANDER),POINT (-87.63691359952921 41.76084938932824),BASEBALL JR/SOFTBALL,OUTDOOR,-87.636914,41.760849,2496,7,16
3,1069,9,HAMILTON (ALEXANDER),POINT (-87.63832013450852 41.76200535544225),BASEBALL JR/SOFTBALL,OUTDOOR,-87.63832,41.762005,2497,7,15
4,1070,9,HAMILTON (ALEXANDER),POINT (-87.63805916837423 41.760473845106304),BASEBALL JR/SOFTBALL,OUTDOOR,-87.638059,41.760474,2498,7,15


In [15]:
landmarks[:5]

Unnamed: 0,LANDMARK NAME,ID,ADDRESS,DATE BUILT,ARCHITECT,LANDMARK DESIGNATION DATE,LATITUDE,LONGITUDE,LOCATION,rowid,lat_bins,long_bins
0,Vassar Swiss Underwear Company Building,L-265,2543 - 2545 W Diversey Av,,,07/30/2008,41.931627,-87.6921,"(41.9316266084, -87.6921000957)",0,18,11
1,Mathilde Eliel House,L- 89,4122 S Ellis Av,1886.0,Adler & Sullivan,10/02/1991,41.819256,-87.602788,"(41.819255751, -87.6027879992)",1,11,18
2,Manhattan Building,L-139,431 S Dearborn St,1891.0,William LeBaron Jenney,07/07/1978,41.876066,-87.628964,"(41.8760657234, -87.6289644505)",2,15,16
3,Machinery Hall at Illinois Institute of Techno...,L- 12,100 W 33rd St,1901.0,"Patton, Fisher & Miller",05/26/2004,41.835161,-87.629221,"(41.8351614122, -87.6292212235)",3,12,16
4,Melissa Ann Elam House,L- 88,4726 S Dr Martin Luther King Jr Dr,1903.0,Henry L. Newhouse,03/21/1979,41.80853,-87.617204,"(41.808529769, -87.6172043949)",4,10,17


Then, we can apply a join to combine the dataset on rows that have the same bin value:

In [16]:
parks_and_landmarks = parks.merge(landmarks,on=['lat_bins','long_bins'])
parks_and_landmarks[:5]

Unnamed: 0,OBJECTID,PARK_NO,PARK,the_geom,FACILITY_N,FACILITY_T,X_COORD,Y_COORD,GISOBJID,lat_bins,...,LANDMARK NAME,ID,ADDRESS,DATE BUILT,ARCHITECT,LANDMARK DESIGNATION DATE,LATITUDE,LONGITUDE,LOCATION,rowid
0,1085,106,HAMLIN (HANNIBAL),POINT (-87.68030436443362 41.93675714691855),BASKETBALL COURT,OUTDOOR,-87.680304,41.936757,2513,19,...,(Former) Schlitz BreweryTied-House@2159 W.Belmont,L-305,2159 W. Belmont Ave.,1903-1904,Charles Thisslew,07/06/2011,41.93925,-87.683033,"(41.9392502017, -87.6830331298)",248
1,1086,106,HAMLIN (HANNIBAL),POINT (-87.67917786526876 41.93654865287884),FOOTBALL/SOCCER COMBO FLD,OUTDOOR,-87.679178,41.936549,2514,19,...,(Former) Schlitz BreweryTied-House@2159 W.Belmont,L-305,2159 W. Belmont Ave.,1903-1904,Charles Thisslew,07/06/2011,41.93925,-87.683033,"(41.9392502017, -87.6830331298)",248
2,1088,106,HAMLIN (HANNIBAL),POINT (-87.68032910446819 41.93604307713284),DOG FRIENDLY AREA,OUTDOOR,-87.680329,41.936043,2516,19,...,(Former) Schlitz BreweryTied-House@2159 W.Belmont,L-305,2159 W. Belmont Ave.,1903-1904,Charles Thisslew,07/06/2011,41.93925,-87.683033,"(41.9392502017, -87.6830331298)",248
3,1089,106,HAMLIN (HANNIBAL),POINT (-87.67982968492478 41.937481110855266),FITNESS CENTER,INDOOR,-87.67983,41.937481,2517,19,...,(Former) Schlitz BreweryTied-House@2159 W.Belmont,L-305,2159 W. Belmont Ave.,1903-1904,Charles Thisslew,07/06/2011,41.93925,-87.683033,"(41.9392502017, -87.6830331298)",248
4,1090,106,HAMLIN (HANNIBAL),POINT (-87.68024830957748 41.93708947936905),GYMNASIUM,INDOOR,-87.680248,41.937089,2518,19,...,(Former) Schlitz BreweryTied-House@2159 W.Belmont,L-305,2159 W. Belmont Ave.,1903-1904,Charles Thisslew,07/06/2011,41.93925,-87.683033,"(41.9392502017, -87.6830331298)",248


In [17]:
def filter(parks_and_landmarks):
    rtn = []
    
    N = len(landmarks)
    
    for i in range(N):
        coords_i = parks_and_landmarks['LATITUDE'][i],parks_and_landmarks['LONGITUDE'][i]
        coords_j = parks_and_landmarks['Y_COORD'][i],parks_and_landmarks['X_COORD'][i]
        
        if geopy.distance.geodesic(coords_j, coords_i).km < 0.1: #100m
            rtn.append(i)
    
    return parks_and_landmarks.loc[rtn]

filter(parks_and_landmarks).groupby(['PARK', 'LANDMARK NAME'])['PARK', 'LANDMARK NAME'].count()

  from ipykernel import kernelapp as app


Unnamed: 0_level_0,Unnamed: 1_level_0,PARK,LANDMARK NAME
PARK,LANDMARK NAME,Unnamed: 2_level_1,Unnamed: 3_level_1
CARMEN,(Former) Schlitz Brewery-Tied House,1,1
CARMEN,Myron Bachman House,1,1
HASAN (ELLIOT),6901 Oglesby Cooperative Apartment Building,1,1
JACKSON (ANDREW),63rd Street Bathing Pavilion,3,3


## Linking to a larger dataset

Suppose, we want to understand what types of businesses are close to the landmarks, we can use the business register from the city of chicago. This particular dataset is significantly larger than the other two (not suprisingly)!

In [18]:
business = pd.read_csv('Business_Licenses_-_Current_Active.csv')
business[:5]

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,ID,LICENSE ID,ACCOUNT NUMBER,SITE NUMBER,LEGAL NAME,DOING BUSINESS AS NAME,ADDRESS,CITY,STATE,ZIP CODE,...,LICENSE TERM START DATE,LICENSE TERM EXPIRATION DATE,LICENSE APPROVED FOR ISSUANCE,DATE ISSUED,LICENSE STATUS,LICENSE STATUS CHANGE DATE,SSA,LATITUDE,LONGITUDE,LOCATION
0,2664681-20190513,2664681,398472,1,"ISLAND PARTY HUT, LLC",ISLAND PARTY HUT,355 E RIVERWALK SOUTH 1,CHICAGO,IL,60601,...,05/13/2019,05/12/2020,05/13/2019,05/13/2019,AAI,,,41.887806,-87.617882,"(41.88780560547408, -87.61788171685379)"
1,2664546-20190513,2664546,405793,1,"TINY LOUNGE ON THE RIVERWALK, L.L.C.",TINY TAPP,55-71 W RIVERWALK 1ST,CHICAGO,IL,60601,...,05/13/2019,05/12/2020,05/13/2019,05/13/2019,AAI,,,41.886952,-87.629875,"(41.88695199839048, -87.62987459293426)"
2,2699019-20191122,2699019,417235,4,ROBERT BAUSCH,Bob's Belgian Hot Chocolate,18 S ADDISON,BENSONVILLE,IL,60106,...,11/22/2019,05/13/2020,11/14/2019,11/22/2019,AAI,,,,,
3,2699024-20191114,2699024,324455,15,Frieder Frotscher,German Grill,50 W WASHINGTON ST,CHICAGO,IL,60602,...,11/14/2019,05/13/2020,11/14/2019,11/14/2019,AAI,,,41.883322,-87.629775,"(41.88332219914999, -87.62977509664286)"
4,2699286-20191114,2699286,324457,16,Uli Koretz,Sweet Swabian,50 W WASHINGTON ST,CHICAGO,IL,60602,...,11/14/2019,05/13/2020,11/14/2019,11/14/2019,AAI,,,41.883322,-87.629775,"(41.88332219914999, -87.62977509664286)"


In [19]:
business['lat_bins'], latbins = pd.cut(x=business['LATITUDE'], labels=False, bins=latbins, retbins=True)
business['long_bins'], longbins = pd.cut(x=business['LONGITUDE'], labels=False, bins=longbins, retbins=True)
business[:5]

Unnamed: 0,ID,LICENSE ID,ACCOUNT NUMBER,SITE NUMBER,LEGAL NAME,DOING BUSINESS AS NAME,ADDRESS,CITY,STATE,ZIP CODE,...,LICENSE APPROVED FOR ISSUANCE,DATE ISSUED,LICENSE STATUS,LICENSE STATUS CHANGE DATE,SSA,LATITUDE,LONGITUDE,LOCATION,lat_bins,long_bins
0,2664681-20190513,2664681,398472,1,"ISLAND PARTY HUT, LLC",ISLAND PARTY HUT,355 E RIVERWALK SOUTH 1,CHICAGO,IL,60601,...,05/13/2019,05/13/2019,AAI,,,41.887806,-87.617882,"(41.88780560547408, -87.61788171685379)",16.0,17.0
1,2664546-20190513,2664546,405793,1,"TINY LOUNGE ON THE RIVERWALK, L.L.C.",TINY TAPP,55-71 W RIVERWALK 1ST,CHICAGO,IL,60601,...,05/13/2019,05/13/2019,AAI,,,41.886952,-87.629875,"(41.88695199839048, -87.62987459293426)",15.0,16.0
2,2699019-20191122,2699019,417235,4,ROBERT BAUSCH,Bob's Belgian Hot Chocolate,18 S ADDISON,BENSONVILLE,IL,60106,...,11/14/2019,11/22/2019,AAI,,,,,,,
3,2699024-20191114,2699024,324455,15,Frieder Frotscher,German Grill,50 W WASHINGTON ST,CHICAGO,IL,60602,...,11/14/2019,11/14/2019,AAI,,,41.883322,-87.629775,"(41.88332219914999, -87.62977509664286)",15.0,16.0
4,2699286-20191114,2699286,324457,16,Uli Koretz,Sweet Swabian,50 W WASHINGTON ST,CHICAGO,IL,60602,...,11/14/2019,11/14/2019,AAI,,,41.883322,-87.629775,"(41.88332219914999, -87.62977509664286)",15.0,16.0


Now that we've binned up the business latitudes and longitudes, let's try a bunch of 3-way merges and see how long they take:

In [20]:
start = datetime.datetime.now()
(parks.merge(landmarks,on=['lat_bins','long_bins'])).merge(business, on=['lat_bins','long_bins'])
print('Elapsed Time join(join(parks, landmarks), business) ', (datetime.datetime.now()-start).total_seconds())

Elapsed Time join(join(parks, landmarks), business)  4.442228


In [21]:
start = datetime.datetime.now()
(landmarks.merge(business,on=['lat_bins','long_bins'])).merge(parks, on=['lat_bins','long_bins'])
print('Elapsed Time join(join(landmarks, business), parks) ', (datetime.datetime.now()-start).total_seconds())

Elapsed Time join(join(landmarks, business), parks)  1.800296


In [22]:
start = datetime.datetime.now()
(business.merge(parks,on=['lat_bins','long_bins'])).merge(landmarks, on=['lat_bins','long_bins'])
print('Elapsed Time join(join(parks, business), landmarks) ', (datetime.datetime.now()-start).total_seconds())

Elapsed Time join(join(parks, business), landmarks)  2.952085
