Geospatial Analysis - Finding the Closest Warehouse
Scenario: A logistics company has a list of their retail stores and a separate list of their distribution warehouses. They need to determine the closest warehouse for each store.

Your Task: For each store, calculate the distance to every warehouse and then determine which warehouse is the closest. The final DataFrame should be the original stores table with two new columns: closest_warehouse_id and distance_km. You will need to implement the Haversine formula to calculate the distance between two lat/long points on a sphere. This is a perfect task for a custom NumPy function.

In [1]:
import numpy as np
import pandas as pd

stores = pd.read_csv('https://raw.githubusercontent.com/vlad-gby/ds_5_mini-projects/refs/heads/main/02_geospacial_analysis/stores.csv')
warehouses = pd.read_csv('https://raw.githubusercontent.com/vlad-gby/ds_5_mini-projects/refs/heads/main/02_geospacial_analysis/warehouses.csv')

def calc_dist(combinations):
    lat1 = np.radians(combinations['store_lat'])
    long1 = np.radians(combinations['store_lon'])
    lat2 = np.radians(combinations['wh_lat'])
    long2 = np.radians(combinations['wh_lon'])

    radius = 6_371
    lat_diff = lat2 - lat1
    long_diff = long2 - long1

    # Harversine formula
#   a = sin²(Δlat/2) + cos(lat1) * cos(lat2) * sin²(Δlon/2)
#   c = 2 * atan2(√a, √(1−a))
#   d = R * c

    a = np.sin(lat_diff/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(long_diff/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    d = radius * c

    return d

 # Get all combinations
combinations = stores.merge(warehouses, how='cross')
combinations['dist'] = calc_dist(combinations)
# Take best warehouse for each store
best_dist = combinations.loc[combinations.groupby('store_id')['dist'].idxmin()].reset_index().loc[:, ['store_id', 'warehouse_id', 'dist']]
# Paste the result into an original data
stores = stores.merge(best_dist, on='store_id')

print(stores)

  store_id  store_lat  store_lon warehouse_id        dist
0       S1    45.4642     9.1900           W1   12.379630
1       S2    41.9028    12.4964           W2  241.805063
2       S3    43.7696    11.2558           W2   10.900462


The analysis was successful in solving a critical logistics problem: identifying the optimal warehouse for each retail store.

By implementing the Haversine formula in a vectorized NumPy function, it was possible to accurately calculate the great-circle distance between every store and every potential warehouse. The use of a cross merge in Pandas allowed for the efficient creation of all possible pairings, and a combination of .groupby() and .idxmin() made it possible to isolate the single best warehouse for each store based on minimum distance.