## Speed comparison of Shapely vs PyGeos


### Shapely
Shapely 2.0 version is a major release featuring a complete refactor of the internals and new vectorized (element-wise) array operations, providing considerable performance improvements (based on the developments in the PyGEOS package)   
For more info about the performence upgrades, see the documentation on [Version 2.0.0 (2022-12-12)](https://shapely.readthedocs.io/en/stable/release/2.x.html#version-2-0-0-2022-12-12)

### PyGeos
PyGEOS is a Python library that wraps the GEOS library using Ctypes. It provides a high-level interface for the GEOS library, supporting operations on geometries in both vectorized and scalar forms. PyGEOS is designed to be used as a dependency for other Python libraries, such as GeoPandas and Shapely, and is not intended to be used directly by end-users.  
For more info about the performence upgrades, see the documentation on [PyGeos](https://pygeos.readthedocs.io/en/latest/)

PyGEOS can be used as a replacement for Shapely, providing a faster and more efficient implementation of the geometric operations. However it is not fully integrated with GeoPandas yet.  
E.g. For GeoDataFrame creation, shapely geometries are accepted with corresponding CRS, while PyGeos geometries are not.  
However, one can add PyGeos geometries to a new column in a GeoDataFrame that is empty or containing the valid shapely geometries.

#### Compability of Shapely and PyGeos

Shapely >2.0 use built-in PyGeos for faster operations.
A fresh install of Shapely 2.0 will automatically contain Geos as a dependency.
This can be checked by running the following code:
```python
print(gpd.options.use_pygeos)
```

Since both Shapely and PyGeos are using the same GEOS library, they can conflict over the version of the GEOS library.  
When having both Shapely (with GEOS activated) and PyGeos in the same environment, the following warning can be raised:

```python
UserWarning: The Shapely GEOS version (3.11.3-CAPI-1.17.3) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.
  compat.set_use_pygeos(value)
```
If this warning is raised, it is true as the warning says: "Conversions between both will be slow".  
Therefore, it is recommended to deactivate the GEOS library in Shapely and run the code in PyGeos for faster operations.
Deactivate the GEOS library in Shapely by running the following code:
```python
gpd.options.use_pygeos = False
```

**OBS: PYGEOS IS DEPRECATED -> MOVE OVER TO SHAPLY 2.0 INSTEAD**

Test uninstalling PyGeos and running the code with Shapely 2.0


Tests:
- generate random points
- compute distance matrix between points

In [24]:
# pip uninstall pygeos

In [25]:
# pip show pygeos

In [26]:
import geopandas as gpd
import numpy as np
import time
from shapely.geometry import Point
# import pygeos

In [27]:
gpd.options.use_pygeos = False
# print("Check if geopandas uses shapely's pygeos:", gpd.options.use_pygeos) # speed up spatial operations

In [28]:
pip show shapely

Name: shapely
Version: 2.0.3
Summary: Manipulation and analysis of geometric objects
Home-page: 
Author: Sean Gillies
Author-email: 
License: BSD 3-Clause
Location: c:\ProgramData\miniconda3\envs\MCLP_env\Lib\site-packages
Requires: numpy
Required-by: geopandas, osmnx
Note: you may need to restart the kernel to use updated packages.


In [29]:
# $ pip install pytest  # or shapely[test]
# $ pytest --pyargs shapely.tests

In [30]:
import shapely
import pytest

# pytest --pyargs shapely.tests


In [31]:
# pytest shapely[test]

In [32]:
# pytest --pyargs shapely.tests;


In [41]:
num_points = 1000000  # Number of points to generate

# SHAPLEY POINTS GENERATION AND GDF CONVERSION
# Start timing
start_time_shapely = time.time()

# Generate Shapely points
shapely_points = [Point(np.random.uniform(0, 10), np.random.uniform(0, 10)) for _ in range(num_points)]

# Convert to GeoDataFrame
gdf_shapely = gpd.GeoDataFrame(geometry=shapely_points, crs="EPSG:4326")

# End timing
end_time_shapely = time.time()



# SHAPELY ALTERNATIVE METHOD

# Start timing
start_time_shapely2 = time.time()



# Generate random coordinates using NumPy's vectorization capabilities
x_coords = np.random.uniform(0, 10, size=num_points)
y_coords = np.random.uniform(0, 10, size=num_points)

# Unpack the combined coordinates back into separate arrays
x_data, y_data = point_data.T  # Transpose to access columns as rows

# Create a GeoDataFrame directly from the separate x and y coordinates
gdf_shapely2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(x_data, y_data), crs="EPSG:4326")


# End timing
end_time_shapely2 = time.time()


# # PYGEOS POINTS GENERATION AND GDF CONVERSION
# # Start timing
# start_time_pygeos = time.time()

# # Generate PyGEOS points
# pygeos_points = pygeos.points(np.random.uniform(0, 10, size=(num_points, 2)))

# # should create the Geodataframe with converted to shapely geometry, then add the pygeos geometry in new column
# # Convert to Shapely for the GeoDataFrame compatibility
# gdf_pygeos = gpd.GeoDataFrame(pygeos.to_shapely(pygeos_points), columns=["geometry"], crs="EPSG:4326")
# gdf_pygeos["geometry_pygeos"] = pygeos_points


# # # Convert to GeoDataFrame - leveraging direct conversion to Shapely for the GeoDataFrame compatibility
# # gdf_pygeos = gpd.GeoDataFrame()
# # gdf_pygeos["geometry_pygeos"] = pygeos_points

# # 

# # End timing
# end_time_pygeos = time.time()

# COMPARING PERFORMANCE
time_shapely = end_time_shapely - start_time_shapely
time_shapely2 = end_time_shapely2 - start_time_shapely2

print(f"Shapely: {time_shapely} seconds")
print(f"Shapely2: {time_shapely2} seconds")

# gdf_pygeos.head()


Shapely: 16.89991593360901 seconds
Shapely2: 0.9531147480010986 seconds


### Test 1: Point generation and geodataframe creation
```python
For num_points = 1000 000, the results were:  
- Shapely: 15.396 seconds  
- PyGEOS: 0.307 seconds  
```

But the ``shapely`` geometry works well with ``geopandas``, and can be plotted easily.  
While ``pygeos`` would have to be converted to ``shapely`` to be plotted.

### Test 2: Point generation and geodataframe creation, but
#### pygeos was forced to add both pygeos- and shapely geometries to the geodataframe

```python
For num_points = 1000 000, the results were:  
- Shapely: 15.275 seconds  
- PyGEOS: 3.100 seconds  
```

Which means that PyGEOS is 5 times faster than Shapely for this task.  
Even when have to convert geometry back to shapely for geodataframe creation.

51 seconds for just shapely, when having both shapely and pygeos installed -> conflicting geos versions made conversions SLOW.

### Next comparison: Cost matrix computation
#### Cost matrix, using same scipy.spatial.distance_matrix function

In [23]:
# COST MATRIX COMPUTATION
from scipy.spatial import distance_matrix

# Assume pygeos_points exists and has been created as shown previously
# Convert PyGEOS geometries to Shapely geometries for comparison
shapely_points = [Point(xy) for xy in np.random.uniform(0, 10, size=(num_points, 2))]

# PYGEOS Cost Matrix Computation
start_time_pygeos = time.time()
coords_pygeos = pygeos.get_coordinates(pygeos_points)
cost_matrix_pygeos = distance_matrix(coords_pygeos, coords_pygeos)
end_time_pygeos = time.time()

# SHAPELY Cost Matrix Computation (Using scipy's distance_matrix for a fair comparison)
start_time_shapely = time.time()
coords_shapely = np.array([(point.x, point.y) for point in shapely_points])
cost_matrix_shapely = distance_matrix(coords_shapely, coords_shapely)
end_time_shapely = time.time()

# Performance Comparison
time_pygeos = end_time_pygeos - start_time_pygeos
time_shapely = end_time_shapely - start_time_shapely

print(f"PyGEOS Cost Matrix Computation Time: {time_pygeos} seconds")
print(f"Shapely Cost Matrix Computation Time: {time_shapely} seconds")


PyGEOS Cost Matrix Computation Time: 0.16904711723327637 seconds
Shapely Cost Matrix Computation Time: 0.18500208854675293 seconds


#### Cost matrix, using their own functions

In [24]:
import numpy as np
import time

# Placeholder for the cost matrices
cost_matrix_shapely = np.zeros((len(shapely_points), len(shapely_points)))
cost_matrix_pygeos = np.zeros((len(pygeos_points), len(pygeos_points)))

# SHAPLEY COST MATRIX
start_time_shapely = time.time()
for i, point_i in enumerate(shapely_points):
    for j, point_j in enumerate(shapely_points):
        if i != j:  # Avoid computing distance to itself
            cost_matrix_shapely[i, j] = point_i.distance(point_j)
end_time_shapely = time.time()

# PYGEOS COST MATRIX
start_time_pygeos = time.time()
# Efficiently compute distances using PyGEOS
for i in range(num_points):
    cost_matrix_pygeos[i, :] = pygeos.distance(pygeos_points[i], pygeos_points)

# Ensure diagonal is 0 (distance to itself)
np.fill_diagonal(cost_matrix_pygeos, 0)
end_time_pygeos = time.time()

# COMPARING PERFORMANCE
time_shapely = end_time_shapely - start_time_shapely
time_pygeos = end_time_pygeos - start_time_pygeos

print(f"Shapely Cost Matrix Calculation: {time_shapely} seconds")
print(f"PyGEOS Cost Matrix Calculation: {time_pygeos} seconds")


Shapely Cost Matrix Calculation: 48.9489643573761 seconds
PyGEOS Cost Matrix Calculation: 1.9496917724609375 seconds


### Cost matrix, using their own functions - results
```python
For num_points = 2000 the results were:  
- Shapely: 48.948 seconds  
- PyGEOS: 1.949 seconds  
```
