transform much slower in 2.0.1 than 1.9.6 #187

docmarionum1 · 2019-03-11T21:01:03Z

I'm using python 3.6.7 on ubuntu 18.04.1. I installed each version of pyproj via pip install pyproj==x.y.z.

I'm using pyproj via geopandas. I recently upgraded from 1.9.5.1 to 2.0.1 and noticed that calls to to_crs which uses pyproj.transform via shapely.ops.transform got much slower.

It seems like there is a large overhead on the call to transform now. When calling transform with large arrays, the difference in less pronounced, but when called on individual coordinates it is quite large.

I tested with individual x,y pairs; arrays of 10 elements each to simulate usage with a "normal" sized geometry; and a test of 1,000,000 elements each which is unrealistic unless you're working just with points and not more complex geometry.

This was the setup:

import pyproj
proj_in = pyproj.Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = pyproj.Proj({'init': 'epsg:4326'}, preserve_units=True)

Testing on Individual coordinate pairs, 2.0.1 is ~1000x slower than 1.9.5.1:

%%t -n50 -r50
pyproj.transform(proj_in, proj_out, random.randint(80000, 120000), random.randint(200000, 250000))

1.9.5.1 - 13.9 µs ± 5.82 µs per loop
1.9.6 - 14.6 µs ± 6.52 µs per loop
2.0.0 - 945 µs ± 152 µs per loop
2.0.1 - 8.77 ms ± 915 µs per loop

For arrays of 10 coordinates each:

%%t -n50 -r50
pyproj.transform(proj_in, proj_out, np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10))

1.9.6 - 25.1 µs ± 16.8 µs per loop
2.0.1 - 8.81 ms ± 798 µs per loop

And for arrays of 1,000,000:

%%t -n5 -r5
pyproj.transform(proj_in, proj_out, np.random.randint(80000, 120000, 1000000), np.random.randint(200000, 250000, 1000000))

1.9.6 - 689 ms ± 7.57 ms per loop
2.0.1 - 1.18 s ± 24.9 ms per loop

The text was updated successfully, but these errors were encountered:

jswhit · 2019-03-11T23:31:23Z

I wonder if the C lib is just slower? Don't see that much has changed in the pyproj.transform interface.

jswhit · 2019-03-12T16:59:05Z

Looks like most of the extra time is in setting up the data structures that the new pyproj.transform needs (specifically the TransProj object). That's why the speed difference goes down as you increase the size of the arrays.

jswhit · 2019-03-12T19:45:33Z

We could allow for a TransProj instance to be created outside of transform, and then re-use it for multiple calls. Something like this:

import pyproj
proj_in = pyproj.Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = pyproj.Proj({'init': 'epsg:4326'}, preserve_units=True)
proj_trans = pyproj.TransProj(proj_in, proj_out)
pyproj.transform2(proj_trans, random.randint(80000, 120000), random.randint(200000, 250000))

The cost of pyproj.transform2 should then be quite similar to pyproj.transform in 1.9.6.

docmarionum1 · 2019-03-12T20:15:59Z

If most of the added overhead is in the instantiation of TransProj then yeah, that would be perfect.

snowman2 · 2019-03-12T20:56:20Z

Thoughts using a class based approach? This would mean that the TransProj could be a property on the class and would not need to be generated by the user.

from pyproj import Transformer, Proj

proj_in = Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = Proj({'init': 'epsg:4326'}, preserve_units=True)
transformer = Transformer(proj_in, proj_out)
transformer.transform(random.randint(80000, 120000), random.randint(200000, 250000))

Plus, it would be fun to have a class called Transformer.

jswhit · 2019-03-12T21:11:21Z

I like it! Preserves backwards compatibility too.

snowman2 · 2019-03-12T21:19:35Z

Sounds great. I think going this route could also make supporting custom proj pipelines possible:

transformer = Transformer.from_proj(proj_in, proj_out)
transformer = Transformer.from_pipeline("PROJ pipeline projection string")

snowman2 · 2019-03-14T00:35:18Z

New version times:

import numpy as np                                                                                                                            
from pyproj import Transformer                                                                                                                

transformer = Transformer.from_proj(2263, 4326)

Test 1:

%%timeit -n50 -r50 
transformer.transform(np.random.randint(80000, 120000), np.random.randint(200000, 250000))

36.4 µs ± 7.27 µs per loop (mean ± std. dev. of 50 runs, 50 loops each)

Test 2:

%%timeit -n50 -r50 
 transformer.transform(np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10))

The slowest run took 4.05 times longer than the fastest. This could mean that an intermediate result is being cached.
63.5 µs ± 33.6 µs per loop (mean ± std. dev. of 50 runs, 50 loops each)

%%timeit 
transformer.transform(np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10))

30.8 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Test 3:

 %%timeit -n5 -r5  
transformer.transform(np.random.randint(80000, 120000, 1000000), np.random.randint(200000, 250000, 1000000))

1.94 s ± 21.9 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

But, this is on my machine. Curious to know how it performs on yours when 2.1.0 is released.

snowman2 · 2019-03-16T00:27:24Z

pyproj==2.1.0 is released with the fix. Mind giving it a go?

snowman2 · 2019-03-19T13:26:16Z

Closing as the issue has been addressed. Thanks for the report!

andrerav · 2019-03-24T18:43:13Z

Hi @snowman2, we are still seeing severe performance degradation when using transform in pyproj 2.1.1 compared to 1.9.6. The change in execution time seems to be on 2-3 orders of magnitude. We are still investigating, but the only fix so far has been to revert back to 1.9.6.

snowman2 · 2019-03-24T20:17:12Z

Just to be sure, you are following the recommendation here: https://pyproj4.github.io/pyproj/html/optimize_transformations.html?

snowman2 · 2019-03-24T20:18:22Z

If so, it might be related to #128.

docmarionum1 · 2019-03-25T15:40:48Z

Got a chance to test out the new Transformer functionality and it brings the runtime back down to 1.9.6 levels. Thanks for the quick fix @snowman2!

snowman2 · 2019-03-25T15:53:53Z

Fantastic, this is good news!

RadMagnus · 2020-04-29T14:43:57Z

@andrerav could fix this without reverting?

I just upgraded pyproj {1.9.6 (defaults/win-64) -> 2.6.0 (conda-forge/win-64)}

I need to iterate over dataframes with 100ks of rows. In 1.9.6 this would take about 30s. After upgrading, performance drop was orders of magnitude. After applying https://pyproj4.github.io/pyproj/stable/advanced_examples.html#advanced-examples](url) performance increased a little but is still at 14it/s, so it would take ~2h. Also read #128 but could not solve it from there.

edit: fixed link

snowman2 · 2020-04-29T14:55:38Z

See: geopandas/geopandas#1400

jorisvandenbossche · 2020-04-29T14:58:54Z

@RadMagnus can you show some code that you are using? Maybe we can spot something that might be wrong (or even ideally a reproducible example that is slow for you)

RadMagnus · 2020-04-30T09:08:30Z

I must apologise, i can't seem to reproduce my error from yesterday.
I again followed: https://pyproj4.github.io/pyproj/stable/advanced_examples.html#advanced-examples and this time it worked flawlessly.
As i can't rollback to 1.9.6 i cannot reproduce computational speed from that version running the old code. However as @docmarionum1 already mentioned speed is up to 1.9.6 levels again. But i will post the old & new code with reproducible examples nontheless.
Thank you for the quick replies!

Edit: i know the apply & lambda construction is not the fastest solution and @snowman2 posted a vectorized version here: https://gis.stackexchange.com/a/334307. For my problem size apply feels still sufficiently fast.
Edit I: Worked this into my project and want to emphasize the always_xy = True flag in
transformer = Transformer.from_crs(crs_1, crs_2, always_xy = True).

Code from 1.9.6

running at ~12it./s in 2.6.0

# original imports
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyproj import transform, Proj
import tqdm
tqdm.tqdm.pandas()

# create test data
import numpy as np
import pandas.util.testing as tm

np.random.seed(444)
df = pd.DataFrame(np.random.rand(1000,2))
df.index = pd.date_range('2020/01/01', freq = 's', periods = df.shape[0])
df = df.rename(columns={0:'longitude', 1:'latitude'})

# lon, lat are given in epsg:4326 and need to be epsg:25832
# specify projections, the 1.9.6 way:
inProj = Proj(init='epsg:4326')
outProj = Proj(init='epsg:25832')

# turn df into geodataframe in epsg_25832 projection
def createGeoDf(df, inProj, outProj):
    # transform coordinates using 'def coordinateTransoform' and store as tuple in series
    df['epsg_25832'] = df.progress_apply(lambda x: coordinateTransform(x['longitude'], x['latitude'], inProj, outProj),axis=1)
    # turn coordinates into geodataframeable geometries using shapely Points
    points = df.progress_apply(lambda row: Point(row.epsg_25832[0], row.epsg_25832[1]), axis=1)
    # create gdf & set crs
    geo_df = gpd.GeoDataFrame(df, geometry=points)
    geo_df.crs = {'init': 'epsg:25832'}       
    return geo_df

# project car coordinates from inProj (epsg:4326) to  outProj (epsg:25832)
def coordinateTransform(longitude, latitude, inProj, outProj):
    trans_x, trans_y = transform(inProj,outProj,longitude,latitude)
    return trans_x, trans_y

def main():
    geo_df = createGeoDf(df, inProj, outProj)

if __name__ == "__main__":
    main()

Code for 2.6.0

running at 29375it/s, so as fast as in 1.9.6🙂

# original imports
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyproj import Transformer, transform
import tqdm
tqdm.tqdm.pandas()

# create test data
import numpy as np
import pandas.util.testing as tm
np.random.seed(444)

df = pd.DataFrame(np.random.rand(100000,2))
df.index = pd.date_range('2020/01/01', freq = 's', periods = df.shape[0])
df = df.rename(columns={0:'longitude', 1:'latitude'})

# lon, lat are given in epsg:4326 and need to be epsg:25832
# specify projections, the 2.6.0 way:
transformer = Transformer.from_crs(4326, 25832, always_xy = True)

# turn df into geodataframe in epsg_25832 projection
def createGeoDf(df, transformer):
    # transform coordinates using 'def coordinateTransoform' and store as
    df['epsg_25832'] = df.progress_apply(
        lambda x: coordinateTransform(x['longitude'], x['latitude'], transformer),axis=1)
    # turn coordinates into geodataframeable geometries using shapely Points
    points = df.progress_apply(
        lambda row: Point(row.epsg_25832[0], row.epsg_25832[1]), axis=1)
    # create gdf & set crs
    geo_df = gpd.GeoDataFrame(df, geometry=points)
    geo_df.crs = {'init': 'epsg:25832'}       
    return geo_df

# project car coordinates from inProj (epsg:4326) to  outProj (epsg:25832)
def coordinateTransform(longitude, latitude, transformer):
    trans_coord = transformer.transform(longitude, latitude)
    return trans_coord[0], trans_coord[1]

def main():
    geo_df = createGeoDf(df, transformer)

if __name__ == "__main__":
    main()

jorisvandenbossche · 2020-04-30T13:00:33Z

@RadMagnus good to hear that it is solved!

BTW, what you are doing is very similar to what GeoDataFrame.to_crs provides. But I suppose you are doing the projection first manually before creating the GeoDataFrame, because this might be faster?

That should be generally true (if you do it on the full array instead of applying over the rows), but with your code above, there is actually almost no difference, based on a quick check:

Using your createGeoDf function:

In [29]: %%timeit df = pd.DataFrame(np.random.randn(10000, 2), columns=['longitude', 'latitude']) 
    ...: createGeoDf(df, transformer) 
    ...: 
556 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using geopandas to convert to different CRS:

In [33]: %%timeit df = pd.DataFrame(np.random.randn(10000, 2), columns=['longitude', 'latitude']) 
    ...: points = df.apply(lambda row: Point(row.longitude, row.latitude), axis=1) 
    ...: geo_df = gpd.GeoDataFrame(df, geometry=points, crs={'init': 'epsg:25832'}) 
    ...: geo_df.to_crs(epsg=25832) 
    ...:  
559 ms ± 7.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And using the upcoming geopandas release with the optional pygeos dependency (currently in master), it's actually much faster:

In [5]: %%timeit df = pd.DataFrame(np.random.randn(10000, 2), columns=['longitude', 'latitude'])  
   ...: geo_df = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(df['longitude'], df['latitude']), crs="EPSG:4326") 
   ...: geo_df.to_crs(epsg=25832) 
   ...:  
69 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

RadMagnus · 2020-04-30T15:27:10Z

@jorisvandenbossche

I honestly was just not aware of GeoDataFrame.to_crs in this context. I had the coordinates before having a geodataframe, so converting first and then creating the gdf was just a little more obviuos to me.

I actually looked into pygeos based on your talk https://jorisvandenbossche.github.io/talks/2019_FOSS4GBE_pygeos/#39 before responding to this issue. Mainly to speed up Point in Polygon operations. It looked very promising but then i figured you are going to incorporate it in geopandas 0.8 and decided to wait for that release before re-writing non-urgent code twice. I'm thrilled to see those speedups happening. Thanks for all the effort!

snowman2 mentioned this issue Mar 13, 2019

Added class based version of Transformer #189

Merged

ibrewster mentioned this issue Mar 14, 2019

Upgrade from 1.9.6 to 2.0.2 gives different results even with preserve_units=False #195

Closed

snowman2 closed this as completed Mar 19, 2019

docmarionum1 mentioned this issue Mar 25, 2019

Remove the dependency for pyproj geopandas/geopandas#885

Closed

snowman2 mentioned this issue Mar 26, 2019

Performance regression from 5.2.0 to 6.0.0 OSGeo/PROJ#1367

Closed

toolness mentioned this issue Apr 5, 2019

Attempt to fix pyproj-related Travis CI breakage and PyYAML warning nycdb/nycdb#86

Closed

astrojuanlu mentioned this issue Jul 13, 2019

Improve projection transformation performance satellogic/telluric#253

Closed

snowman2 mentioned this issue Aug 20, 2019

A strange SQLite internal state corruption issue? #374

Closed

snowman2 mentioned this issue Nov 14, 2019

pyproj.transform() 2.2.0 running 11,000 times slower than pyproj.transform() 1.9.5.1 #484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transform much slower in 2.0.1 than 1.9.6 #187

transform much slower in 2.0.1 than 1.9.6 #187

docmarionum1 commented Mar 11, 2019

jswhit commented Mar 11, 2019

jswhit commented Mar 12, 2019

jswhit commented Mar 12, 2019

docmarionum1 commented Mar 12, 2019

snowman2 commented Mar 12, 2019

jswhit commented Mar 12, 2019

snowman2 commented Mar 12, 2019 •

edited

Loading

snowman2 commented Mar 14, 2019

snowman2 commented Mar 16, 2019

snowman2 commented Mar 19, 2019

andrerav commented Mar 24, 2019

snowman2 commented Mar 24, 2019

snowman2 commented Mar 24, 2019

docmarionum1 commented Mar 25, 2019

snowman2 commented Mar 25, 2019

RadMagnus commented Apr 29, 2020 •

edited

Loading

snowman2 commented Apr 29, 2020

jorisvandenbossche commented Apr 29, 2020

RadMagnus commented Apr 30, 2020 •

edited

Loading

jorisvandenbossche commented Apr 30, 2020

RadMagnus commented Apr 30, 2020

transform much slower in 2.0.1 than 1.9.6 #187

transform much slower in 2.0.1 than 1.9.6 #187

Comments

docmarionum1 commented Mar 11, 2019

jswhit commented Mar 11, 2019

jswhit commented Mar 12, 2019

jswhit commented Mar 12, 2019

docmarionum1 commented Mar 12, 2019

snowman2 commented Mar 12, 2019

jswhit commented Mar 12, 2019

snowman2 commented Mar 12, 2019 • edited Loading

snowman2 commented Mar 14, 2019

New version times:

snowman2 commented Mar 16, 2019

snowman2 commented Mar 19, 2019

andrerav commented Mar 24, 2019

snowman2 commented Mar 24, 2019

snowman2 commented Mar 24, 2019

docmarionum1 commented Mar 25, 2019

snowman2 commented Mar 25, 2019

RadMagnus commented Apr 29, 2020 • edited Loading

snowman2 commented Apr 29, 2020

jorisvandenbossche commented Apr 29, 2020

RadMagnus commented Apr 30, 2020 • edited Loading

Code from 1.9.6

Code for 2.6.0

jorisvandenbossche commented Apr 30, 2020

RadMagnus commented Apr 30, 2020

snowman2 commented Mar 12, 2019 •

edited

Loading

RadMagnus commented Apr 29, 2020 •

edited

Loading

RadMagnus commented Apr 30, 2020 •

edited

Loading