Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transform much slower in 2.0.1 than 1.9.6 #187

Closed
docmarionum1 opened this issue Mar 11, 2019 · 21 comments
Closed

transform much slower in 2.0.1 than 1.9.6 #187

docmarionum1 opened this issue Mar 11, 2019 · 21 comments

Comments

@docmarionum1
Copy link

I'm using python 3.6.7 on ubuntu 18.04.1. I installed each version of pyproj via pip install pyproj==x.y.z.

I'm using pyproj via geopandas. I recently upgraded from 1.9.5.1 to 2.0.1 and noticed that calls to to_crs which uses pyproj.transform via shapely.ops.transform got much slower.

It seems like there is a large overhead on the call to transform now. When calling transform with large arrays, the difference in less pronounced, but when called on individual coordinates it is quite large.

I tested with individual x,y pairs; arrays of 10 elements each to simulate usage with a "normal" sized geometry; and a test of 1,000,000 elements each which is unrealistic unless you're working just with points and not more complex geometry.

This was the setup:

import pyproj
proj_in = pyproj.Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = pyproj.Proj({'init': 'epsg:4326'}, preserve_units=True)

Testing on Individual coordinate pairs, 2.0.1 is ~1000x slower than 1.9.5.1:

%%t -n50 -r50
pyproj.transform(proj_in, proj_out, random.randint(80000, 120000), random.randint(200000, 250000))
  • 1.9.5.1 - 13.9 µs ± 5.82 µs per loop
  • 1.9.6 - 14.6 µs ± 6.52 µs per loop
  • 2.0.0 - 945 µs ± 152 µs per loop
  • 2.0.1 - 8.77 ms ± 915 µs per loop

For arrays of 10 coordinates each:

%%t -n50 -r50
pyproj.transform(proj_in, proj_out, np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10))
  • 1.9.6 - 25.1 µs ± 16.8 µs per loop
  • 2.0.1 - 8.81 ms ± 798 µs per loop

And for arrays of 1,000,000:

%%t -n5 -r5
pyproj.transform(proj_in, proj_out, np.random.randint(80000, 120000, 1000000), np.random.randint(200000, 250000, 1000000))
  • 1.9.6 - 689 ms ± 7.57 ms per loop
  • 2.0.1 - 1.18 s ± 24.9 ms per loop
@jswhit
Copy link
Collaborator

jswhit commented Mar 11, 2019

I wonder if the C lib is just slower? Don't see that much has changed in the pyproj.transform interface.

@jswhit
Copy link
Collaborator

jswhit commented Mar 12, 2019

Looks like most of the extra time is in setting up the data structures that the new pyproj.transform needs (specifically the TransProj object). That's why the speed difference goes down as you increase the size of the arrays.

@jswhit
Copy link
Collaborator

jswhit commented Mar 12, 2019

We could allow for a TransProj instance to be created outside of transform, and then re-use it for multiple calls. Something like this:

import pyproj
proj_in = pyproj.Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = pyproj.Proj({'init': 'epsg:4326'}, preserve_units=True)
proj_trans = pyproj.TransProj(proj_in, proj_out)
pyproj.transform2(proj_trans, random.randint(80000, 120000), random.randint(200000, 250000))

The cost of pyproj.transform2 should then be quite similar to pyproj.transform in 1.9.6.

@docmarionum1
Copy link
Author

If most of the added overhead is in the instantiation of TransProj then yeah, that would be perfect.

@snowman2
Copy link
Member

Thoughts using a class based approach? This would mean that the TransProj could be a property on the class and would not need to be generated by the user.

from pyproj import Transformer, Proj

proj_in = Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = Proj({'init': 'epsg:4326'}, preserve_units=True)
transformer = Transformer(proj_in, proj_out)
transformer.transform(random.randint(80000, 120000), random.randint(200000, 250000))

Plus, it would be fun to have a class called Transformer.

@jswhit
Copy link
Collaborator

jswhit commented Mar 12, 2019

I like it! Preserves backwards compatibility too.

@snowman2
Copy link
Member

snowman2 commented Mar 12, 2019

Sounds great. I think going this route could also make supporting custom proj pipelines possible:

transformer = Transformer.from_proj(proj_in, proj_out)
transformer = Transformer.from_pipeline("PROJ pipeline projection string")

@snowman2
Copy link
Member

New version times:

import numpy as np                                                                                                                            
from pyproj import Transformer                                                                                                                

transformer = Transformer.from_proj(2263, 4326) 

Test 1:

%%timeit -n50 -r50 
transformer.transform(np.random.randint(80000, 120000), np.random.randint(200000, 250000)) 
36.4 µs ± 7.27 µs per loop (mean ± std. dev. of 50 runs, 50 loops each)

Test 2:

%%timeit -n50 -r50 
 transformer.transform(np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10)) 
The slowest run took 4.05 times longer than the fastest. This could mean that an intermediate result is being cached.
63.5 µs ± 33.6 µs per loop (mean ± std. dev. of 50 runs, 50 loops each)
%%timeit 
transformer.transform(np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10)) 
30.8 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Test 3:

 %%timeit -n5 -r5  
transformer.transform(np.random.randint(80000, 120000, 1000000), np.random.randint(200000, 250000, 1000000)) 
1.94 s ± 21.9 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

But, this is on my machine. Curious to know how it performs on yours when 2.1.0 is released.

@snowman2
Copy link
Member

pyproj==2.1.0 is released with the fix. Mind giving it a go?

@snowman2
Copy link
Member

Closing as the issue has been addressed. Thanks for the report!

@andrerav
Copy link

Hi @snowman2, we are still seeing severe performance degradation when using transform in pyproj 2.1.1 compared to 1.9.6. The change in execution time seems to be on 2-3 orders of magnitude. We are still investigating, but the only fix so far has been to revert back to 1.9.6.

@snowman2
Copy link
Member

Just to be sure, you are following the recommendation here: https://pyproj4.github.io/pyproj/html/optimize_transformations.html?

@snowman2
Copy link
Member

If so, it might be related to #128.

@docmarionum1
Copy link
Author

Got a chance to test out the new Transformer functionality and it brings the runtime back down to 1.9.6 levels. Thanks for the quick fix @snowman2!

@snowman2
Copy link
Member

Fantastic, this is good news!

@RadMagnus
Copy link

RadMagnus commented Apr 29, 2020

@andrerav could fix this without reverting?

I just upgraded pyproj {1.9.6 (defaults/win-64) -> 2.6.0 (conda-forge/win-64)}

I need to iterate over dataframes with 100ks of rows. In 1.9.6 this would take about 30s. After upgrading, performance drop was orders of magnitude. After applying https://pyproj4.github.io/pyproj/stable/advanced_examples.html#advanced-examples](url) performance increased a little but is still at 14it/s, so it would take ~2h. Also read #128 but could not solve it from there.

edit: fixed link

@snowman2
Copy link
Member

See: geopandas/geopandas#1400

@jorisvandenbossche
Copy link
Contributor

@RadMagnus can you show some code that you are using? Maybe we can spot something that might be wrong (or even ideally a reproducible example that is slow for you)

@RadMagnus
Copy link

RadMagnus commented Apr 30, 2020

I must apologise, i can't seem to reproduce my error from yesterday.
I again followed: https://pyproj4.github.io/pyproj/stable/advanced_examples.html#advanced-examples and this time it worked flawlessly.
As i can't rollback to 1.9.6 i cannot reproduce computational speed from that version running the old code. However as @docmarionum1 already mentioned speed is up to 1.9.6 levels again. But i will post the old & new code with reproducible examples nontheless.
Thank you for the quick replies!

Edit: i know the apply & lambda construction is not the fastest solution and @snowman2 posted a vectorized version here: https://gis.stackexchange.com/a/334307. For my problem size apply feels still sufficiently fast.
Edit I: Worked this into my project and want to emphasize the always_xy = True flag in
transformer = Transformer.from_crs(crs_1, crs_2, always_xy = True).

Code from 1.9.6

running at ~12it./s in 2.6.0

# original imports
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyproj import transform, Proj
import tqdm
tqdm.tqdm.pandas()

# create test data
import numpy as np
import pandas.util.testing as tm

np.random.seed(444)
df = pd.DataFrame(np.random.rand(1000,2))
df.index = pd.date_range('2020/01/01', freq = 's', periods = df.shape[0])
df = df.rename(columns={0:'longitude', 1:'latitude'})

# lon, lat are given in epsg:4326 and need to be epsg:25832
# specify projections, the 1.9.6 way:
inProj = Proj(init='epsg:4326')
outProj = Proj(init='epsg:25832')

# turn df into geodataframe in epsg_25832 projection
def createGeoDf(df, inProj, outProj):
    # transform coordinates using 'def coordinateTransoform' and store as tuple in series
    df['epsg_25832'] = df.progress_apply(lambda x: coordinateTransform(x['longitude'], x['latitude'], inProj, outProj),axis=1)
    # turn coordinates into geodataframeable geometries using shapely Points
    points = df.progress_apply(lambda row: Point(row.epsg_25832[0], row.epsg_25832[1]), axis=1)
    # create gdf & set crs
    geo_df = gpd.GeoDataFrame(df, geometry=points)
    geo_df.crs = {'init': 'epsg:25832'}       
    return geo_df

# project car coordinates from inProj (epsg:4326) to  outProj (epsg:25832)
def coordinateTransform(longitude, latitude, inProj, outProj):
    trans_x, trans_y = transform(inProj,outProj,longitude,latitude)
    return trans_x, trans_y

def main():
    geo_df = createGeoDf(df, inProj, outProj)

if __name__ == "__main__":
    main()    

Code for 2.6.0

running at 29375it/s, so as fast as in 1.9.6🙂

# original imports
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyproj import Transformer, transform
import tqdm
tqdm.tqdm.pandas()

# create test data
import numpy as np
import pandas.util.testing as tm
np.random.seed(444)

df = pd.DataFrame(np.random.rand(100000,2))
df.index = pd.date_range('2020/01/01', freq = 's', periods = df.shape[0])
df = df.rename(columns={0:'longitude', 1:'latitude'})

# lon, lat are given in epsg:4326 and need to be epsg:25832
# specify projections, the 2.6.0 way:
transformer = Transformer.from_crs(4326, 25832, always_xy = True)

# turn df into geodataframe in epsg_25832 projection
def createGeoDf(df, transformer):
    # transform coordinates using 'def coordinateTransoform' and store as
    df['epsg_25832'] = df.progress_apply(
        lambda x: coordinateTransform(x['longitude'], x['latitude'], transformer),axis=1)
    # turn coordinates into geodataframeable geometries using shapely Points
    points = df.progress_apply(
        lambda row: Point(row.epsg_25832[0], row.epsg_25832[1]), axis=1)
    # create gdf & set crs
    geo_df = gpd.GeoDataFrame(df, geometry=points)
    geo_df.crs = {'init': 'epsg:25832'}       
    return geo_df

# project car coordinates from inProj (epsg:4326) to  outProj (epsg:25832)
def coordinateTransform(longitude, latitude, transformer):
    trans_coord = transformer.transform(longitude, latitude)
    return trans_coord[0], trans_coord[1]

def main():
    geo_df = createGeoDf(df, transformer)

if __name__ == "__main__":
    main()    

@jorisvandenbossche
Copy link
Contributor

@RadMagnus good to hear that it is solved!

BTW, what you are doing is very similar to what GeoDataFrame.to_crs provides. But I suppose you are doing the projection first manually before creating the GeoDataFrame, because this might be faster?

That should be generally true (if you do it on the full array instead of applying over the rows), but with your code above, there is actually almost no difference, based on a quick check:

Using your createGeoDf function:

In [29]: %%timeit df = pd.DataFrame(np.random.randn(10000, 2), columns=['longitude', 'latitude']) 
    ...: createGeoDf(df, transformer) 
    ...: 
556 ms ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using geopandas to convert to different CRS:

In [33]: %%timeit df = pd.DataFrame(np.random.randn(10000, 2), columns=['longitude', 'latitude']) 
    ...: points = df.apply(lambda row: Point(row.longitude, row.latitude), axis=1) 
    ...: geo_df = gpd.GeoDataFrame(df, geometry=points, crs={'init': 'epsg:25832'}) 
    ...: geo_df.to_crs(epsg=25832) 
    ...:  
559 ms ± 7.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And using the upcoming geopandas release with the optional pygeos dependency (currently in master), it's actually much faster:

In [5]: %%timeit df = pd.DataFrame(np.random.randn(10000, 2), columns=['longitude', 'latitude'])  
   ...: geo_df = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(df['longitude'], df['latitude']), crs="EPSG:4326") 
   ...: geo_df.to_crs(epsg=25832) 
   ...:  
69 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@RadMagnus
Copy link

@jorisvandenbossche

I honestly was just not aware of GeoDataFrame.to_crs in this context. I had the coordinates before having a geodataframe, so converting first and then creating the gdf was just a little more obviuos to me.

I actually looked into pygeos based on your talk https://jorisvandenbossche.github.io/talks/2019_FOSS4GBE_pygeos/#39 before responding to this issue. Mainly to speed up Point in Polygon operations. It looked very promising but then i figured you are going to incorporate it in geopandas 0.8 and decided to wait for that release before re-writing non-urgent code twice. I'm thrilled to see those speedups happening. Thanks for all the effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants