-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transform much slower in 2.0.1 than 1.9.6 #187
Comments
I wonder if the C lib is just slower? Don't see that much has changed in the pyproj.transform interface. |
Looks like most of the extra time is in setting up the data structures that the new |
We could allow for a import pyproj
proj_in = pyproj.Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = pyproj.Proj({'init': 'epsg:4326'}, preserve_units=True)
proj_trans = pyproj.TransProj(proj_in, proj_out)
pyproj.transform2(proj_trans, random.randint(80000, 120000), random.randint(200000, 250000)) The cost of |
If most of the added overhead is in the instantiation of |
Thoughts using a class based approach? This would mean that the TransProj could be a property on the class and would not need to be generated by the user. from pyproj import Transformer, Proj
proj_in = Proj({'init': 'epsg:2263'}, preserve_units=True)
proj_out = Proj({'init': 'epsg:4326'}, preserve_units=True)
transformer = Transformer(proj_in, proj_out)
transformer.transform(random.randint(80000, 120000), random.randint(200000, 250000)) Plus, it would be fun to have a class called |
I like it! Preserves backwards compatibility too. |
Sounds great. I think going this route could also make supporting custom proj pipelines possible: transformer = Transformer.from_proj(proj_in, proj_out)
transformer = Transformer.from_pipeline("PROJ pipeline projection string") |
New version times:import numpy as np
from pyproj import Transformer
transformer = Transformer.from_proj(2263, 4326) Test 1: %%timeit -n50 -r50
transformer.transform(np.random.randint(80000, 120000), np.random.randint(200000, 250000))
Test 2: %%timeit -n50 -r50
transformer.transform(np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10))
%%timeit
transformer.transform(np.random.randint(80000, 120000, 10), np.random.randint(200000, 250000, 10))
Test 3: %%timeit -n5 -r5
transformer.transform(np.random.randint(80000, 120000, 1000000), np.random.randint(200000, 250000, 1000000))
But, this is on my machine. Curious to know how it performs on yours when 2.1.0 is released. |
pyproj==2.1.0 is released with the fix. Mind giving it a go? |
Closing as the issue has been addressed. Thanks for the report! |
Hi @snowman2, we are still seeing severe performance degradation when using transform in pyproj 2.1.1 compared to 1.9.6. The change in execution time seems to be on 2-3 orders of magnitude. We are still investigating, but the only fix so far has been to revert back to 1.9.6. |
Just to be sure, you are following the recommendation here: https://pyproj4.github.io/pyproj/html/optimize_transformations.html? |
If so, it might be related to #128. |
Got a chance to test out the new Transformer functionality and it brings the runtime back down to 1.9.6 levels. Thanks for the quick fix @snowman2! |
Fantastic, this is good news! |
@andrerav could fix this without reverting? I just upgraded pyproj I need to iterate over dataframes with 100ks of rows. In 1.9.6 this would take about 30s. After upgrading, performance drop was orders of magnitude. After applying https://pyproj4.github.io/pyproj/stable/advanced_examples.html#advanced-examples](url) performance increased a little but is still at 14it/s, so it would take ~2h. Also read #128 but could not solve it from there. edit: fixed link |
@RadMagnus can you show some code that you are using? Maybe we can spot something that might be wrong (or even ideally a reproducible example that is slow for you) |
I must apologise, i can't seem to reproduce my error from yesterday. Edit: i know the apply & lambda construction is not the fastest solution and @snowman2 posted a vectorized version here: https://gis.stackexchange.com/a/334307. For my problem size apply feels still sufficiently fast. Code from 1.9.6running at ~12it./s in 2.6.0 # original imports
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyproj import transform, Proj
import tqdm
tqdm.tqdm.pandas()
# create test data
import numpy as np
import pandas.util.testing as tm
np.random.seed(444)
df = pd.DataFrame(np.random.rand(1000,2))
df.index = pd.date_range('2020/01/01', freq = 's', periods = df.shape[0])
df = df.rename(columns={0:'longitude', 1:'latitude'})
# lon, lat are given in epsg:4326 and need to be epsg:25832
# specify projections, the 1.9.6 way:
inProj = Proj(init='epsg:4326')
outProj = Proj(init='epsg:25832')
# turn df into geodataframe in epsg_25832 projection
def createGeoDf(df, inProj, outProj):
# transform coordinates using 'def coordinateTransoform' and store as tuple in series
df['epsg_25832'] = df.progress_apply(lambda x: coordinateTransform(x['longitude'], x['latitude'], inProj, outProj),axis=1)
# turn coordinates into geodataframeable geometries using shapely Points
points = df.progress_apply(lambda row: Point(row.epsg_25832[0], row.epsg_25832[1]), axis=1)
# create gdf & set crs
geo_df = gpd.GeoDataFrame(df, geometry=points)
geo_df.crs = {'init': 'epsg:25832'}
return geo_df
# project car coordinates from inProj (epsg:4326) to outProj (epsg:25832)
def coordinateTransform(longitude, latitude, inProj, outProj):
trans_x, trans_y = transform(inProj,outProj,longitude,latitude)
return trans_x, trans_y
def main():
geo_df = createGeoDf(df, inProj, outProj)
if __name__ == "__main__":
main() Code for 2.6.0running at 29375it/s, so as fast as in 1.9.6🙂 # original imports
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pyproj import Transformer, transform
import tqdm
tqdm.tqdm.pandas()
# create test data
import numpy as np
import pandas.util.testing as tm
np.random.seed(444)
df = pd.DataFrame(np.random.rand(100000,2))
df.index = pd.date_range('2020/01/01', freq = 's', periods = df.shape[0])
df = df.rename(columns={0:'longitude', 1:'latitude'})
# lon, lat are given in epsg:4326 and need to be epsg:25832
# specify projections, the 2.6.0 way:
transformer = Transformer.from_crs(4326, 25832, always_xy = True)
# turn df into geodataframe in epsg_25832 projection
def createGeoDf(df, transformer):
# transform coordinates using 'def coordinateTransoform' and store as
df['epsg_25832'] = df.progress_apply(
lambda x: coordinateTransform(x['longitude'], x['latitude'], transformer),axis=1)
# turn coordinates into geodataframeable geometries using shapely Points
points = df.progress_apply(
lambda row: Point(row.epsg_25832[0], row.epsg_25832[1]), axis=1)
# create gdf & set crs
geo_df = gpd.GeoDataFrame(df, geometry=points)
geo_df.crs = {'init': 'epsg:25832'}
return geo_df
# project car coordinates from inProj (epsg:4326) to outProj (epsg:25832)
def coordinateTransform(longitude, latitude, transformer):
trans_coord = transformer.transform(longitude, latitude)
return trans_coord[0], trans_coord[1]
def main():
geo_df = createGeoDf(df, transformer)
if __name__ == "__main__":
main() |
@RadMagnus good to hear that it is solved! BTW, what you are doing is very similar to what That should be generally true (if you do it on the full array instead of applying over the rows), but with your code above, there is actually almost no difference, based on a quick check: Using your
Using geopandas to convert to different CRS:
And using the upcoming geopandas release with the optional pygeos dependency (currently in master), it's actually much faster:
|
I honestly was just not aware of I actually looked into pygeos based on your talk https://jorisvandenbossche.github.io/talks/2019_FOSS4GBE_pygeos/#39 before responding to this issue. Mainly to speed up Point in Polygon operations. It looked very promising but then i figured you are going to incorporate it in geopandas 0.8 and decided to wait for that release before re-writing non-urgent code twice. I'm thrilled to see those speedups happening. Thanks for all the effort! |
I'm using python 3.6.7 on ubuntu 18.04.1. I installed each version of pyproj via
pip install pyproj==x.y.z
.I'm using pyproj via geopandas. I recently upgraded from
1.9.5.1
to2.0.1
and noticed that calls toto_crs
which usespyproj.transform
viashapely.ops.transform
got much slower.It seems like there is a large overhead on the call to
transform
now. When callingtransform
with large arrays, the difference in less pronounced, but when called on individual coordinates it is quite large.I tested with individual x,y pairs; arrays of 10 elements each to simulate usage with a "normal" sized geometry; and a test of 1,000,000 elements each which is unrealistic unless you're working just with points and not more complex geometry.
This was the setup:
Testing on Individual coordinate pairs, 2.0.1 is ~1000x slower than 1.9.5.1:
For arrays of 10 coordinates each:
And for arrays of 1,000,000:
The text was updated successfully, but these errors were encountered: