Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the default arraysize #475

Open
pmantica1 opened this issue Sep 3, 2020 · 6 comments
Open

Increase the default arraysize #475

pmantica1 opened this issue Sep 3, 2020 · 6 comments

Comments

@pmantica1
Copy link

I am currently noticed a 10x performance discrepancy when transitioning from using a SQL Sever to using a Oracle Database with the almost the same data. I later found out that this was due that this performance discrepancy was due to a low network throughput and I was able to fix this 10x discrepancy by increasing the cx_oracle arraysize.

Even though there is some great documentation on performance tuning, I think that it would be best if users never encounter this performance issue in the first place. So I think that it might be a good idea to increase the default arraysize of 100. I know that this is a tradeoff between memory and throughput and I am not really qualified to tell you what a good number would be here. But I still wanted to bring this to your attention since I think it could possibly improve the library's user experience.

@cjbj
Copy link
Member

cjbj commented Sep 3, 2020

Thanks for the data point. What size did you end up using? What was the SQL Server default? And how is performance now in comparison?

If anyone else has comments and data, please let us know.

The DB API suggested a default arraysize of 1 which is way small. In cx_Oracle 4.4 @anthony-tuininga changed it to 50 and later this was upped to 100 in 5.3 to align with other APIs e.g. in PHP and Node.js.

A lot of people seem to doing single-row selects, where obviously a smaller arraysize is going to be better. Other than that, every user seems to have a different use case. When we ran some artificial benchmarks to help choose a default prefetchrows value for 8.0, there were so many factors like network costs, row width, DB load, machine speeds etc that it was difficult to make a generalization.

@pmantica1
Copy link
Author

pmantica1 commented Sep 4, 2020

No problem and thanks a lot for the prompt reply. I think I am going to use an arraysize of 10000.

Here are some more specifics on the data/querying.

So the query I used for comparison was one that selects the first 100,000 rows of a table with 9 columns. (This table is essentially the same in the SQL Server and Oracle Database I am using).

Here are the results with different array sizes when I was targeting the Oracle Database:
Array size: 100, time: 85.52440857887268
Array size: 200, time: 44.30461573600769
Array size: 400, time: 23.15495467185974
Array size: 800, time: 14.494195699691772
Array size: 1600, time: 12.315463304519653
Array size: 3200, time: 11.775989532470703
Array size: 6400, time: 9.458912372589111
Array size: 12800, time: 9.155619859695435
Array size: 25600, time: 8.652088165283203
Array size: 51200, time: 7.232534408569336

In comparison, I was able to retrieve the data from SQL Server in 4 seconds. I got the same performance results when using pymyssql and pyodbc, another different MSSQL python driver. pymyssql cursors do not have an arraysize attribute. pyodbc cursors have an arraysize of 1. Though the query performance did not change when I varied the pyodbc arraysize.

I looked at the pyodbc code and from what I could tell, the arraysize parameter has not effect on how the packets/rows are batched and buffered.
https://github.com/mkleehammer/pyodbc/blob/03f762c7bd8549093694dffa4d1c59c016fbc8fe/src/cursor.cpp#L1066

I also looked at the implementation of fetch_many for pymyssql and it also does not seem like the size parameter affects the packet/row buffering.
https://github.com/pymssql/pymssql/blob/master/src/pymssql.pyx#L519

So it seems to me like it is a peculiarity of cx_oracle that the arraysize is tied to how packets/rows are buffered and batched.

Here is how I retrieved the data.

cx_oracle:

for i in range(10):
    start = time.time()
    cur = cx_oracle_connection.cursor()
    cur.arraysize = 100*2**i
    [row for row in cur.execute(query)]
    print(f"Array size: {cur.arraysize}, time: {time.time()-start}")

pymssql/pyodbc:

start = time.time()
cursor = connection.cursor()
cursor.execute(mssql_version_of_query) 
[row for row in cursor]
print(time.time()-start)

@pmantica1
Copy link
Author

https://pkg.go.dev/github.com/godror/godror?tab=doc
The Oracle Go driver seems to have a DefaultArraySize of 1024.

@cjbj
Copy link
Member

cjbj commented Sep 4, 2020

Thanks for the details and for showing interest in making improvement

This discussion is about the cost of fetching data across the network, so your timings could/should be for the fetch, not the execute phase. Of course, the prefetchrows value will play a role, and is used at the execute, so the situation becomes murky to compare. Also by comparing with the execute phase included, all the DB configuration and indexes and DB host disk speed and buffering could play a role. Overall, although I appreciate you saw better 'nearly out of the box' performance with one DB, a direct vendor-to-vendor comparison would justify deeper analysis (including network config) - and is beyond the scope of cx_Oracle itself. (But feel free to tell us more, as it's always good to know).

https://pkg.go.dev/github.com/godror/godror?tab=doc
The Oracle Go driver seems to have a DefaultArraySize of 1024.

That is for PL/SQL array sizing and not related to query fetch tuning. The godror query internal buffer sizes for fetcharraysize and prefetchrows are documented just above (DPI_DEFAULT_*), and default to the same 100 & 2 that cx_Oracle, node-oracledb etc use. They all use the same ODPI code underneath.

Overall, yes a bigger arraysize (and prefetchrows) will help big transfers, but what is the common case for all users (who don't read the tuning doc) and balances memory allocation costs & sizes (right through the Oracle stack and DB) against network costs?!

Out of interest did you check memory and network use with the higher arraysize? Did you tune your SDU sizes? See the links in https://cx-oracle.readthedocs.io/en/latest/user_guide/tuning.html

@pmantica1
Copy link
Author

Fair point about the execution time.

If the execution time is just the time to run cursor.execute(query) then I found it be a negligible 0.08 seconds when running against both the SQL Server and the Oracle Database.

Thanks for pointing that about the Go buffer sizes.

I definitely agree that this a hard problem to optimize for. I definitely don't think that I have enough expertise to provide a good answer here. I just wanted provide a data point. Also, for my personal curiosity, do you mind explaining what happens if you have an arraysize of 10k but execute a query that returns only 100 rows? Is there higher latency/memory usage?

I used htop to manually inspect the memory usage did not see a significant bump in memory usage, i.e. a >100 Megabyte jump, even when using an arraysize. (I know that "significant" here is highly subjective and relative here). The network usage when using an arraysize of 52000 was close to 1 megabyte per second which is pretty close to my internet speed limit so I do not see the need for optimizing any further.

@thoo
Copy link

thoo commented Dec 29, 2020

I got way more than 10X performance. I set it up to 10,000,000.

import cx_Oracle
class Connection(cx_Oracle.Connection):
    def cursor(self):
        cursor = super(Connection, self).cursor()
        cursor.arraysize = 10000000
        return cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants