New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String performance issue using single quotes #70306
Comments
There appears to be a significant performance issue between the following two statements. Unable to explain performance impact. s = "{0},".format(columnvalue) # fast
s = "'{0}',".format(columnvalue) # ~30x slower So far, no luck trying to find other statements to improve performance, such as: |
Please show us how you're measuring the performance. Also, please show the output of "python -V". |
My initial observations with my Python script using: s = "{0},".format(columnvalue) # fast
Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min)
One particular file 6.5MB took ~1 minute. When I changed this line of code to: My Python environment is: I did some more testing with a very simplified piece of code, but is not conclusive. But there is a significant jump when I introduce the single quotes where I see a jump from 0m2.410s to 0m3.875s. $ python -V
Python 3.5.1 // $ time python test.py real 0m2.410s // s='test' $ time python test2.py real 0m2.510s // s='test' $ time python test3.py real 0m3.875s |
I see a small difference, but I think it's related to the fact that in the first example you're concatenating 2 strings (',' and the result of {0}), and in the 2nd example it's 3 strings ("'", {0}, "',"): $ echo '",{0}".format(x)'
",{0}".format(x)
$ python -m timeit -s 'x=4' '",{0}".format(x)'
1000000 loops, best of 3: 0.182 usec per loop
$ echo '"'\''{0}'\'',".format(x)'
"'{0}',".format(x)
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
1000000 loops, best of 3: 0.205 usec per loop If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code. |
I cannot replicate that performance difference under Linux. There's a small difference (about 0.1 second per million iterations, or a tenth of a microsecond) on my computer, but I don't think that's meaningful:
Please re-run your tests using the timeit module, and see if you can still see a consistent difference with and without single quotes. Perhaps this is specific to Windows? Otherwise, I can only suggest that the timing difference is unrelated to the difference in quotes in the script. Are you sure that this is absolutely the only change between the run that took one minute and the run that took 30 minutes? No other changes to the script, running on the same data file, on the same disk, no difference in what other processes are running? (E.g. if one run is fighting for disk access with, say, a memory-hungry anti-virus scan, that would easily explain the difference.) |
Eric, Steven, thank you for your feedback so far. I am using Windows7, Intel i7. Steven suggested the idea that this phenomenon might be specific to Windows. And I agree, that is what it is looking like. Or is Python doing something in the background? The Python script is straight forward with a loop that reads a line from a CSV file, split the column values and saves each value as '<value>' to another file. Basically building an SQL statement. Because I can reproduce this performance difference at will by alternating which line I comment out, leads me to believe it cannot be HDD, AV or something outside the python script interfering. I repeated the simplified test, that I ran earlier on a Linux system, but this time on my Windows system. I will try some more tests and copy your examples. import time
loopcount = 10000000
# Using string value
s="test 1"
v="test 1"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End {0}: {1}".format(s,end_ms))
print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 1: 1452828394523
# End test 1: 1452828397957
# Diff test 1: 3434 ms
s="test 2"
v="test 2"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End {0}: {1}".format(s,end_ms))
print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 2: 1452828397957
# End test 2: 1452828401233
# Diff test 2: 3276 ms
s="test 3"
v="test 3"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End {0}: {1}".format(s,end_ms))
print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 3: 1452828401233
# End test 3: 1452828406320
# Diff test 3: 5087 ms
# Using integer value
s="test 4"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End {0}: {1}".format(s,end_ms))
print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 4: 1452828406320
# End test 4: 1452828411378
# Diff test 4: 5058 ms
s="test 5"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End {0}: {1}".format(s,end_ms))
print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 5: 1452828411378
# End test 5: 1452828415264
# Diff test 5: 3886 ms
s="test 6"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End {0}: {1}".format(s,end_ms))
print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 6: 1452828415264
# End test 6: 1452828421292
# Diff test 6: 6028 ms |
Eric, I just tried your examples. Test1: Eric's results: Test2: Eric's results: |
Eric, Steven, During further testing I was not able to find any real evidence that the statement I was focused on had a real performance issue. As I did more testing I noticed that appending data to the file slowed down. The file grew initially with ~30-50KB increments and around 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be odd and because Steven had mentioned other processes, I started looking at some other statements. After quite a lot of trial and error, I was able to use single quotes and increase my performance to acceptable levels. Can you explain to me why there was a performance penalty in example 2 ? Did conv.escape_string() change something about columnvalue, so that adding a single quote before and after it introduced some add behavior with writing to file ? I am not an expert on Python and remember reading something about Dynamic typing. Example 1: Fast performance, variable s is not encapsulated with single quotes Example 2: Slow performance, pre- and post-fixed variable s with single quotes Example 3: Fast performance, variable columnvalue is pre- and post-fixed with single quotes |
I implemented overkill optimization in _PyUnicodeWriter API used by str.format() and str%args. If the result is the input, the string is not copied by value, but by reference. >>> x="hello"
>>> ("%s" % x) is x
True
>>> ("{}".format(x)) is x
True If the format string adds something before/after, the string is duplicated: >>> ("x=%s" % x) is x
False
>>> ("x={}".format(x)) is x
False The optimization is implemented in _PyUnicodeWriter_WriteStr(): https://hg.python.org/cpython/file/32a4e7b337c9/Objects/unicodeobject.c#l13604 |
The performance of instructions like ("x=%s" % x) or ("x={}".format(x)) depend on the length of the string. Maybe poostenr has some very long strings? -- Closed issues related to _PyUnicodeWriter API:
I'm listing these issues because they contain microbenchmarks script if you would like to play with them. |
On Fri, Jan 15, 2016 at 07:56:39AM +0000, poostenr wrote:
How are you appending to the file? In the code snippets below you merely Or perhaps your disk is just badly fragmented, and as the file gets
I don't know. What's conv.escape_string? Where does it come from, and Perhaps you should run the profiler and see where your code is spending |
Thank you for your feedback Victor and Steven. I just copied my scripts and 360MB of CSV files over to Linux. I apologize for not being able to provide the entire code. I am opening a file like this: I write to file: I'm using a library to escape the string before saving to file. I appreciate the feedback and ideas so far. I never suspected inserting the two single quotes to cause such a problem in performance. I noticed it when I parsed ~40GB of data and it took almost a week to complete instead of my expected 6-7 hrs. Today, I wasn't expecting such a big difference between running my script on Linux or Windows. If I discover anything else, I will post an update. |
poostenr, this is demonstrably not a problem with the CPython, which this bug tracker is about. There are few options available on the internet if you need help with your code: mailing lists and irc are among them. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: