Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String performance issue using single quotes #70306

Closed
poostenr mannequin opened this issue Jan 14, 2016 · 13 comments
Closed

String performance issue using single quotes #70306

poostenr mannequin opened this issue Jan 14, 2016 · 13 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@poostenr
Copy link
Mannequin

poostenr mannequin commented Jan 14, 2016

BPO 26118
Nosy @vstinner, @ericvsmith, @stevendaprano

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-01-15.17:01:30.144>
created_at = <Date 2016-01-14.22:47:44.428>
labels = ['interpreter-core', 'invalid', 'performance']
title = 'String performance issue using single quotes'
updated_at = <Date 2016-01-15.17:01:30.142>
user = 'https://bugs.python.org/poostenr'

bugs.python.org fields:

activity = <Date 2016-01-15.17:01:30.142>
actor = 'SilentGhost'
assignee = 'none'
closed = True
closed_date = <Date 2016-01-15.17:01:30.144>
closer = 'SilentGhost'
components = ['Interpreter Core']
creation = <Date 2016-01-14.22:47:44.428>
creator = 'poostenr'
dependencies = []
files = []
hgrepos = []
issue_num = 26118
keywords = []
message_count = 13.0
messages = ['258243', '258247', '258252', '258256', '258260', '258264', '258265', '258275', '258276', '258277', '258287', '258308', '258310']
nosy_count = 6.0
nosy_names = ['vstinner', 'eric.smith', 'steven.daprano', 'SilentGhost', 'poostenr', 'ubehera']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue26118'
versions = ['Python 3.5']

@poostenr
Copy link
Mannequin Author

poostenr mannequin commented Jan 14, 2016

There appears to be a significant performance issue between the following two statements. Unable to explain performance impact.

s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

So far, no luck trying to find other statements to improve performance, such as:
s = "\'{0}\',".format(columnvalue)
s = "'" + "%s" %(columnvalue) + "'"+","
s = "{0}{1}{2},".format("'",columnvalue,"'")

@poostenr poostenr mannequin added OS-windows performance Performance or resource usage labels Jan 14, 2016
@ericvsmith
Copy link
Member

Please show us how you're measuring the performance. Also, please show the output of "python -V".

@ericvsmith ericvsmith added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed OS-windows labels Jan 14, 2016
@poostenr
Copy link
Mannequin Author

poostenr mannequin commented Jan 14, 2016

My initial observations with my Python script using:

s = "{0},".format(columnvalue)   # fast
Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min)
One particular file 6.5MB took ~1 minute.

When I changed this line of code to:
s = "'{0}',".format(columnvalue) # ~30x slower (1 min. vs 30 min.)
Same particular file of 6.5MB took ~30 minutes (228KB/min).

My Python environment is:
C:\Data>python -V
Python 3.5.1 :: Anaconda 2.4.1 (32-bit)

I did some more testing with a very simplified piece of code, but is not conclusive. But there is a significant jump when I introduce the single quotes where I see a jump from 0m2.410s to 0m3.875s.

$ python -V
Python 3.5.1

//
// s='test'
// for x in range(10000000):
// y = "{0}".format(s)

$ time python test.py

real 0m2.410s
user 0m2.356s
sys 0m0.048s

// s='test'
// for x in range(10000000):
// y = "'%s'" % (s)

$ time python test2.py 

real 0m2.510s
user 0m2.453s
sys 0m0.051s

// s='test'
// for x in range(10000000):
// y = "'{0}'".format(s)

$ time python test3.py 

real 0m3.875s
user 0m3.819s
sys 0m0.048s

@ericvsmith
Copy link
Member

I see a small difference, but I think it's related to the fact that in the first example you're concatenating 2 strings (',' and the result of {0}), and in the 2nd example it's 3 strings ("'", {0}, "',"):

$ echo '",{0}".format(x)'
",{0}".format(x)
$ python -m timeit -s 'x=4' '",{0}".format(x)'
1000000 loops, best of 3: 0.182 usec per loop

$ echo '"'\''{0}'\'',".format(x)'
"'{0}',".format(x)
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
1000000 loops, best of 3: 0.205 usec per loop

If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code.

@stevendaprano
Copy link
Member

I cannot replicate that performance difference under Linux. There's a small difference (about 0.1 second per million iterations, or a tenth of a microsecond) on my computer, but I don't think that's meaningful:

py> from timeit import Timer
py> t1 = Timer('"{0},".format(999)')
py> t2 = Timer('"\'{0}\',".format(999)')
py> min(t1.repeat())
4.671058462932706
py> min(t2.repeat())
4.774653751403093

Please re-run your tests using the timeit module, and see if you can still see a consistent difference with and without single quotes. Perhaps this is specific to Windows?

Otherwise, I can only suggest that the timing difference is unrelated to the difference in quotes in the script. Are you sure that this is absolutely the only change between the run that took one minute and the run that took 30 minutes? No other changes to the script, running on the same data file, on the same disk, no difference in what other processes are running? (E.g. if one run is fighting for disk access with, say, a memory-hungry anti-virus scan, that would easily explain the difference.)

@poostenr
Copy link
Mannequin Author

poostenr mannequin commented Jan 15, 2016

Eric, Steven, thank you for your feedback so far.

I am using Windows7, Intel i7.
That one particular file of 6.5MB took ~1 minute on my machine.
When I ran that same test on Linux with Python 3.5.1, it took about 3 seconds. I was amazed to see a 20x difference.

Steven suggested the idea that this phenomenon might be specific to Windows. And I agree, that is what it is looking like. Or is Python doing something in the background?

The Python script is straight forward with a loop that reads a line from a CSV file, split the column values and saves each value as '<value>' to another file. Basically building an SQL statement.
I have had no issues until I added the encapsulating single quotes around the value.

Because I can reproduce this performance difference at will by alternating which line I comment out, leads me to believe it cannot be HDD, AV or something outside the python script interfering.

I repeated the simplified test, that I ran earlier on a Linux system, but this time on my Windows system.
I don't see anything spectacular.
I am just puzzled that using one statement or the other causes such a huge performance impact somehow.

I will try some more tests and copy your examples.

import time
loopcount = 10000000

# Using string value
s="test 1"
v="test 1"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 1: 1452828394523
# End   test 1: 1452828397957
# Diff  test 1: 3434 ms


s="test 2"
v="test 2"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 2: 1452828397957
# End   test 2: 1452828401233
# Diff  test 2: 3276 ms


s="test 3"
v="test 3"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 3: 1452828401233
# End   test 3: 1452828406320
# Diff  test 3: 5087 ms

# Using integer value
s="test 4"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 4: 1452828406320
# End   test 4: 1452828411378
# Diff  test 4: 5058 ms


s="test 5"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 5: 1452828411378
# End   test 5: 1452828415264
# Diff  test 5: 3886 ms

s="test 6"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 6: 1452828415264
# End   test 6: 1452828421292
# Diff  test 6: 6028 ms

@poostenr
Copy link
Mannequin Author

poostenr mannequin commented Jan 15, 2016

Eric,

I just tried your examples.
The loop count is 100x more, but the results are about a factor 10 off.

Test1:
My results:
C:\Data>python -m timeit -s 'x=4' '",{0}".format(x)'
100000000 loops, best of 3: 0.0116 usec per loop

Eric's results:
$ python -m timeit -s 'x=4' '",{0}".format(x)'
1000000 loops, best of 3: 0.182 usec per loop

Test2:
My results:
C:\Data>python -m timeit -s 'x=4' '"\'{0}\',".format(x)'
100000000 loops, best of 3: 0.0122 usec per loop

Eric's results:
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
1000000 loops, best of 3: 0.205 usec per loop

@poostenr
Copy link
Mannequin Author

poostenr mannequin commented Jan 15, 2016

Eric, Steven,

During further testing I was not able to find any real evidence that the statement I was focused on had a real performance issue.

As I did more testing I noticed that appending data to the file slowed down. The file grew initially with ~30-50KB increments and around 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be odd and because Steven had mentioned other processes, I started looking at some other statements.

After quite a lot of trial and error, I was able to use single quotes and increase my performance to acceptable levels.
Example 3 below is how I resolved it.

Can you explain to me why there was a performance penalty in example 2 ?
Python did something under the hood that I am overlooking.

Did conv.escape_string() change something about columnvalue, so that adding a single quote before and after it introduced some add behavior with writing to file ? I am not an expert on Python and remember reading something about Dynamic typing.

Example 1: Fast performance, variable s is not encapsulated with single quotes
6.5MB parsed in ~1 minute.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "{0},".format(columnvalue) # No single quotes
else
s = "{0},".format(columnvalue) # No single quotes
count -= 1
Append s to file.

Example 2: Slow performance, pre- and post-fixed variable s with single quotes
6.5MB parsed in 35 minutes.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "'{0}',".format(columnvalue) # Added single quotes
else
s = "'{0}',".format(columnvalue) # Added single quotes
count -= 1
Append s to file.

Example 3: Fast performance, variable columnvalue is pre- and post-fixed with single quotes
6.5MB parsed in !45 seconds.
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string("'" + columnvalue + "'") # Moved single quotes to this statement.
if (count > 1):
s = "{0},".format(columnvalue)
else
s = "{0},".format(columnvalue)
count -= 1
Append s to file.

@vstinner
Copy link
Member

I implemented overkill optimization in _PyUnicodeWriter API used by str.format() and str%args. If the result is the input, the string is not copied by value, but by reference.

>>> x="hello"
>>> ("%s" % x) is x
True
>>> ("{}".format(x)) is x
True

If the format string adds something before/after, the string is duplicated:

>>> ("x=%s" % x) is x
False
>>> ("x={}".format(x)) is x
False

The optimization is implemented in _PyUnicodeWriter_WriteStr():

https://hg.python.org/cpython/file/32a4e7b337c9/Objects/unicodeobject.c#l13604

@vstinner
Copy link
Member

If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code.

The performance of instructions like ("x=%s" % x) or ("x={}".format(x)) depend on the length of the string. Maybe poostenr has some very long strings?

--

Closed issues related to _PyUnicodeWriter API:

  • issue bpo-14716: Use unicode_writer API for str.format()
  • issue bpo-14687: Optimize str%tuple for the PEP-393
  • issue bpo-14744: Use _PyUnicodeWriter API in str.format() internals
  • issue bpo-16147: Rewrite PyUnicode_FromFormatV() to use the _PyUnicodeWriter API
  • issue bpo-17694: Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation
  • issue bpo-19513: Use PyUnicodeWriter instead of PyAccu in repr(tuple) and repr(list)
  • issue bpo-19646: Use PyUnicodeWriter in repr(dict)

I'm listing these issues because they contain microbenchmarks script if you would like to play with them.

@stevendaprano
Copy link
Member

On Fri, Jan 15, 2016 at 07:56:39AM +0000, poostenr wrote:

As I did more testing I noticed that appending data to the file slowed
down. The file grew initially with ~30-50KB increments and around
500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew
at ~1KB/s.

How are you appending to the file? In the code snippets below you merely
say "append s to file", which is not Python code and could hide a
multitude of sins. You're asking us to diagnose slow performance in code
we can't see.

Or perhaps your disk is just badly fragmented, and as the file gets
bigger, performance gets worse.

Did conv.escape_string() change something about columnvalue

I don't know. What's conv.escape_string? Where does it come from, and
what does it do? Again, you're asking us to diagnose code we can't see.

Perhaps you should run the profiler and see where your code is spending
most of its time.

@poostenr
Copy link
Mannequin Author

poostenr mannequin commented Jan 15, 2016

Thank you for your feedback Victor and Steven.

I just copied my scripts and 360MB of CSV files over to Linux.
The entire process finished in 4 minutes exactly, using the original python scripts.
So there is something different between my environments.
If it was a fragmentation issue, then I would expect to always have a slow performance on the Windows system. But I can influence the performance by alternating between the two original statements:
s = "{0},".format(columnvalue) # fast
s = "'{0}',".format(columnvalue) # ~30x slower

I apologize for not being able to provide the entire code.
There is too much code to post at this time.

I am opening a file like this:
#logger = open(filename, rw, buffering, encoding)
logger = open('output.sql', 'a', 1, 'iso-8859-1')

I write to file:
logger.write(text+'\n')

I'm using a library to escape the string before saving to file.
import pymysql.converters as conv
<...>
for key in listkeys:
keyvalue = self.recordstats[key]
fieldtype = keyvalue[0]
columnvalue = record[key]
columnvalue = conv.escape_string(columnvalue)
if (count > 1):
s = "{0},".format(columnvalue) # No single quotes
else
s = "{0},".format(columnvalue) # No single quotes
count -= 1
logger.write(s+'\n')

I appreciate the feedback and ideas so far.
Trying the profiler is on my list to see if it provides more insight.
I am not using Anaconda3 on Linux. Perhaps that has an impact somehow?

I never suspected inserting the two single quotes to cause such a problem in performance. I noticed it when I parsed ~40GB of data and it took almost a week to complete instead of my expected 6-7 hrs.
Just the other day I decided to remove the single quotes because it was the only thing left that I'd changed. I had discarded that change the past two weeks because that couldn't be causing the performance problem.

Today, I wasn't expecting such a big difference between running my script on Linux or Windows.

If I discover anything else, I will post an update.
When I get the chance I can remove redundant code and post the source.

@SilentGhost
Copy link
Mannequin

SilentGhost mannequin commented Jan 15, 2016

poostenr, this is demonstrably not a problem with the CPython, which this bug tracker is about. There are few options available on the internet if you need help with your code: mailing lists and irc are among them.

@SilentGhost SilentGhost mannequin closed this as completed Jan 15, 2016
@SilentGhost SilentGhost mannequin added the invalid label Jan 15, 2016
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

3 participants