PERF: Removed the GIL from parts of the TextReader class #11272

jdeschenes · 2015-10-09T16:08:15Z

The GIL was removed around the tokenizer functions and the conversion function(_string_convert excluded).

Benchmark:

Data Generation:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1000000,10))
df.to_csv('test.csv')

Benchmark Code:

import pandas as pd
from pandas.util.testing import test_parallel

def f():
    for i in range(4):
        pd.read_csv('test.csv', index_col=0)

@test_parallel(4)
def g():
    pd.read_csv('test.csv', index_col=0)

Before:

In [4]: %timeit pd.read_csv('test.csv', index_col=0)
1 loops, best of 3: 2.3 s per loop

In [7]: %timeit f()
1 loops, best of 3: 9.15 s per loop

In [8]: %timeit g()
1 loops, best of 3: 9.25 s per loop

After:

In [6]: %timeit pd.read_csv('test.csv', index_col=0)
1 loops, best of 3: 2.35 s per loop

In [9]: %timeit f()
1 loops, best of 3: 9.55 s per loop

In [10]: %timeit g()
1 loops, best of 3: 4.38 s per loop

jreback · 2015-10-09T16:28:50Z

asv_bench/benchmarks/gil.py

+        if (not have_real_test_parallel):
+            raise NotImplementedError
+        # Using the values
+        self.data = '0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n        0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n        0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n        0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n        0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n        '


I would just do use random here. e.g. DataFrame(np.random.randn(10000,10)) (or whatever makes this a reasonable time

also can you add a tests that uses object and another that has a datetime parse in the index_col

jreback · 2015-10-09T17:49:05Z

some windows cythoning errors.

odd they don't show up for you, what platform are you testing on?

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>git checkout -b jdeschenes-nogil_csv master
Switched to a new branch 'jdeschenes-nogil_csv'

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>git pull https://github.com/jdeschenes/pandas.git nogil_csv
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), done.
From https://github.com/jdeschenes/pandas
 * branch            nogil_csv  -> FETCH_HEAD
Updating 26db172..2da353e
Fast-forward
 asv_bench/benchmarks/gil.py |  23 ++++
 pandas/parser.pyx           | 269 +++++++++++++++++++++++++++++++-------------
 2 files changed, 213 insertions(+), 79 deletions(-)

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>make
python setup.py build_ext --inplace
make: python: Command not found
Makefile:2: recipe for target `tseries' failed
make: *** [tseries] Error 127

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
cythoning pandas\parser.pyx to pandas\parser.c

Error compiling Cython file:
------------------------------------------------------------
...
            # in the hash table
            if k != na_hashset.n_buckets:
                na_count[0] += 1
                data[0] = NA
            else:
                data[0] = parser.converter(word, &p_end, parser.decimal, parser.sci,
                                         ^
------------------------------------------------------------

pandas\parser.pyx:1538:42: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
                        data[0] = NA
            data += 1
    else:
        for i in range(lines):
            COLITER_NEXT(it, word)
            data[0] = parser.converter(word, &p_end, parser.decimal, parser.sci,
                                     ^
------------------------------------------------------------

pandas\parser.pyx:1557:38: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
                na_count[0] += 1
                data[0] = NA
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1680:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            data += 1
    else:
        for i in range(lines):
            COLITER_NEXT(it, word)

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1688:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            if k != false_hashset.n_buckets:
                data[0] = 0
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1758:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            if k != false_hashset.n_buckets:
                data[0] = 0
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1778:30: Calling gil-requiring function not allowed without gil
building 'pandas.parser' extension
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ipandas/src/klib -Ipandas/src -IC:\Miniconda\envs\pandas3.5\lib\site-packages\numpy\core\in
clude -IC:\Miniconda\envs\pandas3.5\include -IC:\Miniconda\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 1
4.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\in
clude\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas\parser.c /Fobuild\temp.win
-amd64-3.5\Release\pandas\parser.obj
parser.c
pandas\parser.c(1): fatal error C1189: #error:  Do not use this file, it is the result of a failed Cython compilation.
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>

jdeschenes · 2015-10-09T17:52:03Z

I fixed a few issues with the build

jreback · 2015-10-09T18:01:36Z

builds ,couple of errors on windows

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>nosetests -A "not network" pandas\io\tests\test_parsers.py
...........S.........................................................................F......................................S....................S..........................S...........................
................................F........S..................................................................................................................................S...........................
..............................................
======================================================================
FAIL: test_parse_bools (pandas.io.tests.test_parsers.TestCParserHighMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Jeff Reback\pandas3.5\pandas\io\tests\test_parsers.py", line 1177, in test_parse_bools
    self.assertEqual(data['A'].dtype, np.bool_)
AssertionError: dtype('O') != <class 'numpy.bool_'>

======================================================================
FAIL: test_parse_bools (pandas.io.tests.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Jeff Reback\pandas3.5\pandas\io\tests\test_parsers.py", line 1177, in test_parse_bools
    self.assertEqual(data['A'].dtype, np.bool_)
AssertionError: dtype('O') != <class 'numpy.bool_'>

----------------------------------------------------------------------
Ran 446 tests in 12.620s

FAILED (SKIP=6, failures=2)

jdeschenes · 2015-10-09T20:47:44Z

This was an issue with python3 and not limited to windows.

jreback · 2015-10-09T21:43:38Z

asv_bench/benchmarks/gil.py

+    def time_nogil_read_csv(self):
+        @test_parallel(num_threads=2)
+        def run(arr):
+            read_csv('__test__.csv', sep=',', header=None, float_precision=None)


is their a cleanup function on these? (e.g. to remove the created csvs)?

@qwhelan ?

@jdeschenes @qwhelan ?

There isn't. I had a look at a few tests and they don't seem to cleanup either... see test.h5

jreback · 2015-10-13T12:31:25Z

@jdeschenes

couple of questions for you above.
can you post the asv results here.
pls add a note in the performance section (use this PR number)
squash

jdeschenes · 2015-10-13T14:08:42Z

@jreback

Where do I need to add information in the performance section? Is it in the what's new file in the documentation?

jreback · 2015-10-13T14:17:38Z

whatsnew/v0.17.1 (Performance section)

jreback · 2015-10-25T15:31:07Z

@jdeschenes looks good. can you

add a note in whatsnew (performance section)
rebase
squash

jreback · 2015-11-02T12:05:50Z

can you update according to comments

mrocklin · 2015-11-03T15:09:08Z

What's the status on this @jdeschenes? I'd like to include this work in a talk happening tomorrow. It'd be awesome to be able to say that this was in master rather than in a branch.

jreback · 2015-11-03T16:42:54Z

you can say slated for 0.17.1 :)

jdeschenes · 2015-11-03T20:24:03Z

I will get the final changes tomorrow.

jdeschenes · 2015-11-04T18:31:59Z

@jreback, The changes have been implemented. Let me know if there is anything else that needs to be done.

jreback · 2015-11-04T18:50:19Z

pandas/parser.pyx

@@ -1452,6 +1449,18 @@ cdef _to_fw_string(parser_t *parser, int col, int line_start,
    result = np.empty(line_end - line_start, dtype='|S%d' % width)
    data = <char*> result.data

+    with nogil:
+        _to_fw_string_internal(parser, col, line_start, line_end, width, data)


call this: _to_fw_string_nogil instead

jreback · 2015-11-04T18:54:22Z

@jdeschenes thanks, just some small comments. ping when pushed. pls also post a short benchmark in the top of the PR (you can just run before/after in ipython via timeit if you want), mainly for posterity.

The GIL was released around the tokenizer functions and the conversion function(_string_convert excluded).

jdeschenes · 2015-11-04T19:44:19Z

@jreback: Added the benchmarks.

jreback · 2015-11-04T19:49:35Z

looks good

ping when green

mrocklin · 2015-11-04T20:17:35Z

Ping

PERF: Removed the GIL from parts of the TextReader class

jreback · 2015-11-04T20:28:50Z

thanks @jdeschenes

and @mrocklin for the pings!

Merge pull request pandas-dev#11272 from jdeschenes/nogil_csv

jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Oct 9, 2015

jreback added this to the 0.17.1 milestone Oct 9, 2015

jreback reviewed Oct 9, 2015
View reviewed changes

jdeschenes force-pushed the nogil_csv branch from 605e4e0 to 94d4f0d Compare October 27, 2015 21:13

jdeschenes force-pushed the nogil_csv branch 2 times, most recently from a5ff16a to 9a4f845 Compare November 4, 2015 18:29

jreback reviewed Nov 4, 2015
View reviewed changes

PERF: Released the GIL from parts of the TextReader class

091df3e

The GIL was released around the tokenizer functions and the conversion function(_string_convert excluded).

jdeschenes force-pushed the nogil_csv branch from 9a4f845 to 091df3e Compare November 4, 2015 19:07

jreback added a commit that referenced this pull request Nov 4, 2015

Merge pull request #11272 from jdeschenes/nogil_csv

774411c

PERF: Removed the GIL from parts of the TextReader class

jreback merged commit 774411c into pandas-dev:master Nov 4, 2015

jdeschenes deleted the nogil_csv branch November 4, 2015 21:04

khs26 pushed a commit to khs26/pandas that referenced this pull request Nov 6, 2015

Merge pull request #1 from pydata/master

fd4fc5f

Merge pull request pandas-dev#11272 from jdeschenes/nogil_csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Removed the GIL from parts of the TextReader class #11272

PERF: Removed the GIL from parts of the TextReader class #11272

jdeschenes commented Oct 9, 2015

jreback Oct 9, 2015

jreback commented Oct 9, 2015

jdeschenes commented Oct 9, 2015

jreback commented Oct 9, 2015

jdeschenes commented Oct 9, 2015

jreback Oct 9, 2015

jreback Oct 25, 2015

jdeschenes Oct 27, 2015

jreback commented Oct 13, 2015

jdeschenes commented Oct 13, 2015

jreback commented Oct 13, 2015

jreback commented Oct 25, 2015

jreback commented Nov 2, 2015

mrocklin commented Nov 3, 2015

jreback commented Nov 3, 2015

jdeschenes commented Nov 3, 2015

jdeschenes commented Nov 4, 2015

jreback Nov 4, 2015

jreback commented Nov 4, 2015

jdeschenes commented Nov 4, 2015

jreback commented Nov 4, 2015

mrocklin commented Nov 4, 2015

jreback commented Nov 4, 2015

PERF: Removed the GIL from parts of the TextReader class #11272

PERF: Removed the GIL from parts of the TextReader class #11272

Conversation

jdeschenes commented Oct 9, 2015

Benchmark:

Data Generation:

Benchmark Code:

Before:

After:

jreback Oct 9, 2015

Choose a reason for hiding this comment

jreback commented Oct 9, 2015

jdeschenes commented Oct 9, 2015

jreback commented Oct 9, 2015

jdeschenes commented Oct 9, 2015

jreback Oct 9, 2015

Choose a reason for hiding this comment

jreback Oct 25, 2015

Choose a reason for hiding this comment

jdeschenes Oct 27, 2015

Choose a reason for hiding this comment

jreback commented Oct 13, 2015

jdeschenes commented Oct 13, 2015

jreback commented Oct 13, 2015

jreback commented Oct 25, 2015

jreback commented Nov 2, 2015

mrocklin commented Nov 3, 2015

jreback commented Nov 3, 2015

jdeschenes commented Nov 3, 2015

jdeschenes commented Nov 4, 2015

jreback Nov 4, 2015

Choose a reason for hiding this comment

jreback commented Nov 4, 2015

jdeschenes commented Nov 4, 2015

jreback commented Nov 4, 2015

mrocklin commented Nov 4, 2015

jreback commented Nov 4, 2015