Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Removed the GIL from parts of the TextReader class #11272

Merged
merged 1 commit into from
Nov 4, 2015

Conversation

jdeschenes
Copy link
Contributor

The GIL was removed around the tokenizer functions and the conversion function(_string_convert excluded).

Benchmark:

Data Generation:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1000000,10))
df.to_csv('test.csv')

Benchmark Code:

import pandas as pd
from pandas.util.testing import test_parallel

def f():
    for i in range(4):
        pd.read_csv('test.csv', index_col=0)

@test_parallel(4)
def g():
    pd.read_csv('test.csv', index_col=0)

Before:

In [4]: %timeit pd.read_csv('test.csv', index_col=0)
1 loops, best of 3: 2.3 s per loop

In [7]: %timeit f()
1 loops, best of 3: 9.15 s per loop

In [8]: %timeit g()
1 loops, best of 3: 9.25 s per loop

After:

In [6]: %timeit pd.read_csv('test.csv', index_col=0)
1 loops, best of 3: 2.35 s per loop

In [9]: %timeit f()
1 loops, best of 3: 9.55 s per loop

In [10]: %timeit g()
1 loops, best of 3: 4.38 s per loop

@jreback jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Oct 9, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 9, 2015
if (not have_real_test_parallel):
raise NotImplementedError
# Using the values
self.data = '0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n 0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n 0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n 0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n 0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just do use random here. e.g. DataFrame(np.random.randn(10000,10)) (or whatever makes this a reasonable time

also can you add a tests that uses object and another that has a datetime parse in the index_col

@jreback
Copy link
Contributor

jreback commented Oct 9, 2015

some windows cythoning errors.

odd they don't show up for you, what platform are you testing on?

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>git checkout -b jdeschenes-nogil_csv master
Switched to a new branch 'jdeschenes-nogil_csv'

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>git pull https://github.com/jdeschenes/pandas.git nogil_csv
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), done.
From https://github.com/jdeschenes/pandas
 * branch            nogil_csv  -> FETCH_HEAD
Updating 26db172..2da353e
Fast-forward
 asv_bench/benchmarks/gil.py |  23 ++++
 pandas/parser.pyx           | 269 +++++++++++++++++++++++++++++++-------------
 2 files changed, 213 insertions(+), 79 deletions(-)

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>make
python setup.py build_ext --inplace
make: python: Command not found
Makefile:2: recipe for target `tseries' failed
make: *** [tseries] Error 127

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
cythoning pandas\parser.pyx to pandas\parser.c

Error compiling Cython file:
------------------------------------------------------------
...
            # in the hash table
            if k != na_hashset.n_buckets:
                na_count[0] += 1
                data[0] = NA
            else:
                data[0] = parser.converter(word, &p_end, parser.decimal, parser.sci,
                                         ^
------------------------------------------------------------

pandas\parser.pyx:1538:42: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
                        data[0] = NA
            data += 1
    else:
        for i in range(lines):
            COLITER_NEXT(it, word)
            data[0] = parser.converter(word, &p_end, parser.decimal, parser.sci,
                                     ^
------------------------------------------------------------

pandas\parser.pyx:1557:38: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
                na_count[0] += 1
                data[0] = NA
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1680:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            data += 1
    else:
        for i in range(lines):
            COLITER_NEXT(it, word)

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1688:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            if k != false_hashset.n_buckets:
                data[0] = 0
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1758:30: Calling gil-requiring function not allowed without gil

Error compiling Cython file:
------------------------------------------------------------
...
            if k != false_hashset.n_buckets:
                data[0] = 0
                data += 1
                continue

            error = to_boolean(word, data)
                             ^
------------------------------------------------------------

pandas\parser.pyx:1778:30: Calling gil-requiring function not allowed without gil
building 'pandas.parser' extension
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ipandas/src/klib -Ipandas/src -IC:\Miniconda\envs\pandas3.5\lib\site-packages\numpy\core\in
clude -IC:\Miniconda\envs\pandas3.5\include -IC:\Miniconda\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 1
4.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\in
clude\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas\parser.c /Fobuild\temp.win
-amd64-3.5\Release\pandas\parser.obj
parser.c
pandas\parser.c(1): fatal error C1189: #error:  Do not use this file, it is the result of a failed Cython compilation.
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>

@jdeschenes
Copy link
Contributor Author

I fixed a few issues with the build

@jreback
Copy link
Contributor

jreback commented Oct 9, 2015

builds ,couple of errors on windows

[pandas3.5] C:\Users\Jeff Reback\pandas3.5>nosetests -A "not network" pandas\io\tests\test_parsers.py
...........S.........................................................................F......................................S....................S..........................S...........................
................................F........S..................................................................................................................................S...........................
..............................................
======================================================================
FAIL: test_parse_bools (pandas.io.tests.test_parsers.TestCParserHighMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Jeff Reback\pandas3.5\pandas\io\tests\test_parsers.py", line 1177, in test_parse_bools
    self.assertEqual(data['A'].dtype, np.bool_)
AssertionError: dtype('O') != <class 'numpy.bool_'>

======================================================================
FAIL: test_parse_bools (pandas.io.tests.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Jeff Reback\pandas3.5\pandas\io\tests\test_parsers.py", line 1177, in test_parse_bools
    self.assertEqual(data['A'].dtype, np.bool_)
AssertionError: dtype('O') != <class 'numpy.bool_'>

----------------------------------------------------------------------
Ran 446 tests in 12.620s

FAILED (SKIP=6, failures=2)

@jdeschenes
Copy link
Contributor Author

This was an issue with python3 and not limited to windows.

def time_nogil_read_csv(self):
@test_parallel(num_threads=2)
def run(arr):
read_csv('__test__.csv', sep=',', header=None, float_precision=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is their a cleanup function on these? (e.g. to remove the created csvs)?

@qwhelan ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't. I had a look at a few tests and they don't seem to cleanup either... see test.h5

@jreback
Copy link
Contributor

jreback commented Oct 13, 2015

@jdeschenes

  • couple of questions for you above.
  • can you post the asv results here.
  • pls add a note in the performance section (use this PR number)
  • squash

@jdeschenes
Copy link
Contributor Author

@jreback

Where do I need to add information in the performance section? Is it in the what's new file in the documentation?

@jreback
Copy link
Contributor

jreback commented Oct 13, 2015

whatsnew/v0.17.1 (Performance section)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

@jdeschenes looks good. can you

  • add a note in whatsnew (performance section)
  • rebase
  • squash

@jreback
Copy link
Contributor

jreback commented Nov 2, 2015

can you update according to comments

@mrocklin
Copy link
Contributor

mrocklin commented Nov 3, 2015

What's the status on this @jdeschenes? I'd like to include this work in a talk happening tomorrow. It'd be awesome to be able to say that this was in master rather than in a branch.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2015

you can say slated for 0.17.1 :)

@jdeschenes
Copy link
Contributor Author

I will get the final changes tomorrow.

@jdeschenes jdeschenes force-pushed the nogil_csv branch 2 times, most recently from a5ff16a to 9a4f845 Compare November 4, 2015 18:29
@jdeschenes
Copy link
Contributor Author

@jreback, The changes have been implemented. Let me know if there is anything else that needs to be done.

@@ -1452,6 +1449,18 @@ cdef _to_fw_string(parser_t *parser, int col, int line_start,
result = np.empty(line_end - line_start, dtype='|S%d' % width)
data = <char*> result.data

with nogil:
_to_fw_string_internal(parser, col, line_start, line_end, width, data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this: _to_fw_string_nogil instead

@jreback
Copy link
Contributor

jreback commented Nov 4, 2015

@jdeschenes thanks, just some small comments. ping when pushed. pls also post a short benchmark in the top of the PR (you can just run before/after in ipython via timeit if you want), mainly for posterity.

The GIL was released around the tokenizer functions and the conversion function(_string_convert excluded).
@jdeschenes
Copy link
Contributor Author

@jreback: Added the benchmarks.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2015

looks good

ping when green

@mrocklin
Copy link
Contributor

mrocklin commented Nov 4, 2015

Ping

jreback added a commit that referenced this pull request Nov 4, 2015
PERF: Removed the GIL from parts of the TextReader class
@jreback jreback merged commit 774411c into pandas-dev:master Nov 4, 2015
@jreback
Copy link
Contributor

jreback commented Nov 4, 2015

thanks @jdeschenes

and @mrocklin for the pings!

@jdeschenes jdeschenes deleted the nogil_csv branch November 4, 2015 21:04
khs26 pushed a commit to khs26/pandas that referenced this pull request Nov 6, 2015
Merge pull request pandas-dev#11272 from jdeschenes/nogil_csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants