Slow import of files with text escape #80

omenSi · 2023-08-28T08:37:51Z

Hello,

I was wondering could there be an issue with file_reader when using text_qualifier? Because it shokingly reduces performance of import...

I've made a test script for this, where I show difference of import with pandas, numpy and tablite.

tablite:

from tablite import Table
from tablite.config import Config
import numpy as np
import time
from pandas import DataFrame
import pandas

# Create a wide and short table

wide_df = {k: np.random.random_sample(size = 3000) for k in range(0, 1000)}
df_to_save = DataFrame.from_dict(wide_df)
df_to_save.to_csv("./test.csv")
# Generated table size == 56473KB


start = time.time()
tbl = Table.from_file("./test.csv", guess_datatypes=False, text_qualifier='"', delimiter=',')
end = time.time()
print(end - start)


# OUTPUT
importing: consolidating 'test.csv': 100.00%|██████████| [1:18:05<00:00]  
4685.573677778244

pandas

start = time.time()
new_df = pandas.read_csv("./test.csv", escapechar='"', delimiter=',')
end = time.time()
print(end - start)

# OUTPUT
0.45365166664123535

numpy:

Disclaimer: Comparison to numpy is not 100% analitically correct, since there is no text escape.

start = time.time()
arr = np.genfromtxt('./test.csv', delimiter=',', dtype=None)
end = time.time()
print(end - start)

# OUTPUT
5.538379430770874

At first I thought, that there could be overhead from multiprocessing enabled, since there are a lot of columns and they are short, but that is not the case, because using Config.MULTIPROCESSING_MODE = Config.FALSE made the performance even worse. I was impatient, since tablite in 31 minutes was only able to import 73.68%, where the progress bar has started at 70%.

importing: saving 'test1.csv' to disk: 73.68%|███████▎  | [31:30<3:37:32]

When used without text_qualifier option tablite imports the same file in 2 minutes 21 seconds (which is still quite slower than pandas):

start = time.time()
tbl = Table.from_file("./test.csv", guess_datatypes=False, delimiter=',')
end = time.time()
print(end - start)

# OUTPUT
importing: consolidating 'test.csv': 100.00%|██████████| [02:21<00:00]
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3004.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3005.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3006.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3007.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3008.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3009.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3010.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3011.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3012.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3013.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3014.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3015.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3016.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3017.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3018.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3019.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3020.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3021.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3022.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3023.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3024.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3025.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3026.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3027.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/3028.npy
...
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4002.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4003.npy
3636 deleted page '/tmp/tablite-tmp/pid-3636/pages/4004.npy
141.9796497821808

The text was updated successfully, but these errors were encountered:

root-11 · 2023-08-28T08:59:36Z

I was thinking that perhaps we could use this:
https://github.com/juancarlospaco/faster-than-csv
but didn't finish to pack it with creating the numpy file format in:
https://nbviewer.org/github/root-11/root-11.github.io/blob/master/content/reading_numpys_fileformat.ipynb

what are your thoughts?

omenSi · 2023-08-28T09:15:57Z

I think if it's memory safe (no risk of out-of-memory exceptions), then perhaps it could be considered as an option.

From the first glance, I can see that the lib you linked returns the whole table as a list, which could mean that the user device may run out of memory, where the list will just won't fit in RAM.

I think it is more practical if the tool, just like in tablite, would fail only if the column does not fit in RAM, not the whole table.

Unless my understanding of tablite core is incorrect, then please correct me.

root-11 · 2023-08-28T09:18:45Z

I would read tablite.config.Config.page_size (int) rows at a time and thereby slice the read operations to prevent OOMError

realratchet · 2023-08-28T10:40:35Z

The current implementation itself also has a lot of room to be improved.

In a few places the file is re-read from the beginning every time we want to go to certain line, when we could file.seek(), since we we can save the line offsets in the initial loop through the file.

The text escape uses python loops which are slow on their own. And then there's constant memory re-allocations when working with strings. The effect of this is 35ms per line (on my machine). If we compare that with csv.reader with a custom dialect, it can do a line 0.17ms.

It could probably be even faster because I assume there's setup overhead which could probably knock even more if the entire file (or a chunk of it) was processed instead of on line-by-line basis.

We can play around with dialect settings and see what passes all the existing tests.

omenSi · 2023-08-28T12:36:11Z

To improve performance of current text escaping module we could use as @realratchet mentioned standard csv library reader:

file_reader_utils.py#L103

    def _call_3(self, s):  # looks for qoutes.
        words = []
        # qoute = False
        # ix = 0
        # while ix < len(s):
        #     c = s[ix]
        #     if c == self.qoute:
        #         qoute = not qoute
        #     if qoute:
        #         ix += 1
        #         continue
        #     if c == self.delimiter:
        #         word, s = s[:ix], s[ix + self._delimiter_length :]
        #         word = word.lstrip(self.qoute).rstrip(self.qoute)
        #         words.append(word)
        #         ix = -1
        #     ix += 1
        # if s:
        #     s = s.lstrip(self.qoute).rstrip(self.qoute)
        #     words.append(s)

        class MyDialect(csv.Dialect):
            delimiter = self.delimiter
            quotechar = self.qoute
            escapechar = '\\'
            doublequote = True
            quoting = csv.QUOTE_MINIMAL
            skipinitialspace = False
            lineterminator = "\n"

        dia = MyDialect
        parsed_words = list(csv.reader(StringIO(s), dialect=dia))[0]
       words.extend(parsed_words)
        return words

with this improvement we can achieve fast text escaping, since the implementation is written in C.

However, there is one test (test_filereader_formats.py/test_text_escape), which fail, because of incorrect data format according to RFC4180 standard (https://www.rfc-editor.org/rfc/rfc4180#page-2 - point 7).

7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote

The test:

te = text_escape('"1000294";"S2417DG 24"" LED monitor (210-AJWM)";"47.120,000";"CM3";3')
assert te == ["1000294", 'S2417DG 24"" LED monitor (210-AJWM)', "47.120,000", "CM3", "3"]

For test to be correct, it should contain data, which is following CSV standard. To make it correct, all double qoutes, which are inside fields, according to standard, have to be double qoute escaped:

te = text_escape('"1000294";"S2417DG 24"""" LED monitor (210-AJWM)";"47.120,000";"CM3";3')
assert te == ["1000294", 'S2417DG 24"" LED monitor (210-AJWM)', "47.120,000", "CM3", "3"]

Then the test passes.

omenSi · 2023-08-28T12:41:30Z

If you agree with the statement about incorrect test, I can make a PR.

root-11 · 2023-08-28T15:09:05Z

The proof is not in my opinion, but in the benchmarks. 1) What performance improvements do you see? 2) Could we enable "repair mode" so that detailed analysis is used when a RFC4180-breaches is detected? Just like when files are read we can ignore, repair or raise when errors occur?

…

On Mon, 28 Aug 2023 at 13:41, Ovidijus Grigas ***@***.***> wrote: If you agree with the statement about incorrect test, I can make a PR. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA64MJWUEB7UUYIBT6HXSZ3XXSGYLANCNFSM6AAAAAA4BCFMYA> . You are receiving this because you commented.Message ID: ***@***.***>

realratchet · 2023-08-28T15:41:26Z

Ovidijus would have to give concrete figures after csv.reader implementation, but 35ms 0.17ms is a huge jump since that's where we're spending most of our time.

As for breaking file format I don't think we should care, because we either break properly exported files, or we support only one bad exporter. The only sideffect is noy breaking import it's just the imported result is slightly different. As we talked with Ovidijus having bad imports for correctly following standart exporters vs wrong exporters adds no value.

root-11 · 2023-08-28T15:53:24Z

Right or wrong export, the fact remains that the data in that test came from SAP and they may have cared little about RFC4180 as it wasn't their core business. Last I tested the "sniffer" from the csv module, it struggled to make sense of the data that came from non-US encodings, but that might have changed. Nonetheless - if you have an idea on how to raise in case of non-compliance I'm all in for the speedup.

…

On Mon, 28 Aug 2023 at 16:41, Ratchet ***@***.***> wrote: Ovidijus would have to give concrete figures after csv.reader implementation, but 35ms 0.17ms is a huge jump since that's where we're spending most of our time. As for breaking file format I don't think we should care, because we either break properly exported files, or we support only one bad exporter. The only sideffect is noy breaking import it's just the imported result is slightly different. As we talked with Ovidijus having bad imports for correctly following standart exporters vs wrong exporters adds no value. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA64MJVNGILRDRUE7UIZYGTXXS33FANCNFSM6AAAAAA4BCFMYA> . You are receiving this because you commented.Message ID: ***@***.***>

realratchet · 2023-08-28T16:15:13Z

It would never raise an error either case. We loose any compliance with any exporter that follows csv standard in favor for supporting quotes of one that does not as those double quotes would just be consumed and produce empty sequence. As I'm not sure which one is more important you'd have to make the choice.

However, if following that specific exporter format is important, tablite then looses a lot of value being public repository.

If we drop csv.reader idea and keep the existing functionality we can still have some speedup by not reallocating the memory as frequently as allocating on the heap and gc are expensive. Even an non-interpreted language would choke on heap allocations and deallocations in the loop.

As for sniffer I don't think we experimented with sniffer just line by line parsing.

root-11 · 2023-08-28T16:19:43Z

Accepted: Speed matters, so let's go with saving the milliseconds and merely acknowledge that we can't handle the edge case and merely make a note that if the data isn't RFC4180 compliant the user is on his own.

…

On Mon, 28 Aug 2023 at 17:15, Ratchet ***@***.***> wrote: It would never raise an error either case. We loose any compliance with any exporter that follows csv standard in favor for supporting quotes of one that does not as those double quotes would just be consumed and produce empty sequence. As I'm not sure which one is more important you'd have to make the choice. However, if following that specific exporter format is important, tablite then looses a lot of value being public repository. If we drop csv.reader idea and keep the existing functionality we can still have some speedup by not reallocating the memory as frequently as allocating on the heap and gc are expensive. Even an non-interpreted language would choke on heap allocations and deallocations in the loop. As for sniffer I don't think we experimented with sniffer just line by line parsing. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA64MJVCEVKSZRTK5MO522TXXS7ZZANCNFSM6AAAAAA4BCFMYA> . You are receiving this because you commented.Message ID: ***@***.***>

omenSi · 2023-08-29T06:00:10Z

If we look at the speed impact, the same file (renamed, so tablite does not use cache) with the same options took 3 minutes 13 seconds (previously 1 hour 18 minutes):

start = time.time()
tbl = Table.from_file("./test2.csv", guess_datatypes=False, text_qualifier='"', delimiter=',')
end = time.time()
print(end - start)

# OUTPUT

importing: reading 'test2.csv' bytes: 0.00%|          | [00:00<?]
importing: consolidating 'test2.csv': 100.00%|██████████| [03:13<00:00]
193.13518619537354

realratchet · 2023-08-29T09:10:19Z

We got it down to something like 7s, still WIP.

root-11 · 2023-08-29T09:14:40Z

Awesome, thanks for the update!

realratchet · 2023-08-29T13:13:18Z

test.csv 1000 columns wide, 3000 rows.
Running 10 tests averaged 5.76 seconds on my machine. Only a single core is used. No datatype inference.
Running 10 tests averaged 2 minutes 39 seconds on my machine. Only a single core is used. With datatype inference.

When it comes to datatype inference, I went the naive route. Where I infer datatypes during pagination since it already loads the numpy anyway.

There can still be SOME improvements but because we support non-primitive datatypes, datetime, date, time, using numpy as data format makes it extremely challenging because it requires not only understanding fully how numpy saves objects since it uses contiguous blocks as far as I can tell when reading anything, but because it uses pickling we also need to understand pickling format fully so that when the file is read until the end we could seek back into necessary offset and override the pickling headers, etc.

And even then I don't think it would add much speed improvements when it comes to data type inference, because I found that pythons string parsing is just slow, so it's not worth bothering with it I think since using native would probably improve it.

We'll need to do a lot more experiments just to make sure things are in order but all tests pass, so it's promising.

realratchet · 2023-08-30T06:53:03Z

gesaber_data.csv 10935992 lines

Old tablite took 522s seconds, experimental tablite changes takes 63 seconds. That's 820% speed improvement.

realratchet mentioned this issue Aug 30, 2023

Performance improvements for text reader #81

Merged

realratchet closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow import of files with text escape #80

Slow import of files with text escape #80

omenSi commented Aug 28, 2023

root-11 commented Aug 28, 2023

omenSi commented Aug 28, 2023 •

edited

Loading

root-11 commented Aug 28, 2023

realratchet commented Aug 28, 2023 •

edited

Loading

omenSi commented Aug 28, 2023 •

edited

Loading

omenSi commented Aug 28, 2023

root-11 commented Aug 28, 2023 via email

realratchet commented Aug 28, 2023

root-11 commented Aug 28, 2023 via email

realratchet commented Aug 28, 2023

root-11 commented Aug 28, 2023 via email

omenSi commented Aug 29, 2023

realratchet commented Aug 29, 2023

root-11 commented Aug 29, 2023 via email •

edited

Loading

realratchet commented Aug 29, 2023

realratchet commented Aug 30, 2023

Slow import of files with text escape #80

Slow import of files with text escape #80

Comments

omenSi commented Aug 28, 2023

tablite:

pandas

numpy:

root-11 commented Aug 28, 2023

omenSi commented Aug 28, 2023 • edited Loading

root-11 commented Aug 28, 2023

realratchet commented Aug 28, 2023 • edited Loading

omenSi commented Aug 28, 2023 • edited Loading

The test:

omenSi commented Aug 28, 2023

root-11 commented Aug 28, 2023 via email

realratchet commented Aug 28, 2023

root-11 commented Aug 28, 2023 via email

realratchet commented Aug 28, 2023

root-11 commented Aug 28, 2023 via email

omenSi commented Aug 29, 2023

realratchet commented Aug 29, 2023

root-11 commented Aug 29, 2023 via email • edited Loading

realratchet commented Aug 29, 2023

realratchet commented Aug 30, 2023

omenSi commented Aug 28, 2023 •

edited

Loading

realratchet commented Aug 28, 2023 •

edited

Loading

omenSi commented Aug 28, 2023 •

edited

Loading

root-11 commented Aug 29, 2023 via email •

edited

Loading