IO position not updated while reading a CSV file #195

konzinov · 2020-11-28T15:20:00Z

While reading CSV files the pos method always return first line position.

It seems the issue appeared since this commit (found using git bisect)

To illustrate it i made a test file.

I also made a fork on my own repo with the test file and a csv file (in test/csv/proof) and tried to fix it.

My solution is to pass down the @input instance from Parser to Scanner and update position while iterating within each_line method.

I wanted to make sure it was an issue (or maybe a bad use from me) before opening a Pull Request

I am using Ruby 2.7.2

Best regards,
Emmanuel KONZI.

The text was updated successfully, but these errors were encountered:

kou · 2020-11-28T23:16:59Z

Could you show your use case of CSV#pos?

konzinov · 2020-11-28T23:43:08Z

Yes. Basically It looks like this:

#file = File.expand_path(File.join(File.dirname(__FILE__), file_name))
CSV.open(file, 'rb', {col_sep: ';', encoding: 'ISO-8859-1'}) do |csv|
  pos = csv.pos #or .tell
  csv.each { |_| positions << pos; pos = csv.pos }
end

I got a full example of the test in a gist

kou · 2020-11-29T20:42:33Z

Thanks.
Could you also provide why do you want to use CSV#pos and how did you use CSV#pos in real use case?

kou · 2020-11-29T20:44:29Z

Could you also show the positions value of the example script on your environment?

konzinov · 2020-11-29T22:13:05Z

Thanks.
Could you also provide why do you want to use CSV#pos and how did you use CSV#pos in real use case?

In my application i have an autocomplete dropdown for some cities zipcode from a csv file.

So to be able to search effectively in that file we use the CSV#pos to build an index for the cities name chunks.

For example: We know in that file Moscow is at position 45 and Monrovia at the position 367. Our index is built with the chunks of Moscow and Monrovia like this:

M =>45, Mo => 45, Mos => 45, Mosc => 45, ... Moscow =>45
M => 367, Mo => 367, Mon => 367, ..., Monrovia => 367

So when someone enters M or Mo we can propose line Moscow and Monrovia based on the fact that we know M or Mo are on the lines 45 and 367.

Does it make sense ?

konzinov · 2020-11-29T22:16:00Z

Could you also show the positions value of the example script on your environment?

@kou here is an output of the script.

➜  csv git:(master) ✗ ruby test/csv/proof/proof_test.rb

Loaded suite test/csv/proof/proof_test
Started
0
450
450
450
450
450
450
450
450
450
450
F
==============================================================================================================================================================================================================================================================================
     12:     positions = @csv_reader.read_positions
     13:     puts positions
     14:     # Before commit 8505ff0 this test was passing. But now position stays at the first line.
  => 15:     assert_equal positions, positions.uniq, "Not unique"
     16:   end
     17: 
     18:   class CsvReader
test/csv/proof/proof_test.rb:15:in `test_read_positions'
Not unique
<[0, 450, 450, 450, 450, 450, 450, 450, 450, 450, 450]> expected but was
<[0, 450]>

diff:
? [0, 450, 450, 450, 450, 450, 450, 450, 450, 450, 450]
Failure: test_read_positions(ProofTest)
==============================================================================================================================================================================================================================================================================

Finished in 0.007358 seconds.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
0% passed
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
135.91 tests/s, 135.91 assertions/s

kou · 2020-11-30T00:23:46Z

How about using CSV#lineno instead of CSV#pos for your use case?

We may read a large chunk at once for performance instead of reading each line. So I don't want to recommend users that they expect CSV#pos returns the position of the current parsing line.

konzinov · 2020-11-30T09:18:06Z

Yes we could have used lineno but we would have to change the way we look into the file. For now we are using CSV#seek (which we found pretty convenient).

Please could you show me some valid examples of CSV#pos usage ?

If CSV#pos is not expected to return the position then we need to update the docs or check if other io-delegated methods keep working.

kou · 2020-12-01T04:48:37Z

How about caching a parsed CSV object in memory instead of using CSV#pos and CSV#seek? It'll be faster.

I don't think CSV#pos is useful for parsing generally. And CSV#seek isn't safe for parsing generally. For example, it doesn't work for a CSV file that has BOM. Because we need to read the first some bytes for the file. It also doesn't work for a case that needs to read headers the CSV file. The current implementation also doesn't work if CSV#seek is called after starting parsing. If we support it, we need to care it like CSV#rewind.

konzinov · 2020-12-03T22:33:01Z

Yes it will be faster for sure but we got a really big csv file so am wondering about memory growth issues. Anyway thanks for the advice, we will try to experiment it.

Then i guess the changes about CSV#pos behaviour are known from you and is not an issue from what i can see. Thanks for all the replies. Can i close the issue ?

But am still waiting for some valid example of those io-delegated methods (such as CSV#pos)

kou · 2020-12-05T22:28:55Z

I think that it's time you switch to other approach such as using DB from CSV in your situation.

I close this.

I don't think most of those IO-delegated methods are useful. They still exist for historical reasons. They may be useful with CSV.open but I'm not sure...

kou closed this as completed Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO position not updated while reading a CSV file #195

IO position not updated while reading a CSV file #195

konzinov commented Nov 28, 2020 •

edited

Loading

kou commented Nov 28, 2020

konzinov commented Nov 28, 2020 •

edited

Loading

kou commented Nov 29, 2020

kou commented Nov 29, 2020

konzinov commented Nov 29, 2020 •

edited

Loading

konzinov commented Nov 29, 2020

kou commented Nov 30, 2020

konzinov commented Nov 30, 2020 •

edited

Loading

kou commented Dec 1, 2020

konzinov commented Dec 3, 2020 •

edited

Loading

kou commented Dec 5, 2020

IO position not updated while reading a CSV file #195

IO position not updated while reading a CSV file #195

Comments

konzinov commented Nov 28, 2020 • edited Loading

kou commented Nov 28, 2020

konzinov commented Nov 28, 2020 • edited Loading

kou commented Nov 29, 2020

kou commented Nov 29, 2020

konzinov commented Nov 29, 2020 • edited Loading

konzinov commented Nov 29, 2020

kou commented Nov 30, 2020

konzinov commented Nov 30, 2020 • edited Loading

kou commented Dec 1, 2020

konzinov commented Dec 3, 2020 • edited Loading

kou commented Dec 5, 2020

konzinov commented Nov 28, 2020 •

edited

Loading

konzinov commented Nov 28, 2020 •

edited

Loading

konzinov commented Nov 29, 2020 •

edited

Loading

konzinov commented Nov 30, 2020 •

edited

Loading

konzinov commented Dec 3, 2020 •

edited

Loading