Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO position not updated while reading a CSV file #195

Closed
konzinov opened this issue Nov 28, 2020 · 11 comments
Closed

IO position not updated while reading a CSV file #195

konzinov opened this issue Nov 28, 2020 · 11 comments

Comments

@konzinov
Copy link

konzinov commented Nov 28, 2020

While reading CSV files the pos method always return first line position.

It seems the issue appeared since this commit (found using git bisect)

To illustrate it i made a test file.

I also made a fork on my own repo with the test file and a csv file (in test/csv/proof) and tried to fix it.

My solution is to pass down the @input instance from Parser to Scanner and update position while iterating within each_line method.

I wanted to make sure it was an issue (or maybe a bad use from me) before opening a Pull Request

I am using Ruby 2.7.2

Best regards,
Emmanuel KONZI.

@kou
Copy link
Member

kou commented Nov 28, 2020

Could you show your use case of CSV#pos?

@konzinov
Copy link
Author

konzinov commented Nov 28, 2020

Yes. Basically It looks like this:

#file = File.expand_path(File.join(File.dirname(__FILE__), file_name))
CSV.open(file, 'rb', {col_sep: ';', encoding: 'ISO-8859-1'}) do |csv|
  pos = csv.pos #or .tell
  csv.each { |_| positions << pos; pos = csv.pos }
end

I got a full example of the test in a gist

@kou
Copy link
Member

kou commented Nov 29, 2020

Thanks.
Could you also provide why do you want to use CSV#pos and how did you use CSV#pos in real use case?

@kou
Copy link
Member

kou commented Nov 29, 2020

Could you also show the positions value of the example script on your environment?

@konzinov
Copy link
Author

konzinov commented Nov 29, 2020

Thanks.
Could you also provide why do you want to use CSV#pos and how did you use CSV#pos in real use case?

In my application i have an autocomplete dropdown for some cities zipcode from a csv file.

So to be able to search effectively in that file we use the CSV#pos to build an index for the cities name chunks.

For example: We know in that file Moscow is at position 45 and Monrovia at the position 367. Our index is built with the chunks of Moscow and Monrovia like this:

M =>45, Mo => 45, Mos => 45, Mosc => 45, ... Moscow =>45
M => 367, Mo => 367, Mon => 367, ..., Monrovia => 367

So when someone enters M or Mo we can propose line Moscow and Monrovia based on the fact that we know M or Mo are on the lines 45 and 367.

Does it make sense ?

@konzinov
Copy link
Author

Could you also show the positions value of the example script on your environment?

@kou here is an output of the script.

➜  csv git:(master) ✗ ruby test/csv/proof/proof_test.rb

Loaded suite test/csv/proof/proof_test
Started
0
450
450
450
450
450
450
450
450
450
450
F
==============================================================================================================================================================================================================================================================================
     12:     positions = @csv_reader.read_positions
     13:     puts positions
     14:     # Before commit 8505ff0 this test was passing. But now position stays at the first line.
  => 15:     assert_equal positions, positions.uniq, "Not unique"
     16:   end
     17: 
     18:   class CsvReader
test/csv/proof/proof_test.rb:15:in `test_read_positions'
Not unique
<[0, 450, 450, 450, 450, 450, 450, 450, 450, 450, 450]> expected but was
<[0, 450]>

diff:
? [0, 450, 450, 450, 450, 450, 450, 450, 450, 450, 450]
Failure: test_read_positions(ProofTest)
==============================================================================================================================================================================================================================================================================

Finished in 0.007358 seconds.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
0% passed
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
135.91 tests/s, 135.91 assertions/s

@kou
Copy link
Member

kou commented Nov 30, 2020

How about using CSV#lineno instead of CSV#pos for your use case?

We may read a large chunk at once for performance instead of reading each line. So I don't want to recommend users that they expect CSV#pos returns the position of the current parsing line.

@konzinov
Copy link
Author

konzinov commented Nov 30, 2020

Yes we could have used lineno but we would have to change the way we look into the file. For now we are using CSV#seek (which we found pretty convenient).

Please could you show me some valid examples of CSV#pos usage ?

If CSV#pos is not expected to return the position then we need to update the docs or check if other io-delegated methods keep working.

@kou
Copy link
Member

kou commented Dec 1, 2020

How about caching a parsed CSV object in memory instead of using CSV#pos and CSV#seek? It'll be faster.

I don't think CSV#pos is useful for parsing generally. And CSV#seek isn't safe for parsing generally. For example, it doesn't work for a CSV file that has BOM. Because we need to read the first some bytes for the file. It also doesn't work for a case that needs to read headers the CSV file. The current implementation also doesn't work if CSV#seek is called after starting parsing. If we support it, we need to care it like CSV#rewind.

@konzinov
Copy link
Author

konzinov commented Dec 3, 2020

Yes it will be faster for sure but we got a really big csv file so am wondering about memory growth issues. Anyway thanks for the advice, we will try to experiment it.

Then i guess the changes about CSV#pos behaviour are known from you and is not an issue from what i can see. Thanks for all the replies. Can i close the issue ?

But am still waiting for some valid example of those io-delegated methods (such as CSV#pos)

@kou
Copy link
Member

kou commented Dec 5, 2020

I think that it's time you switch to other approach such as using DB from CSV in your situation.

I close this.

I don't think most of those IO-delegated methods are useful. They still exist for historical reasons. They may be useful with CSV.open but I'm not sure...

@kou kou closed this as completed Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants