Skip to content

Conversation

@nyamsprod
Copy link
Member

@nyamsprod nyamsprod commented Sep 5, 2016

Introduction

Using this library to extract data from multiple CSV at once is complicated with the fact that the query features are coupled to the Reader class.

Proposal

Describe the new feature

This proposal introduces:

  • a Query class to extract some records from the CSV data. This is an immutable value object.
  • a Records class to manipulate the resulting records.

before:

<?php

use League\Csv\Reader;

$csv = Reader::createFromPath('/path/to/your/iso8859-1.csv');
$csv->setInputEncoding('iso-8859-1');

$csv->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
foreach ($csv->fetch() as $row) {
    $row; //no conversion
}

$csv->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$res = $csv->fetchAll();  //no conversion

$csv->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$col2 = $csv->fetchColumn(2);   //no conversion

$csv->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$xml = $csv->toXML();  //converted into UTF-8

after:

<?php

use League\Csv\Query;
use League\Csv\Reader;

$csv = Reader::createFromPath('/path/to/your/iso8859-1.csv');
$csv->setInputEncoding('iso-8859-1');
$query = (new Query())
    ->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$records = $query($csv);
foreach ($records as $row) {
    $row; //converted into UTF-8
}
$res = $records->fetchAll();  //converted into UTF-8
$col2 = $records->fetchColumn(2);  //converted into UTF-8
$xml = $records->toXML();  //converted into UTF-8
  • You can now issue queries with The Writer object directly.
  • setInputEncoding will now have an effect when extracting records from the CSV
  • setInputEncoding will use stream filter if possible or will fallback to using mb_* to convert the found records on the fly.

To ease creating a Records class, a new method on the AbstractCsv::getRecords(Query $query = null) is added to instantiate a Records class from a Csv object.

<?php

use League\Csv\Query;
use League\Csv\Reader;

$csv = Reader::createFromPath('/path/to/your/iso8859-1.csv');
$csv->setInputEncoding('iso-8859-1');
$query = (new Query())
    ->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$records = $csv->getRecords($query);

Backward Incompatible Changes

  • All query filters methods attached to the Reader class are now attached to the Query class.
  • All extracting methods attached to the Reader class are now attached to the Records class.
  • All conversion methods attached to the AbstractCsv class are now attached to the Records class.
  • The following extracting methods are removed:
    • each;
    • fetch ;
    • fetchPairsWithoutDuplicates;
  • The optional callable parameter from the extract methods is removed

Targeted release version

version 9.X

Open issues

  • Should we add as requirement in the composer.json the iconv extension, to normalize how data conversion is done with/without stream filter ?
  • AbstractCsv::getRecords is a proposed named, but it could be renamed AbstractCsv::query ?
  • Should we use __call to partially keep the current stable API. We could still access the extract method and would help migration from 8.x version to 9.x version ?

@dequis
Copy link

dequis commented Sep 6, 2016

Thanks! Using setInputEncoding that way looks much more comfortable, and the immutable query object is neat.

One potential issue I see with the cell-oriented transcoding method (which isn't introduced by this PR but wasn't used by everything before) is that it does splitting by ascii bytes first, and not all encodings are ascii compatible.

If you have a UTF-16 encoded document CSV with 㬬 (U+3B2C) in it, the csv splitter see two bytes ,;, pick one of those as the delimiter, add an extra unwanted column to the csv, and the cell-oriented transcoder won't even hear about one of those chars, probably throwing a truncated data error.

So the stream filters should be used whenever possible.

Other API ideas:

  1. Would it be a bad idea to keep (partial) compatibility with the "old" api as a __call method that creates a new Query and calls the corresponding method in that object? For example $csv->setOffset(10) doing return (new Query())->setOffset(10) transparently. Might be too much magic for your taste.
  2. Instead of $query = (new Query()), how about $query = $csv->query()?
  3. The Query object could be initialized with a reference to the reader, like new Query($csv) or $csv->query() calling new Query(this). This way, the Query object could have a get() method that does ``$this->csv->getRecords($this)`.

So you could go from:

$query = (new Query())
    ->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$records = $csv->getRecords($query);

To:

$records = $csv->query()
    ->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3)
    ->get();

@nyamsprod
Copy link
Member Author

nyamsprod commented Sep 6, 2016

If you have a UTF-16 encoded document CSV with 㬬 (U+3B2C) in it, the csv splitter see two bytes ,;, pick one of those as the delimiter, add an extra unwanted column to the csv, and the cell-oriented transcoder won't even hear about one of those chars, probably throwing a truncated data error. So the stream filters should be used whenever possible.

The latest commit to the PR just does that 👍 . It uses stream whenever possible or else fallback to cell by cell conversion.

Would it be a bad idea to keep (partial) compatibility with the "old" api as a __call method that creates a new Query and calls the corresponding method in that object? For example $csv->setOffset(10) doing return (new Query())->setOffset(10) transparently. Might be too much magic for your taste.

As you mentioned it would be partial compatibility and too much magic IMHO

Instead of $query = (new Query()), how about $query = $csv->query()?

The goal is to have the Query being totally decoupled from the Csv so that if I have to treat 100 files at once I would to something like this

use League\Csv\Query;
use League\Csv\Reader;
use League\Csv\Records;

$query = (new Query())
    ->setOffset(3)
    ->setLimit(12)
    ->addFilter($callable)
;

foreach ($csv_list as $csv) {
    $data = $csv->getRecords($query)->fetchAssoc();
    ....
}

The Query object could be initialized with a reference to the reader, like new Query($csv) or $csv->query() calling new Query(this). This way, the Query object could have a get() method.

Again this goes against the decoupling. And looking at your chaining one could easily loose what is returned when ? What the new API provide is an alternative like this:

use League\Csv\Query;
use League\Csv\Reader;

$query = (new Query())
    ->setOffset(3)
    ->setLimit(12)
    ->addFilter($callable)
;

$csv = new Reader::CreateFromPath('/path/to/file.csv')
    ->setDelimiter(";")
    ->setInputEncoding("ISO-8859-1")
;

$records = new Records($csv, $query);

And yes $csv->getRecords($query) is an alias of new Records($csv, $query)

@frankdejonge
Copy link
Member

@nyamsprod this looks like a nice improvement! One thing that feels a little out of place, is the Query name. But I'm not sure with what else it could be replaced.

@nyamsprod
Copy link
Member Author

@frankdejonge the other option would maybe be 'Filter', I don't know... naming thing is hard :)

@frankdejonge
Copy link
Member

@nyamsprod it seems more like a specification of some sort... right?

@frankdejonge
Copy link
Member

And yeah, naming things is really hard.

@nyamsprod
Copy link
Member Author

nyamsprod commented Sep 8, 2016

@frankdejonge @dequis @cordoval how about renaming the classes as follow

  • Query to Statement to mimic how things are named in SQL language
  • Records to RecordSet since the Csv RFC calls each row a record.
  • AbstractCsv::getRecords to AbstractCsv::select

Which would gives the following result:

<?php

use League\Csv\Reader;
use League\Csv\Statement;

$csv = Reader::createFromPath('/path/to/your/iso8859-1.csv');
$csv->setInputEncoding('iso-8859-1');
$csv->setHeader(0); //specify the record to be use as header - can be an array too
$stmt = (new Statement())
    ->addFilter($filter)
    ->setOffset(10)
    ->setLimit(3);
$records = $csv->select($stmt); 
$records = $stmt->process($csv); //or to enable the use of the same statement on different CSV objects
foreach ($records as $row) {
    $row; //converted into UTF-8
}
$res = $records->fetchAll();  //converted into UTF-8 with the header fields combined
$col2 = $records->fetchColumn(2);  //converted into UTF-8
$xml = $records->toXML();  //converted into UTF-8

By decoupling the extract features from the reader we can now:

- Apply the same query to multiple CSV files without redefining them each time
- Apply auto-conversion on any CSV without stream filter usage (see issue #177)
- Auto remove the BOM sequence if present
- Set the CSV header more easily
- Selecting records can now be done on a League\Csv\Writer object too

This is a major BC break so it is schedule for the next major version

- Dropping support for PHP 5.5
- Updating PHPUnit to PHPUnit 5.5
- Require iconv extension
@nyamsprod
Copy link
Member Author

@dequis @frankdejonge I've reconsidered your idea of using __call and implemented it on the Reader class. to view the full v9.x proposed new API you can look at this gist

- Adding getter in the Statement object
- RecordSet constructor expects a Reader and a Statement objects
- To have on consistent select/recordset behavior the stream filter
  mode can no longer be changed:
     - STREAM_FILTER_READ for the Reader class
     - STREAM_FILTER_WRITE for the Writer class
- The stream filter API is simplified
- Conversion and filtering of the Reader::getIterator is no longer done in
  the Statement class but in the RecordSet class.
@nyamsprod
Copy link
Member Author

This PR is closed in favor of PR #210

@nyamsprod nyamsprod closed this Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants