May be a BOM issue? #184

Closed
harikt opened this Issue Oct 1, 2016 · 10 comments

Comments

Projects
None yet
2 participants
@harikt
Contributor

harikt commented Oct 1, 2016

Issue summary

Hi,

I have been playing with importing a csv file. It has came to my notice that there is some sort of issue with μ character. May be it is something else I am wrong.

System informations

Information Description
League\Csv version 8.1.1
PHP/HHVM version PHP 7.0.8-0ubuntu0.16.04.1 (cli) ( NTS )
OS Platform ubuntu0.16.04.1

Standalone code, or other way to reproduce the problem

You can download the csv file from http://www.data-explorer.com/content/data/standard-model-of-particle-physics-csv.zip

<?php
require __DIR__ . '/vendor/autoload.php';

use League\Csv\Reader;

$file = "/path/to/physics/file.csv";

$reader = Reader::createFromPath($file);
$reader->stripBom(true);

echo "<pre>";

//get the first row, usually the CSV header
$headers = $reader->fetchOne(0);
print_r($headers);

$reader->addFilter(function ($row, $index) {
    return $index > 0;
});

$rows = $reader->fetch();

foreach ($rows as $row) {
    $content = array_combine($headers, $row);
    $json = json_encode($content);

    if ($json === FALSE) {
        echo var_dump($content);
        exit;
    }
}

Expected result

I am not sure what would be the appropriate character for μ .

Actual result

array(21) {
  ["ID"]=>
  string(3) "9"
  ["Name"]=>
  string(9) "Muon"
  ["Symbol"]=>
  string(3) "��"
 ....
}

There is actually more on the same csv which have similar issues.

Thank you.

@harikt harikt changed the title from May be a BOM issue with μ character ? to May be a BOM issue? Oct 1, 2016

@nyamsprod

This comment has been minimized.

Show comment
Hide comment
@nyamsprod

nyamsprod Oct 1, 2016

Member

@harikt there's nothing wrong with the library. By default the library thinks you are using a UTF-8 document. In your case the document is not UTF-8. Here's what I did:

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
//in your case $input_bom === Reader::BOM_UTF16_LE
//so I've just added this following line (I assume that the iconv extension is installed on your server)
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8'); //converting to UTF-8
$reader->stripBOM(true);
echo "<pre>";
//get the first row, usually the CSV header
$headers = $reader->fetchOne(0);
$rows = $reader->setOffset(1)->fetchAssoc($headers); //Simpler than using the Reader::addFilter method
foreach ($rows as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}
Member

nyamsprod commented Oct 1, 2016

@harikt there's nothing wrong with the library. By default the library thinks you are using a UTF-8 document. In your case the document is not UTF-8. Here's what I did:

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
//in your case $input_bom === Reader::BOM_UTF16_LE
//so I've just added this following line (I assume that the iconv extension is installed on your server)
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8'); //converting to UTF-8
$reader->stripBOM(true);
echo "<pre>";
//get the first row, usually the CSV header
$headers = $reader->fetchOne(0);
$rows = $reader->setOffset(1)->fetchAssoc($headers); //Simpler than using the Reader::addFilter method
foreach ($rows as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}
@harikt

This comment has been minimized.

Show comment
Hide comment
@harikt

harikt Oct 1, 2016

Contributor

@nyamsprod Thank you for your quick help.

A few things I noticed are

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
$reader->stripBOM(true);
$headers = $reader->fetchOne(0);

Adding $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8'); creates a warning as below.

PHP Notice:  Uninitialized string offset: 0 in /var/www/projects/vendor/league/csv/src/Modifier/QueryFilter.php on line 225

Notice: Uninitialized string offset: 0 in /var/www/projects/vendor/league/csv/src/Modifier/QueryFilter.php on line 22

Also closely looking the id keys got missing. See below.

{
    "": "17",

Thank you

Contributor

harikt commented Oct 1, 2016

@nyamsprod Thank you for your quick help.

A few things I noticed are

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
$reader->stripBOM(true);
$headers = $reader->fetchOne(0);

Adding $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8'); creates a warning as below.

PHP Notice:  Uninitialized string offset: 0 in /var/www/projects/vendor/league/csv/src/Modifier/QueryFilter.php on line 225

Notice: Uninitialized string offset: 0 in /var/www/projects/vendor/league/csv/src/Modifier/QueryFilter.php on line 22

Also closely looking the id keys got missing. See below.

{
    "": "17",

Thank you

@nyamsprod

This comment has been minimized.

Show comment
Hide comment
@nyamsprod

nyamsprod Oct 1, 2016

Member

@harikt strange I failed to reproduce this bug and I'm also using the 16.04 LTS

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/test.csv";

$reader = Reader::createFromPath($file);
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
$reader->stripBOM(true);

echo "<pre>";
foreach ($reader->fetchAssoc(0) as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}
Member

nyamsprod commented Oct 1, 2016

@harikt strange I failed to reproduce this bug and I'm also using the 16.04 LTS

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/test.csv";

$reader = Reader::createFromPath($file);
$reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
$reader->stripBOM(true);

echo "<pre>";
foreach ($reader->fetchAssoc(0) as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}
@harikt

This comment has been minimized.

Show comment
Hide comment
@harikt

harikt Oct 1, 2016

Contributor

I hope you could re-produce this now. This is when you call $reader->getInputBOM(); .

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/test.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
if ( $input_bom === Reader::BOM_UTF16_LE || $input_bom === Reader::BOM_UTF16_BE ) {
    echo "BOM UTF-16 " . PHP_EOL;
    $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
}
$reader->stripBOM(true);
foreach ($reader->fetchAssoc(0) as $row) {
    // echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}
Contributor

harikt commented Oct 1, 2016

I hope you could re-produce this now. This is when you call $reader->getInputBOM(); .

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/test.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
if ( $input_bom === Reader::BOM_UTF16_LE || $input_bom === Reader::BOM_UTF16_BE ) {
    echo "BOM UTF-16 " . PHP_EOL;
    $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
}
$reader->stripBOM(true);
foreach ($reader->fetchAssoc(0) as $row) {
    // echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}
@nyamsprod

This comment has been minimized.

Show comment
Hide comment
@nyamsprod

nyamsprod Oct 1, 2016

Member

@harikt thanks I found the issue ... I'll release a patch to fix this when I have time 👍

Member

nyamsprod commented Oct 1, 2016

@harikt thanks I found the issue ... I'll release a patch to fix this when I have time 👍

@harikt

This comment has been minimized.

Show comment
Hide comment
@harikt

harikt Oct 1, 2016

Contributor

Good to hear you was able to reproduce the same.

And thank you very much for the valuable support you have provided.

I wasn't able to find the reason to send a PR. Sorry about that :( .

Contributor

harikt commented Oct 1, 2016

Good to hear you was able to reproduce the same.

And thank you very much for the valuable support you have provided.

I wasn't able to find the reason to send a PR. Sorry about that :( .

@harikt

This comment has been minimized.

Show comment
Hide comment
@harikt

harikt Oct 1, 2016

Contributor

Also in the case as I mentioned when calling $reader->getInputBOM() the ID was null on json_encode data.

Contributor

harikt commented Oct 1, 2016

Also in the case as I mentioned when calling $reader->getInputBOM() the ID was null on json_encode data.

@nyamsprod

This comment has been minimized.

Show comment
Hide comment
@nyamsprod

nyamsprod Oct 3, 2016

Member

Also in the case as I mentioned when calling $reader->getInputBOM() the ID was null on json_encode data.

I've looked into this issue and in fact again there's no issue. The thing is if you use stream filter you don't need to explicitly call the stripBOM method.
The conversion is done prior to generate the array row from the CSV document so the BOM sequence is already removed. (I didn't know this, since I've never really used the stream filter on BOM document).

So the code should be:

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
if ($input_bom === Reader::BOM_UTF16_LE || $input_bom === Reader::BOM_UTF16_BE) {
    $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
}
foreach ($reader->fetchAssoc(0) as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}

The stripBOM method should only be used if stream filter can not be used to convert your CSV document to UTF-8.

I should document this into the library documentation.. or you could do it 👍

Member

nyamsprod commented Oct 3, 2016

Also in the case as I mentioned when calling $reader->getInputBOM() the ID was null on json_encode data.

I've looked into this issue and in fact again there's no issue. The thing is if you use stream filter you don't need to explicitly call the stripBOM method.
The conversion is done prior to generate the array row from the CSV document so the BOM sequence is already removed. (I didn't know this, since I've never really used the stream filter on BOM document).

So the code should be:

<?php

error_reporting(-1);
ini_set('display_errors', '1');

require 'vendor/autoload.php';

use League\Csv\Reader;

$file = __DIR__."/file.csv";

$reader = Reader::createFromPath($file);
$input_bom = $reader->getInputBOM();
if ($input_bom === Reader::BOM_UTF16_LE || $input_bom === Reader::BOM_UTF16_BE) {
    $reader->appendStreamFilter('convert.iconv.UTF-16/UTF-8');
}
foreach ($reader->fetchAssoc(0) as $row) {
    echo json_encode($row, JSON_PRETTY_PRINT), PHP_EOL;
}

The stripBOM method should only be used if stream filter can not be used to convert your CSV document to UTF-8.

I should document this into the library documentation.. or you could do it 👍

nyamsprod added a commit that referenced this issue Oct 3, 2016

@harikt

This comment has been minimized.

Show comment
Hide comment
@harikt

harikt Oct 3, 2016

Contributor

Thank you @nyamsprod .

I will look into sending a PR regarding the same. Please give me some time.

Thank you.

Contributor

harikt commented Oct 3, 2016

Thank you @nyamsprod .

I will look into sending a PR regarding the same. Please give me some time.

Thank you.

@nyamsprod

This comment has been minimized.

Show comment
Hide comment
@nyamsprod

nyamsprod Oct 27, 2016

Member

version 8.1.2 is released with the fix

Member

nyamsprod commented Oct 27, 2016

version 8.1.2 is released with the fix

@nyamsprod nyamsprod closed this Oct 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment