Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parseLBS fails to parse (valid?) xml #63

Closed
legrostdg opened this Issue Sep 28, 2015 · 14 comments

Comments

Projects
None yet
2 participants
@legrostdg
Copy link

legrostdg commented Sep 28, 2015

I'm trying to parse a (maybe not very well formated) rss feed with xml-conduit:

> res <- simpleHttp "http://www.semencespaysannes.org/rss.php"
> parseLBS def res
Left Data.Conduit.Text.decode: Error decoding stream of UTF-8 bytes. Error encountered in stream at offset 98. Encountered at byte sequence "\233sea"

@legrostdg legrostdg changed the title parseLBS fails to parse xml parseLBS fails to parse (valid?) xml Sep 28, 2015

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Sep 28, 2015

Can you confirm that that URL contains proper utf8 data

On Mon, Sep 28, 2015, 7:52 PM legrostdg notifications@github.com wrote:

I'm trying to parse a (maybe not very well formated) rss feed with
xml-conduit:

res <- simpleHttp "http://www.semencespaysannes.org/rss.php"
parseLBS def res
Left Data.Conduit.Text.decode: Error decoding stream of UTF-8 bytes. Error encountered in stream at offset 98. Encountered at byte sequence "\233sea"


Reply to this email directly or view it on GitHub
#63.

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Sep 29, 2015

$ curl --silent http://www.semencespaysannes.org/rss.php | file -
/dev/stdin: XML 1.0 document, ISO-8859 text, with very long lines, with CRLF, LF line terminators

This seems to match the header:

<?xml version="1.0" encoding="iso-8859-1"?>

So, no, it is not utf8 data. But shouldn't other encodings be supported?

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Sep 29, 2015

This isn't something handled by the default decoding function. You could either:

  1. First decode the ByteString to Text, and then use the Text decoding functions.
  2. Enhance the decodeUtf function to handle iso-8859-1
@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Sep 30, 2015

My issue is that I need to parse feeds encoded with "iso-8859-1", and others with "utf8". So I guess that I would need to find a way to parse the encoding part of the xml header...

Where is the decodeUtf you are talking about in your second solution? Is this Data.Conduit.Text.decodeUtf in conduit-extra? I guess it would be better to enable "iso-8859-1" support for everyone using xml-conduit.

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Oct 1, 2015

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Oct 1, 2015

OK, thanks. I've looked at detectUtf, and did some little researchs on encoding detection: as I understand it, the current method works great to detect utf(8|16|32) but couldn't be adapted to detect latin1. I guess what would need to be added is a fallback detection method based on the xml declaration "encoding" attribute.

A working implementation seems to be http://haddock.stackage.org/lts-3.7/tagstream-conduit-0.5.5.3/Text-HTML-TagStream-Text.html#v:tokenStreamBS But I failed in adapting detectUtf to only parse the XML document until the end of the declaration token... Maybe you could have a look or give me some hints to try to implement this? Or maybe the encoding detection part of tagstream-conduit could be released as a separate library and be used there?

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Oct 1, 2015

Another way would be for detectUtf to default to latin1, as utf8 should have been detected before. Very hacky, though :-)...

snoyberg added a commit that referenced this issue Oct 1, 2015

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Oct 1, 2015

I just pushed a commit to master that should address this, can you test it?

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Oct 2, 2015

Sure! I read your code, it looks great! But it seems like I've reached my stack limits... I added my local clone of the xml-conduit repo to my project's stack.yaml, and "stack build" compiles until the installation of xml-conduit:

(~/src/myproject)$ cat stack.yaml
flags: {}
packages:
- '.'
- location: '../xml-conduit'
subdirs:
- 'xml-conduit'
extra-deps: []
resolver: lts-3.0
(~/src/myproject)$ stack build
xml-conduit-1.3.2: configure
xml-conduit-1.3.2: build
xml-conduit-1.3.2: install
Progress: 1/2
--  While building package xml-conduit-1.3.2 using:
      /home/myuser/.stack/setup-exe-cache/setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.22.4.0/ install
    Process exited with code: ExitFailure 1
    Logs have been written to: /home/myuser/src/myproject/.stack-work/logs/xml-conduit-1.3.2.log

    Configuring xml-conduit-1.3.2...
    Preprocessing library xml-conduit-1.3.2...
    [1 of 7] Compiling Text.XML.Cursor.Generic ( Text/XML/Cursor/Generic.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Cursor/Generic.o )
    [2 of 7] Compiling Text.XML.Stream.Token ( Text/XML/Stream/Token.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Token.o )

    Text/XML/Stream/Token.hs:19:1: Warning:
        The import of ‘Data.Monoid’ is redundant
          except perhaps to import instances from ‘Data.Monoid’
        To import instances alone, use: import Data.Monoid()
    [3 of 7] Compiling Text.XML.Stream.Render ( Text/XML/Stream/Render.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Render.o )

    Text/XML/Stream/Render.hs:29:1: Warning:
        The import of ‘Control.Monad.Trans.Resource’ is redundant
          except perhaps to import instances from ‘Control.Monad.Trans.Resource’
        To import instances alone, use: import Control.Monad.Trans.Resource()

    Text/XML/Stream/Render.hs:30:1: Warning:
        The import of ‘Data.ByteString’ is redundant
          except perhaps to import instances from ‘Data.ByteString’
        To import instances alone, use: import Data.ByteString()

    Text/XML/Stream/Render.hs:40:1: Warning:
        The import of ‘Data.Monoid’ is redundant
          except perhaps to import instances from ‘Data.Monoid’
        To import instances alone, use: import Data.Monoid()

    Text/XML/Stream/Render.hs:54:1: Warning:
        Top-level binding with no type signature:
          renderBytes :: forall (m :: * -> *) (base :: * -> *).
                         (transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
                          primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
                         RenderSettings -> ConduitM Event ByteString m ()

    Text/XML/Stream/Render.hs:63:1: Warning:
        Top-level binding with no type signature:
          renderText :: forall (m :: * -> *) (base :: * -> *).
                        (MonadThrow m,
                         transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
                         primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
                        RenderSettings -> ConduitM Event Text m ()

    Text/XML/Stream/Render.hs:100:18: Warning:
        This binding for ‘attr’ shadows the existing binding
          defined at Text/XML/Stream/Render.hs:364:1

    Text/XML/Stream/Render.hs:344:25: Warning:
        This binding for ‘content’ shadows the existing binding
          defined at Text/XML/Stream/Render.hs:351:1
    [4 of 7] Compiling Text.XML.Stream.Parse ( Text/XML/Stream/Parse.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Parse.o )

    Text/XML/Stream/Parse.hs:155:1: Warning:
        The import of ‘Applicative, <$>’
        from module ‘Control.Applicative’ is redundant

    Text/XML/Stream/Parse.hs:564:5: Warning:
        This binding for ‘parseText'’ shadows the existing binding
          defined at Text/XML/Stream/Parse.hs:360:1

    Text/XML/Stream/Parse.hs:697:20: Warning:
        This binding for ‘attr’ shadows the existing binding
          defined at Text/XML/Stream/Parse.hs:904:1
    [5 of 7] Compiling Text.XML.Unresolved ( Text/XML/Unresolved.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Unresolved.o )

    Text/XML/Unresolved.hs:54:1: Warning:
        The import of ‘Control.Applicative’ is redundant
          except perhaps to import instances from ‘Control.Applicative’
        To import instances alone, use: import Control.Applicative()

    Text/XML/Unresolved.hs:128:1: Warning:
        Top-level binding with no type signature:
          renderBytes :: forall a (m :: * -> *) (base :: * -> *).
                         (transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
                          primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
                         R.RenderSettings -> Document -> ConduitM a ByteString m ()

    Text/XML/Unresolved.hs:131:1: Warning:
        Top-level binding with no type signature:
          renderText :: forall a (m :: * -> *) (base :: * -> *).
                        (MonadThrow m,
                         transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
                         primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
                        R.RenderSettings -> Document -> ConduitM a Text m ()
    [6 of 7] Compiling Text.XML         ( Text/XML.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML.o )

    Text/XML.hs:111:1: Warning:
        The import of ‘Control.Monad.Trans.Resource’ is redundant
          except perhaps to import instances from ‘Control.Monad.Trans.Resource’
        To import instances alone, use: import Control.Monad.Trans.Resource()

    Text/XML.hs:119:1: Warning:
        The import of ‘Data.Monoid’ is redundant
          except perhaps to import instances from ‘Data.Monoid’
        To import instances alone, use: import Data.Monoid()

    Text/XML.hs:284:1: Warning:
        Top-level binding with no type signature:
          renderBytes :: forall a (m :: * -> *) (base :: * -> *).
                         (transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
                          primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
                         R.RenderSettings -> Document -> ConduitM a ByteString m ()
    [7 of 7] Compiling Text.XML.Cursor  ( Text/XML/Cursor.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Cursor.o )
    In-place registering xml-conduit-1.3.2...
    Installing library in
    /home/myuser/src/xml-conduit/.stack-work/install/x86_64-linux/lts-3.7/7.10.2/lib/x86_64-linux-ghc-7.10.2/xml-conduit-1.3.2-1nwHfPMDb4LJd58sdbNPmA
    setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2:
    '/home/myuser/.stack/programs/x86_64-linux/ghc-7.10.2/bin/ghc' exited with an
    error:
    Bad interface file:
    .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Parse.hi
    Something is amiss; requested module
    xml-conduit-1.3.2@xmlco_1nwHfPMDb4LJd58sdbNPmA:Text.XML.Stream.Parse differs
    from name found in the interface file
    xmlco_LhNN2CDAfHp6PBvxqT7DQA:Text.XML.Stream.Parse
@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Oct 2, 2015

Try a stack clean first, it looks like GHC created a problematic .hi file
in between. I don't know why that happens.

On Fri, Oct 2, 2015 at 9:40 AM, legrostdg notifications@github.com wrote:

Sure! I read your code, it looks great! But it seems like I've reached my
stack limits... I added my local clone of the xml-conduit repo to my
project's stack.yaml, and "stack build" compiles until the installation of
xml-conduit:

(~/src/myproject)$ cat stack.yaml
flags: {}
packages:

  • '.'

  • location: '../xml-conduit'
    subdirs:

  • 'xml-conduit'
    extra-deps: []
    resolver: lts-3.0
    (~/src/myproject)$ stack build
    xml-conduit-1.3.2: configure
    xml-conduit-1.3.2: build
    xml-conduit-1.3.2: install
    Progress: 1/2
    -- While building package xml-conduit-1.3.2 using:
    /home/myuser/.stack/setup-exe-cache/setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.22.4.0/ install
    Process exited with code: ExitFailure 1
    Logs have been written to: /home/myuser/src/myproject/.stack-work/logs/xml-conduit-1.3.2.log

    Configuring xml-conduit-1.3.2...
    Preprocessing library xml-conduit-1.3.2...
    [1 of 7] Compiling Text.XML.Cursor.Generic ( Text/XML/Cursor/Generic.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Cursor/Generic.o )
    [2 of 7] Compiling Text.XML.Stream.Token ( Text/XML/Stream/Token.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Token.o )

    Text/XML/Stream/Token.hs:19:1: Warning:
    The import of ‘Data.Monoid’ is redundant
    except perhaps to import instances from ‘Data.Monoid’
    To import instances alone, use: import Data.Monoid()
    [3 of 7] Compiling Text.XML.Stream.Render ( Text/XML/Stream/Render.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Render.o )

    Text/XML/Stream/Render.hs:29:1: Warning:
    The import of ‘Control.Monad.Trans.Resource’ is redundant
    except perhaps to import instances from ‘Control.Monad.Trans.Resource’
    To import instances alone, use: import Control.Monad.Trans.Resource()

    Text/XML/Stream/Render.hs:30:1: Warning:
    The import of ‘Data.ByteString’ is redundant
    except perhaps to import instances from ‘Data.ByteString’
    To import instances alone, use: import Data.ByteString()

    Text/XML/Stream/Render.hs:40:1: Warning:
    The import of ‘Data.Monoid’ is redundant
    except perhaps to import instances from ‘Data.Monoid’
    To import instances alone, use: import Data.Monoid()

    Text/XML/Stream/Render.hs:54:1: Warning:
    Top-level binding with no type signature:
    renderBytes :: forall (m :: * -> *) (base :: * -> *).
    (transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
    primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
    RenderSettings -> ConduitM Event ByteString m ()

    Text/XML/Stream/Render.hs:63:1: Warning:
    Top-level binding with no type signature:
    renderText :: forall (m :: * -> *) (base :: * -> *).
    (MonadThrow m,
    transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
    primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
    RenderSettings -> ConduitM Event Text m ()

    Text/XML/Stream/Render.hs💯18: Warning:
    This binding for ‘attr’ shadows the existing binding
    defined at Text/XML/Stream/Render.hs:364:1

    Text/XML/Stream/Render.hs:344:25: Warning:
    This binding for ‘content’ shadows the existing binding
    defined at Text/XML/Stream/Render.hs:351:1
    [4 of 7] Compiling Text.XML.Stream.Parse ( Text/XML/Stream/Parse.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Parse.o )

    Text/XML/Stream/Parse.hs:155:1: Warning:
    The import of ‘Applicative, <$>’
    from module ‘Control.Applicative’ is redundant

    Text/XML/Stream/Parse.hs:564:5: Warning:
    This binding for ‘parseText'’ shadows the existing binding
    defined at Text/XML/Stream/Parse.hs:360:1

    Text/XML/Stream/Parse.hs:697:20: Warning:
    This binding for ‘attr’ shadows the existing binding
    defined at Text/XML/Stream/Parse.hs:904:1
    [5 of 7] Compiling Text.XML.Unresolved ( Text/XML/Unresolved.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Unresolved.o )

    Text/XML/Unresolved.hs:54:1: Warning:
    The import of ‘Control.Applicative’ is redundant
    except perhaps to import instances from ‘Control.Applicative’
    To import instances alone, use: import Control.Applicative()

    Text/XML/Unresolved.hs:128:1: Warning:
    Top-level binding with no type signature:
    renderBytes :: forall a (m :: * -> *) (base :: * -> *).
    (transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
    primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
    R.RenderSettings -> Document -> ConduitM a ByteString m ()

    Text/XML/Unresolved.hs:131:1: Warning:
    Top-level binding with no type signature:
    renderText :: forall a (m :: * -> *) (base :: * -> *).
    (MonadThrow m,
    transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
    primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
    R.RenderSettings -> Document -> ConduitM a Text m ()
    [6 of 7] Compiling Text.XML ( Text/XML.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML.o )

    Text/XML.hs:111:1: Warning:
    The import of ‘Control.Monad.Trans.Resource’ is redundant
    except perhaps to import instances from ‘Control.Monad.Trans.Resource’
    To import instances alone, use: import Control.Monad.Trans.Resource()

    Text/XML.hs:119:1: Warning:
    The import of ‘Data.Monoid’ is redundant
    except perhaps to import instances from ‘Data.Monoid’
    To import instances alone, use: import Data.Monoid()

    Text/XML.hs:284:1: Warning:
    Top-level binding with no type signature:
    renderBytes :: forall a (m :: * -> *) (base :: * -> *).
    (transformers-base-0.4.4:Control.Monad.Base.MonadBase base m,
    primitive-0.6:Control.Monad.Primitive.PrimMonad base) =>
    R.RenderSettings -> Document -> ConduitM a ByteString m ()
    [7 of 7] Compiling Text.XML.Cursor ( Text/XML/Cursor.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Cursor.o )
    In-place registering xml-conduit-1.3.2...
    Installing library in
    /home/myuser/src/xml-conduit/.stack-work/install/x86_64-linux/lts-3.7/7.10.2/lib/x86_64-linux-ghc-7.10.2/xml-conduit-1.3.2-1nwHfPMDb4LJd58sdbNPmA
    setup-Simple-Cabal-1.22.4.0-x86_64-linux-ghc-7.10.2:
    '/home/myuser/.stack/programs/x86_64-linux/ghc-7.10.2/bin/ghc' exited with an
    error:
    Bad interface file:
    .stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/Text/XML/Stream/Parse.hi
    Something is amiss; requested module
    xml-conduit-1.3.2@xmlco_1nwHfPMDb4LJd58sdbNPmA:Text.XML.Stream.Parse differs
    from name found in the interface file
    xmlco_LhNN2CDAfHp6PBvxqT7DQA:Text.XML.Stream.Parse


Reply to this email directly or view it on GitHub
#63 (comment).

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Oct 2, 2015

Wow, thanks for this super fast response! I confirm that latin1 is now supported. Thanks again!

@legrostdg legrostdg closed this Oct 2, 2015

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Oct 8, 2015

Could you do a release with this fix?

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Oct 9, 2015

1.3.2 is already released

On Thu, Oct 8, 2015, 11:25 PM legrostdg notifications@github.com wrote:

Could you do a release with this fix?


Reply to this email directly or view it on GitHub
#63 (comment).

@legrostdg

This comment has been minimized.

Copy link
Author

legrostdg commented Oct 9, 2015

OK, thanks! I didn't find it on https://github.com/snoyberg/xml/releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.