Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using xpath or css selectors with htmerl #2

Closed
lpil opened this issue Feb 10, 2023 · 6 comments
Closed

Using xpath or css selectors with htmerl #2

lpil opened this issue Feb 10, 2023 · 6 comments

Comments

@lpil
Copy link

lpil commented Feb 10, 2023

Hello!

Thanks for another fab parser!

Is there a way to use xpath or css selectors with htmerl with this library? If not, do you have a recommended way to get certain elements from the list of SAX events?

Thanks,
Louis

@zadean
Copy link
Owner

zadean commented Feb 10, 2023

Hey! Thanks! 🤗

@zadean
Copy link
Owner

zadean commented Feb 10, 2023

This is an idea I have played with for a while, but never really found a great way to do in a simple way. 😄

There is xqerl, but that comes with its own compiler and more. It will allow you use XPath on HTML, XML, or JSON documents using XQuery.. Actually cool, since you aren't limited to just XPath and have access to higher-order functions, maps, arrays, and all the XSD data-types. Maybe overkill though for path based extractions. I've never gone down the CSS selector path though.

I've built some DSLs for just this (not open source), but eventually just did it all in Rust later anyway. 😉

I imagine there's an elegant way of turning path-like strings into simple functions based on their level and placement in the document, but haven't taken the time to actually build it!

Might be a fun project for someone out there with more time on their hands than I have right now. 😺

@zadean
Copy link
Owner

zadean commented Feb 11, 2023

Maybe a minimum example of a simple path extraction could be:

-module(htmerl_example).

-export([run/0]).

run() ->
    Html =
        <<"<html><body><p>Check</p>nothing here<p>this <b>bold garbage</b></p>g"
          "arbage<p>out!</p></body></html>">>,
    XPath = <<"html/body/p">>,
    Path =
        lists:reverse(
            binary:split(XPath, <<"/">>, [global])),
    Opts = [{event_fun, fun xpath/3}, {user_state, {[], Path, []}}],
    {ok, TextList, []} = htmerl:sax(Html, Opts),
    TextList.

xpath({characters, Text}, _LineNum, {Path, Path, Acc}) ->
    {Path, Path, [Text | Acc]};
xpath({endElement, _Ns, Ln, _}, _LineNum, {[Ln | Path], XPath, Acc}) ->
    {Path, XPath, Acc};
xpath({startElement, _Ns, Ln, _, _Atts}, _LineNum, {Path, XPath, Acc}) ->
    {[Ln | Path], XPath, Acc};
xpath(endDocument, _LineNum, {_Path, _XPath, Acc}) ->
    lists:reverse(Acc);
xpath(_Event, _LineNum, State) ->
    State.
1> htmerl_example:run().
[<<"Check">>,<<"this">>,<<"out!">>]

@lpil
Copy link
Author

lpil commented Feb 11, 2023

Hey!

xqerl looks fab but I was looking a bit more lightweight and simpler I think. Love that example you've got there, super handy. I might see if I can support a subset of CSS selectors or such like that, enough to be useful for most basic tasks :)

Maybe that example could be stuck in an example folder or the README or such for future users?

@zadean
Copy link
Owner

zadean commented Feb 12, 2023

I added the example in the README with #3 😸

@lpil
Copy link
Author

lpil commented Feb 12, 2023

Thank you 💜

@lpil lpil closed this as completed Feb 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants