Parse for CSS URLS in HTML style elements #143

rockdaboot · 2017-03-09T09:30:56Z

Needs a unit test + work in html_url.c.

Example code:

<div class="example" style="background: url(/storage/example/bg.png) top center no-repeat #b4392d;"></div>

The text was updated successfully, but these errors were encountered:

rootkea · 2017-03-20T12:16:04Z

I think there can be two approaches to achieve this:

use wget_css_parse_file()
Now, this can be achieved by either saving the complete string of style attribute e.g. for above example it would be "background: url(/storage/example/bg.png) top center no-repeat #b4392d;" in a temporary file and then passing it to wget_css_parse_file() or

by piping it to another process invoking wget_css_parse_file(). So no temporary file but involves process synchronization.

use wget_css_parse_buffer()
Or simply passing the complete string of style attribute to wget_css_parse_buffer()

I am inclined towards way 2 for obvious reasons! :)

BTW this brings me to my next question "Is libwget documented somewhere with all of its functions?" This would immensely help rather than going through libwget by hand.

rootkea · 2017-03-20T12:41:23Z

To locate style attribute we can write a small parsing routine or better can use the xmlParser from libwget but again I'm kinda lost in libwget. Particularly what parseXml() does exactly? This is important to understand to grab the parsing logic enough to locate style attribute

juaristi · 2017-03-20T13:53:02Z

Maybe @rockdaboot can raise a bit more light here but as I see it, parseXML() should handle all the attributes quite decently, including style. It's a SAX-like parser: you give him a callback function, and it calls it whenever it finds something interesting.

Once you have that, as you already guessed, you'd need to feed it to the CSS parser. If there's something to be modified, that should be in the CSS code.

Regarding the docs, we keep them inline, and then generate HTML output with Doxygen. But as you probably already saw, not all functions are documented. make should generate the HTMLs and place them in docs/html.

rockdaboot · 2017-03-20T13:57:23Z

Regarding parseXML(): It is a stateless xml / html scanner not using memory allocations. Instead it uses a callback function. (Sorry, currently the function is not documented.)

But look at libwget/html_url.c as an example. Here the callback function is _html_get_url(). For a better understanding, uncomment L104 (info_print), rebuild all and execute examples/print_html_urls.

To parse the style attributes, you possibly can extend this function.

rockdaboot · 2017-03-20T14:05:15Z

As @juaristi suggests, once you have the value for the attribute 'style', you could use wget_css_parse_buffer() to avoid a temp file. That would be similar to src/wget.c/css_parse().

Maybe you first create a small .html file including style (with a URL) and than run this with examples/print_html_urls. The URL inside your style attribute won't be printed... then slowly start with your code or your debugging print outs, step by step.

rockdaboot · 2017-03-20T14:09:28Z

"Is libwget documented somewhere with all of its functions?"

Not yet all (any additions to the docs are highly welcome).
But after a make, there should be doc/html/index.html which you can view in a browser. Under files/file list you can see which source files have been documented. xml.c hasn't yet.

rootkea · 2017-03-21T06:33:26Z

make should generate the HTMLs and place them in docs/html

@juaristi Oh! I completely missed that. Thanks.

uncomment L104 (info_print)

@rockdaboot I did that. But then it says:

html_url.c:104:53: error: 'dir' undeclared (first use in this function)
   info_printf("%02X %s %s '%.*s' %zd %zd\n", flags, dir, attr, (int) len, val, len, pos);

I can see that dir is a character pointer. So will char * dir; work? Or is it suppose to be a character array? (in which case declaring it as a character pointer will lead to SegFault)

And also since we are passing it to a printing function; it seems that dir needs to be initialized. What should be the initializing value of dir?

BTW I just declared it as a char * and build went successful. But it's an uninitialized pointer. I hope I didn't leave door open for illegal memory access.

Maybe you first create a small .html file including style (with a URL) and than run this with examples/print_html_urls. The URL inside your style attribute won't be printed... then slowly start with your code or your debugging print outs, step by step.

Yes. That's how I'm proceeding.

rootkea · 2017-03-21T06:49:32Z

And when I did that info_printf() prints nothing. Nothing at all. As if it's commented.

Just to make sure that make built the updated file I added dummy printfs before and after info_printf(). They get printed and there's nothing printed by info_printf(). I guess char * dir was a bad idea.

rockdaboot · 2017-03-21T09:40:59Z

Sorry, that comment was a bit old... I just pushed a corrected version (it must be 'tag' instead of 'dir').

And when I did that info_printf() prints nothing. Nothing at all. As if it's commented.

Yup. Because the app doesn't set a 'logger' for 'info' messages (or any other kind of messages).

You see in comments (in print_html_urls.c):

	wget_global_init(
		WGET_DEBUG_STREAM, stderr,
		WGET_ERROR_STREAM, stderr,
		WGET_INFO_STREAM, stdout,
		NULL);

Remove the comments to get full output. Just comment out the line with WGET_DEBUG_STREAM and you won't see any debug output (=wget_debug_printf()) any more. Instead of STREAM you can also redirect different kinds of output to a function (_FUNC) or a file on disk (_FILE).

rootkea · 2017-03-21T11:48:15Z

I am trying hard to follow parseXML() but a bit stuck.

More specifically:

What are XML hints? (hints from XML_CONTEXT structure)
What does getToken() return? In other words, what is considered as token by getToken()?

rockdaboot · 2017-03-21T12:00:58Z

The parser parses both HTML and XML, both are somewhat different. If hints is set to HTML, the parser parses HTML.

But really... don't bother with that code (you need a decent understanding of XML and HTML definition). Just use the code as stated above. If there is something unexpected or a bug, report it to me and I'll fix that. Well, of course you can go through it. But I have the feeling that wastes your time.

rockdaboot · 2017-03-21T12:06:14Z

BTW, what IDE are you using ? You should not simply use an editor but something more mighty like Eclipse or Netbeans. For example in Netbeans, you simply right-click on 'hints', select 'show usage' and immediately can see that the only value that ever becomes checked for is XML_HINT_HTML (you get a list of all usages within the whole project).

rootkea · 2017-03-21T12:39:18Z

@rockdaboot I'm using VIM with ctags. I'm not an advanced VIM user though.

About Netbeans, that sounds awesome. Without going through the whole project and jumping between files. I should use it.

BTW what IDE will you recommend? Is it Netbeans?

rockdaboot · 2017-03-21T12:53:52Z

I use Netbeans. But I believe that Eclipse does the same or even more for C/C++. I have lot's of Java projects, so I stick with Netbeans.

rootkea · 2017-03-21T14:58:08Z

@rockdaboot I think I got it. I added style attribute to attrs array. And then I'm doing

if (!wget_strcasecmp_ascii(attr, "style")) {
//skip white spaces
//Parse till ' (I need to check HTML5 spec about ' and '' or no quotes)
//Now store the further parsed string in a buffer till matching quote is found (In case of no quotes check for space character)
//pass this buffer to wget_css_parse_buffer()
}

Am I in correct direction?
if yes then, what should be the size of buffer as we don't know the number of characters before hand. Even if we want to allocate memory dynamically still we need to know how many bytes we should allocate.

rootkea · 2017-03-21T15:02:15Z

Or should I first just walk over to find the end of string and then use that count as no. of bytes to be allocated?
I'm going with this for now but it involves parsing the string twice so not efficient.

rockdaboot · 2017-03-21T15:29:37Z

len is the length of the value (that is what you are processing), e.g. with style="css-xxx", value points to the beginning 'c' (AFAIR) and count would be 7. Just from my memory (you have to test it): Since the parser works inline, you have to malloc 7 + 1, copy the data, set last byte to 0. Not efficient, but first goal is to get it working. If it turns out to be a CPU hog, we optimize it later.

I am not sure, but I think the quote has already been parsed by then, also leading white space should not bother the CSS parser (if it does, we have to fix it there).

So, you are absolutely on the right way :-)

rootkea · 2017-03-21T16:04:10Z

Ok. Now I have a CSS string in buffer and now I need callback_uri and callback_encoding function pointers to call wget_css_parse_buffer().

Now I can copy those two functions from print_css_urls.c to the current working file i.e. libwget/html_url.c but it's absurd.

Since libget just provides an API we should ask user to provide these two callback functions. It seems that we need to extend wget_html_get_urls_inline to accept these two callback functions as arguments. What do you think?

rootkea · 2017-03-21T16:12:21Z

On a side note, is there a css parsing function like wget_html_get_urls_inline() which doesn't require callback functions?

rockdaboot · 2017-03-21T16:13:20Z

For now, forget callback_encoding, just use the encoding of the HTML document.

Also, feel free to copy callback_uri into libwget/html_url.c... remember, first get it going, optimizing later (except optimizing is straight forward).

I am just in a hurry right now... so can't take a deeper look.
Also, for me to review early, create an own branch, push it to your wget2 repo and create a pull request with name "WIP: ...", so we know it is work-in-progress. You can always add commits or remove / change them (git push -f). And Github allows me to add inline comments.

rootkea · 2017-03-21T16:16:53Z

Also why are we using callback functions in wget_css_parse_file()? Why can't we store the results somewhere like in css-contextmay be?(similar to what wget_html_get_urls_inline does)

rockdaboot · 2017-03-21T16:33:12Z

Maybe we can amend the CSS parsing so it works like the HTML parsing. Create an issue for discussion !

rootkea · 2017-04-21T13:24:32Z

Shouldn't this be closed too? Though I'm yet to really look into the approach @MichaelHeerklotz took and compare it with what I was thinking, it works as expected. Or am I missing something?

rockdaboot added the enhancement label Mar 9, 2017

rootkea mentioned this issue Mar 21, 2017

WIP : Parse for CSS URLS in HTML style elements #145

Closed

rockdaboot closed this as completed Apr 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse for CSS URLS in HTML style elements #143

Parse for CSS URLS in HTML style elements #143

rockdaboot commented Mar 9, 2017

rootkea commented Mar 20, 2017 •

edited

rootkea commented Mar 20, 2017

juaristi commented Mar 20, 2017 •

edited

rockdaboot commented Mar 20, 2017 •

edited

rockdaboot commented Mar 20, 2017

rockdaboot commented Mar 20, 2017

rootkea commented Mar 21, 2017

rootkea commented Mar 21, 2017 •

edited

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017 •

edited

rockdaboot commented Mar 21, 2017

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017

rootkea commented Mar 21, 2017 •

edited

rockdaboot commented Mar 21, 2017 •

edited

rootkea commented Mar 21, 2017 •

edited

rootkea commented Mar 21, 2017

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017 •

edited

rockdaboot commented Mar 21, 2017

rootkea commented Apr 21, 2017 •

edited

Parse for CSS URLS in HTML style elements #143

Parse for CSS URLS in HTML style elements #143

Comments

rockdaboot commented Mar 9, 2017

rootkea commented Mar 20, 2017 • edited

rootkea commented Mar 20, 2017

juaristi commented Mar 20, 2017 • edited

rockdaboot commented Mar 20, 2017 • edited

rockdaboot commented Mar 20, 2017

rockdaboot commented Mar 20, 2017

rootkea commented Mar 21, 2017

rootkea commented Mar 21, 2017 • edited

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017 • edited

rockdaboot commented Mar 21, 2017

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017

rootkea commented Mar 21, 2017 • edited

rockdaboot commented Mar 21, 2017 • edited

rootkea commented Mar 21, 2017 • edited

rootkea commented Mar 21, 2017

rockdaboot commented Mar 21, 2017

rootkea commented Mar 21, 2017 • edited

rockdaboot commented Mar 21, 2017

rootkea commented Apr 21, 2017 • edited

rootkea commented Mar 20, 2017 •

edited

juaristi commented Mar 20, 2017 •

edited

rockdaboot commented Mar 20, 2017 •

edited

rootkea commented Mar 21, 2017 •

edited

rootkea commented Mar 21, 2017 •

edited

rootkea commented Mar 21, 2017 •

edited

rockdaboot commented Mar 21, 2017 •

edited

rootkea commented Mar 21, 2017 •

edited

rootkea commented Mar 21, 2017 •

edited

rootkea commented Apr 21, 2017 •

edited