Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse for CSS URLS in HTML style elements #143

Closed
rockdaboot opened this issue Mar 9, 2017 · 23 comments
Closed

Parse for CSS URLS in HTML style elements #143

rockdaboot opened this issue Mar 9, 2017 · 23 comments

Comments

@rockdaboot
Copy link
Owner

Needs a unit test + work in html_url.c.

Example code:

<div class="example" style="background: url(/storage/example/bg.png) top center no-repeat #b4392d;"></div>
@rootkea
Copy link
Contributor

rootkea commented Mar 20, 2017

I think there can be two approaches to achieve this:

  1. use wget_css_parse_file()
    Now, this can be achieved by either saving the complete string of style attribute e.g. for above example it would be "background: url(/storage/example/bg.png) top center no-repeat #b4392d;" in a temporary file and then passing it to wget_css_parse_file() or

by piping it to another process invoking wget_css_parse_file(). So no temporary file but involves process synchronization.

  1. use wget_css_parse_buffer()
    Or simply passing the complete string of style attribute to wget_css_parse_buffer()

I am inclined towards way 2 for obvious reasons! :)

BTW this brings me to my next question "Is libwget documented somewhere with all of its functions?" This would immensely help rather than going through libwget by hand.

@rootkea
Copy link
Contributor

rootkea commented Mar 20, 2017

To locate style attribute we can write a small parsing routine or better can use the xmlParser from libwget but again I'm kinda lost in libwget. Particularly what parseXml() does exactly? This is important to understand to grab the parsing logic enough to locate style attribute

@juaristi
Copy link
Collaborator

juaristi commented Mar 20, 2017

Maybe @rockdaboot can raise a bit more light here but as I see it, parseXML() should handle all the attributes quite decently, including style. It's a SAX-like parser: you give him a callback function, and it calls it whenever it finds something interesting.

Once you have that, as you already guessed, you'd need to feed it to the CSS parser. If there's something to be modified, that should be in the CSS code.

Regarding the docs, we keep them inline, and then generate HTML output with Doxygen. But as you probably already saw, not all functions are documented. make should generate the HTMLs and place them in docs/html.

@rockdaboot
Copy link
Owner Author

rockdaboot commented Mar 20, 2017

Regarding parseXML(): It is a stateless xml / html scanner not using memory allocations. Instead it uses a callback function. (Sorry, currently the function is not documented.)

But look at libwget/html_url.c as an example. Here the callback function is _html_get_url(). For a better understanding, uncomment L104 (info_print), rebuild all and execute examples/print_html_urls.

To parse the style attributes, you possibly can extend this function.

@rockdaboot
Copy link
Owner Author

As @juaristi suggests, once you have the value for the attribute 'style', you could use wget_css_parse_buffer() to avoid a temp file. That would be similar to src/wget.c/css_parse().

Maybe you first create a small .html file including style (with a URL) and than run this with examples/print_html_urls. The URL inside your style attribute won't be printed... then slowly start with your code or your debugging print outs, step by step.

@rockdaboot
Copy link
Owner Author

"Is libwget documented somewhere with all of its functions?"

Not yet all (any additions to the docs are highly welcome).
But after a make, there should be doc/html/index.html which you can view in a browser. Under files/file list you can see which source files have been documented. xml.c hasn't yet.

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

make should generate the HTMLs and place them in docs/html

@juaristi Oh! I completely missed that. Thanks.

uncomment L104 (info_print)

@rockdaboot I did that. But then it says:

html_url.c:104:53: error: 'dir' undeclared (first use in this function)
   info_printf("%02X %s %s '%.*s' %zd %zd\n", flags, dir, attr, (int) len, val, len, pos);

I can see that dir is a character pointer. So will char * dir; work? Or is it suppose to be a character array? (in which case declaring it as a character pointer will lead to SegFault)

And also since we are passing it to a printing function; it seems that dir needs to be initialized. What should be the initializing value of dir?

BTW I just declared it as a char * and build went successful. But it's an uninitialized pointer. I hope I didn't leave door open for illegal memory access.

Maybe you first create a small .html file including style (with a URL) and than run this with examples/print_html_urls. The URL inside your style attribute won't be printed... then slowly start with your code or your debugging print outs, step by step.

Yes. That's how I'm proceeding.

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

And when I did that info_printf() prints nothing. Nothing at all. As if it's commented.

Just to make sure that make built the updated file I added dummy printfs before and after info_printf(). They get printed and there's nothing printed by info_printf(). I guess char * dir was a bad idea.

@rockdaboot
Copy link
Owner Author

Sorry, that comment was a bit old... I just pushed a corrected version (it must be 'tag' instead of 'dir').

And when I did that info_printf() prints nothing. Nothing at all. As if it's commented.

Yup. Because the app doesn't set a 'logger' for 'info' messages (or any other kind of messages).

You see in comments (in print_html_urls.c):

	wget_global_init(
		WGET_DEBUG_STREAM, stderr,
		WGET_ERROR_STREAM, stderr,
		WGET_INFO_STREAM, stdout,
		NULL);

Remove the comments to get full output. Just comment out the line with WGET_DEBUG_STREAM and you won't see any debug output (=wget_debug_printf()) any more. Instead of STREAM you can also redirect different kinds of output to a function (_FUNC) or a file on disk (_FILE).

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

I am trying hard to follow parseXML() but a bit stuck.

More specifically:

  1. What are XML hints? (hints from XML_CONTEXT structure)
  2. What does getToken() return? In other words, what is considered as token by getToken()?

@rockdaboot
Copy link
Owner Author

The parser parses both HTML and XML, both are somewhat different. If hints is set to HTML, the parser parses HTML.

But really... don't bother with that code (you need a decent understanding of XML and HTML definition). Just use the code as stated above. If there is something unexpected or a bug, report it to me and I'll fix that. Well, of course you can go through it. But I have the feeling that wastes your time.

@rockdaboot
Copy link
Owner Author

BTW, what IDE are you using ? You should not simply use an editor but something more mighty like Eclipse or Netbeans. For example in Netbeans, you simply right-click on 'hints', select 'show usage' and immediately can see that the only value that ever becomes checked for is XML_HINT_HTML (you get a list of all usages within the whole project).

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

@rockdaboot I'm using VIM with ctags. I'm not an advanced VIM user though.

About Netbeans, that sounds awesome. Without going through the whole project and jumping between files. I should use it.

BTW what IDE will you recommend? Is it Netbeans?

@rockdaboot
Copy link
Owner Author

I use Netbeans. But I believe that Eclipse does the same or even more for C/C++. I have lot's of Java projects, so I stick with Netbeans.

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

@rockdaboot I think I got it. I added style attribute to attrs array. And then I'm doing

if (!wget_strcasecmp_ascii(attr, "style")) {
//skip white spaces
//Parse till ' (I need to check HTML5 spec about ' and '' or no quotes)
//Now store the further parsed string in a buffer till matching quote is found (In case of no quotes check for space character)
//pass this buffer to wget_css_parse_buffer()
}
  1. Am I in correct direction?
  2. if yes then, what should be the size of buffer as we don't know the number of characters before hand. Even if we want to allocate memory dynamically still we need to know how many bytes we should allocate.

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

Or should I first just walk over to find the end of string and then use that count as no. of bytes to be allocated?
I'm going with this for now but it involves parsing the string twice so not efficient.

@rockdaboot
Copy link
Owner Author

rockdaboot commented Mar 21, 2017

len is the length of the value (that is what you are processing), e.g. with style="css-xxx", value points to the beginning 'c' (AFAIR) and count would be 7. Just from my memory (you have to test it): Since the parser works inline, you have to malloc 7 + 1, copy the data, set last byte to 0. Not efficient, but first goal is to get it working. If it turns out to be a CPU hog, we optimize it later.

I am not sure, but I think the quote has already been parsed by then, also leading white space should not bother the CSS parser (if it does, we have to fix it there).

So, you are absolutely on the right way :-)

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

Ok. Now I have a CSS string in buffer and now I need callback_uri and callback_encoding function pointers to call wget_css_parse_buffer().

Now I can copy those two functions from print_css_urls.c to the current working file i.e. libwget/html_url.c but it's absurd.

Since libget just provides an API we should ask user to provide these two callback functions. It seems that we need to extend wget_html_get_urls_inline to accept these two callback functions as arguments. What do you think?

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

On a side note, is there a css parsing function like wget_html_get_urls_inline() which doesn't require callback functions?

@rockdaboot
Copy link
Owner Author

For now, forget callback_encoding, just use the encoding of the HTML document.

Also, feel free to copy callback_uri into libwget/html_url.c... remember, first get it going, optimizing later (except optimizing is straight forward).

I am just in a hurry right now... so can't take a deeper look.
Also, for me to review early, create an own branch, push it to your wget2 repo and create a pull request with name "WIP: ...", so we know it is work-in-progress. You can always add commits or remove / change them (git push -f). And Github allows me to add inline comments.

@rootkea
Copy link
Contributor

rootkea commented Mar 21, 2017

Also why are we using callback functions in wget_css_parse_file()? Why can't we store the results somewhere like in css-contextmay be?(similar to what wget_html_get_urls_inline does)

@rockdaboot
Copy link
Owner Author

Maybe we can amend the CSS parsing so it works like the HTML parsing. Create an issue for discussion !

@rootkea
Copy link
Contributor

rootkea commented Apr 21, 2017

Shouldn't this be closed too? Though I'm yet to really look into the approach @MichaelHeerklotz took and compare it with what I was thinking, it works as expected. Or am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants