-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLParser incorrectly handles cdata elements. #57567
Comments
The HTML tag at the bottom of this page correctly identified has having cdata like properties and trigger set_cdata_mode(). Due to the cdata properties of this tag, the only way to end the data segment is with a closing </script> tag, NO OTHER tag can close this data segment. Currently in cdata mode the HTMLParser will use this regular expression to close this script tag: re.compile(r'<(/|\Z)'), however this script tag is setting a variable with data that contains "</b>" which will terminate this script tag prematurely. I have written and tested the following patch on my system: class html_patch(HTMLParser.HTMLParser):
# Internal -- sets the proper tag terminator based on cdata element type
def set_cdata_mode(self, tag):
#We check if the script is either a style or a script
#based on self.CDATA_CONTENT_ELEMENTS
if tag=="style":
self.interesting = endtagfind_style
elif tag=="script":
self.interesting = endtagfind_script
else:
self.error("Unknown cdata type:"+tag) # should never happen
self.cdata_tag = tag This cdata tag isn't parsed properly by HTMLParser, but it works fine in a browser: var wideAds = '<table width="100%" cellspacing="0" cellpadding="8"' +
'border="0" class="lhcl_search_ad_tbl"><tbody><tr>';
for(var i = 0; i < google_num_ads; i++){
if (google_ads[i].type=="text/wide"){
wideAds += '<td>';
wideAds += '<table width="100%" cellspacing="0" ' +
'cellpadding="0" border="0"><tbody><tr><td>';
wideAds+='<a href="' + google_ads[i].url + '" ' +
'onmouseout="window.status=\'\';return true" ' +
'onmouseover="window.status=\'go to ' +
google_ads[i].visible_url + '\';return true" ' +
'style="text-decoration:none">' +
'<span style="text-decoration:underline">' +
'<b>' + google_ads[i].line1 + '</b><br></span>' +
'<span style="color:#000000">' +
google_ads[i].line2 + '<br></span>' +
'<span style="color:#008000">' +
google_ads[i].visible_url + '</span></a><BR>';
wideAds += '</td>';
if (i == google_num_ads - 1) {
wideAds += '<td valign="top">' +
'<div id="lhid_search_ad_spl">Sponsored Links</div></td>';
}
wideAds += '</tr></tbody></table>';
wideAds += '</td>';
}
}
wideAds += '</tr></tbody></table>';
document.getElementById("lhid_search_ad_unit").innerHTML = wideAds;
}
google_afs_query = 'test';
google_afs_ad = 'w3'; // specify the number of ads you are requesting
google_afs_client = 'google-picasa_js'; // substitute your client ID
// google_afs_channel = ''; // enter your comma-separated channel IDs
google_afs_ie = 'utf8'; // select input encoding scheme
google_afs_oe = 'utf8'; // select output encoding scheme
google_afs_adsafe = 'high'; // specify level for filtering non-family-safe ads
google_afs_adtest = 'off'; // ** set parameter to off before launch to production
document.write('<' +
'script src="https://www.google.com/afsonline/show_afs_ads.js"></' +
'script>');
</script> |
Have you tried with the latest 2.7? (see msg147170) |
Yes I am running python 2.7.2. On Sun, Nov 6, 2011 at 12:52 PM, Ezio Melotti <report@bugs.python.org>wrote:
|
This one should also have a priority change. Tested python 2.7.3 --MIke On Sun, Nov 6, 2011 at 12:54 PM, Michael Brooks <report@bugs.python.org>wrote:
|
Has anyone else been able to verify this? On Mon, Nov 7, 2011 at 7:46 AM, Michael Brooks <report@bugs.python.org>wrote:
|
I'm working on it, but a minimal example seems to work fine. (P.S. there's no need to quote the previous message(s) while replying) |
It seems to me that the arguments are parsed correctly, but handle_data is called multiple time between handle_starttag and handle_endtag. |
Ok so until you fix this bug, i'll be overriding HTMLParser with my fix, Thanks. On Thu, Nov 17, 2011 at 9:24 AM, Ezio Melotti <report@bugs.python.org>wrote:
|
It already behaves like a browser, it just gives you data in chunks instead of calling handle_data() only once at the end. The documentation is not clear about this though. It says that feed() can be called several times, but it doesn't say that handle_data() (and possibly other methods) might get called more than once. This seems to always be the case while calling feed() several times. |
Oah, then there is a misunderstanding. No browser will parse the html Thu, Nov 17, 2011 at 11:17 AM, Ezio Melotti <report@bugs.python.org> wrote:
|
Attached patch should solve the issue. |
New changeset 91163aa3d5b4 by Ezio Melotti in branch '2.7': New changeset 0a32e7e3aa1f by Ezio Melotti in branch '3.2': New changeset e12d2b9c88ef by Ezio Melotti in branch 'default': |
This should be fixed now, let me know if you find other problems with the parser. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: