You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def get_image_type(url):
for ending in ['jpg', 'jpeg', '.gif' '.png']:
if url.endswith(ending):
return ending
else:
try:
f, temp_file_name = tempfile.mkstemp()
urllib.urlretrieve(url, temp_file_name)
image_type = imghdr.what(temp_file_name)
return image_type
except IOError:
return None
This single method has 3 bugs:
Lack of url = url.lower() since sometime extension can be uppercaser, it causes redundant http request to detect the image type.
'.gif' '.png'] missing a comma, so ".gif .png" causes .png and .gif never met. Also missing '.bmp' which imghdr will not recognize.
The checking should change to if url.endswith(ending) or ((ending + '?') in url):, or else it missing images with ?parameters which itself is a html contains inner img src, and the imghdr will not recognize it but ePUB editor and web browser able to render it.
Second place is constants.py, seems like both 'code' and pre tags not included. It causes sample code in https://security.googleblog.com/2009/03/reducing-xss-by-way-of-automatic.html get drop, but sample code is important. Also <style> need to support or else the caller can't control the padding between images, e.g. 'style': ['display', 'padding', 'max-height', 'max-width'],
Third place is chapter.py should support set timeout or else it wait forever but it should give a chance skip to next chapter:
def _get_image_urls(self):
image_nodes = self._content_tree.find_all('img')
raw_image_urls = [node['src'] for node in image_nodes if node.has_attr('src')]
full_image_urls = [urlparse.urljoin(self.url, image_url) for image_url in raw_image_urls]
image_nodes_filtered = [node for node in image_nodes if node.has_attr('src')]
return zip(image_nodes_filtered, full_image_urls)
Method urlparse.urljoin does not handle the case where self.url is a local file. Parameter full_image_urls will have the wrong values for any local file images.
chapter.py
:This single method has 3 bugs:
url = url.lower()
since sometime extension can be uppercaser, it causes redundant http request to detect the image type.'.gif' '.png']
missing a comma, so ".gif .png" causes .png and .gif never met. Also missing '.bmp' whichimghdr
will not recognize.if url.endswith(ending) or ((ending + '?') in url):
, or else it missing images with?parameters
which itself is a html contains inner img src, and theimghdr
will not recognize it but ePUB editor and web browser able to render it.Second place is
constants.py
, seems like both 'code' andpre
tags not included. It causes sample code inhttps://security.googleblog.com/2009/03/reducing-xss-by-way-of-automatic.html
get drop, but sample code is important. Also <style> need to support or else the caller can't control the padding between images, e.g.'style': ['display', 'padding', 'max-height', 'max-width'],
Third place is
chapter.py
should support set timeout or else it wait forever but it should give a chance skip to next chapter:The text was updated successfully, but these errors were encountered: