Bug fix suggestion #9

limkokhole · 2018-08-07T12:40:05Z

chapter.py:

def get_image_type(url):
    for ending in ['jpg', 'jpeg', '.gif' '.png']:
        if url.endswith(ending):
            return ending
    else:
        try:
            f, temp_file_name = tempfile.mkstemp()
            urllib.urlretrieve(url, temp_file_name)
            image_type = imghdr.what(temp_file_name)
            return image_type
        except IOError:
return None

This single method has 3 bugs:

Lack of url = url.lower() since sometime extension can be uppercaser, it causes redundant http request to detect the image type.
'.gif' '.png'] missing a comma, so ".gif .png" causes .png and .gif never met. Also missing '.bmp' which imghdr will not recognize.
The checking should change to if url.endswith(ending) or ((ending + '?') in url):, or else it missing images with ?parameters which itself is a html contains inner img src, and the imghdr will not recognize it but ePUB editor and web browser able to render it.

Second place is constants.py, seems like both 'code' and pre tags not included. It causes sample code in https://security.googleblog.com/2009/03/reducing-xss-by-way-of-automatic.html get drop, but sample code is important. Also <style> need to support or else the caller can't control the padding between images, e.g. 'style': ['display', 'padding', 'max-height', 'max-width'],

Third place is chapter.py should support set timeout or else it wait forever but it should give a chance skip to next chapter:

$ grep -n requests\.g pypub/chapter.py
70:            requests_object = requests.get(image_url, headers=request_headers)
241:            request_object = requests.get(url, headers=self.request_headers, allow_redirects=False)

The text was updated successfully, but these errors were encountered:

dazuraz · 2018-10-30T00:14:31Z

One more bug in the following method:

def _get_image_urls(self):
    image_nodes = self._content_tree.find_all('img')
    raw_image_urls = [node['src'] for node in image_nodes if node.has_attr('src')]
    full_image_urls = [urlparse.urljoin(self.url, image_url) for image_url in raw_image_urls]
    image_nodes_filtered = [node for node in image_nodes if node.has_attr('src')]
    return zip(image_nodes_filtered, full_image_urls)

Method urlparse.urljoin does not handle the case where self.url is a local file. Parameter full_image_urls will have the wrong values for any local file images.

wcember closed this as completed Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix suggestion #9

Bug fix suggestion #9

limkokhole commented Aug 7, 2018 •

edited

Loading

dazuraz commented Oct 30, 2018

Bug fix suggestion #9

Bug fix suggestion #9

Comments

limkokhole commented Aug 7, 2018 • edited Loading

dazuraz commented Oct 30, 2018

limkokhole commented Aug 7, 2018 •

edited

Loading