Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix suggestion #9

Closed
limkokhole opened this issue Aug 7, 2018 · 1 comment
Closed

Bug fix suggestion #9

limkokhole opened this issue Aug 7, 2018 · 1 comment

Comments

@limkokhole
Copy link

limkokhole commented Aug 7, 2018

chapter.py:

def get_image_type(url):
    for ending in ['jpg', 'jpeg', '.gif' '.png']:
        if url.endswith(ending):
            return ending
    else:
        try:
            f, temp_file_name = tempfile.mkstemp()
            urllib.urlretrieve(url, temp_file_name)
            image_type = imghdr.what(temp_file_name)
            return image_type
        except IOError:
return None

This single method has 3 bugs:

  1. Lack of url = url.lower() since sometime extension can be uppercaser, it causes redundant http request to detect the image type.
  2. '.gif' '.png'] missing a comma, so ".gif .png" causes .png and .gif never met. Also missing '.bmp' which imghdr will not recognize.
  3. The checking should change to if url.endswith(ending) or ((ending + '?') in url):, or else it missing images with ?parameters which itself is a html contains inner img src, and the imghdr will not recognize it but ePUB editor and web browser able to render it.

Second place is constants.py, seems like both 'code' and pre tags not included. It causes sample code in https://security.googleblog.com/2009/03/reducing-xss-by-way-of-automatic.html get drop, but sample code is important. Also <style> need to support or else the caller can't control the padding between images, e.g. 'style': ['display', 'padding', 'max-height', 'max-width'],

Third place is chapter.py should support set timeout or else it wait forever but it should give a chance skip to next chapter:

$ grep -n requests\.g pypub/chapter.py
70:            requests_object = requests.get(image_url, headers=request_headers)
241:            request_object = requests.get(url, headers=self.request_headers, allow_redirects=False)
@dazuraz
Copy link

dazuraz commented Oct 30, 2018

One more bug in the following method:

def _get_image_urls(self):
    image_nodes = self._content_tree.find_all('img')
    raw_image_urls = [node['src'] for node in image_nodes if node.has_attr('src')]
    full_image_urls = [urlparse.urljoin(self.url, image_url) for image_url in raw_image_urls]
    image_nodes_filtered = [node for node in image_nodes if node.has_attr('src')]
    return zip(image_nodes_filtered, full_image_urls)

Method urlparse.urljoin does not handle the case where self.url is a local file. Parameter full_image_urls will have the wrong values for any local file images.

@wcember wcember closed this as completed Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants