Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shadow] Bug with Links using javascript:... in href attribute #18

Closed
trentm opened this issue Mar 7, 2011 · 0 comments
Closed

[shadow] Bug with Links using javascript:... in href attribute #18

trentm opened this issue Mar 7, 2011 · 0 comments

Comments

@trentm
Copy link
Owner

trentm commented Mar 7, 2011

_This is a _shadow issue* for Issue 18 on Google Code (from which this project was moved).
Added 2008-04-30T09:28:35.000Z by joh...@gmail.com. Closed (Fixed).
Labels: Type-Defect, Priority-High.
Please make updates to the bug there.*

Original description

Hello,

I produce text files from html-files using the html2text python script from
Aaron Swartz (http://www.aaronsw.com/2002/html2text/). When using it on the
page http://www.sptimes.ru/index.php?action_id=1&i_number=1241 there are
some links your regexp doesn't match (see attached file, eg. link number
10, 86).

I have gently adapted the expression:
  a) the line with # url = \2: 
      Allow for any charachter in URL (also spaces)
  b) the line with # title = \3: 
      Allow for empty title strings (i.e. empty brackets)

My version of the regular expression:
_link_def_re = re.compile(r"""
            ^[ ]{0,%d}\[(.+)\]: # id = \1
              [ \t]*
              \n?               # maybe *one* newline
              [ \t]*
            <?(.+?)>?          # url = \2
              [ \t]*
              \n?               # maybe one newline
              [ \t]*
            (?:
                (?<=\s)         # lookbehind for whitespace
                ['&quot;(]
                (.*?)           # title = \3 (allow empty titles)
                ['&quot;)]
                [ \t]*
            )?  # title is optional
            (?:\n+|\Z)
            &quot;&quot;&quot; % less_than_tab, re.X | re.M)

Maybe you could give it a thought if it is worth to add this to the code.

regards
Johannes Fitz 
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant