You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
_This is a _shadow issue* for Issue 18 on Google Code (from which this project was moved).
Added 2008-04-30T09:28:35.000Z by joh...@gmail.com. Closed (Fixed).
Labels: Type-Defect, Priority-High.
Please make updates to the bug there.*
Original description
Hello,
I produce text files from html-files using the html2text python script from
Aaron Swartz (http://www.aaronsw.com/2002/html2text/). When using it on the
page http://www.sptimes.ru/index.php?action_id=1&i_number=1241 there are
some links your regexp doesn't match (see attached file, eg. link number
10, 86).
I have gently adapted the expression:
a) the line with # url = \2:
Allow for any charachter in URL (also spaces)
b) the line with # title = \3:
Allow for empty title strings (i.e. empty brackets)
My version of the regular expression:
_link_def_re = re.compile(r"""
^[ ]{0,%d}\[(.+)\]: # id = \1
[ \t]*
\n? # maybe *one* newline
[ \t]*
<?(.+?)>? # url = \2
[ \t]*
\n? # maybe one newline
[ \t]*
(?:
(?<=\s) # lookbehind for whitespace
['"(]
(.*?) # title = \3 (allow empty titles)
['")]
[ \t]*
)? # title is optional
(?:\n+|\Z)
""" % less_than_tab, re.X | re.M)
Maybe you could give it a thought if it is worth to add this to the code.
regards
Johannes Fitz
The text was updated successfully, but these errors were encountered:
_This is a _shadow issue* for Issue 18 on Google Code (from which this project was moved).
Added 2008-04-30T09:28:35.000Z by joh...@gmail.com. Closed (Fixed).
Labels: Type-Defect, Priority-High.
Please make updates to the bug there.*
Original description
The text was updated successfully, but these errors were encountered: