Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 108 lines (107 sloc) 7.155 kb
b5e89c8 @rubys Make docs validate
authored
1 <!DOCTYPE html PUBLIC
2 "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
3 "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
5966c45 @rubys Initial pass at documentation
authored
4 <html xmlns="http://www.w3.org/1999/xhtml">
5 <head>
6 <script type="text/javascript" src="docs.js"></script>
7 <link rel="stylesheet" type="text/css" href="docs.css"/>
8 <title>Venus Normalization</title>
9 </head>
10 <body>
11 <h2>Normalization</h2>
12 <p>Venus builds on, and extends, the <a
13 href="http://www.feedparser.org/">Universal Feed Parser</a> and <a
3024af0 @rubys Switch from Beautiful Soup to html5lib
authored
14 href="http://code.google.com/p/html5lib/">html5lib</a> to
4a777e1 @rubys Documentation nits
authored
15 convert all feeds into Atom 1.0, with well formed XHTML, and encoded as UTF-8,
5966c45 @rubys Initial pass at documentation
authored
16 meaning that you don't have to worry about funky feeds, tag soup, or character
17 encoding.</p>
18 <h3>Encoding</h3>
4a777e1 @rubys Documentation nits
authored
19 <p>Input data in feeds may be encoded in a variety of formats, most commonly
5966c45 @rubys Initial pass at documentation
authored
20 ASCII, ISO-8859-1, WIN-1252, AND UTF-8. Additionally, many feeds make use of
21 the wide range of
22 <a href="http://www.w3.org/TR/html401/sgml/entities.html">character entity
23 references</a> provided by HTML. Each is converted to UTF-8, an encoding
24 which is a proper superset of ASCII, supports the entire range of Unicode
25 characters, and is one of
26 <a href="http://www.w3.org/TR/2006/REC-xml-20060816/#charsets">only two</a>
27 encodings required to be supported by all conformant XML processors.</p>
28 <p>Encoding problems are one of the more common feed errors, and every
29 attempt is made to correct common errors, such as the inclusion of
30 the so-called
31 <a href="http://www.fourmilab.ch/webtools/demoroniser/">moronic</a> versions
32 of smart-quotes. In rare cases where individual characters can not be
33 converted to valid UTF-8 or into
34 <a href="http://www.w3.org/TR/xml/#charsets">characters allowed in XML 1.0
35 documents</a>, such characters will be replaced with the Unicode
36 <a href="http://www.fileformat.info/info/unicode/char/fffd/index.htm">Replacement character</a>, with a title that describes the original character whenever possible.</p>
37 <p>In order to support the widest range of inputs, use of Python 2.3 or later,
38 as well as the installation of the python <code>iconvcodec</code>, is
39 recommended.</p>
40 <h3>HTML</h3>
41 <p>A number of different normalizations of HTML are performed. For starters,
42 the HTML is
43 <a href="http://www.feedparser.org/docs/html-sanitization.html">sanitized</a>,
44 meaning that HTML tags and attributes that could introduce javascript or
45 other security risks are removed.</p>
46 <p>Then,
47 <a href="http://www.feedparser.org/docs/resolving-relative-links.html">relative
48 links are resolved</a> within the HTML. This is also done for links
49 in other areas in the feed too.</p>
50 <p>Finally, unmatched tags are closed. This is done with a
3024af0 @rubys Switch from Beautiful Soup to html5lib
authored
51 <a href="http://code.google.com/p/html5lib/">knowledge of the semantics of HTML</a>. Additionally, a
5966c45 @rubys Initial pass at documentation
authored
52 <a href="http://golem.ph.utexas.edu/~distler/blog/archives/000165.html#sanitizespec">large
53 subset of MathML</a>, as well as a
17aed24 @rubys Documentation updates
authored
54 <a href="http://www.w3.org/TR/SVGMobile/">tiny profile of SVG</a>
55 is also supported.</p>
5966c45 @rubys Initial pass at documentation
authored
56 <h3>Atom 1.0</h3>
57 <p>The Universal Feed Parser also
58 <a href="http://www.feedparser.org/docs/content-normalization.html">normalizes the content of feeds</a>. This involves a
59 <a href="http://www.feedparser.org/docs/reference.html">large number of elements</a>; the best place to start is to look at
17aed24 @rubys Documentation updates
authored
60 <a href="http://www.feedparser.org/docs/annotated-examples.html">annotated examples</a>. Among other things a wide variety of
5966c45 @rubys Initial pass at documentation
authored
61 <a href="http://www.feedparser.org/docs/date-parsing.html">date formats</a>
62 are converted into
63 <a href="http://www.ietf.org/rfc/rfc3339.txt">RFC 3339</a> formatted dates.</p>
64 <p>If no <a href="http://www.feedparser.org/docs/reference-entry-id.html">ids</a> are found in entries, attempts are made to synthesize one using (in order):</p>
65 <ul>
66 <li><a href="http://www.feedparser.org/docs/reference-entry-link.html">link</a></li>
67 <li><a href="http://www.feedparser.org/docs/reference-entry-title.html">title</a></li>
68 <li><a href="http://www.feedparser.org/docs/reference-entry-summary.html">summary</a></li>
69 <li><a href="http://www.feedparser.org/docs/reference-entry-content.html">content</a></li>
70 </ul>
71 <p>If no <a href="http://www.feedparser.org/docs/reference-feed-
6cc797c @rubys added a new config option: future_dates
authored
72 updated.html">updated</a> dates are found in an entry, the updated date from
73 the feed is used. If no updated date is found in either the feed or
74 the entry, the current time is substituted.</p>
5966c45 @rubys Initial pass at documentation
authored
75 <h3 id="overrides">Overrides</h3>
76 <p>All of the above describes what Venus does automatically, either directly
77 or through its dependencies. There are a number of errors which can not
78 be corrected automatically, and for these, there are configuration parameters
79 that can be used to help.</p>
80 <ul>
81 <li><code>ignore_in_feed</code> allows you to list any number of elements
2529bdd @rubys Add xml:lang to list of scrubbable attributes
authored
82 or attributes which are to be ignored in feeds. This is often handy in the
65e41f7 @rubys author tags can be ignored too!
authored
83 case of feeds where the <code>author</code>, <code>id</code>,
84 <code>updated</code> or <code>xml:lang</code> values can't be trusted.</li>
5966c45 @rubys Initial pass at documentation
authored
85 <li><code>title_type</code>, <code>summary_type</code>,
86 <code>content_type</code> allow you to override the
87 <a href="http://www.feedparser.org/docs/reference-entry-title_detail.html#reference.entry.title_detail.type"><code>type</code></a>
88 attributes on these elements.</li>
89 <li><code>name_type</code> does something similar for
90 <a href="http://www.feedparser.org/docs/reference-entry-author_detail.html#reference.entry.author_detail.name">author names</a></li>
6cc797c @rubys added a new config option: future_dates
authored
91 <li><code>future_dates</code> allows you to specify how to deal with dates which are in the future.
92 <ul style="margin:0">
93 <li><code>ignore_date</code> will cause the date to be ignored (and will therefore default to the time the entry was first seen) until the feed is updated and the time indicated is past, at which point the entry will be updated with the new date.</li>
94 <li><code>ignore_entry</code> will cause the entire entry containing the future date to be ignored until the date is past.</li>
95 <li>Anything else (i.e.. the default) will leave the date as is, causing the entries that contain these dates sort to the top of the planet until the time passes.</li>
96 </ul>
97 </li>
77d15d2 @rubys xml_base overrides
authored
98 <li><code>xml_base</code> will adjust the <code>xml:base</code> values in effect for each of the text constructs in the feed (things like <code>title</code>, <code>summary</code>, and <code>content</code>). Other elements in the feed (most notably, <code>link</code> are not affected by this value.
99 <ul style="margin:0">
100 <li><code>feed_alternate</code> will replace the <code>xml:base</code> in effect with the value of the <code>alternate</code> <code>link</code> found either in the enclosed <code>source</code> or enclosing <code>feed</code> element.</li>
101 <li><code>entry_alternate</code> will replace the <code>xml:base</code> in effect with the value of the <code>alternate</code> <code>link</code> found in this entry.</li>
102 <li>Any other value will be treated as a <a href="http://www.ietf.org/rfc/rfc3986.txt">URI reference</a>. These values may be relative or absolute. If relative, the <code>xml:base</code> values in each text construct will each be adjusted separately using to the specified value.</li>
103 </ul>
104 </li>
5966c45 @rubys Initial pass at documentation
authored
105 </ul>
106 </body>
107 </html>
Something went wrong with that request. Please try again.