Skip to content

Commit

Permalink
Docs and README Update for 2.0.0 (#277)
Browse files Browse the repository at this point in the history
* docs and version update:
- add docs for compatibility features
- add docs for memento
- updat rewriter docs
- bump version to 2.0.0, update README, and changelist
  • Loading branch information
ikreymer committed Jan 12, 2018
1 parent 36b9bdf commit 0c24f8a
Show file tree
Hide file tree
Showing 8 changed files with 284 additions and 21 deletions.
8 changes: 8 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
pywb 2.0.0 changelist
~~~~~~~~~~~~~~~~~~~~~

See the docs at https://pywb.readthedocs.org for more info.

**TODO: more detailed changelist**


pywb 0.33.2 changelist
~~~~~~~~~~~~~~~~~~~~~~

Expand Down
12 changes: 7 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
pywb 2.0 beta
=============
Webrecorder pywb 2.0.0
======================

.. image:: https://travis-ci.org/ikreymer/pywb.svg?branch=master
:target: https://travis-ci.org/ikreymer/pywb
Expand All @@ -21,7 +21,7 @@ that is used by other web archives, including the traditional "Wayback Machine"
New Features
^^^^^^^^^^^^

The 2.0 beta release includes a major overhaul of pywb and introduces the following new features, including:
The 2.0 release includes a major overhaul of pywb and introduces the following new features, including:

* Dynamic multi-collection configuration system with no-restart updates.

Expand All @@ -37,6 +37,8 @@ The 2.0 beta release includes a major overhaul of pywb and introduces the follow

* Significantly improved client-side rewriting to handle most modern web sites.

* Improved 'calendar' query UI, groping results by year and month, and updated replay banner.


Please see the `full documentation <https://pywb.readthedocs.org>`_ for more detailed info on all these features.

Expand All @@ -48,7 +50,7 @@ A few key features are high on list of priorities, but have not yet been impleme

* Url Exclusion System

* New Default UI (calendar and banner)
* UI Improvements

If you are intersted in contributing, especially to any of these areas, please let us know!

Expand All @@ -64,7 +66,7 @@ To run and install locally you can:

* Run Wayback with ``wayback`` (see docs for info on how to setup collections)

* Build docs locally with: ``cd docs; make html``. (The docs will be built in `./_build/html/index.html`)
* Build docs locally with: ``cd docs; make html``. (The docs will be built in ``./_build/html/index.html``)


Consult the local or `online docs <https://pywb.readthedocs.org>`_ for latest usage and configuration details.
Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ A subset of features provides the basic functionality of a "Wayback Machine".
manual/usage
manual/configuring
manual/architecture
manual/cdxserver_api
manual/apis
code/pywb


Expand Down
10 changes: 10 additions & 0 deletions docs/manual/apis.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
APIs
====

pywb supports the following APIs:

.. toctree::

cdxserver_api
memento

62 changes: 56 additions & 6 deletions docs/manual/configuring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@ Configuring the Web Archive

pywb offers an extensible YAML based configuration format via a main ``config.yaml`` at the root of each web archive.

Framed vs Frameless Replay vs HTTPS proxy
-----------------------------------------
.. _framed_vs_frameless:

Framed vs Frameless Replay
--------------------------

pywb supports several modes for serving archived web content.

Expand All @@ -19,8 +21,6 @@ With **frameless replay**, the archived content is loaded directly, and a banner

In this mode, the content is served directly at ``http://my-archive.example.com/<coll name>/http://example.com/``

(pywb can also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details).

For security reasons, we recommend running pywb in framed mode, because a malicious site
`could tamper with the banner <http://labs.rhizome.org/presentations/security.html#/13>`_

Expand All @@ -31,6 +31,9 @@ To disable framed replay add:
``framed_replay: false`` to your config.yaml


Note: pywb also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details.


Directory Structure
-------------------

Expand Down Expand Up @@ -220,6 +223,8 @@ This configures the ``/live/`` route to point to the live web.
This collection can be useful for testing, or even more powerful, when combined with recording.


.. _auto-all:

Auto "All" Aggregate Collection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -236,7 +241,7 @@ Collection Provenance
"""""""""""""""""""""

When using the auto-all collection, it is possible to determine the original collection of each resource by looking at the ``Link`` header metadata
if Memento API is enabled. The header will include the extra ``rel="collection"``, specifying the collection::
if :ref:`memento-api` is enabled. The header will include the extra ``collection`` field, specifying the collection::

Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"

Expand All @@ -254,7 +259,7 @@ Identifiying the Collections
""""""""""""""""""""""""""""

When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata,
which in addition to memento relations, include the extra ``rel="collection"``, specifying the collection::
which in addition to memento relations, include the extra ``collection=`` field, specifying the collection::

Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"

Expand Down Expand Up @@ -465,3 +470,48 @@ See the `wsgiprox README <https://github.com/webrecorder/wsgiprox/blob/master/RE

For more information on custom certificate authority (CA) installation, the `mitmproxy certificate page <http://docs.mitmproxy.org/en/stable/certinstall.html>`_ provides a good overview for installing a custom CA on different platforms.


Compatibility: Redirects, Memento, Flash video overrides
--------------------------------------------------------

Exact Timestamp Redirects
^^^^^^^^^^^^^^^^^^^^^^^^^

By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp.

For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``,

there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in

the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.

However, if the classic redirect behavior is desired, it can be enable by adding::

redirect_to_exact: true

to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations,
at expense of additional network traffic.


Memento Protocol
^^^^^^^^^^^^^^^^

:ref:`memento-api` support is enabled by default, and works with no-timestamp-redirect and classic redirect behaviors.

However, Memento API support can be disabled by adding::

enable_memento: false


Flash Video Override
^^^^^^^^^^^^^^^^^^^^

A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based.
For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.

To enable previous behavior, add to config::

enable_flash_video_rewrite: true

The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons.
87 changes: 87 additions & 0 deletions docs/manual/memento.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
.. _memento-api:

Memento API
===========

pywb supports the Memento Protocol as specified in `RFC 7089 <https://tools.ietf.org/html/rfc7089>`_ and provides API endpoints
for Memento Timemaps and Timegates per collection.

Memento support is enabled by default and can be controlled via the ``enable_memento: true|false`` setting in the ``config.yaml``


TimeMap API
-----------

The timemap API is available at ``/<coll>/timemap/<type>/<url>`` for any pywb collection ``<coll>`` and ``<url>`` in the collection.

The timemap (URL-T) can be provided in several output formats, as specified by the ``<type>`` param:

* ``link`` -- returns an ``application/link-format`` as required by the `Memento spec <https://tools.ietf.org/html/rfc7089#section-5>`_
* ``cdxj`` -- returns a timemap in the native CDXJ format.
* ``json`` -- returns the timemap as newline-delimited JSON lines (NDJSON) format.


Although not required by the Memento spec, the Link output produced by timemap also includes the extra ``collection=`` field, specifying
the collection of each url. This is especially useful when accessing the timemap for the special :ref:`auto-all` to view a timemap across
multiple collections in a single response.


The Timemap API is implemented as a subset of the :ref:`cdx-server-api` and should produce the same result as the equivalent CDX server query.

For example, the timemap query:
``http://localhost:8080/pywb/timemap/link/http://example.com/`` is equivalent to the CDX server query:
``http://localhost:8080/pywb/cdx?url=http://example.com/&output=link``


TimeGate API
------------

The TimeGate API for any pywb collection is ``/<coll>/<url>``, eg. ``/my-coll/http://example.com/``

The timegate can either be a non-redirecting timegate (URL-M, 200-style negotiation) and return a URL-M response, or a redirecting timegate (302-style negotiation) and redirect to a URL-M.

.. _memento-no-redirect:

Non-Redirecting TimeGate (Memento Pattern 2.2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This behavior is consistent with `Memento Pattern 2.2 <https://tools.ietf.org/html/rfc7089#section-4.2.2>`_ and is the default behavior.

To avoid an extra redirect, the TimeGate returns the requested memento directly (200-style negotiation) without redirecting to its canonical, timestamped url.
The 'canonical' URL-M is included in the ``Content-Location`` header and should be used to reference the memento in the future.


(For HTML Mementos, the rewriting system also injects the url and timestamp into the page so that it can be displayed to the user). This behavior optimizes network traffic by avoiding unneeded redirects.


Redirecting TimeGate (Memento Pattern 2.3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This behavior is consistent with `Memento Pattern 2.3 <https://tools.ietf.org/html/rfc7089#section-4.2.3>`_

To enable this behavior, add ``redirect_to_exact: true`` to the config.

In this mode, the TimeGate always issues a 302 to redirect a request to the "canonical" URL-M memento. The ``Location`` header is always present
with the redirect.

As this approach always includes a redirect, use of this system is discouraged when the intent is to render mementos. However, this approach is useful when the goal is to determine the URL-M and to provide backwards compatibility.


URL-M Headers
-------------

When serving a URL-M (any archived url), the following additional headers are included in accordance with Memento spec:

* ``Vary: accept-datetime`` is included as required
* ``Link`` header with at least ``original``, ``timegate`` and ``timemap`` relations
* ``Content-Location`` is included if using :ref:`memento-no-redirect` behavior

(Note: the ``Content-Location`` may also be included in case of fuzzy-matching response, where the actual/canonical url is different than requested url due to an inexact match)








0 comments on commit 0c24f8a

Please sign in to comment.