163 changes: 147 additions & 16 deletions docs/manual/access-control.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,89 @@
.. _access-control:

Access Control System
---------------------
Embargo and Access Control
--------------------------

The access controls system allows for a flexible configuration of rules to allow,
block or exclude access to individual urls by longest-prefix match.
The embargo system allows for date-based rules to block access to captures based on their capture dates.

The access controls system provides additional URL-based rules to allow, block or exclude access to specific URL prefixes or exact URLs.

The embargo and access control rules are configured per collection.

Embargo Settings
================

The embargo system allows restricting access to all URLs within a collection based on the timestamp of each URL.
Access to these resources is 'embargoed' until the date range is adjusted or the time interval passes.

The embargo can be used to disallow access to captures based on following criteria:

- Captures before an exact date
- Captures after an exact date
- Captures newer than a time interval
- Captures older than a time interval

Embargo Before/After Exact Date
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To block access to all captures before or after a specific date, use the ``before`` or ``after`` embargo blocks
with a specific timestamp.

For example, the following blocks access to all URLs captured before 2020-12-26 in the collection ``embargo-before``::

embargo-before:
index_paths: ...
archive_paths: ...
embargo:
before: '20201226'


The following blocks access to all URLs captured on or after 2020-12-26 in collection ``embargo-after``::

embargo-after:
index_paths: ...
archive_paths: ...
embargo:
after: '20201226'

Embargo By Time Interval
^^^^^^^^^^^^^^^^^^^^^^^^

The embargo can also be set for a relative time interval, consisting of years, months, weeks and/or days.


For example, the following blocks access to all URLs newer than 1 year::

embargo-newer:
...
embargo:
newer:
years: 1



The following blocks access to all URLs older than 1 year, 2 months, 3 weeks and 4 days::

embargo-older:
...
embargo:
older:
years: 1
months: 2
weeks: 3
days: 4


Any combination of years, months, weeks and days can be used (as long as at least one is provided) for the ``newer`` or ``older`` embargo settings.


Access Control Settings
=======================

Access Control Files (.aclj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Access controls are set in one or more access control json files (.aclj), sorted in reverse alphabetical order.
To determine the best match, a binary search is used (similar to CDXJ) lookup and then the best match is found forward.
URL-based access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order.
To determine the best match, a binary search is used (similar to CDXJ lookup) and then the best match is found forward.

An .aclj file may look as follows::

Expand All @@ -22,32 +95,71 @@ An .aclj file may look as follows::

Each JSON entry contains an ``access`` field and the original ``url`` field that was used to convert to the SURT (if any).

The prefix consists of a SURT key and a ``-`` (currently reserved for a timestamp/date range field to be added later)
The JSON entry may also contain a ``user`` field, as explained below.

The prefix consists of a SURT key and a ``-`` (currently reserved for a timestamp/date range field to be added later).

Given these rules, a user would:

* be allowed to visit ``http://httpbin.org/anything/something`` (allow)
* but would receive an 'access blocked' error message when viewing ``http://httpbin.org/`` (block)
* would receive a 404 not found error when viewing ``http://httpbin.org/anything`` (exclude)


Access Types: allow, block, exclude
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Access Types: allow, block, exclude, allow_ignore_embargo
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The available access types are as follows:

- ``exclude`` - when matched, results are excluded from the index, as if they do not exist. User will receive a 404.
- ``block`` - when matched, results are not excluded from the index, marked with ``access: block``, but access to the actual is blocked. User will see a 451
- ``allow`` - full access to the index and the resource.
- ``block`` - when matched, results are not excluded from the index, but access to the actual content is blocked. User will see a 451.
- ``allow`` - full access to the index and the resource, but may be overriden by embargo
- ``allow_ignore_embargo`` - full access to the index and resource, overriding any embargo settings

The difference between ``exclude`` and ``block`` is that when blocked, the user can be notified that access is blocked, while
with exclude, no trace of the resource is presented to the user.

The use of ``allow`` is useful to provide access to more specific resources within a broader block/exclude rule.
The use of ``allow`` is useful to provide access to more specific resources within a broader block/exclude rule, while ``allow_ignore_embargo``
can be used to override any embargo settings.

If both are present, the embargo restrictions are checked first and take precedence, unless the ``allow_ignore_embargo`` option is used
to override the embargo.


User-Based Access Controls
^^^^^^^^^^^^^^^^^^^^^^^^^^

The access control rules can further be customized be specifying different permissions for different 'users'. Since pywb does not have a user system,
a special header, ``X-Pywb-ACL-User`` can be used to indicate a specific user.

This setting is designed to allow a more privileged user to access additional content or override an embargo.

For example, the following access control settings restrict access to ``https://example.com/restricted/`` by default, but allow access for the ``staff`` user::

com,example)/restricted - {"access": "allow", "user": "staff"}
com,example)/restricted - {"access": "block"}


Combined with the embargo settings, this can also be used to override the embargo for internal organizational users, while keeping the embargo for general access::

com,example)/restricted - {"access": "allow_ignore_embargo", "user": "staff"}
com,example)/restricted - {"access": "allow"}

To make this work, pywb must be running behind an Apache or Nginx system that is configured to set ``X-Pywb-ACL-User: staff`` based on certain settings.

For example, this header may be set based on IP range, or based on password authentication.

Further examples of how to set this header will be provided in the deployments section.

**Note: Do not use the user-based rules without configuring proper authentication on an Apache or Nginx frontend to set or remove this header, otherwise the 'X-Pywb-ACL-User' can easily be faked.**

See the :ref:`config-acl-header` section in Usage for examples on how to configure this header.


Access Error Messages
^^^^^^^^^^^^^^^^^^^^^

The special error code 451 is used to indicate that a resource has been blocked (access setting ``block``)
The special error code 451 is used to indicate that a resource has been blocked (access setting ``block``).

The `error.html <https://github.com/webrecorder/pywb/blob/master/pywb/templates/error.html>`_ template contains a special message for this access and can be customized further.

Expand All @@ -61,7 +173,7 @@ The .aclj files need not ever be added or edited manually.

The pywb ``wb-manager`` utility has been extended to provide tools for adding, removing and checking access control rules.

The access rules are written to ``<collection>/acl/access-rules.acl`` for a given collection ``<collection>`` for automatic collections.
The access rules are written to ``<collection>/acl/access-rules.aclj`` for a given collection ``<collection>`` for automatic collections.

For example, to add the first line to an ACL file ``access.aclj``, one could run::

Expand All @@ -73,6 +185,11 @@ The URL supplied can be a URL or a SURT prefix. If a SURT is supplied, it is use
wb-manager acl add <collection> com, allow


A specific user for user-based rules can also be specified, for example to add ``allow_ignore_embargo`` for user ``staff`` only, run::

wb-manager acl add <collection> http://httpbin.org/anything/something allow_ignore_embargo -u staff


By default, access control rules apply to a prefix of a given URL or SURT.

To have the rule apply only to the exact match, use::
Expand Down Expand Up @@ -104,7 +221,7 @@ Access Controls for Custom Collections

For manually configured collections, there are additional options for configuring access controls.
The access control files can be specified explicitly using the ``acl_paths`` key and allow specifying multiple ACL files,
and allowing sharing access control files between different collections.
and allow sharing access control files between different collections.

Single ACLJ::

Expand Down Expand Up @@ -134,7 +251,21 @@ When finding the best rule from multiple ``.aclj`` files, each file is binary se
set merge-sorted to find the best match (very similar to the CDXJ index lookup).

Note: It might make sense to separate ``allows.aclj`` and ``blocks.aclj`` into individual files for organizational reasons,
but there is no specific need to keep more than one access control files.
but there is no specific need to keep more than one access control file.

Finally, ACLJ and embargo settings combined for the same collection might look as follows::

collections:
test:
...
embargo:
newer:
days: 366

acl_paths:
- ./path/to/allows.aclj
- ./path/to/blocks.aclj


Default Access
^^^^^^^^^^^^^^
Expand Down
23 changes: 12 additions & 11 deletions docs/manual/cdxserver_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For example, the following query might return the first 10 results from host ``h
http://localhost:8080/coll/cdx?url=http://example.com/*&page=1&filter=mime:text/html&limit=10


By default, the api endpoint is available at ``/<coll>/cdx`` for every collection.
By default, the api endpoint is available at ``/<coll>/cdx`` for a collection named ``<coll>``.

The setting can be changed by setting ``cdx_api_endpoint`` in ``config.yaml``.

Expand All @@ -36,9 +36,10 @@ API Reference
^^^^^^^

| The only required parameter to the cdx server api is the url, ex:
| ``http://localhost:8080/coll-cdx?url=example.com``
| ``http://localhost:8080/coll/cdx?url=example.com``
will return a list of captures for ‘example.com’
will return a list of captures for ‘example.com’ in the collection
``coll`` (see above regarding per-collection api endpoints).


``from, to``
Expand All @@ -50,7 +51,7 @@ given date/time range (inclusive).
Timestamps may be <=14 digits and will be padded to either lower or
upper bound.

| For example, ``...coll-cdx?url=example.com&from=2014&to=2014`` will
| For example, ``...?url=example.com&from=2014&to=2014`` will
return results of ``example.com`` that
| have a timestamp between ``20140101000000`` and ``20141231235959``
Expand All @@ -75,11 +76,11 @@ The cdx server supports the following ``matchType``
As a shortcut, instead of specifying a separate ``matchType`` parameter,
wildcards may be used in the url:

- ``...coll-cdx?url=http://example.com/path/*`` is equivalent to
``...coll-cdx?url=http://example.com/path/&matchType=prefix``
- ``...?url=http://example.com/path/*`` is equivalent to
``...?url=http://example.com/path/&matchType=prefix``

- ``...coll-cdx?url=*.example.com`` is equivalent to
``...coll-cdx?url=example.com&matchType=domain``
- ``...?url=*.example.com`` is equivalent to
``...?url=example.com&matchType=domain``

*Note: if you are using legacy cdx index files which are not
SURT-ordered, the ``domain`` option will not be available. if this is
Expand Down Expand Up @@ -141,10 +142,10 @@ The ``filter`` param can be specified multiple times to filter by
specific fields in the cdx index. Field names correspond to the fields
returned in the JSON output. Filters can be specified as follows:

- ``...coll-cdx?url=example.com/*&filter==mime:text/html&filter=!=status:200``
- ``...?url=example.com/*&filter==mime:text/html&filter=!=status:200``
Return captures from example.com/\* where mime is text/html and http
status is not 200.
- ``...coll-cdx?url=example.com&matchType=domain&filter=~url:.*\.php$``
- ``...?url=example.com&matchType=domain&filter=~url:.*\.php$``
Return captures from the domain example.com which URL ends in
``.php``.

Expand Down Expand Up @@ -182,7 +183,7 @@ the following modifiers:


``fields``
^^^^^^
^^^^^^^^^^

The ``fields`` param can be used to specify which fields to include in the
output. The standard available fields are usually: ``urlkey``,
Expand Down
18 changes: 18 additions & 0 deletions docs/manual/configuring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,7 @@ The full set of configurable options (with their default settings) is as follows
rollover_idle_secs: 600
filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
source_filter: live
enable_put_custom_record: false

The required ``source_coll`` setting specifies the source collection from which to load content that will be recorded.
Most likely this will be the :ref:`live-web` collection, which should also be defined.
Expand Down Expand Up @@ -341,6 +342,23 @@ When any dedup_policy, pywb can also access the dedup Redis index, along with an
This feature is still experimental but should generally work. Additional options for working with the Redis Dedup index will be added in the futuer.


.. _put-custom-record:

Adding Custom Resource Records
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

pywb now also supports adding custom data to a WARC ``resource`` record. This can be used to add custom resources, such as screenshots, logs, error messages,
etc.. that are not normally captured as part of recording, but still useful to store in WARCs.

To add a custom resources, simply call ``PUT /<coll>/record`` with the data to be added as the request body and the type of the data specified as the content-type. The ``url`` can be specified as a query param.

For example, adding a custom record ``file:///my-custom-resource`` containing ``Some Custom Data`` can be done using ``curl`` as follows::

curl -XPUT "localhost:8080/my-web-archive/record?url=file:///my-custom-resource" --data "Some Custom Data"


This feature is only available if ``enable_put_custom_record: true`` is set in the recorder config.


.. _auto-fetch:

Expand Down
152 changes: 152 additions & 0 deletions docs/manual/localization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
.. _localizaation:

Localization / Multi-lingual Support
------------------------------------

pywb supports configuring different language locales and loading different language translations, and dynamically switching languages.

pywb can extract all text from templates and generate CSV files for translation and convert them back into a binary format used for localization/internationalization.

(pywb uses the `Babel library <http://babel.pocoo.org/en/latest/>`_ which extends the `standard Python i18n system <https://docs.python.org/3/library/gettext.html>`_)

To ensure all localization related dependencies are installed, first run::

pip install pywb[i18n]

Locales to use are configured in the ``config.yaml``.

The command-line ``wb-manager`` utility provides a way to manage locales for translation, including generating extracted text, and to update translated text.


Adding a Locale and Extracting Text
===================================

To add a new locale for translation and automatically extract all text that needs to be translated, run::

wb-manager i18n extract <loc>

The ``<loc>`` can be one or more supported two-letter locales or CLDR language codes. To list available codes, you can run ``pybabel --list-locales``.

Localization data is placed in the ``i18n`` directory, and translatable strings can be found in ``i18n/translations/<locale>/LC_MESSAGES/messages.csv``

Each CSV file looks as follows, listing each source string and an empty string for the translated version::

"location","source","target"
"pywb/templates/banner.html:6","Live on",""
"pywb/templates/banner.html:8","Calendar icon",""
"pywb/templates/banner.html:9 pywb/templates/query.html:45","View All Captures",""
"pywb/templates/banner.html:10 pywb/templates/header.html:4","Language:",""
"pywb/templates/banner.html:11","Loading...",""
...


This CSV can then be passed to translators to translate the text.

(The extraction parameters are configured to load data from ``pywb/templates/*.html`` in ``babel.ini``)


For example, the following will generate translation strings for ``es`` and ``pt`` locales::

wb-manager i18n extract es pt


The translatable text can then be found in ``i18n/translations/es/LC_MESSAGES/messages.csv`` and ``i18n/translations/pt/LC_MESSAGES/messages.csv``.


The CSV files should be updated with a translation for each string in the ``target`` column.

The extract command adds any new strings without overwriting existing translations, so after running the update command to compile translated strings (described below), it is safe to run the extract command again.


Updating Locale Catalog
=======================

Once the text has been translated, and the CSV files updated, simply run::

wb-manager i18n update <loc>

This will parse the CSVs and compile the translated string tables for use with pywb.


Specifying locales in pywb
==========================

To enable the locales in pywb, one or more locales can be added to the ``locales`` key in ``config.yaml``, ex::

locales:
- en
- es

Single Language Default Locale
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pywb can be configured with a default, single-language locale, by setting the ``default_locale`` property in ``config.yaml``::


default_locale: es
locales:
- es


With this configuration, pywb will automatically use the ``es`` locale for all text strings in pywb pages.

pywb will also set the ``<html lang="es">`` so that the browser will recognize the correct locale.


Mutli-language Translations
~~~~~~~~~~~~~~~~~~~~~~~~~~~

If more than one locale is specified, pywb will automatically show a language switching UI at the top of collection and search pages, with an option
for each locale listed. To include English as an option, it should also be added as a locale (and no strings translated). For example::

locales:
- en
- es
- pt

will configure pywb to show a language switch option on all pages.


Localized Collection Paths
==========================

When localization is enabled, pywb supports the locale prefix for accessing each collection with a localized language:
If pywb has a collection ``my-web-archive``, then:

* ``/my-web-archive/`` - loads UI with default language (set via ``default_locale``)
* ``/en/my-web-archive/`` - loads UI with ``en`` locale
* ``/es/my-web-archive/`` - loads UI with ``es`` locale
* ``/pt/my-web-archive/`` - loads UI with ``pt`` locale

The language switch options work by changing the locale prefix for the same page.

Listing and Removing Locales
============================

To list the locales that have previously been added, you can also run ``wb-manager i18n list``.

To disable a locale from being used in pywb, simply remove it from the ``locales`` key in ``config.yaml``.

To remove data for a locale permanently, you can run: ``wb-manager i18n remove <loc>``. This will remove the locale directory on disk.

To remove all localization data, you can manually delete the ``i18n`` directory.


UI Templates: Adding Localizable Text
=====================================

Text that can be translated, localizable text, can be marked as such directly in the UI templates:

1. By wrapping the text in ``{% trans %}``/``{% endtrans %}`` tags. For example::

{% trans %}Collection {{ coll }} Search Page{% endtrans %}

2. Short-hand by calling a special ``_()`` function, which can be used in attributes or more dynamically. For example::

... title="{{ _('Enter a URL to search for') }}">


These methods can be used in all UI templates and are supported by the Jinja2 templating system.

See :ref:`ui-customizations` for a list of all available UI templates.

2 changes: 1 addition & 1 deletion docs/manual/owb-to-pywb-deploy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ See the :ref:`nginx-deploy` and :ref:`apache-deploy` sections for more info on d
Working Docker Compose Examples
-------------------------------

The pywb `Deployment Examples <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/>`_ include working examples of deploying pywb with Nginx, Apache and OutbackCDX
The pywb `Deployment Examples <https://github.com/webrecorder/pywb/blob/main/sample-deploy/>`_ include working examples of deploying pywb with Nginx, Apache and OutbackCDX
in Docker using Docker Compose, widely available container orchestration tools.

See `Installing Docker <https://docs.docker.com/get-docker/>`_ and `Installing Docker Compose <https://docs.docker.com/compose/install/>`_ for instructions on how to install these tools.
Expand Down
9 changes: 5 additions & 4 deletions docs/manual/rewriter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Configuring Rewriters
---------------------

pywb provides customizable rewriting based on content-type, the available types are configured
in the :py:mod:`pywb.rewriter.default_rewriter`, which specifies rewriter classes per known type,
in the :py:mod:`pywb.rewrite.default_rewriter`, which specifies rewriter classes per known type,
and mapping of content-types to rewriters.


Expand All @@ -118,6 +118,7 @@ JS Rewriting
The JS rewriter is applied to inline ``<script>`` blocks, or inline attribute js, and any files determine to be javascript (based on content type and ``js_`` modifier).

The default JS rewriter does not rewrite any links. Instead, JS rewriter performs limited regular expression on the following:

* ``postMessage`` calls
* certain ``this`` property accessors
* specific ``location =`` assignment
Expand All @@ -126,7 +127,7 @@ Then, the entire script block is wrapped in a special code block to be executed

The server-side rewriting is to aid the client-side execution of wrapped code.

For more information, see :py:mod:`pywb.rewriter.regex_rewriters.JSWombatProxyRewriterMixin`
For more information, see :py:mod:`pywb.rewrite.regex_rewriters.JSWombatProxyRewriterMixin`


JSONP Rewriting
Expand All @@ -140,13 +141,13 @@ For example, a requested url might be ``/my-coll/http://example.com?callback=jQu

To ensure the JSONP callback works as expected, the content is rewritten to ``jQuery123(...)`` -> ``jQuery456(...)``

For more information, see :py:mod:`pywb.rewriter.jsonp_rewriter`
For more information, see :py:mod:`pywb.rewrite.jsonp_rewriter`


DASH and HLS Rewriting
~~~~~~~~~~~~~~~~~~~~~~

To support recording and replaying, adaptive streaming formants (DASH and HLS), pywb can perform special rewriting on the manifests for these formats to remoe all but one possible resolution/format. As a result, the non-deterministic format selection is reduced to a single consistent format.

For more information, see :py:mod:`pywb.rewriter.rewrite_hls` and :py:mod:`pywb.rewriter.rewrite_dash` and the tests in ``pywb/rewrite/test/test_content_rewriter.py``
For more information, see :py:mod:`pywb.rewrite.rewrite_hls` and :py:mod:`pywb.rewrite.rewrite_dash` and the tests in ``pywb/rewrite/test/test_content_rewriter.py``

50 changes: 47 additions & 3 deletions docs/manual/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,50 @@ Then, in your config, simply include:
The configuration assumes uwsgi is started with ``uwsgi uwsgi.ini``


.. _config-acl-header:

Configuring Access Control Header
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :ref:`access-control` system allows users to be granted different access settings based on the value of an ACL header, ``X-pywb-ACL-user``.

The header can be set via Nginx or Apache to grant custom access priviliges based on IP address, password, or other combination of rules.

For example, to set the value of the header to ``staff`` if the IP of the request is from designated local IP ranges (127.0.0.1, 192.168.1.0/24), the following settings can be added to the configs:

For Nginx::

geo $acl_user {
# ensure user is set to empty by default
default "";

# optional: add IP ranges to allow privileged access
127.0.0.1 "staff";
192.168.0.0/24 "staff";
}

...
location /wayback/ {
...
uwsgi_param HTTP_X_PYWB_ACL_USER $acl_user;
}


For Apache::

<If "-R '192.168.1.0/24' || -R '127.0.0.1'">
RequestHeader set X-Pywb-ACL-User staff
</If>
# ensure header is cleared if no match
<Else>
RequestHeader set X-Pywb-ACL-User ""
</Else>

}




Running on Subdirectory Path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -313,7 +357,7 @@ Deployment Examples
The ``sample-deploy`` directory includes working Docker Compose examples for deploying pywb with Nginx and Apache on the ``/wayback`` subdirectory.

See:
- `Docker Compose Nginx <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/docker-compose-nginx.yaml>`_ for sample Nginx config.
- `Docker Compose Apache <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/docker-compose-apache.yaml>`_ for sample Apache config.
- `uwsgi_subdir.ini <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/uwsgi_subdir.ini>`_ for example subdirectory uwsgi config.
- `Docker Compose Nginx <https://github.com/webrecorder/pywb/blob/main/sample-deploy/docker-compose-nginx.yaml>`_ for sample Nginx config.
- `Docker Compose Apache <https://github.com/webrecorder/pywb/blob/main/sample-deploy/docker-compose-apache.yaml>`_ for sample Apache config.
- `uwsgi_subdir.ini <https://github.com/webrecorder/pywb/blob/main/sample-deploy/uwsgi_subdir.ini>`_ for example subdirectory uwsgi config.

2 changes: 2 additions & 0 deletions extra_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ uwsgi
ujson
pysocks
lxml
babel
translate_toolkit
9 changes: 9 additions & 0 deletions pywb/apps/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@
from argparse import ArgumentParser

import logging
import pkg_resources


#=============================================================================
def get_version():
"""Get version of the pywb"""
return "pywb " + pkg_resources.get_distribution("pywb").version


#=============================================================================
Expand Down Expand Up @@ -40,6 +47,8 @@ def __init__(self, args=None, default_port=8080, desc=''):
:param str desc: The description for the application to be started
"""
parser = ArgumentParser(description=desc)
parser.add_argument("-V", "--version", action="version", version=get_version())

parser.add_argument('-p', '--port', type=int, default=default_port,
help='Port to listen on (default %s)' % default_port)
parser.add_argument('-b', '--bind', default='0.0.0.0',
Expand Down
70 changes: 65 additions & 5 deletions pywb/apps/frontendapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

from werkzeug.routing import Map, Rule, RequestRedirect, Submount
from werkzeug.wsgi import pop_path_info
from six.moves.urllib.parse import urljoin
from six.moves.urllib.parse import urljoin, parse_qsl
from six import iteritems
from warcio.utils import to_native_str
from warcio.timeutils import iso_date_to_timestamp
from warcio.timeutils import iso_date_to_timestamp, timestamp_to_iso_date
from wsgiprox.wsgiprox import WSGIProxMiddleware

from pywb.recorder.multifilewarcwriter import MultiFileWARCWriter
from pywb.recorder.recorderapp import RecorderApp
from pywb.recorder.filters import SkipDupePolicy, WriteDupePolicy, WriteRevisitDupePolicy
from pywb.recorder.redisindexer import WritableRedisIndexer
from pywb.recorder.redisindexer import WritableRedisIndexer, RedisPendingCounterTempBuffer

from pywb.utils.loaders import load_yaml_config
from pywb.utils.geventserver import GeventServer
Expand Down Expand Up @@ -74,6 +74,7 @@ def __init__(self, config_file=None, custom_config=None):
custom_config=custom_config)
self.recorder = None
self.recorder_path = None
self.put_custom_record_path = None
self.proxy_default_timestamp = None

config = self.warcserver.config
Expand Down Expand Up @@ -173,6 +174,10 @@ def _make_coll_routes(self, coll_prefix):
if self.recorder_path:
routes.append(Rule(coll_prefix + self.RECORD_ROUTE + '/<path:url>', endpoint=self.serve_record))

# enable PUT of custom data as 'resource' records
if self.put_custom_record_path:
routes.append(Rule(coll_prefix + self.RECORD_ROUTE, endpoint=self.put_custom_record, methods=["PUT"]))

return routes

def get_upstream_paths(self, port):
Expand Down Expand Up @@ -244,13 +249,25 @@ def init_recorder(self, recorder_config):
dedup_index=dedup_index,
dedup_by_url=dedup_by_url)

if dedup_policy:
pending_counter = self.warcserver.dedup_index_url.replace(':cdxj', ':pending')
pending_timeout = recorder_config.get('pending_timeout', 30)
create_buff_func = lambda params, name: RedisPendingCounterTempBuffer(512 * 1024, pending_counter, params, name, pending_timeout)
else:
create_buff_func = None

self.recorder = RecorderApp(self.RECORD_SERVER % str(self.warcserver_server.port), warc_writer,
accept_colls=recorder_config.get('source_filter'))
accept_colls=recorder_config.get('source_filter'),
create_buff_func=create_buff_func)

recorder_server = GeventServer(self.recorder, port=0)

self.recorder_path = self.RECORD_API % (recorder_server.port, recorder_coll)

# enable PUT of custom data as 'resource' records
if recorder_config.get('enable_put_custom_record'):
self.put_custom_record_path = self.recorder_path + '&put_record={rec_type}&url={url}'

def init_autoindex(self, auto_interval):
"""Initialize and start the auto-indexing of the collections. If auto_interval is None this is a no op.
Expand Down Expand Up @@ -404,10 +421,12 @@ def serve_cdx(self, environ, coll='$root'):
try:
res = requests.get(cdx_url, stream=True)

status_line = '{} {}'.format(res.status_code, res.reason)
content_type = res.headers.get('Content-Type')

return WbResponse.bin_stream(StreamIter(res.raw),
content_type=content_type)
content_type=content_type,
status=status_line)

except Exception as e:
return WbResponse.text_response('Error: ' + str(e), status='400 Bad Request')
Expand Down Expand Up @@ -466,6 +485,47 @@ def serve_content(self, environ, coll='$root', url='', timemap_output='', record

return self.rewriterapp.render_content(wb_url_str, coll_config, environ)

def put_custom_record(self, environ, coll="$root"):
""" When recording, PUT a custom WARC record to the specified collection
(Available only when recording)
:param dict environ: The WSGI environment dictionary for the request
:param str coll: The name of the collection the record is to be served from
"""
chunks = []
while True:
buff = environ["wsgi.input"].read()
if not buff:
break

chunks.append(buff)

data = b"".join(chunks)

params = dict(parse_qsl(environ.get("QUERY_STRING")))

rec_type = "resource"

headers = {"Content-Type": environ.get("CONTENT_TYPE", "text/plain")}

target_uri = params.get("url")

if not target_uri:
return WbResponse.json_response({"error": "no url"}, status="400 Bad Request")

timestamp = params.get("timestamp")
if timestamp:
headers["WARC-Date"] = timestamp_to_iso_date(timestamp)

put_url = self.put_custom_record_path.format(
url=target_uri, coll=coll, rec_type=rec_type
)
res = requests.put(put_url, headers=headers, data=data)

res = res.json()

return WbResponse.json_response(res)

def setup_paths(self, environ, coll, record=False):
"""Populates the WSGI environment dictionary with the path information necessary to perform a response for
content or record.
Expand Down
7 changes: 5 additions & 2 deletions pywb/apps/rewriterapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,8 @@ def __init__(self, framed_replay=False, jinja_env=None, config=None, paths=None)

self.jinja_env.init_loc(self.config.get('locales_root_dir'),
self.config.get('locales'),
self.loc_map)
self.loc_map,
self.config.get('default_locale'))

self.redirect_to_exact = config.get('redirect_to_exact')

Expand Down Expand Up @@ -684,7 +685,7 @@ def handle_error(self, environ, wbe):
return self._error_response(environ, wbe)

def _not_found_response(self, environ, url):
resp = self.not_found_view.render_to_string(environ, url=url)
resp = self.not_found_view.render_to_string(environ, url=url, err_msg="Not Found")

return WbResponse.text_response(resp, status='404 Not Found', content_type='text/html')

Expand All @@ -704,6 +705,8 @@ def _do_req(self, inputreq, wb_url, kwargs, skip_record):
headers = {'Content-Length': str(len(req_data)),
'Content-Type': 'application/request'}

headers.update(inputreq.warcserver_headers)

if skip_record:
headers['Recorder-Skip'] = '1'

Expand Down
3 changes: 3 additions & 0 deletions pywb/indexer/archiveindexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ def merge_request_data(self, other, options):
self['urlkey'] = canonicalize(new_url, surt_ordered)
other['urlkey'] = self['urlkey']

self['method'] = post_query.method
self['requestBody'] = post_query.query

referer = other.record.http_headers.get_header('referer')
if referer:
self['_referer'] = referer
Expand Down
26 changes: 13 additions & 13 deletions pywb/indexer/test/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,9 @@
# post append
>>> print_cdx_index('post-test.warc.gz', append_post=True)
CDX N b a m s k r M S V g
org,httpbin)/post?foo=bar&test=abc 20140610000859 http://httpbin.org/post application/json 200 M532K5WS4GY2H4OVZO6HRPOP47A7KDWU - - 720 0 post-test.warc.gz
org,httpbin)/post?a=1&b=[]&c=3 20140610001151 http://httpbin.org/post application/json 200 M7YCTM7HS3YKYQTAWQVMQSQZBNEOXGU2 - - 723 1196 post-test.warc.gz
org,httpbin)/post?data=^&foo=bar 20140610001255 http://httpbin.org/post?foo=bar application/json 200 B6E5P6JUZI6UPDTNO4L2BCHMGLTNCUAJ - - 723 2395 post-test.warc.gz
org,httpbin)/post?__wb_method=post&foo=bar&test=abc 20140610000859 http://httpbin.org/post application/json 200 M532K5WS4GY2H4OVZO6HRPOP47A7KDWU - - 720 0 post-test.warc.gz
org,httpbin)/post?__wb_method=post&a=1&b=[]&c=3 20140610001151 http://httpbin.org/post application/json 200 M7YCTM7HS3YKYQTAWQVMQSQZBNEOXGU2 - - 723 1196 post-test.warc.gz
org,httpbin)/post?__wb_method=post&data=^&foo=bar 20140610001255 http://httpbin.org/post?foo=bar application/json 200 B6E5P6JUZI6UPDTNO4L2BCHMGLTNCUAJ - - 723 2395 post-test.warc.gz
# no post append, requests included
>>> print_cdx_index('post-test.warc.gz', include_all=True)
Expand All @@ -118,12 +118,12 @@
# post append + requests included
>>> print_cdx_index('post-test.warc.gz', include_all=True, append_post=True)
CDX N b a m s k r M S V g
org,httpbin)/post?foo=bar&test=abc 20140610000859 http://httpbin.org/post application/json 200 M532K5WS4GY2H4OVZO6HRPOP47A7KDWU - - 720 0 post-test.warc.gz
org,httpbin)/post?foo=bar&test=abc 20140610000859 http://httpbin.org/post application/x-www-form-urlencoded - - - - 476 720 post-test.warc.gz
org,httpbin)/post?a=1&b=[]&c=3 20140610001151 http://httpbin.org/post application/json 200 M7YCTM7HS3YKYQTAWQVMQSQZBNEOXGU2 - - 723 1196 post-test.warc.gz
org,httpbin)/post?a=1&b=[]&c=3 20140610001151 http://httpbin.org/post application/x-www-form-urlencoded - - - - 476 1919 post-test.warc.gz
org,httpbin)/post?data=^&foo=bar 20140610001255 http://httpbin.org/post?foo=bar application/json 200 B6E5P6JUZI6UPDTNO4L2BCHMGLTNCUAJ - - 723 2395 post-test.warc.gz
org,httpbin)/post?data=^&foo=bar 20140610001255 http://httpbin.org/post?foo=bar application/x-www-form-urlencoded - - - - 475 3118 post-test.warc.gz
org,httpbin)/post?__wb_method=post&foo=bar&test=abc 20140610000859 http://httpbin.org/post application/json 200 M532K5WS4GY2H4OVZO6HRPOP47A7KDWU - - 720 0 post-test.warc.gz
org,httpbin)/post?__wb_method=post&foo=bar&test=abc 20140610000859 http://httpbin.org/post application/x-www-form-urlencoded - - - - 476 720 post-test.warc.gz
org,httpbin)/post?__wb_method=post&a=1&b=[]&c=3 20140610001151 http://httpbin.org/post application/json 200 M7YCTM7HS3YKYQTAWQVMQSQZBNEOXGU2 - - 723 1196 post-test.warc.gz
org,httpbin)/post?__wb_method=post&a=1&b=[]&c=3 20140610001151 http://httpbin.org/post application/x-www-form-urlencoded - - - - 476 1919 post-test.warc.gz
org,httpbin)/post?__wb_method=post&data=^&foo=bar 20140610001255 http://httpbin.org/post?foo=bar application/json 200 B6E5P6JUZI6UPDTNO4L2BCHMGLTNCUAJ - - 723 2395 post-test.warc.gz
org,httpbin)/post?__wb_method=post&data=^&foo=bar 20140610001255 http://httpbin.org/post?foo=bar application/x-www-form-urlencoded - - - - 475 3118 post-test.warc.gz
# post append + minimal = error
>>> print_cdx_index('example.arc.gz', append_post=True, minimal=True)
Expand Down Expand Up @@ -509,8 +509,8 @@ def test_multipart_form():
print(buff.getvalue())
assert buff.getvalue() == b"""\
CDX N b a m s k r M S V g
com,example)/ajax/bz?foo=bar&q=[{"websessionid":"pb2tr7:vx83uz:fdi8ta","user":"0"}] 20201119195434 https://example.com/ajax/bz?foo=bar unk text/html; 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 420 0 test.warc.gz
com,example)/ajax/bz?foo=bar&q=[{"websessionid":"pb2tr7:vx83uz:fdi8ta","user":"0"}] 20201119195434 https://example.com/ajax/bz?foo=bar multipart/form-data - - - - 701 428 test.warc.gz
com,example)/ajax/bz?__wb_method=post&foo=bar&q=[{"websessionid":"pb2tr7:vx83uz:fdi8ta","user":"0"}] 20201119195434 https://example.com/ajax/bz?foo=bar unk text/html; 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 420 0 test.warc.gz
com,example)/ajax/bz?__wb_method=post&foo=bar&q=[{"websessionid":"pb2tr7:vx83uz:fdi8ta","user":"0"}] 20201119195434 https://example.com/ajax/bz?foo=bar multipart/form-data - - - - 701 428 test.warc.gz
"""


Expand Down Expand Up @@ -556,8 +556,8 @@ def test_multipart_form_no_boundary():
write_cdx_index(buff, test_record, 'test.warc.gz', **options)
assert buff.getvalue() == b"""\
CDX N b a m s k r M S V g
com,connatix,capi)/core/story?__wb_post_data=eyj0zxh0ijogimrlzmf1bhqifq==&v=77797 20201119140252 https://capi.connatix.com/core/story?v=77797 unk multipart/form-data SIGZ3RJW5J7DUKEZ4R7RSYUZNGLETIS5 - - 453 0 test.warc.gz
com,connatix,capi)/core/story?__wb_post_data=eyj0zxh0ijogimrlzmf1bhqifq==&v=77797 20201119140252 https://capi.connatix.com/core/story?v=77797 multipart/form-data - - - - 500 461 test.warc.gz
com,connatix,capi)/core/story?__wb_method=post&__wb_post_data=eyj0zxh0ijogimrlzmf1bhqifq==&v=77797 20201119140252 https://capi.connatix.com/core/story?v=77797 unk multipart/form-data SIGZ3RJW5J7DUKEZ4R7RSYUZNGLETIS5 - - 453 0 test.warc.gz
com,connatix,capi)/core/story?__wb_method=post&__wb_post_data=eyj0zxh0ijogimrlzmf1bhqifq==&v=77797 20201119140252 https://capi.connatix.com/core/story?v=77797 multipart/form-data - - - - 500 461 test.warc.gz
"""


Expand Down
23 changes: 14 additions & 9 deletions pywb/manager/aclmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
class ACLManager(CollectionsManager):
SURT_RX = re.compile('([^:.]+[,)])+')

VALID_ACCESS = ('allow', 'block', 'exclude')
VALID_ACCESS = ('allow', 'block', 'exclude', 'allow_ignore_embargo')

DEFAULT_FILE = 'access-rules.aclj'

Expand Down Expand Up @@ -167,9 +167,9 @@ def add_rule(self, r):
:param argparse.Namespace r: The argparse namespace representing the rule to be added
:rtype: None
"""
return self._add_rule(r.url, r.access, r.exact_match)
return self._add_rule(r.url, r.access, r.exact_match, r.user)

def _add_rule(self, url, access, exact_match=False):
def _add_rule(self, url, access, exact_match=False, user=None):
"""Adds an rule to the acl file
:param str url: The URL for the rule
Expand All @@ -185,12 +185,14 @@ def _add_rule(self, url, access, exact_match=False):
acl['timestamp'] = '-'
acl['access'] = access
acl['url'] = url
if user:
acl['user'] = user

i = 0
replace = False

for rule in self.rules:
if acl['urlkey'] == rule['urlkey'] and acl['timestamp'] == rule['timestamp']:
if acl['urlkey'] == rule['urlkey'] and acl['timestamp'] == rule['timestamp'] and acl.get('user') == rule.get('user'):
replace = True
break

Expand Down Expand Up @@ -255,7 +257,7 @@ def remove_rule(self, r):
i = 0
urlkey = self.to_key(r.url, r.exact_match)
for rule in self.rules:
if urlkey == rule['urlkey']:
if urlkey == rule['urlkey'] and r.user == rule.get('user'):
acl = self.rules.pop(i)
print('Removed Rule:')
self.print_rule(acl)
Expand Down Expand Up @@ -285,7 +287,7 @@ def find_match(self, r):
:rtype: None
"""
access_checker = AccessChecker(self.acl_file, '<default>')
rule = access_checker.find_access_rule(r.url)
rule = access_checker.find_access_rule(r.url, acl_user=r.user)

print('Matched rule:')
print('')
Expand Down Expand Up @@ -344,15 +346,18 @@ def command(name, *args, **kwargs):
else:
op.add_argument(arg)

if kwargs.get('user_opt'):
op.add_argument('-u', '--user')

if kwargs.get('exact_opt'):
op.add_argument('-e', '--exact-match', action='store_true', default=False)

op.set_defaults(acl_func=kwargs['func'])

command('add', 'coll_name', 'url', 'access', func=cls.add_rule, exact_opt=True)
command('remove', 'coll_name', 'url', func=cls.remove_rule, exact_opt=True)
command('add', 'coll_name', 'url', 'access', func=cls.add_rule, exact_opt=True, user_opt=True)
command('remove', 'coll_name', 'url', func=cls.remove_rule, exact_opt=True, user_opt=True)
command('list', 'coll_name', func=cls.list_rules)
command('validate', 'coll_name', func=cls.validate_save)
command('match', 'coll_name', 'url', 'default_access', func=cls.find_match)
command('match', 'coll_name', 'url', 'default_access', func=cls.find_match, user_opt=True)
command('importtxt', 'coll_name', 'filename', 'access', func=cls.add_excludes)

111 changes: 111 additions & 0 deletions pywb/manager/locmanager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
import os
import os.path
import shutil

from babel.messages.frontend import CommandLineInterface

from translate.convert.po2csv import main as po2csv
from translate.convert.csv2po import main as csv2po


ROOT_DIR = 'i18n'

TRANSLATIONS = os.path.join(ROOT_DIR, 'translations')

MESSAGES = os.path.join(ROOT_DIR, 'messages.pot')

# ============================================================================
class LocManager:
def process(self, r):
if r.name == 'list':
r.loc_func(self)
elif r.name == 'remove':
r.loc_func(self, r.locale)
else:
r.loc_func(self, r.locale, r.no_csv)

def extract_loc(self, locale, no_csv):
self.extract_text()

for loc in locale:
loc_dir = os.path.join(TRANSLATIONS, loc)
if os.path.isdir(loc_dir):
self.update_catalog(loc)
else:
os.makedirs(loc_dir)
self.init_catalog(loc)

if not no_csv:
base = os.path.join(TRANSLATIONS, loc, 'LC_MESSAGES')
po = os.path.join(base, 'messages.po')
csv = os.path.join(base, 'messages.csv')
po2csv([po, csv])

self.compile_catalog()

def update_loc(self, locale, no_csv):
for loc in locale:
if not no_csv:
loc_dir = os.path.join(TRANSLATIONS, loc)
base = os.path.join(TRANSLATIONS, loc, 'LC_MESSAGES')
po = os.path.join(base, 'messages.po')
csv = os.path.join(base, 'messages.csv')

if os.path.isfile(csv):
csv2po([csv, po])

self.compile_catalog()

def remove_loc(self, locale):
for loc in locale:
loc_dir = os.path.join(TRANSLATIONS, loc)
if not os.path.isdir(loc_dir):
print('Locale "{0}" does not exist'.format(loc))
return

shutil.rmtree(loc_dir)
print('Removed locale "{0}"'.format(loc))

def list_loc(self):
print('Current locales:')
print('\n'.join(' - ' + x for x in os.listdir(TRANSLATIONS)))
print('')

def extract_text(self):
os.makedirs(ROOT_DIR, exist_ok=True)

CommandLineInterface().run(['pybabel', 'extract', '-F', 'babel.ini', '-k', '_ _Q gettext ngettext', '-o', MESSAGES, './', '--omit-header'])

def init_catalog(self, loc):
CommandLineInterface().run(['pybabel', 'init', '-l', loc, '-i', MESSAGES, '-d', TRANSLATIONS])

def update_catalog(self, loc):
CommandLineInterface().run(['pybabel', 'update', '-l', loc, '-i', MESSAGES, '-d', TRANSLATIONS, '--previous'])

def compile_catalog(self):
CommandLineInterface().run(['pybabel', 'compile', '-d', TRANSLATIONS])


@classmethod
def init_parser(cls, parser):
"""Initializes an argument parser for acl commands
:param argparse.ArgumentParser parser: The parser to be initialized
:rtype: None
"""
subparsers = parser.add_subparsers(dest='op')
subparsers.required = True

def command(name, func):
op = subparsers.add_parser(name)
if name != 'list':
op.add_argument('locale', nargs='+')
if name != 'remove':
op.add_argument('--no-csv', action='store_true')

op.set_defaults(loc_func=func, name=name)

command('extract', cls.extract_loc)
command('update', cls.update_loc)
command('remove', cls.remove_loc)
command('list', cls.list_loc)
30 changes: 29 additions & 1 deletion pywb/manager/manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import six

from distutils.util import strtobool
from pkg_resources import resource_string
from pkg_resources import resource_string, get_distribution

from argparse import ArgumentParser, RawTextHelpFormatter

Expand All @@ -28,8 +28,12 @@ def get_input(msg): # pragma: no cover
return input(msg)

#=============================================================================
def get_version():
"""Get version of the pywb"""
return "wb-manager " + get_distribution("pywb").version


#=============================================================================
class CollectionsManager(object):
""" This utility is designed to
simplify the creation and management of web archive collections
Expand Down Expand Up @@ -335,6 +339,8 @@ def main(args=None):
# epilog=epilog,
formatter_class=RawTextHelpFormatter)

parser.add_argument("-V", "--version", action="version", version=get_version())

subparsers = parser.add_subparsers(dest='type')
subparsers.required = True

Expand Down Expand Up @@ -441,6 +447,28 @@ def do_acl(r):
ACLManager.init_parser(acl)
acl.set_defaults(func=do_acl)

# LOC
loc_avail = False
try:
from pywb.manager.locmanager import LocManager
loc_avail = True
except:
pass

def do_loc(r):
if not loc_avail:
print("You must install i18n extensions with 'pip install pywb[i18n]' to use localization features")
return

loc = LocManager()
loc.process(r)

loc_help = 'Generate strings for i18n/localization'
loc = subparsers.add_parser('i18n', help=loc_help)
if loc_avail:
LocManager.init_parser(loc)
loc.set_defaults(func=do_loc)

# Parse
r = parser.parse_args(args=args)
r.func(r)
Expand Down
3 changes: 1 addition & 2 deletions pywb/recorder/recorderapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ def __init__(self, upstream_host, writer, skip_filters=None, **kwargs):

self.rec_source_name = kwargs.get('name', 'recorder')

self.create_buff_func = kwargs.get('create_buff_func',
self.default_create_buffer)
self.create_buff_func = kwargs.get('create_buff_func') or self.default_create_buffer

self.write_queue = gevent.queue.Queue()
gevent.spawn(self._write_loop)
Expand Down
27 changes: 27 additions & 0 deletions pywb/recorder/redisindexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from io import BytesIO
import os
import tempfile

from pywb.utils.canonicalize import calc_search_range
from pywb.utils.format import res_template
Expand Down Expand Up @@ -101,3 +102,29 @@ def lookup_revisit(self, lookup_params, digest, url, iso_dt):
return res

return None


# ============================================================================
class RedisPendingCounterTempBuffer(tempfile.SpooledTemporaryFile):
def __init__(self, max_size, redis_url, params, name, timeout=30):
redis_url = res_template(redis_url, params)
super(RedisPendingCounterTempBuffer, self).__init__(max_size=max_size)
self.redis, self.key = RedisIndexSource.parse_redis_url(redis_url)
self.timeout = timeout

self.redis.incrby(self.key, 1)
self.redis.expire(self.key, self.timeout)

def write(self, buf):
super(RedisPendingCounterTempBuffer, self).write(buf)
self.redis.expire(self.key, self.timeout)

def close(self):
try:
super(RedisPendingCounterTempBuffer, self).close()
except:
traceback.print_exc()

self.redis.incrby(self.key, -1)
self.redis.expire(self.key, self.timeout)

14 changes: 7 additions & 7 deletions pywb/rewrite/regex_rewriters.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ def remove_https(string, _):
return string.replace("https", "http")

@staticmethod
def replace_str(replacer):
return lambda x, _: x.replace('this', replacer)
def replace_str(replacer, match='this'):
return lambda x, _: x.replace(match, replacer)

@staticmethod
def format(template):
Expand Down Expand Up @@ -100,10 +100,10 @@ def __init__(self):
prop_str = '|'.join(self.local_objs)

rules = [
# rewriting 'eval(....)' - invocation
(r'(?<![$])\beval\s*\(', self.add_prefix('WB_wombat_runEval(function _____evalIsEvil(_______eval_arg$$) { return eval(_______eval_arg$$); }.bind(this)).'), 0),
# rewriting 'eval(...)' - invocation
(r'(?<!function\s)(?:^|[^,$])eval\s*\(', self.replace_str('WB_wombat_runEval(function _____evalIsEvil(_______eval_arg$$) { return eval(_______eval_arg$$); }.bind(this)).eval', 'eval'), 0),
# rewriting 'x = eval' - no invocation
(r'(?<![$])\beval\b', self.add_prefix('WB_wombat_'), 0),
(r'(?<=[=,])\s*\beval\b\s*(?![(:.$])', self.replace_str('self.eval', 'eval'), 0),
(r'(?<=\.)postMessage\b\(', self.add_prefix('__WB_pmw(self).'), 0),
(r'(?<![$.])\s*location\b\s*[=]\s*(?![=])', self.add_suffix(check_loc), 0),
# rewriting 'return this'
Expand All @@ -122,9 +122,9 @@ def __init__(self):

super(JSWombatProxyRules, self).__init__(rules)

self.first_buff = local_init_func + local_declares + '\n\n'
self.first_buff = local_init_func + local_declares + '\n\n{'

self.last_buff = '\n\n}'
self.last_buff = '\n\n}}'


# =================================================================
Expand Down
6 changes: 6 additions & 0 deletions pywb/rewrite/rewriteinputreq.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def __init__(self, env, urlkey, url, rewriter):
self.url = url
self.rewriter = rewriter
self.extra_cookie = None
self.warcserver_headers = {}

is_proxy = ('wsgiprox.proxy_host' in env)

Expand Down Expand Up @@ -82,6 +83,11 @@ def get_req_headers(self):
elif name in ('HTTP_IF_MODIFIED_SINCE', 'HTTP_IF_UNMODIFIED_SINCE'):
continue

elif name == 'HTTP_X_PYWB_ACL_USER':
name = name[5:].title().replace('_', '-')
self.warcserver_headers[name] = value
continue

elif name == 'HTTP_X_FORWARDED_PROTO':
name = 'X-Forwarded-Proto'
if self.splits:
Expand Down
28 changes: 19 additions & 9 deletions pywb/rewrite/templateview.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,17 @@

from six.moves.urllib.parse import urlsplit, quote

from jinja2 import Environment, TemplateNotFound, contextfunction
from jinja2 import Environment, TemplateNotFound, contextfunction, select_autoescape
from jinja2 import FileSystemLoader, PackageLoader, ChoiceLoader

from babel.support import Translations

from webassets.ext.jinja2 import AssetsExtension
from webassets.loaders import YAMLLoader
from webassets.env import Resolver

from pkg_resources import resource_filename

import os
import logging

try:
import ujson as json
Expand Down Expand Up @@ -77,10 +76,12 @@ def __init__(self, paths=None,

if overlay:
jinja_env = overlay.jinja_env.overlay(loader=loader,
autoescape=select_autoescape(),
trim_blocks=True,
extensions=extensions)
else:
jinja_env = RelEnvironment(loader=loader,
autoescape=select_autoescape(),
trim_blocks=True,
extensions=extensions)

Expand All @@ -98,6 +99,8 @@ def __init__(self, paths=None,
assets_env.resolver = PkgResResolver()
jinja_env.assets_environment = assets_env

self.default_locale = ''

def _make_loaders(self, paths, packages):
"""Initialize the template loaders based on the supplied paths and packages.
Expand All @@ -117,16 +120,22 @@ def _make_loaders(self, paths, packages):

return loaders

def init_loc(self, locales_root_dir, locales, loc_map):
def init_loc(self, locales_root_dir, locales, loc_map, default_locale):
locales = locales or []
locales_root_dir = locales_root_dir or os.path.join('i18n', 'translations')
default_locale = default_locale or 'en'
self.default_locale = default_locale

if locales_root_dir:
for loc in locales:
loc_map[loc] = Translations.load(locales_root_dir, [loc, 'en'])
#jinja_env.jinja_env.install_gettext_translations(translations)
if locales:
try:
from babel.support import Translations
for loc in locales:
loc_map[loc] = Translations.load(locales_root_dir, [loc, default_locale])
except:
logging.warn("Ignoring Locales. You must install i18n extensions with 'pip install pywb[i18n]' to use localization features")

def get_translate(context):
loc = context.get('env', {}).get('pywb_lang')
loc = context.get('env', {}).get('pywb_lang', default_locale)
return loc_map.get(loc)

def override_func(jinja_env, name):
Expand Down Expand Up @@ -160,6 +169,7 @@ def quote_gettext(context, text):

self.jinja_env.globals['locales'] = list(loc_map.keys())
self.jinja_env.globals['_Q'] = quote_gettext
self.jinja_env.globals['default_locale'] = default_locale

@contextfunction
def switch_locale(context, locale):
Expand Down
28 changes: 25 additions & 3 deletions pywb/rewrite/test/test_regex_rewriters.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,20 +218,42 @@
>>> _test_js_obj_proxy('eval(a)')
'WB_wombat_runEval(function _____evalIsEvil(_______eval_arg$$) { return eval(_______eval_arg$$); }.bind(this)).eval(a)'
>>> _test_js_obj_proxy(',eval(a)')
',eval(a)'
>>> _test_js_obj_proxy('this.$eval(a)')
'this.$eval(a)'
>>> _test_js_obj_proxy('x = this.$eval; x(a);')
'x = this.$eval; x(a);'
>>> _test_js_obj_proxy('x = eval; x(a);')
'x = WB_wombat_eval; x(a);'
'x = self.eval; x(a);'
>>> _test_js_obj_proxy('$eval = eval; $eval(a);')
'$eval = WB_wombat_eval; $eval(a);'
'$eval = self.eval; $eval(a);'
>>> _test_js_obj_proxy('foo(a, eval(data));')
'foo(a, WB_wombat_runEval(function _____evalIsEvil(_______eval_arg$$) { return eval(_______eval_arg$$); }.bind(this)).eval(data));'
>>> _test_js_obj_proxy('function eval() {}')
'function eval() {}'
>>> _test_js_obj_proxy('window.eval(a);')
'window.WB_wombat_runEval(function _____evalIsEvil(_______eval_arg$$) { return eval(_______eval_arg$$); }.bind(this)).eval(a);'
'window.eval(a);'
>>> _test_js_obj_proxy('x = window.eval; x(a);')
'x = window.eval; x(a);'
>>> _test_js_obj_proxy('obj = { eval : 1 }')
'obj = { eval : 1 }'
>>> _test_js_obj_proxy('x = obj.eval')
'x = obj.eval'
>>> _test_js_obj_proxy('x = obj.eval(a)')
'x = obj.eval(a)'
#=================================================================
# XML Rewriting
Expand Down
14 changes: 12 additions & 2 deletions pywb/rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,11 @@ rules:

- url_prefix: 'com,instagram)/'

rewrite:
js_regexs:
- match: '"is_dash_eligible":true'
replace: '"is_dash_eligible":false'

fuzzy_lookup: '()'


Expand Down Expand Up @@ -410,11 +415,16 @@ rules:
- action_load_comments
- filter

- url_prefix: ['com,youtube)/youtubei', 'com,youtube-nocookie)/youtubei']
- url_prefix: ['com,youtube)/embed', 'com,youtube-nocookie)/embed']

fuzzy_lookup:
match: '()'

- url_prefix: ['com,youtube)/youtubei/v1', 'com,youtube-nocookie)/youtubei/v1']

fuzzy_lookup:
- videoid

- url_prefix: 'com,googlevideo,'

fuzzy_lookup:
Expand Down Expand Up @@ -466,7 +476,7 @@ rules:
- match: '(?:"player":|ytplayer\.config).*"args":\s*{'
replace: '{0}"dash":"0","dashmpd":"",'

- match: '"0"==\w+\.dash\&\&'
- match: '"0"\s*?==\s*?\w+\.dash\&\&'
replace: '1&&'


Expand Down
2 changes: 1 addition & 1 deletion pywb/static/autoFetchWorker.js
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ function fetchDone() {
}

function fetchErrored(err) {
console.warn("Fetch Failed: " + err);
console.warn('Fetch Failed: ' + err);
fetchDone();
}

Expand Down
4 changes: 2 additions & 2 deletions pywb/static/default_banner.js
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ This file is part of pywb, https://github.com/webrecorder/pywb
ancillaryLinks.appendChild(calendarLink);
this.calendarLink = calendarLink;

if (typeof window.banner_info.locales !== "undefined" && window.banner_info.locales.length) {
if (typeof window.banner_info.locales !== "undefined" && window.banner_info.locales.length > 1) {
var locales = window.banner_info.locales;
var languages = document.createElement("div");

Expand Down Expand Up @@ -317,4 +317,4 @@ This file is part of pywb, https://github.com/webrecorder/pywb
}
}

})();
})();
2 changes: 1 addition & 1 deletion pywb/static/search.js
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ function clearFilters(event) {
}

function performQuery(url) {
var query = [window.wb_prefix + '*?url=' + url];
var query = [window.wb_prefix + '*?url=' + encodeURIComponent(url)];
var filterExpressions = document.getElementById(elemIds.filtering.list)
.children;
if (filterExpressions.length) {
Expand Down
2 changes: 1 addition & 1 deletion pywb/static/wombat.js

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion pywb/templates/banner.html
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{% if not env.pywb_proxy_magic or config.proxy.enable_banner | default(true) %}
{% autoescape false %}
<script>
window.banner_info = {
is_gmt: true,
Expand All @@ -24,5 +25,5 @@
<script src='{{ static_prefix }}/default_banner.js'> </script>
<link rel='stylesheet' href='{{ static_prefix }}/default_banner.css'/>


{% endautoescape %}
{% endif %}
3 changes: 2 additions & 1 deletion pywb/templates/base.html
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!DOCTYPE html>
<html lang="{{ env.pywb_lang | default('en') }}">
<html lang="{{ env.pywb_lang | default(default_locale) }}">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8;charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1">
Expand All @@ -9,6 +9,7 @@
<!-- jquery and bootstrap dependencies query view -->
<link rel="stylesheet" href="{{ static_prefix }}/css/bootstrap.min.css"/>
<link rel="stylesheet" href="{{ static_prefix }}/css/font-awesome.min.css">
<link rel="stylesheet" href="{{ static_prefix }}/css/base.css">

<script src="{{ static_prefix }}/js/jquery-latest.min.js"></script>
<script src="{{ static_prefix }}/js/bootstrap.min.js"></script>
Expand Down
12 changes: 6 additions & 6 deletions pywb/templates/error.html
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{% extends "base.html" %}
{% block title %}Pywb Error{% endblock %}
{% block title %}{{ _('Pywb Error') }}{% endblock %}
{% block body %}
<div class="container text-danger">
<div class="row justify-content-center">
Expand All @@ -8,22 +8,22 @@ <h2 class="display-2">Pywb Error</h2>
<div class="row">
<div class="col-12 text-center">
{% if err_status == 451 %}
<p class="lead">Access Blocked to {{ err_msg }}</p>
<p class="lead">{% trans %}Access Blocked to {{ err_msg }}{% endtrans %}</p>

{% elif err_status == 404 and err_details == 'coll_not_found' %}
<p>Collection not found: <b>{{ err_msg }}</b></p>
<p>{% trans %}Collection not found: <b>{{ err_msg }}{% endtrans %}</b></p>

<p><a href="/">See list of valid collections</a></p>
<p><a href="/{{ env.pywb_lang | default('') }}">{{ _('See list of valid collections') }}</a></p>

{% elif err_status == 404 and err_details == 'static_file_not_found' %}
<p>Static file not found: <b>{{ err_msg }}</b></p>
<p>{% trans %}Static file not found: <b>{{ err_msg }}{% endtrans %}</b></p>

{% else %}

<p class="lead">{{ err_msg }}</p>

{% if err_details %}
<p class="lead">Error Details:</p>
<p class="lead">{% trans %}Error Details:{% endtrans %}</p>
<pre>{{ err_details }}</pre>
{% endif %}
{% endif %}
Expand Down
4 changes: 4 additions & 0 deletions pywb/templates/frame_insert.html
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,12 @@
</style>
<script src='{{ static_prefix }}/wb_frame.js'> </script>

{% autoescape false %}

{{ banner_html }}

{% endautoescape %}

</head>
<body style="margin: 0px; padding: 0px;">

Expand Down
4 changes: 4 additions & 0 deletions pywb/templates/head_insert.html
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
{% autoescape false %}

<!-- WB Insert -->
<script>
{% set urlsplit = cdx.url | urlsplit %}
Expand Down Expand Up @@ -61,5 +63,7 @@

{{ banner_html }}

{% endautoescape %}

<!-- End WB Insert -->

13 changes: 13 additions & 0 deletions pywb/templates/header.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<header>
{% if not err_msg and locales|length > 1 %}
<div class="language-select">
{{ _('Language:') }}
<ul role="listbox" aria-activedescendant="{{ env.pywb_lang | default(default_locale) }}" aria-labelledby="{{ _('Language select') }}">
{% for locale in locales %}
<li role="option" id="{{ locale }}"><a href="{{ switch_locale(locale) }}">{{ locale }}</a></li>
{% endfor %}
</ul>
</div>
{% endif %}
</header>

4 changes: 2 additions & 2 deletions pywb/templates/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
<div class="container">
<div class="row">
<h2 class="display-2">{{ _('Pywb Wayback Machine') }}</h2>
<p class="lead">This archive contains the following collections:</p>
<p class="lead">{{ _('This archive contains the following collections:') }}</p>
</div>
<div class="row">
<ul>
{% for route in routes %}
<li>
<a href="{{ env['pywb.app_prefix'] + '/' + route }}">{{ '/' + route }}</a>
<a href="{{ env['pywb.app_prefix'] + ('/' + env.pywb_lang if env.pywb_lang else '') + '/' + route }}">{{ '/' + route }}</a>
{% if all_metadata and all_metadata[route] %}
({{ all_metadata[route].title }})
{% endif %}
Expand Down
4 changes: 2 additions & 2 deletions pywb/templates/not_found.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{% extends "base.html" %}

{% block title %}URL Not Found{% endblock %}
{% block title %}{{ _('URL Not Found') }}{% endblock %}

{% block body %}
<div class="container">
Expand All @@ -13,7 +13,7 @@ <h4>{% trans %}URL Not Found{% endtrans %}</h4>
{% if wbrequest and wbrequest.env.pywb_proxy_magic and url %}
<p>
<a href="//select.{{ wbrequest and wbrequest.env.pywb_proxy_magic }}/{{ url }}">
Try Different Collection
{{ _('Try Different Collection') }}
</a>
</p>
{% endif %}
Expand Down
66 changes: 33 additions & 33 deletions pywb/templates/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
<div class="container-fluid">
<div class="row justify-content-center">
<h4 class="display-4">
Collection {{ coll }} Search Page
{% trans %}Collection {{ coll }} Search Page{% endtrans %}
</h4>
</div>
</div>
Expand All @@ -27,17 +27,17 @@ <h4 class="display-4">
</label>
<input aria-label="url" aria-required="true" class="form-control form-control-lg" id="search-url"
name="search" placeholder="Enter a URL to search for"
title="Enter a URL to search for" type="search" required/>
title="{{ _('Enter a URL to search for') }}" type="search" required/>
<div class="invalid-feedback">
Please enter a URL
{% trans %}'Please enter a URL{% endtrans %}
</div>
</div>
</div>
<div class="form-row mt-2">
<div class="col-5">
<div class="custom-control custom-checkbox custom-control">
<input type="checkbox" class="custom-control-input" id="open-results-new-window">
<label class="custom-control-label" for="open-results-new-window">Open results in new window</label>
<label class="custom-control-label" for="open-results-new-window">{{ _('Open results in new window') }}</label>
</div>
</div>
<div class="col-7">
Expand All @@ -47,101 +47,101 @@ <h4 class="display-4">
<button class="btn btn-outline-info float-right mr-3" type="button" role="button"
data-toggle="collapse" data-target="#advancedOptions"
aria-expanded="false" aria-controls="advancedOptions" aria-label="Advanced Search Options">
Advanced Search Options
{{ _('Advanced Search Options') }}
</button>
</div>
</div>
<div class="collapse mt-3" id="advancedOptions">
<div class="form-group form-row">
<label for="match-type-select" class="col-sm-2 col-form-label" aria-label="Match Type">
Match Type:
{{ _('Match Type:') }}
</label>
<select id="match-type-select" class="form-control form-control col-sm-6">
<option value=""></option>
<option value="prefix">Prefix</option>
<option value="host">Host</option>
<option value="domain">Domain</option>
<option value="prefix">{% trans %}Prefix{% endtrans %}</option>
<option value="host">{% trans %}Host{% endtrans %}</option>
<option value="domain">{% trans %}Domain{% endtrans %}</option>
</select>
</div>
<p style="cursor: help;">
<span data-toggle="tooltip" data-placement="right"
title="Restricts the results to the given date/time range (inclusive)">
Date/Time Range
{{ _('Date/Time Range') }}
</span>
</p>
<div class="form-row">
<div class="col-6">
<label class="sr-only" for="dt-from" aria-label="Date/Time Range From">From:</label>
<label class="sr-only" for="dt-from" aria-label="Date/Time Range From">{% trans %}From:{% endtrans %}</label>
<div class="input-group">
<div class="input-group-prepend">
<div class="input-group-text">From:</div>
<div class="input-group-text">{% trans %}From:{% endtrans %}</div>
</div>
<input id="dt-from" type="number" name="date-range-from" class="form-control"
pattern="^\d{4,14}$">
<div class="invalid-feedback" id="dt-from-bad">
Please enter a valid <b>From</b> timestamp. Timestamps may be 4 <= ts <=14 digits
{% trans %}Please enter a valid <b>From</b> timestamp. Timestamps may be 4 <= ts <=14 digits{% endtrans %}
</div>
</div>
</div>
<div class="col-6">
<label class="sr-only" for="dt-to" aria-label="Date/Time Range To">To:</label>
<label class="sr-only" for="dt-to" aria-label="Date/Time Range To">{% trans %}To:{% endtrans %}</label>
<div class="input-group">
<div class="input-group-prepend">
<div class="input-group-text">To:</div>
<div class="input-group-text">{% trans %}To:{% endtrans %}</div>
</div>
<input id="dt-to" type="number" name="date-range-to" class="form-control" pattern="^\d{4,14}$">
<div class="invalid-feedback" id="dt-to-bad">
Please enter a valid <b>To</b> timestamp. Timestamps may be 4 <= ts <=14 digits
{% trans %}Please enter a valid <b>To</b> timestamp. Timestamps may be 4 <= ts <=14 digits{% endtrans %}
</div>
</div>
</div>
</div>
<div class="form-group mt-3">
<div class="form-row">
<div class="col-6">
<p>Filtering</p>
<p>{% trans %}Filtering{% endtrans %}</p>
</div>
<div class="col-6">
<button id="clear-filters" class="btn btn-outline-warning float-right" type="button">
Clear Filters
{% trans %}Clear Filters{% endtrans %}
</button>
<button id="add-filter" class="btn btn-outline-secondary float-right mr-2" type="button">
Add Filter
{% trans %}Add Filter{% endtrans %}
</button>
</div>
</div>
<div class="form-row">
<div class="col-6">
<div class="row pb-1">
<label for="filter-by" class="col-form-label col-3">By:</label>
<label for="filter-by" class="col-form-label col-3">{% trans %}By:{% endtrans %}</label>
<select id="filter-by" class="form-control col-7">
<option value="" selected></option>
<option value="mime">Mime Type</option>
<option value="status">Status</option>
<option value="url">URL</option>
<option value="mime">{% trans %}Mime Type{% endtrans %}</option>
<option value="status">{% trans %}Status{% endtrans %}</option>
<option value="url">{% trans %}URL{% endtrans %}</option>
</select>
</div>
<div class="row pb-1">
<label for="filter-modifier" class="col-form-label col-3">How:</label>
<label for="filter-modifier" class="col-form-label col-3">{% trans %}How:{% endtrans %}</label>
<select id="filter-modifier" class="form-control col-7">
<option value="=">Contains</option>
<option value="==">Matches Exactly</option>
<option value="=~">Matches Regex</option>
<option value="=!">Does Not Contains</option>
<option value="=!=">Is Not</option>
<option value="=!~">Does Not Begins With</option>
<option value="=">{% trans %}Contains{% endtrans %}</option>
<option value="==">{% trans %}Matches Exactly{% endtrans %}</option>
<option value="=~">{% trans %}Matches Regex{% endtrans %}</option>
<option value="=!">{% trans %}Does Not Contain{% endtrans %}</option>
<option value="=!=">{% trans %}Is Not{% endtrans %}</option>
<option value="=!~">{% trans %}Does Not Begins With{% endtrans %}</option>
</select>
</div>
<div class="row">
<label for="filter-expression" class="col-form-label col-3">Expr:</label>
<label for="filter-expression" class="col-form-label col-3">{% trans %}Expr:{% endtrans %}</label>
<input type="text" id="filter-expression" class="form-control col-7"
placeholder="Enter an expression to filter by"
>
</div>
</div>
<div class="col-6">
<ul id="filter-list" class="filter-list">
<li id="filtering-nothing">No Filter</li>
<li id="filtering-nothing">{% trans %}No Filter{% endtrans %}</li>
</ul>
</div>
</div>
Expand All @@ -151,7 +151,7 @@ <h4 class="display-4">
</div>
{% if metadata %}
<div class="container mt-4 justify-content-center">
<p class="lead">Collection Metadata</p>
<p class="lead">{{ _('Collection Metadata') }}</p>
<div class="row">
<div class="col-4 pr-1">
<div class="list-group" id="collection-metadata" role="tablist">
Expand Down
2 changes: 1 addition & 1 deletion pywb/version.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = '2.5.0'
__version__ = '2.6.0'

if __name__ == '__main__':
print(__version__)
148 changes: 133 additions & 15 deletions pywb/warcserver/access_checker.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
from pywb.utils.binsearch import search
from pywb.utils.merge import merge

from warcio.timeutils import timestamp_to_datetime
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import os


Expand Down Expand Up @@ -78,12 +81,18 @@ class AccessChecker(object):

EXACT_SUFFIX = '###' # type: str
EXACT_SUFFIX_B = b'###' # type: bytes
# rules in the ACL file are followed by a white space (U+0020):
# for searching we need a match suffix which sorts/compares after
# (resp. before because we use the rev_cmp function). Simply add
# another '#' (U+0023 > U+0020)
EXACT_SUFFIX_SEARCH_B = b'####' # type: bytes

def __init__(self, access_source, default_access='allow'):
def __init__(self, access_source, default_access='allow', embargo=None):
"""Initialize a new AccessChecker
:param str|list[str]|AccessRulesAggregator access_source: An access source
:param str default_access: The default access action (allow)
:param dict embargo: A dict specifying optional embargo setting
"""
if isinstance(access_source, str):
self.aggregator = self.create_access_aggregator([access_source])
Expand All @@ -98,6 +107,72 @@ def __init__(self, access_source, default_access='allow'):
self.default_rule['access'] = default_access
self.default_rule['default'] = 'true'

self.embargo = self.parse_embargo(embargo)

def parse_embargo(self, embargo):
if not embargo:
return None

value = embargo.get('before')
if value:
embargo['before'] = timestamp_to_datetime(str(value))

value = embargo.get('after')
if value:
embargo['after'] = timestamp_to_datetime(str(value))

value = embargo.get('older')
if value:
delta = relativedelta(
years=value.get('years', 0),
months=value.get('months', 0),
weeks=value.get('weeks', 0),
days=value.get('days', 0))

embargo['older'] = delta

value = embargo.get('newer')
if value:
delta = relativedelta(
years=value.get('years', 0),
months=value.get('months', 0),
weeks=value.get('weeks', 0),
days=value.get('days', 0))

embargo['newer'] = delta

return embargo

def check_embargo(self, url, ts):
if not self.embargo:
return None

dt = timestamp_to_datetime(ts)
access = self.embargo.get('access', 'exclude')

# embargo before
before = self.embargo.get('before')
if before:
print(dt, before)
return access if dt < before else None

# embargo after
after = self.embargo.get('after')
if after:
return access if dt > after else None

# embargo if newser than
newer = self.embargo.get('newer')
if newer:
actual = datetime.utcnow() - newer
return access if actual < dt else None

# embargo if older than
older = self.embargo.get('older')
if older:
actual = datetime.utcnow() - older
return access if actual > dt else None

def create_access_aggregator(self, source_files):
"""Creates a new AccessRulesAggregator using the supplied list
of access control file names
Expand Down Expand Up @@ -134,22 +209,26 @@ def create_access_source(self, filename):
else:
raise Exception('Invalid Access Source: ' + filename)

def find_access_rule(self, url, ts=None, urlkey=None):
def find_access_rule(self, url, ts=None, urlkey=None, collection=None, acl_user=None):
"""Attempts to find the access control rule for the
supplied URL otherwise returns the default rule
:param str url: The URL for the rule to be found
:param str|None ts: A timestamp (not used)
:param str|None urlkey: The access control url key
:param str|None collection: The collection, if any
:param str|None acl_user: The access control user, if any
:return: The access control rule for the supplied URL
if one exists otherwise the default rule
:rtype: CDXObject
"""
params = {'url': url,
'urlkey': urlkey,
'nosource': 'true',
'exact_match_suffix': self.EXACT_SUFFIX_B
'exact_match_suffix': self.EXACT_SUFFIX_SEARCH_B
}
if collection:
params['param.coll'] = collection

acl_iter, errs = self.aggregator(params)
if errs:
Expand All @@ -160,68 +239,107 @@ def find_access_rule(self, url, ts=None, urlkey=None):

tld = key.split(b',')[0]

last_obj = None
last_key = None

for acl in acl_iter:

# skip empty/invalid lines
if not acl:
continue

acl_key = acl.split(b' ')[0]
acl_obj = None

if acl_key != last_key and last_obj:
return last_obj

if key_exact == acl_key:
return CDXObject(acl)
acl_obj = CDXObject(acl)

if key.startswith(acl_key):
return CDXObject(acl)
acl_obj = CDXObject(acl)

if acl_obj:
user = acl_obj.get('user')
if user == acl_user:
return acl_obj
elif not user:
last_key = acl_key
last_obj = acl_obj

# if acl key already less than first tld,
# no match can be found
if acl_key < tld:
break

return self.default_rule
return last_obj if last_obj else self.default_rule

def __call__(self, res):
def __call__(self, res, acl_user):
"""Wraps the cdx iter in the supplied tuple returning a
the wrapped cdx iter and the other members of the supplied
tuple in same order
:param tuple res: The result tuple
:param str acl_user: The user associated with this request (optional)
:return: An tuple
"""
cdx_iter, errs = res
return self.wrap_iter(cdx_iter), errs
return self.wrap_iter(cdx_iter, acl_user), errs

def wrap_iter(self, cdx_iter):
def wrap_iter(self, cdx_iter, acl_user):
"""Wraps the supplied cdx iter and yields cdx objects
that contain the access control results for the cdx object
being yielded
:param cdx_iter: The cdx object iterator to be wrapped
:param str acl_user: The user associated with this request (optional)
:return: The wrapped cdx object iterator
"""
last_rule = None
last_url = None
last_user = None
rule = None

for cdx in cdx_iter:
url = cdx.get('url')
timestamp = cdx.get('timestamp')

# if no url, possible idx or other object, don't apply any checks and pass through
if not url:
yield cdx
continue

# TODO: optimization until date range support is included
if url == last_url:
rule = last_rule
else:
rule = self.find_access_rule(url, cdx.get('timestamp'), cdx.get('urlkey'))
access = None
if self.aggregator:
# TODO: optimization until date range support is included
if url == last_url and acl_user == last_user:
rule = last_rule
else:
rule = self.find_access_rule(url, timestamp,
cdx.get('urlkey'),
cdx.get('source-coll'),
acl_user)

access = rule.get('access', 'exclude')

if access != 'allow_ignore_embargo' and access != 'exclude':
embargo_access = self.check_embargo(url, timestamp)
if embargo_access and embargo_access != 'allow':
access = embargo_access

access = rule.get('access', 'exclude')
if access == 'exclude':
continue

if not access:
access = self.default_rule['access']

if access == 'allow_ignore_embargo':
access = 'allow'

cdx['access'] = access
yield cdx

last_rule = rule
last_url = url
last_user = acl_user
5 changes: 4 additions & 1 deletion pywb/warcserver/basewarcserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,9 @@ def send_error(self, errs, start_response,
out_headers['ResErrors'] = res[0]
message = message.encode('utf-8')

message = str(status) + ' ' + message
if isinstance(status, str):
message = status
else:
message = str(status) + ' ' + message
start_response(message, list(out_headers.items()))
return res
33 changes: 28 additions & 5 deletions pywb/warcserver/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from warcio.recordloader import ArchiveLoadFailed

from pywb.warcserver.index.cdxobject import CDXException
from pywb.warcserver.index.fuzzymatcher import FuzzyMatcher
from pywb.warcserver.resource.responseloader import WARCPathLoader, LiveWebLoader, VideoLoader

Expand Down Expand Up @@ -65,8 +66,10 @@ def _load_index_source(self, params):

cdx_iter = self.fuzzy(self.index_source, params)

acl_user = params['_input_req'].env.get("HTTP_X_PYWB_ACL_USER")

if self.access_checker:
cdx_iter = self.access_checker(cdx_iter)
cdx_iter = self.access_checker(cdx_iter, acl_user)

return cdx_iter

Expand All @@ -80,29 +83,49 @@ def __call__(self, params):

output = params.get('output', self.DEF_OUTPUT)
fields = params.get('fields')
if not fields:
fields = params.get('fl')

if fields and isinstance(fields, str):
fields = fields.split(',')

handler = self.OUTPUTS.get(output, fields)
handler = self.OUTPUTS.get(output)
if not handler:
errs = dict(last_exc=BadRequestException('output={0} not supported'.format(output)))
return None, None, errs

cdx_iter, errs = self._load_index_source(params)
cdx_iter = None
try:
cdx_iter, errs = self._load_index_source(params)
except BadRequestException as e:
errs = dict(last_exc=e)
if not cdx_iter:
return None, None, errs

content_type, res = handler(cdx_iter, fields, params)
out_headers = {'Content-Type': content_type}

def check_str(lines):
first_line = None
try:
# raise exceptions early so that they can be handled properly
first_line = next(res)
except StopIteration:
pass
except CDXException as e:
errs = dict(last_exc=e)
return None, None, errs

def check_str(first_line, lines):
if first_line is not None:
if isinstance(first_line, six.text_type):
first_line = first_line.encode('utf-8')
yield first_line
for line in lines:
if isinstance(line, six.text_type):
line = line.encode('utf-8')
yield line

return out_headers, check_str(res), errs
return out_headers, check_str(first_line, res), errs


#=============================================================================
Expand Down
4 changes: 3 additions & 1 deletion pywb/warcserver/index/aggregator.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ def __call__(self, params):

cdx_iter, errs = self.load_index(query.params)

cdx_iter = process_cdx(cdx_iter, query)
if not query.page_count:
cdx_iter = process_cdx(cdx_iter, query)

return cdx_iter, dict(errs)

def load_child_source(self, name, source, params):
Expand Down
4 changes: 4 additions & 0 deletions pywb/warcserver/index/fuzzymatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,10 @@ def match_general_fuzzy_query(self, url, urlkey, cdx, rx_cache):
if mime and mime in self.default_filters['mimes']:
check_query = True

# also check query if has method (non-GET request) or requestBody is set
elif cdx.get('requestBody') or cdx.get('method'):
check_query = True

# if check_query, ensure matched url starts with original prefix, only differs by query
if check_query:
if cdx['url'] == url_no_query or cdx['url'].startswith(url_no_query + '?'):
Expand Down
11 changes: 8 additions & 3 deletions pywb/warcserver/index/indexsource.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,9 +157,9 @@ def _set_load_url(self, cdx, params):
if name:
source_coll = params.get('param.' + name + '.src_coll', '')

cdx[self.url_field] = self.replay_url.format(url=cdx['url'],
cdx[self.url_field] = res_template(self.replay_url, dict(url=cdx['url'],
timestamp=cdx['timestamp'],
src_coll=source_coll)
src_coll=source_coll))
def __repr__(self):
return '{0}({1}, {2})'.format(self.__class__.__name__,
self.api_url,
Expand Down Expand Up @@ -248,7 +248,7 @@ def load_index(self, params):
try:
limit = params.get('limit')
if limit:
query = 'limit: {0} '.format(limit) + query
query = 'limit:{0} '.format(limit) + query

# OpenSearch API requires double-escaping
# TODO: add option to not double escape if needed
Expand Down Expand Up @@ -314,6 +314,11 @@ def convert_to_cdx(self, item):
cdx['digest'] = self.gettext(item, 'digest')
cdx['offset'] = self.gettext(item, 'compressedoffset')
cdx['filename'] = self.gettext(item, 'file')

length = self.gettext(item, 'compressedendoffset')
if length:
cdx['length'] = length

return cdx

def gettext(self, item, name):
Expand Down
6 changes: 5 additions & 1 deletion pywb/warcserver/index/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,11 @@ def secondary_index_only(self):

@property
def page(self):
return int(self.params.get('page', 0))
try:
return int(self.params.get('page', 0))
except ValueError:
msg = 'Invalid value for page= param: {}'
raise CDXException(msg.format(self.params.get('page')))

@property
def page_size(self):
Expand Down
8 changes: 6 additions & 2 deletions pywb/warcserver/index/test/test_xmlquery_indexsource.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,16 @@ def do_query(self, params):
@patch('pywb.warcserver.index.indexsource.requests.sessions.Session.get', mock_get)
def test_exact_query(self):
res, errs = self.do_query({'url': 'http://example.com/', 'limit': 100})
reslist = list(res)

expected = """\
com,example)/ 20180112200243 example.warc.gz
com,example)/ 20180216200300 example.warc.gz"""
assert(key_ts_res(res) == expected)
assert(key_ts_res(reslist) == expected)
assert(errs == {})
assert query_url == 'http://localhost:8080/path?q=limit%3A+100+type%3Aurlquery+url%3Ahttp%253A%252F%252Fexample.com%252F'
assert query_url == 'http://localhost:8080/path?q=limit%3A100+type%3Aurlquery+url%3Ahttp%253A%252F%252Fexample.com%252F'
assert reslist[0]['length'] == '123'
assert 'length' not in reslist[1]


@patch('pywb.warcserver.index.indexsource.requests.sessions.Session.get', mock_get)
Expand Down Expand Up @@ -119,6 +122,7 @@ def _get_etree(cls):
<results>
<result>
<compressedoffset>10</compressedoffset>
<compressedendoffset>123</compressedendoffset>
<mimetype>text/html</mimetype>
<file>example.warc.gz</file>
<redirecturl>-</redirecturl>
Expand Down
14 changes: 14 additions & 0 deletions pywb/warcserver/index/test/test_zipnum.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@
from pywb import get_test_dir
from pywb.warcserver.index.test.test_cdxops import cdx_ops_test, cdx_ops_test_data
from pywb.warcserver.warcserver import init_index_agg
from pywb.warcserver.index.cdxobject import CDXException

import shutil
import tempfile
Expand Down Expand Up @@ -227,13 +228,26 @@ def test_blocks_zero_pages():
res = zip_ops_test_data(url='http://aaa.zz/', matchType='domain', showNumPages=True)
assert(res == {"blocks": 0, "pages": 0, "pageSize": 10})

def test_blocks_ignore_filter_params():
res = zip_ops_test_data(url='*.iana.org', pageSize='4', showNumPages=True, filter='=status:200')
assert(res == {"blocks": 38, "pages": 10, "pageSize": 4})

def test_blocks_ignore_timestamp_params():
res = zip_ops_test_data(url='*.iana.org', pageSize='4', showNumPages=True, closest='20140126000000')
assert(res == {"blocks": 38, "pages": 10, "pageSize": 4})


# Errors

def test_err_file_not_found():
with pytest.raises(IOError):
zip_test_err(url='http://iana.org/x', matchType='exact') # doctest: +IGNORE_EXCEPTION_DETAIL

def test_invalid_int_param():
with pytest.raises(CDXException):
zip_ops_test_data(url='http://iana.org/domains/example', matchType='exact', pageSize='not-an-integer')
with pytest.raises(CDXException):
zip_ops_test_data(url='http://iana.org/domains/example', matchType='exact', page='not-an-integer')



Expand Down
6 changes: 5 additions & 1 deletion pywb/warcserver/index/zipnum.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,11 @@ def compute_page_range(self, reader, query):
if not pagesize:
pagesize = self.max_blocks
else:
pagesize = int(pagesize)
try:
pagesize = int(pagesize)
except ValueError:
msg = 'Invalid value for pageSize= param: {}'
raise CDXException(msg.format(pagesize))

last_line = None

Expand Down
71 changes: 55 additions & 16 deletions pywb/warcserver/inputrequest.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

import base64
import cgi
import json


#=============================================================================
Expand Down Expand Up @@ -77,7 +78,7 @@ def include_method_query(self, url):

method = self.get_req_method()

if method not in ('OPTIONS', 'POST'):
if method == 'GET' or method == 'HEAD':
return url

mime = self._get_content_type()
Expand Down Expand Up @@ -181,7 +182,8 @@ def _get_header(self, name):

# ============================================================================
class MethodQueryCanonicalizer(object):
MAX_POST_SIZE = 16384
#MAX_POST_SIZE = 16384
MAX_QUERY_LENGTH = 4096

def __init__(self, method, mime, length, stream,
buffered_stream=None,
Expand All @@ -196,12 +198,9 @@ def __init__(self, method, mime, length, stream,
self.query = b''

method = method.upper()
self.method = method

if method in ('OPTIONS', 'HEAD'):
self.query = '__pywb_method=' + method.lower()
return

if method != 'POST':
if method != 'POST' and method != 'PUT':
return

try:
Expand All @@ -212,8 +211,8 @@ def __init__(self, method, mime, length, stream,
if length <= 0:
return

# max POST query allowed, for size considerations, only read upto this size
length = min(length, self.MAX_POST_SIZE)
# always read entire POST request, but limit query string later
#length = min(length, self.MAX_POST_SIZE)
query = []

while length > 0:
Expand Down Expand Up @@ -274,12 +273,26 @@ def handle_binary(query):
elif mime.startswith('application/x-amf'):
query = self.amf_parse(query, environ)

elif mime.startswith('application/json'):
try:
query = self.json_parse(query)
except Exception as e:
print(e)
query = ''

elif mime.startswith('text/plain'):
try:
query = self.json_parse(query)
except Exception as e:
query = handle_binary(query)

else:
query = handle_binary(query)

self.query = query
if query:
self.query = query[:self.MAX_QUERY_LENGTH]

def amf_parse(self, string, environ):
def amf_parse(self, string, warn_on_error):
try:
res = decode(BytesIO(string))
return urlencode({"request": Amf.get_representation(res)})
Expand All @@ -290,15 +303,41 @@ def amf_parse(self, string, environ):
print(e)
return None

def json_parse(self, string):
data = {}
dupes = {}

def get_key(n):
if n not in data:
return n

if n not in dupes:
dupes[n] = 1

dupes[n] += 1
return n + "." + str(dupes[n]) + "_";

def _parser(dict_var):
for n, v in dict_var.items():
if isinstance(v, dict):
_parser(v)
else:
data[get_key(n)] = str(v)

_parser(json.loads(string))
return urlencode(data)

def append_query(self, url):
if not self.query:
if self.method == 'GET':
return url

if '?' not in url:
url += '?'
append_str = '?'
else:
url += '&'
append_str = '&'

url += self.query
return url
append_str += "__wb_method=" + self.method
if self.query:
append_str += '&' + self.query

return url + append_str
14 changes: 13 additions & 1 deletion pywb/warcserver/test/test_access.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ def test_blocks_only(self):
assert edx['urlkey'] == 'com,example)/foo'
assert edx['access'] == 'exclude'

edx = access.find_access_rule('https://example.net/abc/path')
assert edx['urlkey'] == 'net,example)/abc/path'
assert edx['access'] == 'block'

edx = access.find_access_rule('https://example.net/abc/path/other')
assert edx['urlkey'] == 'net,example)/abc/path'
assert edx['access'] == 'block'
Expand Down Expand Up @@ -114,7 +118,7 @@ def test_excludes_dir(self):
assert edx['urlkey'] == 'net,example)/abc/path'
assert edx['access'] == 'block'

# exact-only matchc
# exact-only match
edx = access.find_access_rule('https://www.iana.org/')
assert edx['urlkey'] == 'org,iana)/###'
assert edx['access'] == 'allow'
Expand All @@ -127,4 +131,12 @@ def test_excludes_dir(self):
assert edx['urlkey'] == 'org,iana)/'
assert edx['access'] == 'exclude'

# exact-only match, first line in *.aclj file
edx = access.find_access_rule('https://www.iana.org/exact/match/first/line/aclj/')
assert edx['urlkey'] == 'org,iana)/exact/match/first/line/aclj###'
assert edx['access'] == 'allow'

# exact-only match, single rule in *.aclj file
edx = access.find_access_rule('https://www.lonesome-rule.org/')
assert edx['urlkey'] == 'org,lonesome-rule)/###'
assert edx['access'] == 'allow'
Loading