NEWS

w3lib release notes
===================

2.1.1 (2022-12-09)
------------------

- :func:`~w3lib.url.safe_url_string`, :func:`~w3lib.url.safe_download_url`
  and :func:`~w3lib.url.canonicalize_url` now strip whitespace and control
  characters urls according to the URL living standard.


2.1.0 (2022-11-28)
------------------

-   Dropped Python 3.6 support, and made Python 3.11 support official. (#195,
    #200)

-   :func:`~w3lib.url.safe_url_string` now generates safer URLs.

    To make URLs safer for the `URL living standard`_:

    .. _URL living standard: https://url.spec.whatwg.org/

    -   ``;=`` are percent-encoded in the URL username.

    -   ``;:=`` are percent-encoded in the URL password.

    -   ``'`` is percent-encoded in the URL query if the URL scheme is `special
        <https://url.spec.whatwg.org/#special-scheme>`__.

    To make URLs safer for `RFC 2396`_ and `RFC 3986`_, ``|[]`` are
    percent-encoded in URL paths, queries, and fragments.

    .. _RFC 2396: https://www.ietf.org/rfc/rfc2396.txt
    .. _RFC 3986: https://www.ietf.org/rfc/rfc3986.txt

    (#80, #203)

-   :func:`~w3lib.encoding.html_to_unicode` now checks for the `byte order
    mark`_ before inspecting the ``Content-Type`` header when determining the
    content encoding, in line with the `URL living standard`_. (#189, #191)

    .. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark

-   :func:`~w3lib.url.canonicalize_url` now strips spaces from the input URL,
    to be more in line with the `URL living standard`_. (#132, #136)

-   :func:`~w3lib.html.get_base_url` now ignores HTML comments. (#70, #77)

-   Fixed :func:`~w3lib.url.safe_url_string` re-encoding percent signs on
    the URL username and password even when they were being used as part of an
    escape sequence. (#187, #196)

-   Fixed :func:`~w3lib.http.basic_auth_header` using the wrong flavor of
    base64 encoding, which could prevent authentication in rare cases. (#181,
    #192)

-   Fixed :func:`~w3lib.html.replace_entities` raising :exc:`OverflowError` in
    some cases due to `a bug in CPython
    <https://github.com/python/cpython/issues/76763>`__. (#199, #202)

-   Improved typing and fixed typing issues. (#190, #206)

-   Made CI and test improvements. (#197, #198)

-   Adopted a Code of Conduct. (#194)


2.0.1 (2022-08-11)
------------------
Minor documentation fix (release date is set in the changelog).

2.0.0 (2022-08-11)
------------------

Backwards incompatible changes:

- Python 2 is no longer supported; Python 3.6+ is required now (#168, #175).
- :func:`w3lib.url.safe_url_string` and :func:`w3lib.url.canonicalize_url`
  no longer convert "%23" to "#" when it appears in the URL path. This is a bug
  fix. It's listed as a backward-incomatible change because in some cases the
  output of :func:`w3lib.url.canonicalize_url` is going to change, and so, if
  this output is used to generate URL fingerprints, new fingerprints might be
  incompatible with those created with the previous w3lib versions
  (#141).

Deprecation removals (#169):

- The ``w3lib.form`` module is removed.
- The ``w3lib.html.remove_entities`` function is removed.
- The ``w3lib.url.urljoin_rfc`` function is removed.

The following functions are deprecated, and will be removed in future releases
(#170):

- ``w3lib.util.str_to_unicode``
- ``w3lib.util.unicode_to_str``
- ``w3lib.util.to_native_str``

Other improvements and bug fixes:

- Type annotations are added (#172, #184).
- Added support for Python 3.9 and 3.10 (#168, #176).
- Fixed :func:`w3lib.html.get_meta_refresh` for ``<meta>`` tags where
  ``http-equiv`` is written after ``content`` (#179).
- Fixed :func:`w3lib.url.safe_url_string` for IDNA domains with ports (#174).
- :func:`w3lib.url.url_query_cleaner` no longer adds an unneeded ``#`` when
  ``keep_fragments=True`` is passed, and the URL doesn't have a fragment
  (#159).
- Removed a workaround for an ancient pathname2url bug (#142)
- CI is migrated to GitHub Actions (#166, #177); other CI improvements (#160,
  #182).
- The code is formatted using black (#173).

1.22.0 (2020-05-13)
-------------------

- Python 3.4 is no longer supported (issue #156)
- :func:`w3lib.url.safe_url_string` now supports an optional ``quote_path``
  parameter to disable the percent-encoding of the URL path (issue #119)
- :func:`w3lib.url.add_or_replace_parameter` and
  :func:`w3lib.url.add_or_replace_parameters` no longer remove duplicate
  parameters from the original query string that are not being added or
  replaced (issue #126)
- :func:`w3lib.html.remove_tags` now raises a :exc:`ValueError` exception
  instead of :exc:`AssertionError` when using both the ``which_ones`` and the
  ``keep`` parameters (issue #154)
- Test improvements (issues #143, #146, #148, #149)
- Documentation improvements (issues #140, #144, #145, #151, #152, #153)
- Code cleanup (issue #139)


1.21.0 (2019-08-09)
-------------------

- Add the ``encoding`` and ``path_encoding`` parameters to
  :func:`w3lib.url.safe_download_url` (issue #118)
- :func:`w3lib.url.safe_url_string` now also removes tabs and new lines
  (issue #133)
- :func:`w3lib.html.remove_comments` now also removes truncated comments
  (issue #129)
- :func:`w3lib.html.remove_tags_with_content` no longer removes tags which
  start with the same text as one of the specified tags (issue #114)
- Recommend pytest instead of nose to run tests (issue #124)


1.20.0 (2019-01-11)
-------------------

- Fix url_query_cleaner to do not append "?" to urls without a query string (issue #109)
- Add support for Python 3.7 and drop Python 3.3 (issue #113)
- Add `w3lib.url.add_or_replace_parameters` helper (issue #117)
- Documentation fixes (issue #115)


1.19.0 (2018-01-25)
-------------------

- Add a workaround for CPython segfault (https://bugs.python.org/issue32583)
  which affect w3lib.encoding functions. This is technically **backwards
  incompatible** because it changes the way non-decodable bytes are replaced
  (in some cases instead of two ``\ufffd`` chars you can get one).
  As a side effect, the fix speeds up decoding in Python 3.4+.
- Add 'encoding' parameter for w3lib.http.basic_auth_header.
- Fix pypy testing setup, add pypy3 to CI.


1.18.0 (2017-08-03)
-------------------

- Include additional assets used for distribution packages in the source tarball
- Consider ``[`` and ``]`` as safe characters in path and query components
  of URLs, i.e. they are not escaped anymore
- Disable codecov project coverage check


1.17.0 (2017-02-08)
-------------------

- Add Python 3.5 and 3.6 support
- Add ``w3lib.url.parse_data_uri`` helper for parsing "data:" URIs
- Add ``w3lib.html.strip_html5_whitespace`` function to strip leading and
  trailing whitespace as per W3C recommendations, e.g. for cleaning
  "href" attribute values
- Fix ``w3lib.http.headers_raw_to_dict`` for multiple headers with same name
- Do not distribute tests/test_*.pyc artifacts


1.16.0 (2016-11-10)
-------------------

- ``canonicalize_url()`` and ``safe_url_string()``:
  strip ":" when no port is specified (as per `RFC 3986`_;
  see also https://github.com/scrapy/scrapy/issues/2377)
- ``url_query_cleaner()``: support new ``keep_fragments`` argument
  (defaulting to ``False``)

1.15.0 (2016-07-29)
-------------------

- Add ``canonicalize_url()`` to ``w3lib.url``

1.14.3 (2016-07-14)
-------------------

Bugfix release:

- Handle IDNA encoding failures in ``safe_url_string()`` (issue #62)

1.14.2 (2016-04-11)
-------------------

Bugfix release:

- fix function import for (deprecated) ``urljoin_rfc`` (issue #51)
- only expose wanted functions from ``w3lib.url``, via ``__all__``
  (see issue #54, https://github.com/scrapy/scrapy/issues/1917)

1.14.1 (2016-04-07)
-------------------

Bugfix release:

- For bytes URLs, when supplied encoding (or default UTF8) is wrong,
  ``safe_url_string`` falls back to percent-encoding offending bytes.

1.14.0 (2016-04-06)
-------------------

Changes to safe_url_string:

- proper handling of non-ASCII characters in Python2 and Python3
- support IDNs
- new `path_encoding` to override default UTF-8 when serializing non-ASCII
  characters before percent-encoding

html_body_declared_encoding also detects encoding when not sole attribute
in ``<meta>``.

Package is now properly marked as ``zip_safe``.

1.13.0 (2015-11-05)
-------------------

- remove_tags removes uppercase tags as well;
- ignore meta-redirects inside script or noscript tags by default,
  but add an option to not ignore them;
- replace_entities now handles entities without trailing semicolon;
- fixed uncaught UnicodeDecodeError when decoding entities.

1.12.0 (2015-06-29)
-------------------

- meta_refresh regex now handles leading newlines and whitespaces in the url;
- include tests folder in source distribution.

1.11.0 (2015-01-13)
-------------------

- url_query_cleaner now supports str or list parameters;
- add support for resolving base URLs in <base> tags with attributes
  before href.

1.10.0 (2014-08-20)
-------------------

- reverted all 1.9.0 changes.

1.9.0 (2014-08-16)
------------------

- all url-related functions accept bytes and unicode and now return bytes.

1.8.1 (2014-08-14)
------------------

- w3lib.http.basic_auth_header now returns bytes

1.8.0 (2014-07-31)
------------------

- add support for big5-hkscs encoding.

1.7.1 (2014-07-26)
------------------

- PY3 fixed headers_raw_to_dict and headers_dict_to_raw;
- documentation improvements;
- provide wheels.

1.6 (2014-06-03)
----------------

- w3lib.form.encode_multipart is deprecated;
- docstrings and docs are improved;
- w3lib.url.add_or_replace_parameter is re-implemented on top of
  stdlib functions;
- remove_entities is renamed to replace_entities.

1.5 (2013-11-09)
----------------

- Python 2.6 support is dropped.

1.4 (2013-10-18)
----------------

- Python 3 support;
- get_meta_refresh encoding handling is fixed;
- check for '?' in add_or_replace_parameter;
- ISO-8859-1 is used for HTTP Basic Auth;
- fixed unicode handling in replace_escape_chars;

1.3 (2012-05-13)
----------------

- support non-standard gb_2312_80 encoding;
- drop Python 2.5 support.

1.2 (2012-05-02)
----------------

- Detect encoding for content attr before http-equiv in meta tag.

1.1 (2012-04-18)
----------------

- w3lib.html.remove_comments handles multiline comments;
- Added w3lib.encoding module, containing functions for working with character
  encoding, like encoding autodetection from HTML pages.
- w3lib.url.urljoin_rfc is deprecated.

1.0 (2011-04-17)
----------------

First release of w3lib.