Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial functionality. #4

Merged
merged 34 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
40b6d51
Initial functionality.
wRAR Mar 25, 2024
1f6cc87
Rename PolicyRule to UrlRule.
wRAR Mar 26, 2024
de962c6
Move to/from_dict into the UrlRule class.
wRAR Mar 26, 2024
ffb68cf
Improve QueryRemovalPolicy.
wRAR Mar 26, 2024
ab98bf2
Tests for Processor.
wRAR Mar 26, 2024
e932c1f
Basic readme.
wRAR Mar 27, 2024
44e8e66
Rename a method.
wRAR Mar 27, 2024
de00aea
Add .coveragerc.
wRAR Mar 27, 2024
dac0b16
Update the URL matching logic.
wRAR Apr 2, 2024
20bc9d8
Set the min url-matcher version.
wRAR Apr 3, 2024
6fd9b91
Small improvements.
wRAR Apr 8, 2024
233a62f
Use match_universal().
wRAR Apr 11, 2024
5fcd156
Skip duplicate rules.
wRAR Apr 12, 2024
7f442a7
Update the component.
wRAR Apr 12, 2024
e335437
Update the url-matcher version.
wRAR Apr 15, 2024
044c5a7
Rename the Scrapy setting.
wRAR Apr 15, 2024
3254a7b
Rename the processor.
wRAR Apr 18, 2024
7daf848
Add stats to the middleware.
wRAR Apr 18, 2024
d843cfd
Rane policies to processors, other name and type fixes.
wRAR Apr 18, 2024
1f5a56b
Fix a typo.
wRAR Apr 19, 2024
34b9cc2
Change the mw logic.
wRAR Apr 19, 2024
d2d831f
Add a test for skipping universal rules.
wRAR Apr 19, 2024
2238ac5
Use the Scrapy fingerprinter.
wRAR Apr 19, 2024
d39eb3d
Honor request.dont_filter.
wRAR Apr 22, 2024
5567a10
Rephrase the README.
wRAR Apr 22, 2024
5d7840b
Flip the meta var default.
wRAR Apr 24, 2024
2d74b1a
Remove support for loading processors by import path.
wRAR Apr 24, 2024
a27d14d
Tests for the middleware.
wRAR Apr 25, 2024
3953115
Add the settings example.
wRAR Apr 25, 2024
9367bfe
README fixes.
wRAR Apr 26, 2024
570eafc
Replace the middleware with the fingerprinter.
wRAR Apr 26, 2024
ac5a9b0
Fixes.
wRAR May 6, 2024
14f63c8
Fixes.
wRAR May 8, 2024
ac6421a
Rename the fingerprinter.
wRAR May 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[run]
branch = true

[report]
# https://github.com/nedbat/coveragepy/issues/831#issuecomment-517778185
exclude_lines =
pragma: no cover
if TYPE_CHECKING:
6 changes: 6 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ ignore =
E203,
# line too long
E501,
# multiple statements on one line
E704,
# line break before binary operator
W503,

# Missing docstring in public module
D100,
Expand All @@ -21,6 +25,8 @@ ignore =
D107,
# One-line docstring should fit on one line with quotes
D200,
# No blank lines allowed after function docstring
D202,
# 1 blank line required between summary line and description
D205,
# Multi-line docstring closing quotes should be on a separate line
Expand Down
34 changes: 34 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,37 @@ Installation
pip install duplicate-url-discarder

Requires **Python 3.8+**.

Using
=====

Enable the Scrapy component:

.. code-block:: python

...
wRAR marked this conversation as resolved.
Show resolved Hide resolved

It will process request URLs, making canonical forms of them and discarding
URLs with the same canonical form as earlier ones.

Policies
========

``duplicate-url-discarder`` utilizes *policies* to make canonical versions of
URLs. The policies are configured with *URL rules*. Each URL rule specifies
an URL pattern that a policy applies to and specific policy arguments to use.

The following policies are currently available:

* ``queryRemoval``: removes query string parameters *(i.e. key=value)*, wherein
the keys are specified in the arguments. If a given key appears multiple times
with different values in the URL, all of them are removed.

Configuration
=============

``duplicate-url-discarder`` uses the following Scrapy settings:

``DUD_LOAD_POLICY_PATH``: it should be a list of file paths (``str`` or
wRAR marked this conversation as resolved.
Show resolved Hide resolved
``pathlib.Path``) pointing to files with the URL rules to apply. The default
wRAR marked this conversation as resolved.
Show resolved Hide resolved
value of this setting points to the default rules file.
wRAR marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 4 additions & 0 deletions duplicate_url_discarder/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
__version__ = "0.1.0"

from ._rule import UrlRule, load_rules, save_rules
from .middlewares import DuplicateUrlDiscarderDownloaderMiddleware
from .processor import Processor
62 changes: 62 additions & 0 deletions duplicate_url_discarder/_rule.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from __future__ import annotations

import json
from dataclasses import dataclass
from typing import TYPE_CHECKING, Any, Dict, List, Tuple

from url_matcher import Patterns

if TYPE_CHECKING:
# typing.Self requires Python 3.11
from typing_extensions import Self


@dataclass(frozen=True)
class UrlRule:
order: int
url_pattern: Patterns
BurnzZ marked this conversation as resolved.
Show resolved Hide resolved
policy: str
args: Tuple[Any, ...]

@classmethod
def from_dict(cls, policy_dict: Dict[str, Any]) -> Self:
"""Load a rule from a dict"""
return cls(
order=policy_dict["order"],
url_pattern=Patterns(**policy_dict["urlPattern"]),
policy=policy_dict["policy"],
args=tuple(policy_dict.get("args") or ()),
)

def to_dict(self) -> Dict[str, Any]:
"""Save a rule to a dict"""
pattern = {"include": list(self.url_pattern.include)}
if self.url_pattern.exclude:
pattern["exclude"] = list(self.url_pattern.exclude)
result = {
"order": self.order,
"urlPattern": pattern,
"policy": self.policy,
}
if self.args:
result["args"] = list(self.args)
return result


def load_rules(data: str) -> List[UrlRule]:
"""Load a list of rules from a JSON text."""
results: List[UrlRule] = []
j = json.loads(data)
for item in j:
results.append(UrlRule.from_dict(item))
return results
wRAR marked this conversation as resolved.
Show resolved Hide resolved


def save_rules(policies: List[UrlRule]) -> str:
"""Save a list of rules to a JSON text."""
return json.dumps(
[p.to_dict() for p in policies],
ensure_ascii=False,
sort_keys=True,
indent=2,
)
30 changes: 30 additions & 0 deletions duplicate_url_discarder/middlewares.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import os
from typing import List, Set, Union

from scrapy import Request
from scrapy.crawler import Crawler
from scrapy.exceptions import IgnoreRequest, NotConfigured
from scrapy.http import Response

from duplicate_url_discarder.processor import Processor


class DuplicateUrlDiscarderDownloaderMiddleware:
wRAR marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self, crawler: Crawler):
self.crawler: Crawler = crawler
policy_path: List[Union[str, os.PathLike]] = self.crawler.settings.getlist(
"DUD_LOAD_POLICY_PATH"
)
if not policy_path:
raise NotConfigured("No DUD_LOAD_POLICY_PATH set")
self.processor = Processor(policy_path)
self.canonical_urls: Set[str] = set()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can estimate the memory usage of this approach somehow. If it's a lot, it may even make sense to have a separate code path when we know the fingerprints are not going to be updated (e.g. learning is not enabled, or learning is finished).

Copy link

@kmike kmike Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main worry is that with this implementation there is a non-zero chance zyte-spider-templates RAM usage may blow up above SC free unit limits (or 1 SC unit limit) in reasonably common cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips requests without the meta key, but where will we use that key? Probably for all normal requests?

As for memory usage I wanted to say it's comparable to the fingerprinter one but then I realized that URLs are often longer than fingerprints.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips requests without the meta key

By the way, I still think it shouldn't :) There was a thread in the original proposal about this. cc @BurnzZ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way we can save on RAM is to store this on disk, like using https://docs.python.org/3/library/shelve.html. Although this uses a dict-like interface, we can simply use the keys for uniqueness and leave the values empty.

Moreover, as a sidenote, there's a caveat to using shelve's .get() method, where it runs in O(n) (I learned this the hard way) Ref. Using something like this is faster:

try:
    return data_on_disk[key]
except KeyError:
    return None

This skips requests without the meta key

By the way, I still think it shouldn't :) There was a thread in the original proposal about this

Having an opt-in approach is indeed tedious for the user to set but allows to narrow down which types of URLs will be stored here, and thus, reducing the storage needed.

What do you think about having a similar approach with scrapy-zyte-api's TRANSPARENT_MODE while allowing "zyte_api" to be set in the meta. Users can use a setting that turns everything on but for zyte-spider-templates, we can manually set DUD via the meta for optimal usage.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a solution where the overhead is minimal and it's opt-out :) It seems this is achievable. Url matching looks optimized enough, and I'm optimistic RAM can also be optimized.

DIsc storage will trade off RAM for speed; it may not be a good trade here, but I'm not sure.

For URL storage there are also data structures like tries, which can save a lot of memory.

But on a first sight, it looks like if we can make an assumption that the fingeprints don't change, and have a separate code path for the case they can change, it all can be pretty optimal. In this case we may even think if the storate can be shared with the dupefilter, but that's a separate idea.

When the learning is enabled, it looks like optimization can be significantly harder. So, maybe we can have learning opt-in (maybe per-request), not the whole thing opt-in?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I don't think we must have a final optimized implementation in this PR to have it merged.

Addressing it (i.e. evaluating how big is the issue, and making necessary optimizations) is a blocker to make use of DUD in zyte-spider-templates by default.

Copy link

@kmike kmike Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems we need to understand if an optimized version is possible to make a decision on having the component enabled by default for all request (+ a way to opt out per request?) vs having it opt-in per-request.


def process_request(self, request: Request) -> Union[Request, Response, None]:
wRAR marked this conversation as resolved.
Show resolved Hide resolved
if not request.meta.get("dud", False):
return None
canonical_url = self.processor.process_url(request.url)
if canonical_url in self.canonical_urls:
raise IgnoreRequest(f"Duplicate URL discarded: {canonical_url}")
wRAR marked this conversation as resolved.
Show resolved Hide resolved
wRAR marked this conversation as resolved.
Show resolved Hide resolved
self.canonical_urls.add(canonical_url)
wRAR marked this conversation as resolved.
Show resolved Hide resolved
return None
23 changes: 23 additions & 0 deletions duplicate_url_discarder/policies/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from typing import Dict, Type

from scrapy.utils.misc import load_object

from duplicate_url_discarder._rule import UrlRule

from .base import PolicyBase
from .query_removal import QueryRemovalPolicy

_POLICY_CLASSES: Dict[str, Type[PolicyBase]] = {
"queryRemoval": QueryRemovalPolicy,
}


def get_policy(rule: UrlRule) -> PolicyBase:
policy_cls: Type[PolicyBase]
if "." not in rule.policy:
if rule.policy not in _POLICY_CLASSES:
raise ValueError(f"No policy named {rule.policy}")
policy_cls = _POLICY_CLASSES[rule.policy]
else:
policy_cls = load_object(rule.policy)
return policy_cls(rule.args)
16 changes: 16 additions & 0 deletions duplicate_url_discarder/policies/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from abc import ABC, abstractmethod
from typing import Any, Tuple


class PolicyBase(ABC):
def __init__(self, args: Tuple[Any, ...]):
self.args: Tuple[Any, ...] = args
self.validate_args()

def validate_args(self) -> None: # noqa: B027
"""Check that the policy arguments are valid, raise an exception if not."""
pass

@abstractmethod
def process(self, input_url: str) -> str:
"""Return the input URL, modified according to the rules."""
18 changes: 18 additions & 0 deletions duplicate_url_discarder/policies/query_removal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from w3lib.url import url_query_cleaner

from .base import PolicyBase


class QueryRemovalPolicy(PolicyBase):
def validate_args(self) -> None:
for arg in self.args:
if not isinstance(arg, str):
raise TypeError(
f"queryRemoval args must be strings, not {type(arg)}: {arg}"
)

def process(self, input_url: str) -> str:
args_to_remove = self.args
return url_query_cleaner(
input_url, args_to_remove, remove=True, unique=False, keep_fragments=True
)
48 changes: 48 additions & 0 deletions duplicate_url_discarder/processor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import logging
import operator
import os
from pathlib import Path
from typing import Iterable, List, Set, Union

from url_matcher import URLMatcher

from duplicate_url_discarder._rule import UrlRule, load_rules
from duplicate_url_discarder.policies import PolicyBase, get_policy

logger = logging.getLogger(__name__)


class Processor:
wRAR marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self, policy_paths: Iterable[Union[str, os.PathLike]]) -> None:
rules: Set[UrlRule] = set()
full_rule_count = 0
for policy_path in policy_paths:
data = Path(policy_path).read_text()
loaded_rules = load_rules(data)
full_rule_count += len(loaded_rules)
rules.update(loaded_rules)
rule_count = len(rules)
logger.info(
f"Loaded {rule_count} rules, skipped {full_rule_count - rule_count} duplicates."
)

self.url_matcher = URLMatcher()
self.policies: List[PolicyBase] = []
policy_id = 0
for rule in sorted(rules, key=operator.attrgetter("order")):
policy = get_policy(rule)
self.policies.append(policy)
self.url_matcher.add_or_update(policy_id, rule.url_pattern)
policy_id += 1

def process_url(self, url: str) -> str:
use_universal = True
for policy_id in self.url_matcher.match_all(url, include_universal=False):
use_universal = False
policy = self.policies[policy_id]
url = policy.process(url)
wRAR marked this conversation as resolved.
Show resolved Hide resolved
if use_universal:
for policy_id in self.url_matcher.match_universal():
policy = self.policies[policy_id]
url = policy.process(url)
return url
10 changes: 10 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ classifiers = [
]
requires-python = ">=3.8"
dependencies = [
"Scrapy >= 2.0.1",
"url-matcher @ git+https://github.com/zytedata/url-matcher.git@skip-domainless",
"w3lib >= 1.22.0",
]
dynamic = ["version"]

Expand All @@ -40,6 +43,13 @@ duplicate_url_discarder = ["py.typed"]
profile = "black"
multi_line_output = 3

[[tool.mypy.overrides]]
module = [
"scrapy.*",
"url_matcher.*",
]
ignore_missing_imports = true

[[tool.mypy.overrides]]
module = [
"tests.*",
Expand Down
64 changes: 64 additions & 0 deletions tests/test_policies.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import pytest
from url_matcher import Patterns

from duplicate_url_discarder import UrlRule
from duplicate_url_discarder.policies import PolicyBase, QueryRemovalPolicy, get_policy


class HardcodedPolicy(PolicyBase):
def process(self, url: str) -> str:
return "http://hardcoded.example"


def test_get_policy():
pattern = Patterns([])
args = ["foo", "bar"]

rule = UrlRule(0, pattern, "queryRemoval", args)
policy = get_policy(rule)
assert type(policy) is QueryRemovalPolicy
assert policy.args == args

rule = UrlRule(0, pattern, "tests.test_policies.HardcodedPolicy", args)
policy = get_policy(rule)
assert type(policy) is HardcodedPolicy
assert policy.args == args

rule = UrlRule(0, pattern, "unknown", args)
with pytest.raises(ValueError, match="No policy named unknown"):
get_policy(rule)


@pytest.mark.parametrize(
["args", "url", "expected"],
[
([], "http://foo.example?foo=1&bar", "http://foo.example?foo=1&bar"),
(["a"], "http://foo.example?foo=1&bar", "http://foo.example?foo=1&bar"),
(["foo"], "http://foo.example?foo=1&bar", "http://foo.example?bar"),
(["bar"], "http://foo.example?foo=1&bar", "http://foo.example?foo=1"),
(
["bar"],
"http://foo.example?foo=1&foo=2&bar&bar=1",
"http://foo.example?foo=1&foo=2",
),
(
["bar"],
"http://foo.example?foo=1&bar#bar=frag",
"http://foo.example?foo=1#bar=frag",
),
(["foo", "baz"], "http://foo.example?foo=1&bar", "http://foo.example?bar"),
(["foo", "bar"], "http://foo.example?foo=1&bar", "http://foo.example"),
wRAR marked this conversation as resolved.
Show resolved Hide resolved
],
)
def test_query_removal(args, url, expected):
policy = QueryRemovalPolicy(args)
assert policy.process(url) == expected


def test_query_removal_validate_args():
with pytest.raises(TypeError, match="strings, not <class 'bytes'>: b''"):
QueryRemovalPolicy([b""])
with pytest.raises(TypeError, match="strings, not <class 'NoneType'>: None"):
QueryRemovalPolicy(["a", None, ""])
QueryRemovalPolicy([""])
QueryRemovalPolicy([])
Loading
Loading