Skip to content

tarao/perl5-WWW-RobotRules-Parser-MultiValue

Repository files navigation

Build Status

NAME

WWW::RobotRules::Parser::MultiValue - Parse robots.txt

SYNOPSIS

use WWW::RobotRules::Parser::MultiValue;
use LWP::Simple qw(get);

my $url = 'http://example.com/robots.txt';
my $robots_txt = get $url;

my $rules = WWW::RobotRules::Parser::MultiValue->new(
    agent => 'TestBot/1.0',
);
$rules->parse($url, $robots_txt);

if ($rules->allows('http://example.com/some/path')) {
    my $delay = $rules->delay_for('http://example.com/');
    sleep $delay;
    ...
}

my $hash = $rules->rules_for('http://example.com/');
my @list_of_allowed_paths = $hash->get_all('allow');
my @list_of_custom_rule_value = $hash->get_all('some-rule');

DESCRIPTION

WWW::RobotRules::Parser::MultiValue is a parser for robots.txt.

Parsed rules for the specified user agent is stored as a Hash::MultiValue, where the key is a lower case rule name.

Request-rate rule is handled specially. It is normalized to Crawl-delay rule.

METHODS

  • new

      $rules = WWW::RobotRules::Parser::MultiValue->new(
          aget => $user_agent
      );
      $rules = WWW::RobotRules::Parser::MultiValue->new(
          aget => $user_agent,
          ignore_default => 1,
      );
    

    Creates a new object to handle rules in robots.txt. The object parses rules match with $user_agent. The rules of User-agent: * always match and have a lower precedence than the rules explicitly matched with $user_agent. If ignore_default option is specified, rules of User-agent: * are simply ignored.

  • parse

      $rules->parse($uri, $text);
    

    Parses a text content $text whose URI is $uri.

  • match_ua

      $rules->match_ua($pattern);
    

    Test if the user agent matches with $pattern.

  • rules_for

      $hash = $rules->rules_for($uri);
    

    Returns a Hash::MultiValue, which describes the rules of the domain of $uri.

  • allows

      $test = $rules->allows($uri);
    

    Tests if the user agent is allowed to visit $uri. If there is 'Allow' rule for the path of $uri, then the $uri is allowed to visit. If there is 'Disallow' rule for the path of $uri, then the $uri is not allowed to visit. Otherwise, the $uri is allowed to visit.

  • delay_for

      $delay = $rules->delay_for($uri);
      $delay_in_milliseconds = $rules->delay_for($uri, 1000);
    

    Calculate a crawl delay for the specified $uri. The value is determined by 'Crawl-delay' rule or 'Request-rate' rule. The second argument specifies the base of the return value.

SEE ALSO

Hash::MultiValue

LICENSE

Copyright (C) INA Lintaro

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

INA Lintaro tarao.gnn@gmail.com

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages