Skip to content
perl6 web-scraper
Other
  1. Other 100.0%
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib/Web fix t/02 tests to be meaningful Apr 9, 2018
t fix t/02 tests to be meaningful Apr 9, 2018
.gitignore incorporate #2, fix t/03 Apr 9, 2018
META6.json fix t/02 tests to be meaningful Apr 9, 2018
README.md Fix README.md's titles Apr 6, 2018
prove updated readme and fixed html processing May 1, 2014

README.md

Perl 6: Web::Scraper;

This module works very similar to Perl 5's Web::Scraper. Right now it has some extra sugar in some areas. Currently works with parsing from XML and HTML (converts it to XHTML).

Status:

Working on:

Maintenance mode, enhancemenets, etc.

Syntax:

Data:

<data>
  <t id="1">test1</t>
  <t id="2">test2</t>
  <e>etest</e>
  <t id="3">test3</t>
  <t id="4">test4</t>
  <e>etest</e>
  <nest>
    <id>1</id>
    <val>50</val>
    <sval>1</sval>
    <sval>2</sval>
    <sval>3</sval>
    <sval>43</sval>
  </nest>
  <nest>
    <id>2</id>
    <val>30</val>
    <sval>2</sval>
    <sval>3</sval>
    <sval>5</sval>
    <sval>47</sval>
  </nest>
</data>

Scraper:

my $count   = 0;
my $scraper = scraper {
  process 't', 'tarray[]' => {
    name => 'TEXT',
    id   => '@id'
  };
  process 'e', 'e[]' => sub ($elem) {
    return "{$elem.contents[0].text ~ $count++}";
  };
  process 't', 'ttext[]' => 'TEXT';
  process 'nest', 'nested[]' => scraper {
    process 'id', 'id' => 'TEXT';
    process 'val', 'val' => 'TEXT';
    process 'sval', 'svals[]' => 'TEXT';
  };
}  

Results:

$scraper.d = {
  tarray => [
    {id => 1, name => 'test1'},
    {id => 2, name => 'test2'},
    {id => 3, name => 'test3'},
    {id => 4, name => 'test4'},
  ],
  e => [
    'etest1',
    'etest2',
  ],
  ttext => [
    'test1',
    'test2',
    'test3',
    'test4',
  ],
  nested => [
    { id => 1, val => 50, sval => [1,2,3,43,], },
    { id => 2, val => 30, sval => [2,3,5,47,], },
  ],
};

Example with a dynamic source file:

Data:

master.xml:

<xml>
  <files>
    <file>one.xml</file>
    <file>two.xml</file>
  </files>
</xml>

one.xml

<xml>
  <id>1</id>
  <word>one</word>
</xml>

two.xml

<xml>
  <id>2</id>
  <word>two</word>
</xml>

Syntax:

my $dynamicscraper = scraper {
  process 'id', 'id' => 'TEXT';
  process 'word', 'alpha' => 'TEXT';
};
my $masterscraper = scraper {
  process 'files', 'files[]' => scraper {
    process 'file', 'src-file' => 'TEXT';
    resource $dynamicscraper, 'file' => 'TEXT';
  };
};
$masterscraper.scrape('master.xml');

Results:

$masterscraper.d == {
  files => [
    { src-file => 'one.xml', id => '1', alpha => 'one', },
    { src-file => 'two.xml', id => '2', alpha => 'two', },
  ],
}
You can’t perform that action at this time.