Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No results with specific url #196

Closed
plompd opened this issue Jan 16, 2024 · 2 comments
Closed

No results with specific url #196

plompd opened this issue Jan 16, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@plompd
Copy link

plompd commented Jan 16, 2024

Describe the bug
If I try to scrape the following URL, it is successful:
https://www.werkenbijzicht.nl/vacatures.html

If I try to scrape the same page, but with another URL, it returns no result:
https://www.werkenbijzicht.nl/p/1/vacatures.html

It is the same page, but initiated from the pager at the bottom.

So with the original URL, my scraper only does scrape the first 10 items and if I try to go to the next page, I get this new URL like stated above and that results in nothing.

I cannot see what the difference is or how I should fix this?

  • core: 3.0.0
class ZichtSpider extends BasicSpider
{
    public array $startUrls = [
        'https://www.werkenbijzicht.nl/vacatures.html',
    ];

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
        ExecuteJavascriptMiddleware::class,
        [RandomUserAgentMiddleware::class, ['userAgent' => 'Mozilla/5.0 (compatible; RoachPHP/0.1.0)']],
    ];

    public array $spiderMiddleware = [
        //
    ];

    public array $itemProcessors = [
        //
    ];

    public array $extensions = [
        LoggerExtension::class,
        StatsCollectorExtension::class,
    ];

    public int $concurrency = 2;

    public int $requestDelay = 1;

    public function parse(Response $response): Generator
    {
         $items = $response->filter('.actSResContainer .itemContainer')->each(function (Crawler $node) {

            $titleNode = $node->filter('.itemTitle a.cluetips');
            $relAttribute = $titleNode->attr('rel'); // Fetching the rel attribute

            // Regular expression to extract the ID
            $regex = '/id\/(\d+)\//';
            $matches = [];
            $vacancyId = '';

            if (preg_match($regex, $relAttribute, $matches)) {
                $vacancyId = $matches[1]; // The first captured group contains the ID
            }

            return [
                'url' => $titleNode->link()->getUri(),
                'title' => $titleNode->text(),
                'referenceNumber' => $vacancyId,
            ];
        });

        foreach ($items as $item) {
            yield $this->request('GET', $item['url'], 'parseJob', ['item' => $item]);
        }

        try {
            // Attempt to find the next page link. Adjust the selector as needed.
            $nextPageLink = $response->filter('.pageNav a.pnNext');

            if ($nextPageLink->count() > 0) {
                $nextPageUrl = 'https://www.werkenbijzicht.nl' . $nextPageLink->attr('href');
                yield $this->request('GET', $nextPageUrl);
            }
        } catch (\Exception $e) {
         
        }

    }

    public function parseJob(Response $response): Generator
    {
@plompd plompd added the bug Something isn't working label Jan 16, 2024
@plompd
Copy link
Author

plompd commented Jan 25, 2024

@ksassnowski Do you have any clue?

@ksassnowski
Copy link
Contributor

I had a quick look at their site and I honestly have no clue what they're doing. The same URL sometimes returns results and sometimes doesn't. It might be tied to the current session but I haven't investigated this any further.

I'm turning this issue into a discussion since it's not really a problem with the library.

@roach-php roach-php locked and limited conversation to collaborators Jan 28, 2024
@ksassnowski ksassnowski converted this issue into discussion #207 Jan 28, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants