Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to override spider configuration when starting a run #11

Merged
merged 1 commit into from Jan 6, 2022

Conversation

ksassnowski
Copy link
Contributor

This PR adds the option of overriding some or all of a spider’s configuration at runtime.

To so, we can now pass an option $overrides parameter to the Roach::startSpider function. This parameter takes an instance of RoachPHP\Spider\Configuration\Overrides which will get merged with the spider’s own configuration.

Roach::startSpider(
    MySpider::class, 
    new Overrides(startUrls: ['https://roach-php.dev/docs/installation'])
);

Rationale

While having all configuration defined inside the spider class itself is certainly convenient, we’re essentially hard coding everything about a spider. For example, before this PR, it would not have been possible to dynamically pass different start URLs to spider. With this PR, we could now accept the start URL through the UI, load it from the database or conjure it up some other way. We can then start a run of a spider using that URL.

class SpiderController
{
    public function __invoke(Request $request)
    {
        // Request validation is left as an exercise for the reader.
        $startUrl = $request->get('start_url');

        // Or dispatch a job here to start the run or whatever.
        Roach::startSpider(
            MySpider::class, 
            new Overrides(['startUrls' => [$startUrl])
        );

        return redirect()->back();
    }
}

Note that his example assumes that the parsing logic you have written works for all of these dynamic URLs. Validating the start urls is still up to you.

The same holds true for all the other spider configuration values as well. Maybe we want to quickly fire off a test run against a specific URL without having to change the actual spider class itself. We could override the concurrency and requestDelay parameters accordingly as to not hammer the server with requests. We could also register the MaxRequestExtension for this run to ensure that we stop the run after the first request.

Roach::startSpider(
    MySpider::class,
    new Overrides([
        'extensions' => [
            LoggerExtensions::class,
            [MaxRequestExtension::class, ['limit' => 1]],
        ]
    ])
);

Note that when overriding values, they will replace the spider’s configuration, not get merged with it. In the example above we have to make sure to also register the LoggerExtension for the run even if the spider’s own configuration already registers it.

@ksassnowski ksassnowski merged commit 334c37c into main Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant