New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Also extract urls that are pointing to other domains? [CLI] #135
Comments
Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern. |
@sebs this is now available in 1.42.0. Thank you for the issue! |
ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;) I do really aprechiate this as it saves me a ton of time. |
@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use. |
i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through? |
@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI. |
Now available in the CLI v1.45.10. Example below to group domains.
The |
maybe make it possible to add a * to extract all external domain links? Background: one thing I am using the tool for is to create link maps ... aka page a links to page b |
@sebs done via |
<3 |
Hi, spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this: Download our one-pager for more informationthe output from the scape command looks like this for that page: Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance. |
How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed. |
Use website.external_domains to add domains into the group for discovery. |
I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall |
Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later. |
I don't think it is a thing. |
#135 (comment) looks like it was done. Use |
I used this
and this:
this compiles just fine but doesn't give me any external link from the site
i dont think i understand what the wildcard for this is |
I only get 'internal links'.
Is there a way to get external links too?
The text was updated successfully, but these errors were encountered: