Also extract urls that are pointing to other domains? [CLI] #135

sebs · 2023-09-15T11:54:20Z

I only get 'internal links'.

Is there a way to get external links too?

j-mendez · 2023-09-15T12:13:44Z

Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.

j-mendez · 2023-09-15T14:53:57Z

@sebs this is now available in 1.42.0.

Thank you for the issue!

sebs · 2023-09-16T21:05:07Z

ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)

I do really aprechiate this as it saves me a ton of time.

j-mendez · 2023-09-16T21:11:34Z

@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.

sebs · 2023-09-23T20:16:48Z

i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?

j-mendez · 2023-09-23T22:23:02Z

@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.

j-mendez · 2023-09-24T12:05:15Z

Now available in the CLI v1.45.10. Example below to group domains.

spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o.

The E flag can also be written as external-domains.

sebs · 2023-09-25T15:24:55Z

maybe make it possible to add a * to extract all external domain links?

Background: one thing I am using the tool for is to create link maps ... aka page a links to page b

j-mendez · 2023-09-25T16:25:55Z

@sebs done via 1.46.0. Thank you!

sebs · 2023-09-25T18:15:10Z

<3

apsaltis · 2024-01-24T22:27:08Z

Hi,
Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:

spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape

I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:

Download our one-pager for more information

the output from the scape command looks like this for that page:
{
"html": "",
"links": [],
"url": "https://www.theconsortium.cloud/application-consulting-services-page"
},

Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.

scientiac · 2024-05-01T11:31:58Z

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

j-mendez · 2024-05-01T12:35:57Z

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

Use website.external_domains to add domains into the group for discovery.

scientiac · 2024-05-01T12:38:07Z

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

j-mendez · 2024-05-01T12:48:55Z

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.

scientiac · 2024-05-01T13:02:47Z

I don't think it is a thing.

j-mendez · 2024-05-01T13:05:16Z

I don't think it is a thing.

#135 (comment) looks like it was done. Use website.with_external_domains.

scientiac · 2024-05-01T13:09:47Z

asks me to provide an argument

j-mendez · 2024-05-01T13:30:34Z

asks me to provide an argument

Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.

scientiac · 2024-05-01T13:49:52Z

I used this

        .with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))

and this:

        .with_external_domains(Some(std::iter::once("*".to_string())));

this compiles just fine but doesn't give me any external link from the site

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://carboxi.de");

    website.with_respect_robots_txt(true)
        .with_subdomains(true)
        .with_external_domains(Some(std::iter::once("*".to_string())));

    website.crawl().await;

    let links = website.get_links();
    let url = website.get_url().inner();
    let status = website.get_status();

    println!("URL: {:?}", url);
    println!("Status: {:?}\n", status);

    for link in links {
        println!("{:?}", link.as_ref());
    }
}

i dont think i understand what the wildcard for this is

j-mendez closed this as completed Sep 15, 2023

j-mendez reopened this Sep 23, 2023

j-mendez changed the title ~~Also extract urls that are pointing to other domains?~~ Also extract urls that are pointing to other domains? [CLI] Sep 23, 2023

j-mendez closed this as completed Sep 24, 2023

apsaltis mentioned this issue Jan 24, 2024

Extracting all urls on a page #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Also extract urls that are pointing to other domains? [CLI] #135

Also extract urls that are pointing to other domains? [CLI] #135

sebs commented Sep 15, 2023

j-mendez commented Sep 15, 2023

j-mendez commented Sep 15, 2023

sebs commented Sep 16, 2023

j-mendez commented Sep 16, 2023 •

edited

sebs commented Sep 23, 2023

j-mendez commented Sep 23, 2023

j-mendez commented Sep 24, 2023

sebs commented Sep 25, 2023 •

edited

j-mendez commented Sep 25, 2023

sebs commented Sep 25, 2023

apsaltis commented Jan 24, 2024

scientiac commented May 1, 2024

j-mendez commented May 1, 2024

scientiac commented May 1, 2024 •

edited

j-mendez commented May 1, 2024

scientiac commented May 1, 2024

j-mendez commented May 1, 2024

scientiac commented May 1, 2024

j-mendez commented May 1, 2024

scientiac commented May 1, 2024 •

edited

Also extract urls that are pointing to other domains? [CLI] #135

Also extract urls that are pointing to other domains? [CLI] #135

Comments

sebs commented Sep 15, 2023

j-mendez commented Sep 15, 2023

j-mendez commented Sep 15, 2023

sebs commented Sep 16, 2023

j-mendez commented Sep 16, 2023 • edited

sebs commented Sep 23, 2023

j-mendez commented Sep 23, 2023

j-mendez commented Sep 24, 2023

sebs commented Sep 25, 2023 • edited

j-mendez commented Sep 25, 2023

sebs commented Sep 25, 2023

apsaltis commented Jan 24, 2024

scientiac commented May 1, 2024

j-mendez commented May 1, 2024

scientiac commented May 1, 2024 • edited

j-mendez commented May 1, 2024

scientiac commented May 1, 2024

j-mendez commented May 1, 2024

scientiac commented May 1, 2024

j-mendez commented May 1, 2024

scientiac commented May 1, 2024 • edited

j-mendez commented Sep 16, 2023 •

edited

sebs commented Sep 25, 2023 •

edited

scientiac commented May 1, 2024 •

edited

scientiac commented May 1, 2024 •

edited