Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also extract urls that are pointing to other domains? [CLI] #135

Closed
sebs opened this issue Sep 15, 2023 · 20 comments
Closed

Also extract urls that are pointing to other domains? [CLI] #135

sebs opened this issue Sep 15, 2023 · 20 comments

Comments

@sebs
Copy link

sebs commented Sep 15, 2023

I only get 'internal links'.

Is there a way to get external links too?

@j-mendez
Copy link
Member

Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.

@j-mendez
Copy link
Member

@sebs this is now available in 1.42.0.

crawling multiple domains as one for the url https://rssea.fr and https://loto.rsseau.fr

Thank you for the issue!

@sebs
Copy link
Author

sebs commented Sep 16, 2023

ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)

I do really aprechiate this as it saves me a ton of time.

@j-mendez
Copy link
Member

j-mendez commented Sep 16, 2023

@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.

@sebs
Copy link
Author

sebs commented Sep 23, 2023

i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?

@j-mendez
Copy link
Member

@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.

@j-mendez j-mendez reopened this Sep 23, 2023
@j-mendez j-mendez changed the title Also extract urls that are pointing to other domains? Also extract urls that are pointing to other domains? [CLI] Sep 23, 2023
@j-mendez
Copy link
Member

Now available in the CLI v1.45.10. Example below to group domains.

spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o.

The E flag can also be written as external-domains.

@sebs
Copy link
Author

sebs commented Sep 25, 2023

maybe make it possible to add a * to extract all external domain links?

Background: one thing I am using the tool for is to create link maps ... aka page a links to page b

@j-mendez
Copy link
Member

@sebs done via 1.46.0. Thank you!

@sebs
Copy link
Author

sebs commented Sep 25, 2023

<3

@apsaltis
Copy link

Hi,
Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:

spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape

I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:

Download our one-pager for more information

the output from the scape command looks like this for that page:
{
"html": "",
"links": [],
"url": "https://www.theconsortium.cloud/application-consulting-services-page"
},

Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.

@scientiac
Copy link

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

@j-mendez
Copy link
Member

j-mendez commented May 1, 2024

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

Use website.external_domains to add domains into the group for discovery.

@scientiac
Copy link

scientiac commented May 1, 2024

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

@j-mendez
Copy link
Member

j-mendez commented May 1, 2024

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.

@scientiac
Copy link

I don't think it is a thing.

@j-mendez
Copy link
Member

j-mendez commented May 1, 2024

I don't think it is a thing.

CASELESS_WILD_CARD external domains handling

#135 (comment) looks like it was done. Use website.with_external_domains.

@scientiac
Copy link

image

asks me to provide an argument

@j-mendez
Copy link
Member

j-mendez commented May 1, 2024

image

asks me to provide an argument

Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.

@scientiac
Copy link

scientiac commented May 1, 2024

I used this

        .with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))

and this:

        .with_external_domains(Some(std::iter::once("*".to_string())));

this compiles just fine but doesn't give me any external link from the site

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://carboxi.de");

    website.with_respect_robots_txt(true)
        .with_subdomains(true)
        .with_external_domains(Some(std::iter::once("*".to_string())));

    website.crawl().await;

    let links = website.get_links();
    let url = website.get_url().inner();
    let status = website.get_status();

    println!("URL: {:?}", url);
    println!("Status: {:?}\n", status);

    for link in links {
        println!("{:?}", link.as_ref());
    }
}

i dont think i understand what the wildcard for this is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants