Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added "Check webpage for broken links" example #299

Closed
wants to merge 2 commits into from

Conversation

Projects
None yet
2 participants
@sb89
Copy link
Contributor

sb89 commented Sep 27, 2017

#290

I've maybe made this too complicated. Let me know what you think.

@budziq
Copy link
Collaborator

budziq left a comment

@sb89 Thanks a lot! 👍

There are few suggestions to make the example a little more concise.

src/net.md Outdated
let mut broken_urls = HashSet::new();
let url = Url::parse("https://www.rust-lang.org/en-US/")?;
let mut base_url = Url::parse(&format!(

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

this whole block could be substituted with

let base_url = Url::parse(&parsed[..url::Position::BeforePath])?;

and base_url will not need to be mut here

also I would extrat the whole link acquisition and transforming into a separate function for clarity

src/net.md Outdated
url.host_str().unwrap_or("")
))?;
let res = reqwest::get(url.as_str())?;

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

no need for as_str here

src/net.md Outdated
.collect();
for link in links {
let link_url = Url::parse(link).or_else(|e| match e {

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

You can use ParseOptions here to make the crate do all the work here (if the link is absolut then it will be used otherwise the base_url will be used )

 let base_parser = Url::options().base_url(Some(&base_url));
...
 for link in links { //or `map` here
     let final_url = base_parser.parse(link)?;
src/net.md Outdated
_ => Err(e),
});
match link_url {

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

instead of matching, I would suggest using iterator combinators here to extract only the error cases with their links

src/net.md Outdated
Err(e) => println!("Error parsing URL ({}): {}", link, e.description()),
Ok(link_url) => {
let res = reqwest::get(link_url.as_str())?;

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

I would extract this super small logic to separate function for clarity.

src/net.md Outdated
[`Ipv4Addr`]: https://doc.rust-lang.org/std/net/struct.Ipv4Addr.html
[`Name`]: https://docs.rs/select/*/select/predicate/struct.Name.html
[`ParseError::RelativeUrlWithoutBase`]: https://docs.rs/url/1.5.1/url/enum.ParseError.html#variant.RelativeUrlWithoutBase

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

please use wildcard version when pointing to docs.rs

src/net.md Outdated
We can then check if the document has a "base" tag. If so, we get the "href" [`attr`] from the first occurrence of the "base" tag. This is then used as the base URL.
We can then retrieve all the links from the document by using [`find`] with the criteria of the [`Name`] being "a".

This comment has been minimized.

@budziq

budziq Sep 28, 2017

Collaborator

We already have a extract all links from a webpage example, where nearly all of these information is presented. Lets build on it and not duplicate the information. This is a more advanced example and I would suggest to limit the description to only the new information not covered by earlier examples.

If you think that some of this information is not covered so far please consider updating the earlier example descriptions 👍

@sb89

This comment has been minimized.

Copy link
Contributor Author

sb89 commented Sep 30, 2017

I've implemented changes in line with your comments.

src/net.md Outdated
.filter_map(|n| n.attr("href"))
.nth(0)
{
base_url = Url::parse(base_tag_href)?;

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

you can use combinator such as map here instead of having if let and mut base_url

src/net.md Outdated
}
fn run() -> Result<()> {
let (url, mut base_url) = get_urls("https://www.rust-lang.org/en-US/")?;

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

everything upto the get_broken_links should be within the get_urls

@sb89

This comment has been minimized.

Copy link
Contributor Author

sb89 commented Sep 30, 2017

I believe the reviews you have made is on outdated code. I performed an additional commit about 40 minutes ago.

@budziq

This comment has been minimized.

Copy link
Collaborator

budziq commented Sep 30, 2017

I believe the reviews you have made is on outdated code. I performed an additional commit about 40 minutes ago.

Yeah I know I'm on a conference and the wifi is flaky at best (these are reviews that were to be sent few hours ago). Standby I'll publish an updated review shortly. There should be only minor changes and were good to go.

@budziq
Copy link
Collaborator

budziq left a comment

@sb89 just some minor nits left. Sorry for the earlier confusion.

src/net.md Outdated
let links: HashSet<&str> = document
.find(Name("a"))
.filter_map(|n| n.attr("href"))
.collect();

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

I would suggest transforming the links into absolute ones (with the base_url) here prior to collecting into HashSet (it is possible that on a single page we will have both relative and absolute version of the same link).

This comment has been minimized.

@sb89

sb89 Sep 30, 2017

Author Contributor

How would I do this as parse returns a Result and therefore I cannot use the "?" operator when iterating? Should I just ignore any URL that cannot be parsed (and report no errors) like this?

let links: HashSet<Url> = document
    .find(Name("a"))
    .filter_map(|n| n.attr("href"))
    .filter_map(|link| base_parser.parse(link).ok())
    .collect();

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

Well that entirely depends on what you'd like to achieve. Test only the correct links on best effort basis (your suggestion) or error-out even if there is at least one error
by collecting into Result

    let hs: Result<HashSet<_>> =  document
    .find(Name("a"))
    .filter_map(|n| n.attr("href"))
        .map(|link| base_parser.parse(link).map_err(|e| e.into()))
        .collect();

Unfortunately to do this trick with error_chain you'll have to jump through hoops with map_err as above (normally this would not be required)

src/net.md Outdated
.filter_map(|n| n.attr("href"))
.collect();
let broken_links = get_broken_links(&base_url, &links)?;

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

I would extract only the link checking logic into a separate function check_link, it would take only a single Url (or better yet AsRef<Url>). Then such function could be map'ed over the collection iterator instead of collected into yet another HashSet

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

Also there seam to be conflicts making the merge impossible. Could you look into that before squashing the PR commits into single one?

src/net.md Outdated
}
fn get_broken_links<'a>(base_url: &Url, links: &HashSet<&'a str>) -> Result<HashSet<&'a str>> {
let mut broken_links = HashSet::new();

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

I would for this function to operate only on a single Url check_link, it would take only a single Url (or better yet AsRef<Url>) that was already made absolute (making the base_url param unnecessary). Then such function could be map'ed over the collection iterator instead of collected into yet another HashSet

src/net.md Outdated
Now that we have our URL, we pass them into the "get_broken_links" function along with the base URL. This will iterate over the URL, use [`url::ParseOptions`] and [`Url::parse`] for each to get the absolute URL and then perform a HTTP GET request. If the response status is 404, the URL will be added to the HashSet which is returned by the function.
Finally, we check the contents of the HashSet. If it is empty, we report that no links are broken, otherwise we iterate over the HashSet and report the broken links.

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

I would suggest to describe only the finer details, i.e. things that might not be obvious from the code.

src/net.md Outdated
[![reqwest-badge]][reqwest] [![select-badge]][select] [![url-badge]][url] [![cat-net-badge]][cat-net]
We parse the URL using [`Url::parse`] and then retrieve the document using reqwest.

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

The user should ba already familiar with this information from the previous examples.

src/net.md Outdated
We parse the URL using [`Url::parse`] and then retrieve the document using reqwest.
We call "get_base_url" to retrieve the base URL. If the document has a "base" tag, we get the "href" [`attr`] from the first occurrence of the "base" tag. This is then used as the base URL. Otherwise, we can use [`Position::BeforePath`] with the original URL to get the base of that URL.

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

I would try to pare the explanation down to only the information that is not mentioned in earlier examples.

@budziq
Copy link
Collaborator

budziq left a comment

@sb89 very nice! I think that we are at out last iteration prior to merge 👍

src/net.md Outdated
.filter_map(|link| base_parser.parse(link).ok())
.collect();
let broken_links = links.iter().filter(|link| {

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

no need for intermediate broken_links variable. we can iterate over the filtered iterator in a for loop

src/net.md Outdated
let broken_links = links.iter().filter(|link| {
let link_ok = check_link(link);
link_ok.is_ok() && !link_ok.unwrap()

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

please avoid unwrapping.
check_link could return a custom error defined in error_chain! when link is broken,
then you could extract all the errored values with filter_map(|res| res.err()) also

src/net.md Outdated
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

std use would go above the crate ones

src/net.md Outdated
# UrlParseError(url::ParseError);
# }
# }
fn get_base_url(url: &Url, doc: &Document) -> Result<Url> {

This comment has been minimized.

@budziq

budziq Sep 30, 2017

Collaborator

we would need a new line between the error_chain block and the fn

@sb89

This comment has been minimized.

Copy link
Contributor Author

sb89 commented Oct 1, 2017

I've implemented the changes as requested.

Instead of removing the unwrap and using a custom error, I've done something slightly different with the filter. Let me know what you think and if everything is OK I'll squash the commits.

@budziq

This comment has been minimized.

Copy link
Collaborator

budziq commented Oct 1, 2017

@sb89 very cool! Please squash + rebase the commits and we're ready to merge 👍

Steven Blake added some commits Oct 1, 2017

@budziq

This comment has been minimized.

Copy link
Collaborator

budziq commented Oct 1, 2017

merged! Thanks for sticking with it @sb89 !

@budziq budziq closed this Oct 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.