Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAdded "Check webpage for broken links" example #299
Conversation
budziq
requested changes
Sep 28, 2017
| let mut broken_urls = HashSet::new(); | ||
| let url = Url::parse("https://www.rust-lang.org/en-US/")?; | ||
| let mut base_url = Url::parse(&format!( |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 28, 2017
Collaborator
this whole block could be substituted with
let base_url = Url::parse(&parsed[..url::Position::BeforePath])?;and base_url will not need to be mut here
also I would extrat the whole link acquisition and transforming into a separate function for clarity
| url.host_str().unwrap_or("") | ||
| ))?; | ||
| let res = reqwest::get(url.as_str())?; |
This comment has been minimized.
This comment has been minimized.
| .collect(); | ||
| for link in links { | ||
| let link_url = Url::parse(link).or_else(|e| match e { |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 28, 2017
Collaborator
You can use ParseOptions here to make the crate do all the work here (if the link is absolut then it will be used otherwise the base_url will be used )
let base_parser = Url::options().base_url(Some(&base_url));
...
for link in links { //or `map` here
let final_url = base_parser.parse(link)?;| _ => Err(e), | ||
| }); | ||
| match link_url { |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 28, 2017
Collaborator
instead of matching, I would suggest using iterator combinators here to extract only the error cases with their links
| Err(e) => println!("Error parsing URL ({}): {}", link, e.description()), | ||
| Ok(link_url) => { | ||
| let res = reqwest::get(link_url.as_str())?; |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 28, 2017
Collaborator
I would extract this super small logic to separate function for clarity.
| [`Ipv4Addr`]: https://doc.rust-lang.org/std/net/struct.Ipv4Addr.html | ||
| [`Name`]: https://docs.rs/select/*/select/predicate/struct.Name.html | ||
| [`ParseError::RelativeUrlWithoutBase`]: https://docs.rs/url/1.5.1/url/enum.ParseError.html#variant.RelativeUrlWithoutBase |
This comment has been minimized.
This comment has been minimized.
| We can then check if the document has a "base" tag. If so, we get the "href" [`attr`] from the first occurrence of the "base" tag. This is then used as the base URL. | ||
| We can then retrieve all the links from the document by using [`find`] with the criteria of the [`Name`] being "a". |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 28, 2017
Collaborator
We already have a extract all links from a webpage example, where nearly all of these information is presented. Lets build on it and not duplicate the information. This is a more advanced example and I would suggest to limit the description to only the new information not covered by earlier examples.
If you think that some of this information is not covered so far please consider updating the earlier example descriptions
This comment has been minimized.
This comment has been minimized.
|
I've implemented changes in line with your comments. |
budziq
reviewed
Sep 30, 2017
| .filter_map(|n| n.attr("href")) | ||
| .nth(0) | ||
| { | ||
| base_url = Url::parse(base_tag_href)?; |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
you can use combinator such as map here instead of having if let and mut base_url
| } | ||
| fn run() -> Result<()> { | ||
| let (url, mut base_url) = get_urls("https://www.rust-lang.org/en-US/")?; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I believe the reviews you have made is on outdated code. I performed an additional commit about 40 minutes ago. |
This comment has been minimized.
This comment has been minimized.
Yeah I know I'm on a conference and the wifi is flaky at best (these are reviews that were to be sent few hours ago). Standby I'll publish an updated review shortly. There should be only minor changes and were good to go. |
budziq
requested changes
Sep 30, 2017
| let links: HashSet<&str> = document | ||
| .find(Name("a")) | ||
| .filter_map(|n| n.attr("href")) | ||
| .collect(); |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
I would suggest transforming the links into absolute ones (with the base_url) here prior to collecting into HashSet (it is possible that on a single page we will have both relative and absolute version of the same link).
This comment has been minimized.
This comment has been minimized.
sb89
Sep 30, 2017
Author
Contributor
How would I do this as parse returns a Result and therefore I cannot use the "?" operator when iterating? Should I just ignore any URL that cannot be parsed (and report no errors) like this?
let links: HashSet<Url> = document
.find(Name("a"))
.filter_map(|n| n.attr("href"))
.filter_map(|link| base_parser.parse(link).ok())
.collect();
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
Well that entirely depends on what you'd like to achieve. Test only the correct links on best effort basis (your suggestion) or error-out even if there is at least one error
by collecting into Result
let hs: Result<HashSet<_>> = document
.find(Name("a"))
.filter_map(|n| n.attr("href"))
.map(|link| base_parser.parse(link).map_err(|e| e.into()))
.collect();Unfortunately to do this trick with error_chain you'll have to jump through hoops with map_err as above (normally this would not be required)
| .filter_map(|n| n.attr("href")) | ||
| .collect(); | ||
| let broken_links = get_broken_links(&base_url, &links)?; |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
I would extract only the link checking logic into a separate function check_link, it would take only a single Url (or better yet AsRef<Url>). Then such function could be map'ed over the collection iterator instead of collected into yet another HashSet
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
Also there seam to be conflicts making the merge impossible. Could you look into that before squashing the PR commits into single one?
| } | ||
| fn get_broken_links<'a>(base_url: &Url, links: &HashSet<&'a str>) -> Result<HashSet<&'a str>> { | ||
| let mut broken_links = HashSet::new(); |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
I would for this function to operate only on a single Url check_link, it would take only a single Url (or better yet AsRef<Url>) that was already made absolute (making the base_url param unnecessary). Then such function could be map'ed over the collection iterator instead of collected into yet another HashSet
| Now that we have our URL, we pass them into the "get_broken_links" function along with the base URL. This will iterate over the URL, use [`url::ParseOptions`] and [`Url::parse`] for each to get the absolute URL and then perform a HTTP GET request. If the response status is 404, the URL will be added to the HashSet which is returned by the function. | ||
| Finally, we check the contents of the HashSet. If it is empty, we report that no links are broken, otherwise we iterate over the HashSet and report the broken links. |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
I would suggest to describe only the finer details, i.e. things that might not be obvious from the code.
| [![reqwest-badge]][reqwest] [![select-badge]][select] [![url-badge]][url] [![cat-net-badge]][cat-net] | ||
| We parse the URL using [`Url::parse`] and then retrieve the document using reqwest. |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
The user should ba already familiar with this information from the previous examples.
| We parse the URL using [`Url::parse`] and then retrieve the document using reqwest. | ||
| We call "get_base_url" to retrieve the base URL. If the document has a "base" tag, we get the "href" [`attr`] from the first occurrence of the "base" tag. This is then used as the base URL. Otherwise, we can use [`Position::BeforePath`] with the original URL to get the base of that URL. |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
I would try to pare the explanation down to only the information that is not mentioned in earlier examples.
budziq
requested changes
Sep 30, 2017
| .filter_map(|link| base_parser.parse(link).ok()) | ||
| .collect(); | ||
| let broken_links = links.iter().filter(|link| { |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
no need for intermediate broken_links variable. we can iterate over the filtered iterator in a for loop
| let broken_links = links.iter().filter(|link| { | ||
| let link_ok = check_link(link); | ||
| link_ok.is_ok() && !link_ok.unwrap() |
This comment has been minimized.
This comment has been minimized.
budziq
Sep 30, 2017
Collaborator
please avoid unwrapping.
check_link could return a custom error defined in error_chain! when link is broken,
then you could extract all the errored values with filter_map(|res| res.err()) also
| use select::document::Document; | ||
| use select::predicate::Name; | ||
| use std::collections::HashSet; |
This comment has been minimized.
This comment has been minimized.
| # UrlParseError(url::ParseError); | ||
| # } | ||
| # } | ||
| fn get_base_url(url: &Url, doc: &Document) -> Result<Url> { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I've implemented the changes as requested. Instead of removing the unwrap and using a custom error, I've done something slightly different with the filter. Let me know what you think and if everything is OK I'll squash the commits. |
This comment has been minimized.
This comment has been minimized.
|
@sb89 very cool! Please squash + rebase the commits and we're ready to merge |
added some commits
Oct 1, 2017
This comment has been minimized.
This comment has been minimized.
|
merged! Thanks for sticking with it @sb89 ! |
sb89 commentedSep 27, 2017
#290
I've maybe made this too complicated. Let me know what you think.