Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upExtract all unique links from a MediaWiki markup #254
Conversation
budziq
requested changes
Jul 22, 2017
|
Nice work! just few suggestions below:
|
| [`Regex::captures_iter`]. | ||
| In Mediawiki syntax: | ||
| * Internal links take a form of `[[Page]]` or `[[Page|link text]]`. |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
I would try to minimize the info regarding the actual mediawiki syntax (ideally removing them all together).
I would just add external link to wikipedia site describing the link syntax.
| )?; | ||
| let mut content = String::new(); | ||
| resp.read_to_string(&mut content)?; |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
most likely you can get rid of resp variable by creating the request and reading from it in one go
| .captures_iter(content.as_str()) | ||
| { | ||
| if let Some(m) = mat["link"].split('|').next() { | ||
| links.push(m.to_string()); |
This comment has been minimized.
This comment has been minimized.
| urls.push(mat["url"].to_string()); | ||
| } | ||
| links.sort_by_key(|s| s.to_lowercase()); |
This comment has been minimized.
This comment has been minimized.
| let mut urls = Vec::new(); | ||
| // Capture the part of [[Page|link text]] that is inside the brackets | ||
| for mat in Regex::new(r#"\[\[(?P<link>.+?\|?)\]\]"#)? |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
single regex with named capture groups would suffice. collecting results into a single HashSet
This comment has been minimized.
This comment has been minimized.
yandexx
Jul 22, 2017
•
Author
Contributor
That was the plan but I've tried @\[\[(.+?)(\|?.*?)\]\] and the first group is too lazy and captures only the first character. I'll research further I guess.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
yandexx
Jul 22, 2017
•
Author
Contributor
But then I can't use a single named group, right? Using it twice panics, as expected.
This comment has been minimized.
This comment has been minimized.
budziq
requested changes
Jul 22, 2017
|
Good progress! Some final stylistic touches left. also please squash the commits into one. |
| fn run() -> Result<()> { | ||
| let mut content = String::new(); | ||
| reqwest::get( | ||
| "https://en.wikipedia.org/w/index.php?title=Rust_(programming_language)&action=raw", |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
please consider extracting the link extraction function with regex compiled under lazy_static! similarly as in the case of soon to be merged PR https://github.com/brson/rust-cookbook/pull/247/files
This comment has been minimized.
This comment has been minimized.
yandexx
Jul 23, 2017
Author
Contributor
lazy_static! doesn't work with the ? operator so I'll have to unwrap? Let me know if there's an alternative.
This comment has been minimized.
This comment has been minimized.
budziq
Jul 23, 2017
Collaborator
unwrapping the regex compilation is ok this is an invariant of your code (regex is well written) and not something should try to recover from
| look for all entries of internal and external links with | ||
| [`Regex::captures_iter`]. | ||
| MediaWiki syntax is described [here][MediaWiki link syntax]. |
This comment has been minimized.
This comment has been minimized.
| )? | ||
| .read_to_string(&mut content)?; | ||
| let mut links = Regex::new(r#"\[\[(?P<link>[^\[\]|]*)[^\[\]]*\]\]"#)? |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
i would suggest to have a single regex concatenated with |
to make the regex readable please use the multiline (whitespace insensitive) regex definition like in https://brson.github.io/rust-cookbook/basics.html#run-an-external-command-and-process-stdout (you can split the regex into smaller parts each with comment starting with #)
having different capture groups would actually be beneficial as you will be able to match on them separately and implement the case insensitive logic (to_lower) in one arm.
This comment has been minimized.
This comment has been minimized.
yandexx
Jul 23, 2017
Author
Contributor
Doesn't seem to be possible to use multiline since I use raw string literals.
This comment has been minimized.
This comment has been minimized.
yandexx
Jul 23, 2017
Author
Contributor
Also this is what I have so far and I really don't like the if/else there. My iterator-fu may be lacking :/
lazy_static! {
static ref WIKI_REGEX: Regex =
Regex::new(r#"\[\[(?P<internal>[^\[\]|]*)[^\[\]]*\]\]|(url=|URL\||\[)(?P<external>http.*?)[ \|}]"#)
.unwrap();
}
let links: HashSet<String> = WIKI_REGEX
.captures_iter(content.as_str())
.map(|c| {
if c.name("internal").is_some() {
c["internal"].to_lowercase()
} else {
// c.name("external").is_some()
c["external"].to_string()
}
})
.collect();
This comment has been minimized.
This comment has been minimized.
budziq
Jul 23, 2017
Collaborator
Doesn't seem to be possible to use multiline since I use raw string literals.
please look at the example from docs
let re = Regex::new(r"(?x)
(?P<y>\d{4}) # the year
-
(?P<m>\d{2}) # the month
-
(?P<d>\d{2}) # the day
").unwrap();it also uses a raw literal. actually the (?x) is critical here.
in regard to the map you have posted I would use a match instead returning Cow
This comment has been minimized.
This comment has been minimized.
yandexx
Jul 23, 2017
Author
Contributor
Thanks, (?x) has helped!
in regard to the map you have posted I would use a match instead returning Cow
Hmm but what would I match on? What to return in case of no match? which shouldn't happen but this case still has to return something?
This comment has been minimized.
This comment has been minimized.
budziq
Jul 23, 2017
Collaborator
sorry I meant match returning Cow<str> (damn autocorrect on mobile)
you would write something along the lines of:
match (c["internal"], c["external"]) {
(Some(val), None) => Cow::from(val.to_lowercase()),
(None, Some(val)) => Cow::from(val),
_ => unreachable!(),
}if tje unreachable would be triggered then there is a fundamental error with the regex definition so we want to fail noisily and fix our code and not try to recocer
This comment has been minimized.
This comment has been minimized.
| .map(|c| c["link"].to_string()), | ||
| ); | ||
| println!("{:?}", links); |
This comment has been minimized.
This comment has been minimized.
| let mut links = Regex::new(r#"\[\[(?P<link>[^\[\]|]*)[^\[\]]*\]\]"#)? | ||
| .captures_iter(content.as_str()) | ||
| .map(|c| c["link"].to_string()) | ||
| .collect::<HashSet<_>>(); |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
annotating the variable type or the function return type is slightly cleaner.
yandexx
force-pushed the
yandexx:regex-mediawiki-links
branch
from
e261751
to
06215c2
Jul 24, 2017
This comment has been minimized.
This comment has been minimized.
|
I rebased and merged. Thanks @yandexx |
yandexx commentedJul 22, 2017
•
edited
Fixes #245.