Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract all unique links from a MediaWiki markup #254

Closed

Conversation

Projects
None yet
3 participants
@yandexx
Copy link
Contributor

yandexx commented Jul 22, 2017

Fixes #245.

  • I went for deduplicated Vectors instead of HashMaps, but let me know if HashMap is a better choice after all.
@budziq
Copy link
Collaborator

budziq left a comment

Nice work! just few suggestions below:

  • please update PR description with fixes #254 to autoclose the issue
src/net.md Outdated
[`Regex::captures_iter`].
In Mediawiki syntax:
* Internal links take a form of `[[Page]]` or `[[Page|link text]]`.

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

I would try to minimize the info regarding the actual mediawiki syntax (ideally removing them all together).
I would just add external link to wikipedia site describing the link syntax.

src/net.md Outdated
)?;
let mut content = String::new();
resp.read_to_string(&mut content)?;

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

most likely you can get rid of resp variable by creating the request and reading from it in one go

src/net.md Outdated
.captures_iter(content.as_str())
{
if let Some(m) = mat["link"].split('|').next() {
links.push(m.to_string());

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

using map then collect into a set would be preferable

src/net.md Outdated
urls.push(mat["url"].to_string());
}
links.sort_by_key(|s| s.to_lowercase());

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

using set would be preferable instead of sorting and deduping

src/net.md Outdated
let mut urls = Vec::new();
// Capture the part of [[Page|link text]] that is inside the brackets
for mat in Regex::new(r#"\[\[(?P<link>.+?\|?)\]\]"#)?

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

single regex with named capture groups would suffice. collecting results into a single HashSet

This comment has been minimized.

@yandexx

yandexx Jul 22, 2017

Author Contributor

That was the plan but I've tried @\[\[(.+?)(\|?.*?)\]\] and the first group is too lazy and captures only the first character. I'll research further I guess.

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

most likely you'll have to glue the two regexes into one with |

This comment has been minimized.

@yandexx

yandexx Jul 22, 2017

Author Contributor

But then I can't use a single named group, right? Using it twice panics, as expected.

This comment has been minimized.

@yandexx

yandexx Jul 22, 2017

Author Contributor

Never mind I think I found a solution that works :)

@budziq
Copy link
Collaborator

budziq left a comment

Good progress! Some final stylistic touches left.

also please squash the commits into one.

src/net.md Outdated
fn run() -> Result<()> {
let mut content = String::new();
reqwest::get(
"https://en.wikipedia.org/w/index.php?title=Rust_(programming_language)&action=raw",

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

please consider extracting the link extraction function with regex compiled under lazy_static! similarly as in the case of soon to be merged PR https://github.com/brson/rust-cookbook/pull/247/files

This comment has been minimized.

@yandexx

yandexx Jul 23, 2017

Author Contributor

lazy_static! doesn't work with the ? operator so I'll have to unwrap? Let me know if there's an alternative.

This comment has been minimized.

@budziq

budziq Jul 23, 2017

Collaborator

unwrapping the regex compilation is ok this is an invariant of your code (regex is well written) and not something should try to recover from

src/net.md Outdated
look for all entries of internal and external links with
[`Regex::captures_iter`].
MediaWiki syntax is described [here][MediaWiki link syntax].

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

I would rewrite it to "MediaWiki link syntax"

src/net.md Outdated
)?
.read_to_string(&mut content)?;
let mut links = Regex::new(r#"\[\[(?P<link>[^\[\]|]*)[^\[\]]*\]\]"#)?

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

i would suggest to have a single regex concatenated with |
to make the regex readable please use the multiline (whitespace insensitive) regex definition like in https://brson.github.io/rust-cookbook/basics.html#run-an-external-command-and-process-stdout (you can split the regex into smaller parts each with comment starting with #)

having different capture groups would actually be beneficial as you will be able to match on them separately and implement the case insensitive logic (to_lower) in one arm.

This comment has been minimized.

@yandexx

yandexx Jul 23, 2017

Author Contributor

Doesn't seem to be possible to use multiline since I use raw string literals.

This comment has been minimized.

@yandexx

yandexx Jul 23, 2017

Author Contributor

Also this is what I have so far and I really don't like the if/else there. My iterator-fu may be lacking :/

    lazy_static! {
        static ref WIKI_REGEX: Regex = 
            Regex::new(r#"\[\[(?P<internal>[^\[\]|]*)[^\[\]]*\]\]|(url=|URL\||\[)(?P<external>http.*?)[ \|}]"#)
            .unwrap();
    }

    let links: HashSet<String> = WIKI_REGEX
        .captures_iter(content.as_str())
        .map(|c| {
            if c.name("internal").is_some() {
                c["internal"].to_lowercase()
            } else {
                // c.name("external").is_some()
                c["external"].to_string()
            }
        })
        .collect();

This comment has been minimized.

@budziq

budziq Jul 23, 2017

Collaborator

Doesn't seem to be possible to use multiline since I use raw string literals.

please look at the example from docs

let re = Regex::new(r"(?x)
  (?P<y>\d{4}) # the year
  -
  (?P<m>\d{2}) # the month
  -
  (?P<d>\d{2}) # the day
").unwrap();

it also uses a raw literal. actually the (?x) is critical here.

in regard to the map you have posted I would use a match instead returning Cow

This comment has been minimized.

@yandexx

yandexx Jul 23, 2017

Author Contributor

Thanks, (?x) has helped!

in regard to the map you have posted I would use a match instead returning Cow

Hmm but what would I match on? What to return in case of no match? which shouldn't happen but this case still has to return something?

This comment has been minimized.

@budziq

budziq Jul 23, 2017

Collaborator

sorry I meant match returning Cow<str> (damn autocorrect on mobile)
you would write something along the lines of:

match (c["internal"], c["external"]) {
        (Some(val), None) => Cow::from(val.to_lowercase()),
        (None, Some(val)) => Cow::from(val),
        _ => unreachable!(),
    }

if tje unreachable would be triggered then there is a fundamental error with the regex definition so we want to fail noisily and fix our code and not try to recocer

This comment has been minimized.

@yandexx

yandexx Jul 23, 2017

Author Contributor

Ooh that's clever! Thanks, I'm learning a lot here!

src/net.md Outdated
.map(|c| c["link"].to_string()),
);
println!("{:?}", links);

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

i would iterate over the links or use the pretty print syntax

src/net.md Outdated
let mut links = Regex::new(r#"\[\[(?P<link>[^\[\]|]*)[^\[\]]*\]\]"#)?
.captures_iter(content.as_str())
.map(|c| c["link"].to_string())
.collect::<HashSet<_>>();

This comment has been minimized.

@budziq

budziq Jul 22, 2017

Collaborator

annotating the variable type or the function return type is slightly cleaner.

@yandexx yandexx force-pushed the yandexx:regex-mediawiki-links branch from e261751 to 06215c2 Jul 24, 2017

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Jul 25, 2017

I rebased and merged. Thanks @yandexx

@brson brson closed this Jul 25, 2017

@yandexx yandexx deleted the yandexx:regex-mediawiki-links branch Jul 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.