Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract phone numbers from text #252

Closed
wants to merge 1 commit into from

Conversation

AndyGauge
Copy link
Collaborator

@AndyGauge AndyGauge commented Jul 19, 2017

Fixes #241

Copy link
Collaborator

@budziq budziq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AndyGauge nice work! unfortunately I'll not be able to give you a proper review as Im mobile only on holidays so here are some quick thoughts.

  • Please update the PR description to make the related issue autoclose on merge ("fixes" clause)
  • If the example will read an external file please make it no_run

src/basics.md Outdated
println!("Input a phone number");
let mut phone_input = String::new();
// Regular expression is looking for 3 groups of digits, capturing them
let re = Regex::new(r".*([2-9]\d{2}).*(\d{3}).*(\d{4})")?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi this regex will catch false positives "fiz 233 baz 345  bar 6789" so this might not be the best one.
on the other hand the allowed NANP spec is ridiculously complex and even a subset of it spawns not very nice:
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$ (copied via mobile so it might vie utterly broken)

So lets meet halfway and agree that phone numbers are optional country region code + 3 groups of digits optionally separated by either single space dot or -. Country region code is either preceeded by + or 00 If its still too complex we might pair it down further as long as it will not give false positives.

Also it might be a good place to showoff named capture groups (at least for the country code)

And we would still mention in the example that this will work only for nicely formatted numbers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've investigated international phone numbers and their format is way different. Here's the wikipedia article I'm investigating: https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers
Because it would be easy to create an entire library out of the individual nation's formatting requirements I want to make this only function for US telephone numbers. (While including the optional +1). International conventions includes dots, dashes, spaces, and parenthesis optionally between digits. I think +15551231234 and 5551231234 are both valid phone numbers. Just that regex alone is actually a bit more complicated than the one you posted via mobile. (and won't use the ^and$ because we are not validating we are extracting).

Anyway I'll throw something together that eliminates the false positives while looking good enough to reason (while only guaranteeing support for US).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fair enough 👍 just leave a comment that it'll match only US convention.

src/basics.md Outdated
let mut phone_input = String::new();
// Regular expression is looking for 3 groups of digits, capturing them
let re = Regex::new(r".*([2-9]\d{2}).*(\d{3}).*(\d{4})")?;
stdin().read_line(&mut phone_input)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to open some arbitrary text file and iterate over all of its lines and phone numbers.

Also we might want to collect the numbers into a set in order to deduplicate them prior to printing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas, but should we just use a static string so that it works in a cookbook?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can either provide a short static string (then assert_eq on the results to show the intended output) or a filename (in that case make the example no_run)

src/basics.md Outdated
let formatted_phone = format!("({}) {}-{}", &captures[1], &captures[2], &captures[3]);
println!("Formatted like a boss: {}", formatted_phone);
} else {
println!("Phone numbers must include area code. Use numbers and whatever else you want\nFor example +1 (555) 101-9939");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove the lines that are not really to the point

src/basics.md Outdated
// Make sure the regular expression captured something
if let Some(captures) = re.captures(&phone_input){
let formatted_phone = format!("({}) {}-{}", &captures[1], &captures[2], &captures[3]);
println!("Formatted like a boss: {}", formatted_phone);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just printing these out instead of formatting? also no need to add additional noise to the output. lets keep the example short and sweet so it can be copy pasted.

Also I leave it up to you if you would consider using PhoneNumber struct with display (please see the external process example)

src/basics.md Outdated
number: [cap[1].to_string(),cap[2].to_string(),cap[3].to_string()].concat()
}
});
assert!(format!("{}", phone_numbers.next().unwrap()) == "+1 (505) 881-9292");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way other than unwrap for this option?

Copy link
Collaborator

@budziq budziq Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep assert_eq!(phone_numbers.next(), Some("+1 (505) 881-9292"))
possibly you will need to do some to_str or somesuch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I do that I get this:

error[E0308]: mismatched types
  --> src/main.rs:57:5
   |
57 |     assert_eq!(phone_numbers.next(), Some("+1 (299) 339-1020"));
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `PhoneNumber`, found &str
   |
   = note: expected type `std::option::Option<PhoneNumber>`
              found type `std::option::Option<&str>`
   = help: here are some functions which might fulfill your needs:
           - .take()
           - .unwrap()
           - .unwrap_or_default()
   = note: this error originates in a macro outside of the current crate

I could do this

    assert_eq!(phone_numbers.next(), Some(PhoneNumber {country:"US".to_string(), number:"2993391020".to_string()}));

but that doesn't look any better. I really think if were are trying to offer something someone can use, implementing Display and showing how to have the phone numbers formatted is what people can use.

Copy link
Collaborator

@budziq budziq Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is why I've suggested something like
assert_eq!(phone_numbers.next().map(String::from), Some("+1 (299) 339-1020".to_owned()));

Copy link
Collaborator

@budziq budziq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndyGauge almost there. just few minor changes and git squash to a single commit and we good to go!

src/basics.md Outdated
(\d{4}) #Subscriber Number")?;
let mut phone_numbers = re.captures_iter(phone_text).map(|cap| {
PhoneNumber {
country: "US".to_owned(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the country does not make much sense here as its unconditionally "US" maybe lets just drop it (also using String as a match tag would not be a best idea).

how about storing area exchange and subscriber codes separately instead?

src/basics.md Outdated
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self.country.as_ref() {
"US" => write!(f, "+1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10]),
_ => write!(f, "{}", self.number),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets just drop the country and ommit the ±1 from printing . the regex should still conditionally match it

src/basics.md Outdated
#
struct PhoneNumber {
country: String,
number: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the PhoneNumber can refer to not owned str?

src/basics.md Outdated
1.299.339.1020";
let re = Regex::new(r"(?x)
\+?1? #Country Code Optional
[\x20\(\.]*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arbitrary number of parentheses matchers look funky.
we probably shoud match areacode in two alternatives one with one set of paren and other with none.

also there are better ways to match whitespace than 20 ascii code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\x20 is required with the (?x) ignore whitespace. I could use \s but that would match tabs and newlines. If you are OK with that I can switch it.

Copy link
Collaborator Author

@AndyGauge AndyGauge Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you mean?

[\s\.]*|(?\s\()?

Copy link
Collaborator Author

@AndyGauge AndyGauge Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\s?[\.\(]?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • yeah I would go with \s
  • \+?1? should be moved int an optional group (\+1)?\s* so that plus char has to be followed by 1otherwise these two are independent
  • I meant that the area codes expression within the parenthesis has to be matched with both opening and closing parens as a whole or | without both of them. currently the match for opening ,closing parens are independent so we can match (((202 991 9534 .

I may try to help but writing regexes on mobile is a real pain and I have no way to give you the full answer or even verify if my suggestions are fully sane. So you might try to solving this yourself or wait for me to get back behind a keyboard within next two weeks 😢 .

src/basics.md Outdated
}
#
# fn main() {
# run();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use quick_main!

src/basics.md Outdated
## Extract phone numbers from text

Attempts to process a static string as a list including US phone numbers. The
expression is checking that there are 7 digits that come in groups of 3 3 4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not try to explain the actual regex here. we are interested in the crate and its methods and don't try to teach regular expressions

src/basics.md Outdated
Attempts to process a static string as a list including US phone numbers. The
expression is checking that there are 7 digits that come in groups of 3 3 4,
separated by spaces, dots, dashes, and parenthesis. The phone number is then
tested based on formatting rules.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that we are testing is self evident from the asserts in the code.

src/basics.md Outdated
<a name="ex-phone"></a>
## Extract phone numbers from text

Attempts to process a static string as a list including US phone numbers. The
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence looks a little off, I wouldn't say that the code "attempts" it actually does the parsing. Also I would keep the whole description as a short one sentence mentioning the key method used (captures_iter).

@budziq
Copy link
Collaborator

budziq commented Jul 21, 2017 via email

src/basics.md Outdated

fn run() -> Result<()> {
let phone_text = "
+1 505 881 9292
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this code should allow for multiple numbers per line so let's show it

@AndyGauge AndyGauge force-pushed the ex-phone branch 3 times, most recently from 2f30cd0 to 13483b3 Compare July 21, 2017 23:10
Copy link
Collaborator

@budziq budziq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking better!

src/basics.md Outdated
number: [area_code,cap[4].to_string(),cap[5].to_string()].concat()
}
});
assert_eq!(phone_numbers.next().map(|m| format!("{}", m)), Some("1 (505) 881-9292".to_owned()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just .map(|m| m.to_string() this is a preferred variant (clippy has a lint for it)

src/basics.md Outdated
#
#[derive (PartialEq, PartialOrd, Debug)]
enum CountryCode {
US,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my previous comment about country code not doing any real work. it is mostly noise obfuscating a quite simple and elegant example. i would just remove it (the part about matching on Strings was just a sidenote ;) sorry if i was unclear)

src/basics.md Outdated

#[derive (PartialEq, PartialOrd, Debug)]
struct PhoneNumber {
country: CountryCode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the country code but we might like to store the area and exchange codes separately (my comments from previous patcset)

src/basics.md Outdated
struct PhoneNumber {
country: CountryCode,
number: String,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my comments about making this zero-copy and storing &str instead String

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to do this, but I cannot figure out the lifetime. Here's the problem:

error[E0597]: borrowed value does not live long enough
  --> src\main.rs:49:5
   |
47 |             number: &[area_code,cap[4].to_string(),cap[5].to_string()].concat()
   |                      ---------------------------------------------------------- temporary value created here
48 |         }
49 |     });
   |     ^ temporary value dropped here while still borrowed
...
57 | }
   | - temporary value needs to live until here

Here's the struct def:

#[derive (PartialEq, PartialOrd, Debug)]
struct PhoneNumber<'a> {
    number: &'a str,
}

// Allows printing phone numbers based on country convention.
impl<'a> fmt::Display for PhoneNumber<'a> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10])
    }
}

Can you advise?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no clean way of concatenating the str's without allocation

The easiest way would be to follow my suggestion of separate &str member for each capture (area, extension, subscriber). Then you don't need to concat and allocate

src/basics.md Outdated
impl fmt::Display for PhoneNumber {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self.country {
CountryCode::US => write!(f, "1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run rustfmt on the snippet

src/basics.md Outdated
1 (800) 233-2010
1.299.339.1020";
let re = Regex::new(r"(?x)
(?\+?1)? # Country Code Optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first question mark within the parens looks off but I cannot verify it right now so I might be wrong. the rest looks really nice!

src/basics.md Outdated
PhoneNumber {
country: CountryCode::US,
number: [area_code,cap[4].to_string(),cap[5].to_string()].concat()
area: if cap.get(2) == None { &cap[3] } else { &cap[2] },
Copy link
Collaborator Author

@AndyGauge AndyGauge Jul 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error[E0597]: `cap` does not live long enough
  --> src\main.rs:56:44
   |
56 |             area: if cap.get(2) == None { &cap[3] } else { &cap[2] },
   |                                            ^^^ does not live long enough
...
60 |     });
   |     - borrowed value only lives until here
   |
   = note: borrowed value must be valid for the static lifetime...

Same problem. cap only exists inside the iter() block,

Copy link
Collaborator

@budziq budziq Jul 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we want to use as_str instead of depending on &capture Deref into a &str. Pease see a soon to be merged PR for suggestions https://github.com/brson/rust-cookbook/pull/247/files

also I would suggest using match on captures instead of imperative testing for None

Anyhow if its still a blocker then just go with owned version for now and we'll update the example in a week when I'm back behind a keyboard 👍

@brson
Copy link
Contributor

brson commented Jul 25, 2017

I rebased and merged. Thanks @AndyGauge

@brson brson closed this Jul 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants