New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract phone numbers from text #252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AndyGauge nice work! unfortunately I'll not be able to give you a proper review as Im mobile only on holidays so here are some quick thoughts.
- Please update the PR description to make the related issue autoclose on merge ("fixes" clause)
- If the example will read an external file please make it
no_run
src/basics.md
Outdated
println!("Input a phone number"); | ||
let mut phone_input = String::new(); | ||
// Regular expression is looking for 3 groups of digits, capturing them | ||
let re = Regex::new(r".*([2-9]\d{2}).*(\d{3}).*(\d{4})")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi this regex will catch false positives "fiz 233 baz 345 bar 6789" so this might not be the best one.
on the other hand the allowed NANP spec is ridiculously complex and even a subset of it spawns not very nice:
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
(copied via mobile so it might vie utterly broken)
So lets meet halfway and agree that phone numbers are optional country region code + 3 groups of digits optionally separated by either single space dot or -
. Country region code is either preceeded by +
or 00
If its still too complex we might pair it down further as long as it will not give false positives.
Also it might be a good place to showoff named capture groups (at least for the country code)
And we would still mention in the example that this will work only for nicely formatted numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've investigated international phone numbers and their format is way different. Here's the wikipedia article I'm investigating: https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers
Because it would be easy to create an entire library out of the individual nation's formatting requirements I want to make this only function for US telephone numbers. (While including the optional +1). International conventions includes dots, dashes, spaces, and parenthesis optionally between digits. I think +15551231234 and 5551231234 are both valid phone numbers. Just that regex alone is actually a bit more complicated than the one you posted via mobile. (and won't use the ^and$ because we are not validating we are extracting).
Anyway I'll throw something together that eliminates the false positives while looking good enough to reason (while only guaranteeing support for US).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fair enough 👍 just leave a comment that it'll match only US convention.
src/basics.md
Outdated
let mut phone_input = String::new(); | ||
// Regular expression is looking for 3 groups of digits, capturing them | ||
let re = Regex::new(r".*([2-9]\d{2}).*(\d{3}).*(\d{4})")?; | ||
stdin().read_line(&mut phone_input)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to open some arbitrary text file and iterate over all of its lines and phone numbers.
Also we might want to collect the numbers into a set in order to deduplicate them prior to printing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ideas, but should we just use a static string so that it works in a cookbook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can either provide a short static string (then assert_eq on the results to show the intended output) or a filename (in that case make the example no_run)
src/basics.md
Outdated
let formatted_phone = format!("({}) {}-{}", &captures[1], &captures[2], &captures[3]); | ||
println!("Formatted like a boss: {}", formatted_phone); | ||
} else { | ||
println!("Phone numbers must include area code. Use numbers and whatever else you want\nFor example +1 (555) 101-9939"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove the lines that are not really to the point
src/basics.md
Outdated
// Make sure the regular expression captured something | ||
if let Some(captures) = re.captures(&phone_input){ | ||
let formatted_phone = format!("({}) {}-{}", &captures[1], &captures[2], &captures[3]); | ||
println!("Formatted like a boss: {}", formatted_phone); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about just printing these out instead of formatting? also no need to add additional noise to the output. lets keep the example short and sweet so it can be copy pasted.
Also I leave it up to you if you would consider using PhoneNumber struct with display (please see the external process example)
src/basics.md
Outdated
number: [cap[1].to_string(),cap[2].to_string(),cap[3].to_string()].concat() | ||
} | ||
}); | ||
assert!(format!("{}", phone_numbers.next().unwrap()) == "+1 (505) 881-9292"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way other than unwrap for this option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep assert_eq!(phone_numbers.next(), Some("+1 (505) 881-9292"))
possibly you will need to do some to_str
or somesuch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I do that I get this:
error[E0308]: mismatched types
--> src/main.rs:57:5
|
57 | assert_eq!(phone_numbers.next(), Some("+1 (299) 339-1020"));
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `PhoneNumber`, found &str
|
= note: expected type `std::option::Option<PhoneNumber>`
found type `std::option::Option<&str>`
= help: here are some functions which might fulfill your needs:
- .take()
- .unwrap()
- .unwrap_or_default()
= note: this error originates in a macro outside of the current crate
I could do this
assert_eq!(phone_numbers.next(), Some(PhoneNumber {country:"US".to_string(), number:"2993391020".to_string()}));
but that doesn't look any better. I really think if were are trying to offer something someone can use, implementing Display and showing how to have the phone numbers formatted is what people can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is why I've suggested something like
assert_eq!(phone_numbers.next().map(String::from), Some("+1 (299) 339-1020".to_owned()));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AndyGauge almost there. just few minor changes and git squash to a single commit and we good to go!
src/basics.md
Outdated
(\d{4}) #Subscriber Number")?; | ||
let mut phone_numbers = re.captures_iter(phone_text).map(|cap| { | ||
PhoneNumber { | ||
country: "US".to_owned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the country does not make much sense here as its unconditionally "US" maybe lets just drop it (also using String as a match tag would not be a best idea).
how about storing area exchange and subscriber codes separately instead?
src/basics.md
Outdated
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
match self.country.as_ref() { | ||
"US" => write!(f, "+1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10]), | ||
_ => write!(f, "{}", self.number), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets just drop the country and ommit the ±1 from printing . the regex should still conditionally match it
src/basics.md
Outdated
# | ||
struct PhoneNumber { | ||
country: String, | ||
number: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the PhoneNumber can refer to not owned str?
src/basics.md
Outdated
1.299.339.1020"; | ||
let re = Regex::new(r"(?x) | ||
\+?1? #Country Code Optional | ||
[\x20\(\.]* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the arbitrary number of parentheses matchers look funky.
we probably shoud match areacode in two alternatives one with one set of paren and other with none.
also there are better ways to match whitespace than 20 ascii code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\x20 is required with the (?x) ignore whitespace. I could use \s but that would match tabs and newlines. If you are OK with that I can switch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you mean?
[\s\.]*|(?\s\()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\s?[\.\(]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yeah I would go with
\s
\+?1?
should be moved int an optional group(\+1)?\s*
so that plus char has to be followed by 1otherwise these two are independent- I meant that the area codes expression within the parenthesis has to be matched with both opening and closing parens as a whole or | without both of them. currently the match for opening ,closing parens are independent so we can match
(((202 991 9534
.
I may try to help but writing regexes on mobile is a real pain and I have no way to give you the full answer or even verify if my suggestions are fully sane. So you might try to solving this yourself or wait for me to get back behind a keyboard within next two weeks 😢 .
src/basics.md
Outdated
} | ||
# | ||
# fn main() { | ||
# run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use quick_main!
src/basics.md
Outdated
## Extract phone numbers from text | ||
|
||
Attempts to process a static string as a list including US phone numbers. The | ||
expression is checking that there are 7 digits that come in groups of 3 3 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not try to explain the actual regex here. we are interested in the crate and its methods and don't try to teach regular expressions
src/basics.md
Outdated
Attempts to process a static string as a list including US phone numbers. The | ||
expression is checking that there are 7 digits that come in groups of 3 3 4, | ||
separated by spaces, dots, dashes, and parenthesis. The phone number is then | ||
tested based on formatting rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact that we are testing is self evident from the asserts in the code.
src/basics.md
Outdated
<a name="ex-phone"></a> | ||
## Extract phone numbers from text | ||
|
||
Attempts to process a static string as a list including US phone numbers. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence looks a little off, I wouldn't say that the code "attempts" it actually does the parsing. Also I would keep the whole description as a short one sentence mentioning the key method used (captures_iter
).
hi
I no longer see your comment on github. I'm currently on mobile for the
next 1.5weeks so I'll not be able to help with the finer details.
In general you would map the result on the left with to_string and on the
right you would leave Some("the number str".to_owned())
no need to call fmt explicitly or unwrap. the assert_eq is able to compare
complex structs such as Option<String> or something along those lines.
there might be a nicer way but away from the keyboard I cant give a better
suggestion from the top of my head. ay a little with rustc and you'll get
it right 👍
…On 21 Jul 2017 20:43, "Andrew Gauger" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/basics.md
<https://github.com/brson/rust-cookbook/pull/252#discussion_r128833949>:
> + 1.299.339.1020";
+ let re = Regex::new(r"(?x)
+ \+?1? #Country Code Optional
+ [\x20\(\.]*
+ ([2-9]\d{2}) #Area Code
+ [\x20\)\.\-]*
+ ([2-9]\d{2}) #Exchange Code
+ [\x20\.\-]*
+ (\d{4}) #Subscriber Number")?;
+ let mut phone_numbers = re.captures_iter(phone_text).map(|cap| {
+ PhoneNumber {
+ country: "US".to_owned(),
+ number: [cap[1].to_string(),cap[2].to_string(),cap[3].to_string()].concat()
+ }
+ });
+ assert!(format!("{}", phone_numbers.next().unwrap()) == "+1 (505) 881-9292");
So when I do that I get
error[E0308]: mismatched types
--> src/main.rs:53:5
|
53 | assert_eq!(phone_numbers.next().unwrap().fmt(), "+1 (505) 881-9292 <(505)%20881-9292>");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected enum `std::result::Result`, found &str
|
= note: expected type `std::result::Result<(), std::fmt::Error>`
found type `&str`
= note: this error originates in a macro outside of the current crate
I'm using the Some value and then running it through format! to show the
output of the struct. Maybe I need to re-write? I thought the concern about
unwrap is that it may panic, but that's the point of an assert--to panic
when the values aren't equal.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/brson/rust-cookbook/pull/252#discussion_r128833949>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANfSGyUv2bmCjSFQudnaCHmk2-e2-gAks5sQPFGgaJpZM4OdVwh>
.
|
src/basics.md
Outdated
|
||
fn run() -> Result<()> { | ||
let phone_text = " | ||
+1 505 881 9292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this code should allow for multiple numbers per line so let's show it
2f30cd0
to
13483b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking better!
src/basics.md
Outdated
number: [area_code,cap[4].to_string(),cap[5].to_string()].concat() | ||
} | ||
}); | ||
assert_eq!(phone_numbers.next().map(|m| format!("{}", m)), Some("1 (505) 881-9292".to_owned())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just .map(|m| m.to_string()
this is a preferred variant (clippy has a lint for it)
src/basics.md
Outdated
# | ||
#[derive (PartialEq, PartialOrd, Debug)] | ||
enum CountryCode { | ||
US, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my previous comment about country code not doing any real work. it is mostly noise obfuscating a quite simple and elegant example. i would just remove it (the part about matching on Strings was just a sidenote ;) sorry if i was unclear)
src/basics.md
Outdated
|
||
#[derive (PartialEq, PartialOrd, Debug)] | ||
struct PhoneNumber { | ||
country: CountryCode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the country code but we might like to store the area and exchange codes separately (my comments from previous patcset)
src/basics.md
Outdated
struct PhoneNumber { | ||
country: CountryCode, | ||
number: String, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my comments about making this zero-copy and storing &str instead String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to do this, but I cannot figure out the lifetime. Here's the problem:
error[E0597]: borrowed value does not live long enough
--> src\main.rs:49:5
|
47 | number: &[area_code,cap[4].to_string(),cap[5].to_string()].concat()
| ---------------------------------------------------------- temporary value created here
48 | }
49 | });
| ^ temporary value dropped here while still borrowed
...
57 | }
| - temporary value needs to live until here
Here's the struct def:
#[derive (PartialEq, PartialOrd, Debug)]
struct PhoneNumber<'a> {
number: &'a str,
}
// Allows printing phone numbers based on country convention.
impl<'a> fmt::Display for PhoneNumber<'a> {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10])
}
}
Can you advise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no clean way of concatenating the str's without allocation
The easiest way would be to follow my suggestion of separate &str member for each capture (area, extension, subscriber). Then you don't need to concat and allocate
src/basics.md
Outdated
impl fmt::Display for PhoneNumber { | ||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
match self.country { | ||
CountryCode::US => write!(f, "1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please run rustfmt on the snippet
src/basics.md
Outdated
1 (800) 233-2010 | ||
1.299.339.1020"; | ||
let re = Regex::new(r"(?x) | ||
(?\+?1)? # Country Code Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first question mark within the parens looks off but I cannot verify it right now so I might be wrong. the rest looks really nice!
src/basics.md
Outdated
PhoneNumber { | ||
country: CountryCode::US, | ||
number: [area_code,cap[4].to_string(),cap[5].to_string()].concat() | ||
area: if cap.get(2) == None { &cap[3] } else { &cap[2] }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error[E0597]: `cap` does not live long enough
--> src\main.rs:56:44
|
56 | area: if cap.get(2) == None { &cap[3] } else { &cap[2] },
| ^^^ does not live long enough
...
60 | });
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
Same problem. cap only exists inside the iter()
block,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably we want to use as_str
instead of depending on &capture Deref
into a &str. Pease see a soon to be merged PR for suggestions https://github.com/brson/rust-cookbook/pull/247/files
also I would suggest using match on captures instead of imperative testing for None
Anyhow if its still a blocker then just go with owned version for now and we'll update the example in a week when I'm back behind a keyboard 👍
I rebased and merged. Thanks @AndyGauge |
Fixes #241