Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upExtract phone numbers from text #252
Conversation
budziq
requested changes
Jul 19, 2017
|
Hi @AndyGauge nice work! unfortunately I'll not be able to give you a proper review as Im mobile only on holidays so here are some quick thoughts.
|
| println!("Input a phone number"); | ||
| let mut phone_input = String::new(); | ||
| // Regular expression is looking for 3 groups of digits, capturing them | ||
| let re = Regex::new(r".*([2-9]\d{2}).*(\d{3}).*(\d{4})")?; |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 19, 2017
Collaborator
hi this regex will catch false positives "fiz 233 baz 345 bar 6789" so this might not be the best one.
on the other hand the allowed NANP spec is ridiculously complex and even a subset of it spawns not very nice:
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$ (copied via mobile so it might vie utterly broken)
So lets meet halfway and agree that phone numbers are optional country region code + 3 groups of digits optionally separated by either single space dot or -. Country region code is either preceeded by + or 00 If its still too complex we might pair it down further as long as it will not give false positives.
Also it might be a good place to showoff named capture groups (at least for the country code)
And we would still mention in the example that this will work only for nicely formatted numbers.
This comment has been minimized.
This comment has been minimized.
AndyGauge
Jul 20, 2017
Author
Collaborator
So I've investigated international phone numbers and their format is way different. Here's the wikipedia article I'm investigating: https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers
Because it would be easy to create an entire library out of the individual nation's formatting requirements I want to make this only function for US telephone numbers. (While including the optional +1). International conventions includes dots, dashes, spaces, and parenthesis optionally between digits. I think +15551231234 and 5551231234 are both valid phone numbers. Just that regex alone is actually a bit more complicated than the one you posted via mobile. (and won't use the ^and$ because we are not validating we are extracting).
Anyway I'll throw something together that eliminates the false positives while looking good enough to reason (while only guaranteeing support for US).
This comment has been minimized.
This comment has been minimized.
budziq
Jul 20, 2017
Collaborator
Ok fair enough
| let mut phone_input = String::new(); | ||
| // Regular expression is looking for 3 groups of digits, capturing them | ||
| let re = Regex::new(r".*([2-9]\d{2}).*(\d{3}).*(\d{4})")?; | ||
| stdin().read_line(&mut phone_input)?; |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 19, 2017
Collaborator
I would suggest to open some arbitrary text file and iterate over all of its lines and phone numbers.
Also we might want to collect the numbers into a set in order to deduplicate them prior to printing.
This comment has been minimized.
This comment has been minimized.
AndyGauge
Jul 20, 2017
Author
Collaborator
Good ideas, but should we just use a static string so that it works in a cookbook?
This comment has been minimized.
This comment has been minimized.
budziq
Jul 20, 2017
Collaborator
you can either provide a short static string (then assert_eq on the results to show the intended output) or a filename (in that case make the example no_run)
| let formatted_phone = format!("({}) {}-{}", &captures[1], &captures[2], &captures[3]); | ||
| println!("Formatted like a boss: {}", formatted_phone); | ||
| } else { | ||
| println!("Phone numbers must include area code. Use numbers and whatever else you want\nFor example +1 (555) 101-9939"); |
This comment has been minimized.
This comment has been minimized.
| // Make sure the regular expression captured something | ||
| if let Some(captures) = re.captures(&phone_input){ | ||
| let formatted_phone = format!("({}) {}-{}", &captures[1], &captures[2], &captures[3]); | ||
| println!("Formatted like a boss: {}", formatted_phone); |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 19, 2017
Collaborator
how about just printing these out instead of formatting? also no need to add additional noise to the output. lets keep the example short and sweet so it can be copy pasted.
Also I leave it up to you if you would consider using PhoneNumber struct with display (please see the external process example)
AndyGauge
reviewed
Jul 21, 2017
| number: [cap[1].to_string(),cap[2].to_string(),cap[3].to_string()].concat() | ||
| } | ||
| }); | ||
| assert!(format!("{}", phone_numbers.next().unwrap()) == "+1 (505) 881-9292"); |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
•
Collaborator
yep assert_eq!(phone_numbers.next(), Some("+1 (505) 881-9292"))
possibly you will need to do some to_str or somesuch
This comment has been minimized.
This comment has been minimized.
AndyGauge
Jul 21, 2017
Author
Collaborator
When I do that I get this:
error[E0308]: mismatched types
--> src/main.rs:57:5
|
57 | assert_eq!(phone_numbers.next(), Some("+1 (299) 339-1020"));
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `PhoneNumber`, found &str
|
= note: expected type `std::option::Option<PhoneNumber>`
found type `std::option::Option<&str>`
= help: here are some functions which might fulfill your needs:
- .take()
- .unwrap()
- .unwrap_or_default()
= note: this error originates in a macro outside of the current crate
I could do this
assert_eq!(phone_numbers.next(), Some(PhoneNumber {country:"US".to_string(), number:"2993391020".to_string()}));
but that doesn't look any better. I really think if were are trying to offer something someone can use, implementing Display and showing how to have the phone numbers formatted is what people can use.
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
•
Collaborator
that is why I've suggested something like
assert_eq!(phone_numbers.next().map(String::from), Some("+1 (299) 339-1020".to_owned()));
budziq
requested changes
Jul 21, 2017
|
@AndyGauge almost there. just few minor changes and git squash to a single commit and we good to go! |
| (\d{4}) #Subscriber Number")?; | ||
| let mut phone_numbers = re.captures_iter(phone_text).map(|cap| { | ||
| PhoneNumber { | ||
| country: "US".to_owned(), |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
the country does not make much sense here as its unconditionally "US" maybe lets just drop it (also using String as a match tag would not be a best idea).
how about storing area exchange and subscriber codes separately instead?
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
| match self.country.as_ref() { | ||
| "US" => write!(f, "+1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10]), | ||
| _ => write!(f, "{}", self.number), |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
lets just drop the country and ommit the ±1 from printing . the regex should still conditionally match it
| # | ||
| struct PhoneNumber { | ||
| country: String, | ||
| number: String, |
This comment has been minimized.
This comment has been minimized.
| 1.299.339.1020"; | ||
| let re = Regex::new(r"(?x) | ||
| \+?1? #Country Code Optional | ||
| [\x20\(\.]* |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
the arbitrary number of parentheses matchers look funky.
we probably shoud match areacode in two alternatives one with one set of paren and other with none.
also there are better ways to match whitespace than 20 ascii code.
This comment has been minimized.
This comment has been minimized.
AndyGauge
Jul 21, 2017
Author
Collaborator
\x20 is required with the (?x) ignore whitespace. I could use \s but that would match tabs and newlines. If you are OK with that I can switch it.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
- yeah I would go with
\s \+?1?should be moved int an optional group(\+1)?\s*so that plus char has to be followed by 1otherwise these two are independent- I meant that the area codes expression within the parenthesis has to be matched with both opening and closing parens as a whole or | without both of them. currently the match for opening ,closing parens are independent so we can match
(((202 991 9534.
I may try to help but writing regexes on mobile is a real pain and I have no way to give you the full answer or even verify if my suggestions are fully sane. So you might try to solving this yourself or wait for me to get back behind a keyboard within next two weeks
| ## Extract phone numbers from text | ||
|
|
||
| Attempts to process a static string as a list including US phone numbers. The | ||
| expression is checking that there are 7 digits that come in groups of 3 3 4, |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
I would not try to explain the actual regex here. we are interested in the crate and its methods and don't try to teach regular expressions
| Attempts to process a static string as a list including US phone numbers. The | ||
| expression is checking that there are 7 digits that come in groups of 3 3 4, | ||
| separated by spaces, dots, dashes, and parenthesis. The phone number is then | ||
| tested based on formatting rules. |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
the fact that we are testing is self evident from the asserts in the code.
| <a name="ex-phone"></a> | ||
| ## Extract phone numbers from text | ||
|
|
||
| Attempts to process a static string as a list including US phone numbers. The |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
This sentence looks a little off, I wouldn't say that the code "attempts" it actually does the parsing. Also I would keep the whole description as a short one sentence mentioning the key method used (captures_iter).
This comment has been minimized.
This comment has been minimized.
|
hi
I no longer see your comment on github. I'm currently on mobile for the
next 1.5weeks so I'll not be able to help with the finer details.
In general you would map the result on the left with to_string and on the
right you would leave Some("the number str".to_owned())
no need to call fmt explicitly or unwrap. the assert_eq is able to compare
complex structs such as Option<String> or something along those lines.
there might be a nicer way but away from the keyboard I cant give a better
suggestion from the top of my head. ay a little with rustc and you'll get
it right
…On 21 Jul 2017 20:43, "Andrew Gauger" ***@***.***> wrote:
*@AndyGauge* commented on this pull request.
------------------------------
In src/basics.md
<https://github.com/brson/rust-cookbook/pull/252#discussion_r128833949>:
> + 1.299.339.1020";
+ let re = Regex::new(r"(?x)
+ \+?1? #Country Code Optional
+ [\x20\(\.]*
+ ([2-9]\d{2}) #Area Code
+ [\x20\)\.\-]*
+ ([2-9]\d{2}) #Exchange Code
+ [\x20\.\-]*
+ (\d{4}) #Subscriber Number")?;
+ let mut phone_numbers = re.captures_iter(phone_text).map(|cap| {
+ PhoneNumber {
+ country: "US".to_owned(),
+ number: [cap[1].to_string(),cap[2].to_string(),cap[3].to_string()].concat()
+ }
+ });
+ assert!(format!("{}", phone_numbers.next().unwrap()) == "+1 (505) 881-9292");
So when I do that I get
error[E0308]: mismatched types
--> src/main.rs:53:5
|
53 | assert_eq!(phone_numbers.next().unwrap().fmt(), "+1 (505) 881-9292 <(505)%20881-9292>");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected enum `std::result::Result`, found &str
|
= note: expected type `std::result::Result<(), std::fmt::Error>`
found type `&str`
= note: this error originates in a macro outside of the current crate
I'm using the Some value and then running it through format! to show the
output of the struct. Maybe I need to re-write? I thought the concern about
unwrap is that it may panic, but that's the point of an assert--to panic
when the values aren't equal.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/brson/rust-cookbook/pull/252#discussion_r128833949>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANfSGyUv2bmCjSFQudnaCHmk2-e2-gAks5sQPFGgaJpZM4OdVwh>
.
|
budziq
reviewed
Jul 21, 2017
| fn run() -> Result<()> { | ||
| let phone_text = " | ||
| +1 505 881 9292 |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 21, 2017
Collaborator
also this code should allow for multiple numbers per line so let's show it
AndyGauge
force-pushed the
AndyGauge:ex-phone
branch
3 times, most recently
from
2f30cd0
to
13483b3
Jul 21, 2017
budziq
requested changes
Jul 22, 2017
|
looking better! |
| number: [area_code,cap[4].to_string(),cap[5].to_string()].concat() | ||
| } | ||
| }); | ||
| assert_eq!(phone_numbers.next().map(|m| format!("{}", m)), Some("1 (505) 881-9292".to_owned())); |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
you can just .map(|m| m.to_string() this is a preferred variant (clippy has a lint for it)
| # | ||
| #[derive (PartialEq, PartialOrd, Debug)] | ||
| enum CountryCode { | ||
| US, |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
please see my previous comment about country code not doing any real work. it is mostly noise obfuscating a quite simple and elegant example. i would just remove it (the part about matching on Strings was just a sidenote ;) sorry if i was unclear)
| #[derive (PartialEq, PartialOrd, Debug)] | ||
| struct PhoneNumber { | ||
| country: CountryCode, |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
please remove the country code but we might like to store the area and exchange codes separately (my comments from previous patcset)
| struct PhoneNumber { | ||
| country: CountryCode, | ||
| number: String, | ||
| } |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
please see my comments about making this zero-copy and storing &str instead String
This comment has been minimized.
This comment has been minimized.
AndyGauge
Jul 22, 2017
Author
Collaborator
I have tried to do this, but I cannot figure out the lifetime. Here's the problem:
error[E0597]: borrowed value does not live long enough
--> src\main.rs:49:5
|
47 | number: &[area_code,cap[4].to_string(),cap[5].to_string()].concat()
| ---------------------------------------------------------- temporary value created here
48 | }
49 | });
| ^ temporary value dropped here while still borrowed
...
57 | }
| - temporary value needs to live until here
Here's the struct def:
#[derive (PartialEq, PartialOrd, Debug)]
struct PhoneNumber<'a> {
number: &'a str,
}
// Allows printing phone numbers based on country convention.
impl<'a> fmt::Display for PhoneNumber<'a> {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10])
}
}
Can you advise?
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
There is no clean way of concatenating the str's without allocation
The easiest way would be to follow my suggestion of separate &str member for each capture (area, extension, subscriber). Then you don't need to concat and allocate
| impl fmt::Display for PhoneNumber { | ||
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
| match self.country { | ||
| CountryCode::US => write!(f, "1 ({}) {}-{}", &self.number[0..3], &self.number[3..6], &self.number[6..10]), |
This comment has been minimized.
This comment has been minimized.
| 1 (800) 233-2010 | ||
| 1.299.339.1020"; | ||
| let re = Regex::new(r"(?x) | ||
| (?\+?1)? # Country Code Optional |
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
Collaborator
the first question mark within the parens looks off but I cannot verify it right now so I might be wrong. the rest looks really nice!
AndyGauge
reviewed
Jul 22, 2017
| PhoneNumber { | ||
| country: CountryCode::US, | ||
| number: [area_code,cap[4].to_string(),cap[5].to_string()].concat() | ||
| area: if cap.get(2) == None { &cap[3] } else { &cap[2] }, |
This comment has been minimized.
This comment has been minimized.
AndyGauge
Jul 22, 2017
•
Author
Collaborator
error[E0597]: `cap` does not live long enough
--> src\main.rs:56:44
|
56 | area: if cap.get(2) == None { &cap[3] } else { &cap[2] },
| ^^^ does not live long enough
...
60 | });
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
Same problem. cap only exists inside the iter() block,
This comment has been minimized.
This comment has been minimized.
budziq
Jul 22, 2017
•
Collaborator
probably we want to use as_str instead of depending on &capture Deref into a &str. Pease see a soon to be merged PR for suggestions https://github.com/brson/rust-cookbook/pull/247/files
also I would suggest using match on captures instead of imperative testing for None
Anyhow if its still a blocker then just go with owned version for now and we'll update the example in a week when I'm back behind a keyboard
AndyGauge
force-pushed the
AndyGauge:ex-phone
branch
from
1d8d68c
to
9afc8e9
Jul 23, 2017
This comment has been minimized.
This comment has been minimized.
|
I rebased and merged. Thanks @AndyGauge |
AndyGauge commentedJul 19, 2017
•
edited
Fixes #241