# `clean_phone()`: Clean and validate phone numbers

## Introduction

The function `clean_phone()` cleans a column containing phone numbers, and standardizes them in a desired format. The function `validate_phone()` validates either a single phone number or a column of phone numbers, returning True if the value is valid, and False otherwise.

Currently, Canadian/US phone numbers having the following format are supported as valid input:

* Country code of "1" (optional)
* Three-digit area code (optional)
* Three-digit central office code
* Four-digit station code
* Extension number preceded by "#", "x", "ext", or "extension" (optional)

Various delimiters between the digits are also allowed, such as spaces, hyphens, periods, brackets, and/or forward slashes.

Phone numbers can be converted to the following formats via the `output_format` parameter:

* North American Numbering Plan (nanp): NPA-NXX-XXXX
* E.164 (e164): +1NPANXXXXXX
* national: (NPA) NXX-XXXX

Invalid parsing is handled with the `errors` parameter:

* "coerce" (default), then invalid parsing will be set as NaN
* "ignore", then invalid parsing will return the input
* "raise", then invalid parsing will raise an exception

After cleaning, a **report** is printed that provides the following information:

* How many values were cleaned (the value must be transformed)
* How many values could not be parsed
* And the data summary: how many values are in the correct format, and how many values are null

The following sections demonstrate the functionality of `clean_phone()` and `validate_phone()`. 

### An example dirty dataset

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"messy_phone":
                   ["555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
                    15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234", 
                    "2345678901 extension 1234", "2345678", "+66 91 889 8948", 
                    "hello", np.nan, "NULL"]
                  })
df

Unnamed: 0,messy_phone
0,555-234-5678
1,(555) 234-5678
2,555.234.5678
3,555/234/5678
4,15551234567
5,(1) 555-234-5678
6,+1 (234) 567-8901 x. 1234
7,2345678901 extension 1234
8,2345678
9,+66 91 889 8948


## 1. Default `clean_phone()`

By default, the `output_format` parameter is set to "nanp" (NPA-NXX-XXXX) and the `errors` parameter is set to "coerce" (set to NaN when parsing is invalid).

In [2]:
from dataprep.clean import clean_phone
clean_phone(df, "messy_phone")

Phone Number Cleaning Report:
	8 values cleaned (61.54%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,messy_phone,messy_phone_clean
0,555-234-5678,555-234-5678
1,(555) 234-5678,555-234-5678
2,555.234.5678,555-234-5678
3,555/234/5678,555-234-5678
4,15551234567,555-123-4567
5,(1) 555-234-5678,555-234-5678
6,+1 (234) 567-8901 x. 1234,234-567-8901 ext. 1234
7,2345678901 extension 1234,234-567-8901 ext. 1234
8,2345678,234-5678
9,+66 91 889 8948,


Note that "555-234-5678" is considered not cleaned in the report since its resulting format is the same as the input. Also, "+66 91 889 8948" is invalid because it is not a Canadian or US phone number.

## 2. Output formats

This section demonstrates the supported phone number formats.

### E.164 (e164)

In [3]:
clean_phone(df, "messy_phone", output_format="e164")

Phone Number Cleaning Report:
	8 values cleaned (61.54%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,messy_phone,messy_phone_clean
0,555-234-5678,+15552345678
1,(555) 234-5678,+15552345678
2,555.234.5678,+15552345678
3,555/234/5678,+15552345678
4,15551234567,+15551234567
5,(1) 555-234-5678,+15552345678
6,+1 (234) 567-8901 x. 1234,+12345678901 ext. 1234
7,2345678901 extension 1234,+12345678901 ext. 1234
8,2345678,2345678
9,+66 91 889 8948,


Note that the country code "+1" is not added to "2345678" as this would result in an invalid Canadian or US phone number.

### national

In [4]:
clean_phone(df, "messy_phone", output_format="national")

Phone Number Cleaning Report:
	8 values cleaned (61.54%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,messy_phone,messy_phone_clean
0,555-234-5678,(555) 234-5678
1,(555) 234-5678,(555) 234-5678
2,555.234.5678,(555) 234-5678
3,555/234/5678,(555) 234-5678
4,15551234567,(555) 123-4567
5,(1) 555-234-5678,(555) 234-5678
6,+1 (234) 567-8901 x. 1234,(234) 567-8901 ext. 1234
7,2345678901 extension 1234,(234) 567-8901 ext. 1234
8,2345678,234-5678
9,+66 91 889 8948,


## 3. `split` parameter

The `split` parameter adds individual columns containing the cleaned phone number values to the given DataFrame.

In [5]:
clean_phone(df, "messy_phone", split=True)

Phone Number Cleaning Report:
	9 values cleaned (69.23%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,messy_phone,country_code,area_code,office_code,station_code,ext_num
0,555-234-5678,,555.0,234.0,5678.0,
1,(555) 234-5678,,555.0,234.0,5678.0,
2,555.234.5678,,555.0,234.0,5678.0,
3,555/234/5678,,555.0,234.0,5678.0,
4,15551234567,1.0,555.0,123.0,4567.0,
5,(1) 555-234-5678,1.0,555.0,234.0,5678.0,
6,+1 (234) 567-8901 x. 1234,1.0,234.0,567.0,8901.0,1234.0
7,2345678901 extension 1234,,234.0,567.0,8901.0,1234.0
8,2345678,,,234.0,5678.0,
9,+66 91 889 8948,,,,,


## 4. `fix_missing` parameter

By default, the `fix_missing` parameter is set to "empty" (leave the missing country code as is). If set to "auto", the country code is set to "1".

### `split` and `fix_missing`

In [6]:
clean_phone(df, "messy_phone", split=True, fix_missing="auto")

Phone Number Cleaning Report:
	9 values cleaned (69.23%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,messy_phone,country_code,area_code,office_code,station_code,ext_num
0,555-234-5678,1.0,555.0,234.0,5678.0,
1,(555) 234-5678,1.0,555.0,234.0,5678.0,
2,555.234.5678,1.0,555.0,234.0,5678.0,
3,555/234/5678,1.0,555.0,234.0,5678.0,
4,15551234567,1.0,555.0,123.0,4567.0,
5,(1) 555-234-5678,1.0,555.0,234.0,5678.0,
6,+1 (234) 567-8901 x. 1234,1.0,234.0,567.0,8901.0,1234.0
7,2345678901 extension 1234,1.0,234.0,567.0,8901.0,1234.0
8,2345678,,,234.0,5678.0,
9,+66 91 889 8948,,,,,


Again, note that the country code is not set to "1" for "2345678" as this would result in an invalid Canadian or US phone number.

## 5. `inplace` parameter

This deletes the given column from the returned DataFrame. 
A new column containing cleaned phone numbers is added with a title in the format `"{original title}_clean"`.

In [7]:
clean_phone(df, "messy_phone", inplace=True)

Phone Number Cleaning Report:
	8 values cleaned (61.54%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,messy_phone_clean
0,555-234-5678
1,555-234-5678
2,555-234-5678
3,555-234-5678
4,555-123-4567
5,555-234-5678
6,234-567-8901 ext. 1234
7,234-567-8901 ext. 1234
8,234-5678
9,


### `inplace` and `split`

In [8]:
clean_phone(df, "messy_phone", split=True, inplace=True)

Phone Number Cleaning Report:
	9 values cleaned (69.23%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,country_code,area_code,office_code,station_code,ext_num
0,,555.0,234.0,5678.0,
1,,555.0,234.0,5678.0,
2,,555.0,234.0,5678.0,
3,,555.0,234.0,5678.0,
4,1.0,555.0,123.0,4567.0,
5,1.0,555.0,234.0,5678.0,
6,1.0,234.0,567.0,8901.0,1234.0
7,,234.0,567.0,8901.0,1234.0
8,,,234.0,5678.0,
9,,,,,


### `inplace`, `split` and `fix_missing`

In [9]:
clean_phone(df, "messy_phone", split=True, inplace=True, fix_missing="auto")

Phone Number Cleaning Report:
	9 values cleaned (69.23%)
	2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)


Unnamed: 0,country_code,area_code,office_code,station_code,ext_num
0,1.0,555.0,234.0,5678.0,
1,1.0,555.0,234.0,5678.0,
2,1.0,555.0,234.0,5678.0,
3,1.0,555.0,234.0,5678.0,
4,1.0,555.0,123.0,4567.0,
5,1.0,555.0,234.0,5678.0,
6,1.0,234.0,567.0,8901.0,1234.0
7,1.0,234.0,567.0,8901.0,1234.0
8,,,234.0,5678.0,
9,,,,,


## 6. `validate_phone()` 

`validate_phone()` returns True when the input is a valid phone number. Otherwise it returns False.
Valid types are the same as `clean_phone()`.

In [10]:
from dataprep.clean import validate_phone
print(validate_phone(1234))
print(validate_phone(2346789))
print(validate_phone("1 800 234 6789"))
print(validate_phone("+44 7700 900077"))
print(validate_phone("555-234-6789 ext 32"))

False
True
True
False
True


In [11]:
df = pd.DataFrame({"messy_phone":
                   ["555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
                    15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234", 
                    "2345678901 extension 1234", "2345678", "+66 91 889 8948", 
                    "hello", np.nan, "NULL"]
                  })
df["valid"] = validate_phone(df["messy_phone"])
df

Unnamed: 0,messy_phone,valid
0,555-234-5678,True
1,(555) 234-5678,True
2,555.234.5678,True
3,555/234/5678,True
4,15551234567,True
5,(1) 555-234-5678,True
6,+1 (234) 567-8901 x. 1234,True
7,2345678901 extension 1234,True
8,2345678,True
9,+66 91 889 8948,False
