# clean_email(): Cleaning and validation for email address

## Introduction

- clean_email() function supports cleaning of messy email values
- validate_email() function supports validation on email semantic type of single input value or an input column. When it returns True, the input value is a valid email address.

Parameters for clean_email():

- split: whether to split input into multiple columns, default False
- inplace: whether to clean initial column in place, default False
- pre_clean: whether to pre clean input text, default False
- fix_domain: whether to fix common typos in domain, default False
- report: whether to generate report, default True
- errors: error handling types, default "coerce"
    - 'raise': raise an exception when there is broken value
    - 'coerce': set invalid value to NaN
    - 'ignore': just return the initial input

Parameters for validate_email():

- x: input value, can be Union of single value

## Example dirty dataset

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"messy_email": 
                   ["yi@gmali.com","yi@sfu.ca","y i@sfu.ca","Yi@gmail.com","H ELLO@hotmal.COM","hello", np.nan, "NULL"]
                  })
df

Unnamed: 0,messy_email
0,yi@gmali.com
1,yi@sfu.ca
2,y i@sfu.ca
3,Yi@gmail.com
4,H ELLO@hotmal.COM
5,hello
6,
7,


## 1. Default clean_email()

Under default setting, clean_email() will do the strict semantic type check and return report automatically. Broken values will be replaced by NaN.

In [2]:
from dataprep.clean import clean_email
clean_email(df, "messy_email")

NumExpr defaulting to 8 threads.


Email Cleaning Report:
	3 values with bad format (37.5%)
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)


Unnamed: 0,messy_email,messy_email_clean
0,yi@gmali.com,yi@gmali.com
1,yi@sfu.ca,yi@sfu.ca
2,y i@sfu.ca,
3,Yi@gmail.com,yi@gmail.com
4,H ELLO@hotmal.COM,
5,hello,
6,,
7,,


## 2. Split Parameter

By setting split parameter to True, returned table will contain separate columns for domain and username of valid emails.

In [3]:
clean_email(df, "messy_email", split = True)

Email Cleaning Report:
	3 values with bad format (37.5%)
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)


Unnamed: 0,messy_email,username,domain
0,yi@gmali.com,yi,gmali.com
1,yi@sfu.ca,yi,sfu.ca
2,y i@sfu.ca,,
3,Yi@gmail.com,yi,gmail.com
4,H ELLO@hotmal.COM,,
5,hello,,
6,,,
7,,,


## 3. Pre_clean Parameter

when pre_clean parameter is set to True, the function will fix broken text in advance before do semantic type check.

In [4]:
clean_email(df, "messy_email", pre_clean = True)

Email Cleaning Report:
	1 values with bad format (12.5%)
Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)


Unnamed: 0,messy_email,messy_email_clean
0,yi@gmali.com,yi@gmali.com
1,yi@sfu.ca,yi@sfu.ca
2,yi@sfu.ca,yi@sfu.ca
3,Yi@gmail.com,yi@gmail.com
4,HELLO@hotmal.COM,hello@hotmal.com
5,hello,
6,,
7,,


## 4. Fix_domain Parameter

When fix_domain parameter is set to True, the function will do basic check to avoid common typos for popular domains.

In [5]:
clean_email(df, "messy_email", fix_domain = True)

Email Cleaning Report:
	1 values with bad format (12.5%)
Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)


Unnamed: 0,messy_email,messy_email_clean
0,yi@gmali.com,yi@gmail.com
1,yi@sfu.ca,yi@sfu.ca
2,yi@sfu.ca,yi@sfu.ca
3,Yi@gmail.com,yi@gmail.com
4,HELLO@hotmal.COM,hello@hotmail.com
5,hello,
6,,
7,,


## 5. Error Parameter

In [6]:
clean_email(df, "messy_email", errors = "raise")

ValueError: unable to parse value hello

In [7]:
clean_email(df, "messy_email", errors = "ignore")

Email Cleaning Report:
	1 values with bad format (12.5%)
Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)


Unnamed: 0,messy_email,messy_email_clean
0,yi@gmali.com,yi@gmali.com
1,yi@sfu.ca,yi@sfu.ca
2,yi@sfu.ca,yi@sfu.ca
3,Yi@gmail.com,yi@gmail.com
4,HELLO@hotmal.COM,hello@hotmal.com
5,hello,hello
6,,
7,,


## 6. Examples for validate_email()

In [8]:
from dataprep.clean import validate_email
print(validate_email('Abc.example.com'))
print(validate_email('prettyandsimple@example.com'))
print(validate_email('disposable.style.email.with+symbol@example.com'))
print(validate_email('this is"not\allowed@example.com'))

False
True
True
False


In [9]:
validate_email(df["messy_email"])

0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
Name: messy_email, dtype: bool

Note that validate_email() will do the strict semantic check by default.