### An example dataset with email addresses

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "email": [
        "yi@gmali.com", "yi@sfu.ca", "y i@sfu.ca", "Yi@gmail.com",
        "H ELLO@hotmal.COM", "hello", np.nan, "NULL"
    ]
})
df

Unnamed: 0,email
0,yi@gmali.com
1,yi@sfu.ca
2,y i@sfu.ca
3,Yi@gmail.com
4,H ELLO@hotmal.COM
5,hello
6,
7,


## 1. Default clean_email()

By default, `clean_email()` will do a strict check to determine if an email address is in the correct format and set invalid values to NaN.

In [2]:
from dataprep.clean import clean_email
clean_email(df, "email")

  0%|          | 0/8 [00:00<?, ?it/s]

email Cleaning Report:
	1 values cleaned (12.5%)
	3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)


Unnamed: 0,email,email_clean
0,yi@gmali.com,yi@gmali.com
1,yi@sfu.ca,yi@sfu.ca
2,y i@sfu.ca,
3,Yi@gmail.com,yi@gmail.com
4,H ELLO@hotmal.COM,
5,hello,
6,,
7,,


## 2. `split` parameter

By setting the `split` parameter to True, the returned table will contain separate columns for the domain and username of valid emails.

In [3]:
clean_email(df, "email", split=True)

  0%|          | 0/9 [00:00<?, ?it/s]

email Cleaning Report:
	1 values cleaned (12.5%)
	3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)


Unnamed: 0,email,username,domain
0,yi@gmali.com,yi,gmali.com
1,yi@sfu.ca,yi,sfu.ca
2,y i@sfu.ca,,
3,Yi@gmail.com,yi,gmail.com
4,H ELLO@hotmal.COM,,
5,hello,,
6,,,
7,,,


## 3. `remove_whitespace` parameter

When the `remove_whitespace` parameter is set to True, whitespace will be removed before checking if an email is valid.

In [4]:
clean_email(df, "email", remove_whitespace=True)

  0%|          | 0/8 [00:00<?, ?it/s]

email Cleaning Report:
	2 values cleaned (25.0%)
	1 values unable to be parsed (12.5%), set to NaN
Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)


Unnamed: 0,email,email_clean
0,yi@gmali.com,yi@gmali.com
1,yi@sfu.ca,yi@sfu.ca
2,y i@sfu.ca,yi@sfu.ca
3,Yi@gmail.com,yi@gmail.com
4,H ELLO@hotmal.COM,hello@hotmal.com
5,hello,
6,,
7,,


## 4. `fix_domain` parameter

When the `fix_domain` parameter is set to True, `clean_email()` will try to correct invalid domains.

In [5]:
clean_email(df, "email", fix_domain=True)

  0%|          | 0/8 [00:00<?, ?it/s]

email Cleaning Report:
	2 values cleaned (25.0%)
	3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)


Unnamed: 0,email,email_clean
0,yi@gmali.com,yi@gmail.com
1,yi@sfu.ca,yi@sfu.ca
2,y i@sfu.ca,
3,Yi@gmail.com,yi@gmail.com
4,H ELLO@hotmal.COM,
5,hello,
6,,
7,,


## 5. `error` parameter

When `errors="ignore"`, invalid emails will be left unchanged in the output

In [6]:
clean_email(df, "email", errors="ignore")

  0%|          | 0/8 [00:00<?, ?it/s]

email Cleaning Report:
	1 values cleaned (12.5%)
	3 values unable to be parsed (37.5%), left unchanged
Result contains 3 (37.5%) values in the correct format and 2 null values (25.0%)


Unnamed: 0,email,email_clean
0,yi@gmali.com,yi@gmali.com
1,yi@sfu.ca,yi@sfu.ca
2,y i@sfu.ca,y i@sfu.ca
3,Yi@gmail.com,yi@gmail.com
4,H ELLO@hotmal.COM,H ELLO@hotmal.COM
5,hello,hello
6,,
7,,


## 6. `validate_email()`

The function `validate_email()` returns True if an email address is valid and False otherwise. It can be applied on a string or a column of email addresses.

In [7]:
from dataprep.clean import validate_email
print(validate_email('Abc.example.com'))
print(validate_email('prettyandsimple@example.com'))
print(validate_email('disposable.style.email.with+symbol@example.com'))
print(validate_email('this is"not\allowed@example.com'))

False
True
True
False


In [8]:
validate_email(df["email"])

0     True
1     True
2    False
3     True
4    False
5    False
6    False
7    False
Name: email, dtype: bool

Note that `validate_email()` will do the strict semantic check by default.