In the field of data science, we often have to deal with various kind of data and one of common is Text data. Before jumping into a topic let's first start with a problem I recently encountered. I was given a task to extract sent and received email messages from a long thread of multipart email. The email I used to get would be something like below:




In [18]:

eml = """Re: Documents Received



John Doe <john@doe.org>
Wed, Jun 1, 2011, 9:39 PM
to Emma, Don, Bucky



Lorem
Ipsum
Dorem


On 01/06/2011, at 7:57 PM, "Emma" <emma@thompson.com> wrote:


Lorem Ipsum?

Thanks John

On 1 June 2011 13:43, Bucky Hallam <bucky@barnes.com> wrote:


Lorem Ipsum is Dorem.



Thanks Emma"""


Above text is modified from a original multipart email. My goal was to separate received and sent emails from one another. I have written a blog about how to retrieve emails as well, please give it a try if you are interested. I split entire text by `wrote:` and then have to split parts again using `On sent date`. First part was easy but due to different variant of dates, second part got terribly hard. Some of first emails  I worked on had dates like `On Sun 11, 2022`, and I created a list like below

In [16]:
[f'On {d},' for d in ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']]

['On Sun,', 'On Mon,', 'On Tue,', 'On Wed,', 'On Thu,', 'On Fri,', 'On Sat,']

It worked for some but when sent dates were in different format based on mail servers, this failed. Now there are number of ways one could do it. But all are based on finding the pattern of datetime. Usually datetime has pattern like YYYY/MM/DD HH:MM:SS, DD/MM/YYYY HH:MM:SS and so on we could prepare regex for that and find where it matched.

In [19]:
import re

## Finding Date Using Regex

### Format 1
Let's try to find date from the format YYYY/MM/DD without any time.

In [24]:
pattern = r'\d{4}/\d{2}/\d{2}'
txt = "This is 2022/11/11 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022/11/11', '2022/11/12']


This works well but not in the case when another format like `-` is used instead of `/`.

In [26]:
pattern = r'\d{4}/\d{2}/\d{2}'
txt = "This is 2022-11-11 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022/11/12']


It missed date here. We can simply use the or operator to add another format there.

In [38]:
pattern = r'(\d{4}-\d{2}-\d{2}|\d{4}/\d{2}/\d{2})'
txt = "This is 2022-11-11 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022-11-11', '2022/11/12']


### Format 2

Let's use time too.

In [40]:
pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\d{4}/\d{2}/\d{2})'
txt = "This is 2022-11-11 14:23:19 and we are waiting for 2022/11/12."
print(re.findall(pattern, txt, re.DOTALL))

['2022-11-11 14:23:19', '2022/11/12']


It worked but not much in the cases like we have in email. But we can split our text based on found date too and its very useful above.

In [41]:
pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\d{4}/\d{2}/\d{2})'
txt = "This is 2022-11-11 14:23:19 and we are waiting for 2022/11/12."
print(re.split(pattern, txt, re.DOTALL))

['This is ', '2022-11-11 14:23:19', ' and we are waiting for ', '2022/11/12', '.']


## Our Email

There are different format of date in our above email.
* Wed, Jun 1, 2011, 9:39 PM
* 01/06/2011, at 7:57 PM
* 1 June 2011 13:43

And all of these above 3 requires different pattern as well so its little tricky and more hard work to find it.

### For Jun 1, 2011, 9:39 PM


In [117]:
pattern = r'([0-3]?[0-9], \d{4}, [0-2]?[0-9]:[0-5][0-9] [AaPp][Mm])' 
txt = 'This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM'
print(re.findall(pattern, txt, re.DOTALL))


['1, 2011, 9:29 AM', '1, 2011, 19:39 PM']


### For 01/06/2011, at 7:57 PM

In [121]:
pattern = r'([0-1]?[0-2]/[0-3]?[0-9]/\d{4}, at [0-2]?[0-9]:[0-5][0-9] [AaPp][Mm])' 
txt = 'This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM'
print(re.findall(pattern, txt, re.DOTALL))


['01/06/2011, at 7:57 PM', '01/06/2011, at 19:57 PM']


### For 1 June 2011 13:43

In [142]:
pattern = r'([0-1]?[0-2] \w{3,} \d{4} [0-2]?[0-9]:[0-5][0-9])' 
txt = 'This is 1 June 2011 13:43'
print(re.findall(pattern, txt, re.DOTALL))


['1 June 2011 13:43']


But finding the pattern for each format is not a good solution. And there is not a golden pattern either. However there are some Python packages which can help us in these cases.

## Using `dateutil` 
If this package is not installed, please do it by `pip install dateutil`.


In [143]:
from dateutil.parser import parse

With simply calling parse method, we can get datetime.

In [144]:
parse('This is 1 June 2011 13:43', fuzzy_with_tokens=True)

(datetime.datetime(2011, 6, 1, 13, 43), ('This is ', ' '))

But this doesn't work always.

In [147]:
parse('This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM', fuzzy_with_tokens=True)

ParserError: Unknown string format: This is Wed, Jun 1, 2011, 9:29 AM and Wed, Jun 1, 2011, 19:39 PM

In [146]:
parse('This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM', fuzzy_with_tokens=True)

ParserError: Unknown string format: This is 01/06/2011, at 7:57 PM and 01/06/2011, at 19:57 PM

## Using `dateparser`

I found this package to be more effective than `dateutil`. Please install it using `pip install dateparser`.

In [148]:
from dateparser.search import search_dates

search_dates(eml)

[('Wed, Jun 1, 2011, 9:39 PM', datetime.datetime(2011, 6, 1, 21, 39)),
 ('On 01/06/2011, at 7:57 PM', datetime.datetime(2011, 1, 6, 19, 57)),
 ('On 1 June 2011 13:43', datetime.datetime(2011, 6, 1, 13, 43))]

We can see that it found all the date times. And it also returns in native python dateime object. Isn't it awesome?

## Using `datefinder`

This is another package which can find dates from text. Please install it using `pip install datefinder`

In [149]:
!pip install datefinder

Collecting datefinder
  Downloading datefinder-0.7.3-py2.py3-none-any.whl (10 kB)
Installing collected packages: datefinder
Successfully installed datefinder-0.7.3


In [151]:
from datefinder import find_dates

list(find_dates(eml))

[datetime.datetime(2011, 6, 1, 21, 39),
 datetime.datetime(2011, 1, 6, 19, 57),
 datetime.datetime(2011, 6, 1, 13, 43)]

This also gets our job done but we are more concerned about original date format.

Thats all for now and for my use case, I found `dateparser` to be best. What is yours?