## Regular Expressions

### 1. Processing free-text


In [6]:
text1 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text2 = text1.split(' ')
text2

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

In [7]:
# Simple Quize Here : How would you extract hashtags from the following tweet?
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"

tweet2 = tweet.split(" ")
list([ w for w in tweet2 if w.startswith("#")])

['#regex', '#pandas', '#python']

##### - Finding Specific words
- Hashtags
    [w for w in text1 if w.startswith('#')]
- Callouts
    [w for w in text1 if w.startswith('@')]

However, we want callouts with more than just tokens beginning with '@'

Which is, only words that match something after '@' (Alphabets, Numbers, Symbols..)

In [9]:
[w for w in text2 if w.startswith('@')]

# we don't want "@" !!

['@', '@UN', '@UN_Women']

### 2. Regular Expressions

In [10]:
# So we use regular expression
import re
[w for w in text2 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

##### - Parsing the callout regular expression

@[A-Za-z0-9)]+   means

1) starts with @

2) followed by any alphabet, digit, or underscore

3) that repeats at least once, but any number of times

##### - Meta-characetrs : Character matches

.      : wildcard, matches a single character<br>
^      : start of a string<br>
$      : end of a string<br>
[]     : matches one of the set of characters within []<br>
[a-z]  : matches one of the range of characters a, b, ..., z<br>
[^abc] : matches a character **that is not a, b, or c**<br>
a|b    : matches either a or b, where a and b are strings<br>
()     : Scoping for operators<br>
\      : Escape char for special chars (\n, \t..)



##### - Meta-characters : Character Symbols

\b    : matches word boundary<br>
\d    : Any digit, equivalent to [0-9]<br>
\D    : Any non-digit, equivalent to [^0-9]<br>
\s    : Any whitespace, equivalent to [ \t\n\r\f\v]<br>
\S    : Any non-whitespace, equivalent to [^ \t\n\r\f\v]<br>
\w    : Alphanumeric character, equivalent to [a-zA-Z0-9_]<br>
\W    : Non-Alphanumeric character, equivalent to [^a-zA-Z0-9_]<br>

##### - Meta-characters : Repetitions

\*    : matches zero or more occurrences<br>
\+    : matches one or more occurrences<br>
?    : matches zero or one occurrences<br>
{n}  : exactly n repetitions, n>=0<br>
{n,} : at least n rep<br>
{,n} : at most n rep<br>
{m,n}: at least m, at most n times<br>

In [31]:
# again, callout regex
print([w for w in text2 if re.search('@[A-Za-z0-9_]+',w)])
print([w for w in text2 if re.search('@\w+', w)])  # same as above

['@UN', '@UN_Women']
['@UN', '@UN_Women']


In [15]:
# Finding specific characters
text3 = 'ouagadougou'
print(re.findall(r'[aeiou]', text3))  #all the vowels
print(re.findall(r'[^aeiou]', text3)) #all the consonants

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']
['g', 'd', 'g']


### 3. Regex for Dates

Date variations for 23rd October 2002
- 23-10-2002
- 23/10/2002
- 23/10/02
- 10/23/2002
- 23 Oct 2002
- 23 October 2002
- Oct 23, 2002
- October 23, 2002

maybe in types of following : \d{2}[/-]\d{2}[/-]\d{4}

In [24]:
dateStr = '23-9-2001\n23-10-2002\n23/10/2002\n23/10/2002\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n'
dateStr

print(re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', dateStr))  # still one's missing
print(re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', dateStr)) # now we have them all
print(re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dateStr)) # For other months

['23-10-2002', '23/10/2002', '23/10/2002', '10/23/2002']
['23-10-2002', '23/10/2002', '23/10/2002', '10/23/2002']
['23-9-2001', '23-10-2002', '23/10/2002', '23/10/2002', '10/23/2002']


In [35]:
# Now in spelled months

print(re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr))
# what happened?
# scoping:() only shows the scoped part even if other parts matched them

print(re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr))
# To indicate "also include others"

print(re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dateStr))
# [a-z]* to include Oct'ober'

print(re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr))
# include another

print(re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', dateStr))
# include other months

['Oct']
['23 Oct 2002']
['23 Oct 2002', '23 October 2002']
['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']
['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']


## Regex with Pandas and Named Groups

In [36]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [39]:
# find the number of characters for each string in df['text']
df['text'].str.len()  # Series -> String -> Length

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [44]:
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [45]:
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [46]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [49]:
# find all occurances of the digits
print(df['text'].str.len())
df['text'].str.findall(r'\d')

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64


0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [50]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [51]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [52]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [54]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

# to get all matches, use extractall function

  


Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [55]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [56]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am
