## Regular Expressions

### Processing free-text


a = '* 04/20/2009; 04/20/09; 4/20/09; 4/3/09\
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;\
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009; Feb-20-2009;\
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\
* Feb 2009; Sep 2009; Oct 2010\
* 6/2008; 12/2009\
* 2009; 2010'

In [86]:
text1 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text2 = text1.split(' ')
text2

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

In [87]:
# Simple Quize Here : How would you extract hashtags from the following tweet?
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"

tweet2 = tweet.split(" ")
list([ w for w in tweet2 if w.startswith("#")])

['#regex', '#pandas', '#python']

##### - Finding Specific words
- Hashtags
    [w for w in text1 if w.startswith('#')]
- Callouts
    [w for w in text1 if w.startswith('@')]

However, we want callouts with more than just tokens beginning with '@'

Which is, only words that match something after '@' (Alphabets, Numbers, Symbols..)

In [88]:
[w for w in text2 if w.startswith('@')]

# we don't want "@" !!

['@', '@UN', '@UN_Women']

### 1. Simple Patterns

Most letters and characters will simply match themselves. 

For example, the regular expression test will match the string test exactly. (You can enable a case-insensitive mode that would let this RE match Test or TEST as well; more about this later.)

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. 

Instead, they **signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.**

In [89]:
# So we use regular expression
import re
[w for w in text2 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

##### - Parsing the callout regular expression

@[A-Za-z0-9)]+   means

1) starts with @

2) followed by any alphabet, digit, or underscore

3) that repeats at least once, but any number of times

##### - Meta-characetrs : Character matches

.      : wildcard, matches a single character<br>
^      : start of a string<br>
$      : end of a string<br>
[]     : matches one of the set of characters within []<br>
[a-z]  : matches one of the range of characters a, b, ..., z<br>
[^abc] : matches a character **that is not a, b, or c**<br>
a|b    : matches either a or b, where a and b are strings<br>
()     : Scoping for operators<br>
|      : Alternation, A or B (use \| or [|])
\      : Escape char for special chars (\n, \t..)



##### - Meta-characters : Character Symbols

\b    : matches word boundary<br>
\d    : Any digit, equivalent to [0-9]<br>
\D    : Any non-digit, equivalent to [^0-9]<br>
\s    : Any whitespace, equivalent to [ \t\n\r\f\v]<br>
\S    : Any non-whitespace, equivalent to [^ \t\n\r\f\v]<br>
\w    : Alphanumeric character, equivalent to [a-zA-Z0-9_]<br>
\W    : Non-Alphanumeric character, equivalent to [^a-zA-Z0-9_]<br>
\A    : Matches only at the start of the string. When not in MULTILINE mode, same as ^
but they are different in that \A still matches only at the beginning while ^ may match at any location inside the newline<br>
\Z    : Only at the end of the string

In [90]:
# \b example
# following example matches class only when it's a complete word
# it won't match when it's contained inside another word

p = re.compile(r'\bclass\b')
print(p.search('no class at all'))
print(p.search('the declassified algorithm'))
print(p.search('one subclass is'))

<_sre.SRE_Match object; span=(3, 8), match='class'>
None
None


##### - Meta-characters : Repetitions

\*    : matches zero or more occurrences, equivalent to {0,}<br>
\+    : matches one or more occurrences, equivalent to {1,}<br>
?    : matches zero or one occurrences, equivalent to {0,1} <br>
{n}  : exactly n repetitions, n>=0<br>
{n,} : at least n rep<br>
{,n} : at most n rep<br>
{m,n}: at least m, at most n times<br>

\* 주의, *는 먼저 가능한 가장 많은 반복 횟수부터 차례대로 내려오며 비교한다.


In [91]:
string = 'abcbd'

re.search('a[bcd]*b', string)

<_sre.SRE_Match object; span=(0, 4), match='abcb'>

In [92]:
# again, callout regex
print([w for w in text2 if re.search('@[A-Za-z0-9_]+',w)])
print([w for w in text2 if re.search('@\w+', w)])  # same as above

['@UN', '@UN_Women']
['@UN', '@UN_Women']


In [93]:
# Finding specific characters
text3 = 'ouagadougou'
print(re.findall(r'[aeiou]', text3))  #all the vowels
print(re.findall(r'[^aeiou]', text3)) #all the consonants

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']
['g', 'd', 'g']


### 2. Using Regular Expressions

##### 1) Compiling Regular Expressions

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

In [94]:
p = re.compile('ab*')
p

re.compile(r'ab*', re.UNICODE)

In [95]:
p = re.compile('ab*', re.IGNORECASE)
p

re.compile(r'ab*', re.IGNORECASE|re.UNICODE)

The solution is to use Python’s raw string notation for regular expressions; 
backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

##### 2) Functions

- match() : Determine if the RE matches at the beginning of the string.
- search(): Scan through a string, looking for any location where this RE matches.
- findall() : Find all substrings where the RE matches, and returns them as a list.
- finditer(): Find all substrings where the RE matches, and returns them as iterator.

In [96]:
p = re.compile('[a-z]+')
p

re.compile(r'[a-z]+', re.UNICODE)

In [97]:
# match with empty string. It should return None
print(p.match(""))
print(p.findall(""))

None
[]


In [98]:
# In this case, match() will return a 'match' object.
m = p.match('tempo or down')
n = p.findall('tempo or down')
print(m)
print(n)

<_sre.SRE_Match object; span=(0, 5), match='tempo'>
['tempo', 'or', 'down']


In [3]:
# finditer
p = re.compile('\d+')
iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')

for match in iterator:
    print(match.span())

(0, 2)
(22, 24)
(29, 31)


##### 3) Match Object Attributes

- group() : Return the string matched by RE
- start() : Return the starting position of the match
- end()   : Return the ending position of the match
- span()  : Return a tuple containing the (start,end) positions of the match

In [100]:
print(m.group())
print(m.start(), m.end())
print(m.span())

tempo
0 5
(0, 5)


##### 4) Module-Level Functions

You don’t have to create a pattern object and call its methods; 

the re module also provides top-level functions called **match(), search(), findall(), sub(), and so forth.** 

These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either None or a match object instance.

<br>

##### 5) Compilation Flags

Compilation flags let you modify some aspects of how regular expressions work.

- ASCII, A  : Makes several escapes like \w, \b match only on ASCII characters
- DOTALL, S : Make '.' match any character, including newlines.
- IGNORECASE, I : Do case-insensitive matches
- LOCALE, L : Do a locale-aware match
- MULTILINE, M : Multi-line matching, affecting ^ and $.
- VERBOSE, X : Enable verbose REs, which can be organized more cleanly

### 3. Grouping

Frequently you need to obtain more information than just whether the RE matched or not.

Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest.

**Groups are marked by the ( ) metacharacters. **

() have much the same meaning as they do in mathematical expression : they group together the expressions contained inside them, and you can repeate the contents of a group with a repeating qualifier, such as *, +, ?. 

In [101]:
# For example,
p = re.compile('(ab)*')
m = p.match('ababababab')
print(m.span())
print(m.group(0))

(0, 10)
ababababab


** Groups are numbered starting with 0. **
Group 0 is always present; it's the whole RE, so match object methods all have group 0 as their default argument.

Subgroups are numbered from left to right, from 1 upward.

Groups can be nested; to determine the number, just count the opening parenthesis chars, going from left to right.

In [102]:
p = re.compile('(a)b')
m = p.match('abc')
print(m.group())
print(m.group(0))

ab
ab


In [103]:
p = re.compile('(a(b)c)d')
m = p.match('abcd')

print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))

abcd
abcd
abc
b


In [104]:
# the groups() method returns a tuple containing the strings for all the subgroups, 
# from 1 up to inf.
m.groups()

('abc', 'b')

### 4. Non-capturing and Named Groups

Elaborate REs may use many groups, both to capture substrings of interest, and to group and structure the RE itself.

In complex REs, it becomes difficult to keep track of the group numbers. There are two features which help with this problem. 

Sometimes, you'll want to use a group to denote a part of a regular expression, but aren't interested in retrieving the group's contents.

**You can make this fact explicit by using a non-capturing group (?:...), where you can replace the ... with any other regular expressions.**

In [105]:
m = re.match("([abc])+", "abc")
print(m.groups())
m = re.match("(?:[abc])+", "abc")
m.groups()

('c',)


()

In [106]:
p = re.compile(r'(?P<word>\b\w+\b)')
m = p.search( '(((( Lots of punctuation ))))' )
print(m.group('word'))
print(m.group(1))

Lots
Lots


### 5. Regex for Dates

Date variations for 23rd October 2002
- 23-10-2002
- 23/10/2002
- 23/10/02
- 10/23/2002
- 23 Oct 2002
- 23 October 2002
- Oct 23, 2002
- October 23, 2002

maybe in types of following : \d{2}[/-]\d{2}[/-]\d{4}

In [107]:
dateStr = '23-9-2001\n23-10-2002\n23/10/2002\n23/10/2002\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n'
dateStr

print(re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', dateStr))  # still one's missing
print(re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', dateStr)) # now we have them all
print(re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dateStr)) # For other months

['23-10-2002', '23/10/2002', '23/10/2002', '10/23/2002']
['23-10-2002', '23/10/2002', '23/10/2002', '10/23/2002']
['23-9-2001', '23-10-2002', '23/10/2002', '23/10/2002', '10/23/2002']


In [108]:
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [109]:
# Now in spelled months

print(re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr))
# what happened?
# scoping:() only shows the scoped part even if other parts matched them

print(re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr))
# To indicate "also include others"

print(re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dateStr))
# [a-z]* to include Oct'ober'

print(re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr))
# include another

print(re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', dateStr))
# include other months

['Oct']
['23 Oct 2002']
['23 Oct 2002', '23 October 2002']
['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']
['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']


## Regex with Pandas and Named Groups

In [110]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [111]:
# find the number of characters for each string in df['text']
df['text'].str.len()  # Series -> String -> Length

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [112]:
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [113]:
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [114]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [115]:
# find all occurances of the digits
print(df['text'].str.len())
df['text'].str.findall(r'\d')

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64


0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [116]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [117]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [118]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [119]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

# to get all matches, use extractall function

  


Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [120]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [121]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


### Regex Methods

- **regex.match(string, pos, endpos)**
<br>if "beginning" of string matches the regex, returns a corresponding match object

- **regex.fullmatch(string, pos, endpos)**
<br>if the "whole" string matches the regex, returnsa corresponding match object

- **regex.split(string, maxsplit= 0)**
<br>identical to the split() function with given pattern

- **regex.findall(string, pos, end)**
<br>return all matches of pattern in string, as a list of strings

- **regex.finditer(string, pos, end) **
<br>return an iterator yielding match objects


In [366]:
# fullmatch example
pattern = re.compile("o[gh]")
print(pattern.fullmatch("dog")) # no match as "o" is not at the start of "dog"
print(pattern.fullmatch("ogre"))
print(pattern.fullmatch("doggie", 1, 3))

None
None
<_sre.SRE_Match object; span=(1, 3), match='og'>


In [367]:
# split example
re.split('[a-f+]', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

### Regex Group

- **match.group**
<br> returns one or more subgroups of the match

- **match.groups()**
<br>return a tuple containing all the subgroups of the match, from 1 upto however many groups there are in the pattern

- **match.groupdict() **
<br>return a dict containing all the named subgroups of the match, keyed by the subgroup name

In [368]:
# group example
m = re.match(r"(\w+) (\w+)", "Isaac newton, Physicist")
print(m.group(0))  # The entire match
print(m.group(1))  # The first parenthesized subgroup
print(m.group(2))  # The second parenthesized subgroup
print(m.group(1,2))# Mtp given as tuple

m[2] # same as m.group(2)

Isaac newton
Isaac
newton
('Isaac', 'newton')


'newton'

In [369]:
# if a group matches mtp times, only the last match is accessible.
m = re.match(r"(..)+", "a1b2c3") # matches 3 times
m.group(1)

'c3'

### Regex Complement

Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, they cannot be mixed.

Regex use the baskslash character ("\") to indicate special forms or to allow special char to be used without invoking their special meaning. This collide's Python's usage of the same char for the same purpose in string literal.

For example, to match a literal backslash, we should use '\\\\' because regex must be \\, and each backslash must be expressed as \\ inside a regular Python literal.

-> The Solution is to use `r"\n"`.

In [370]:
import re
# search는 처음으로 일치하는 매치 오브젝트를 반환한다.
sent = 'b <a> b <c>'
re.search('<.*>', sent)

<_sre.SRE_Match object; span=(2, 11), match='<a> b <c>'>

In [371]:
# match는 string의 첫 부분부터 일치하는 오브젝트를 반환한다.
re.match('<.*>', sent)

#### *?, +?, ??

The `'*', '+', and '?'` qualifiers are all greedy ; they match as much text as possible.<br>
Sometimes, this behavior isn't desired; if the RE `<.*>` is matched against `<a> b <c>`, it will match the entire string, not just `<a>`. <br>
Adding ? after it performs the match in non-greedy or minimal fashion.

아래의 예를 살펴보자

In [372]:
# greedy way -> sent 전체 문장이 일치한다

sent = 'b <a> b <c>'
re.search('<.*>', sent)

<_sre.SRE_Match object; span=(2, 11), match='<a> b <c>'>

In [373]:
# non greedy way -> minimal하게 <a>만 일치한다.

re.search('<.*?>', sent)

<_sre.SRE_Match object; span=(2, 5), match='<a>'>

#### Other Extensions

- (?...) : extension notation
- (?aiLmsux) : Entire flagging (re.A, re.I, re.L ,...). They should be used first in the expression string.
- (?:...) : Non-capturing version of regular parentheses. Substring matched cannot be retrieved.
- (?imsx-imsx:...) : Sets or Removes the corresponding flags.
- (?P\<name\>...) : Similar to regular parentheses, but the substring matched is accessible.
- (?P=name) : back-reference to a named group; matches whatever was matched by earlier-group
- (?#...) : Comment, ignored
- (?=...) : Lookahead assertion, matches if ... matches next <br> For example, Issac (?=Asimov) match 'Isaac' only if it's followed by 'Asimov'
- (?!...) : negative Lookahead assertion, matches if ... does not match next.<br> For example, Isaac (?!Asimov) match 'Isaac' only if it's not followed by 'Asimov'
- (?<=...) : Positive lookbehind assertion; matches if the cur pos in the string is preceded by a match for ... that ends at the cur pos.
- (?<!...) : negative Lookbehind assertion
- (? (id/name) yes-pattern | no-pattern) : if the group with given id exists, match with yes-pattern, otw, no-pattern

#### Referencing in different ways

1. in the same pattern itself
    1. (?P=name)
    2. \\1

2. when processing match object m
    1. m.group('name')
    2. m[1]
    3. m.end('name')

3. in a string passed to the repl argument of re.sub()
    1. \g<name>
    2. \\g<1>
    3. \\1