## Section II Regular Expressions (Regex)

#### **1. Process free-text**

$\qquad$ Find words with specific format (e.g., starts with "@"): *startswith*


In [2]:
text1 = '"Ethics are built right into the ideals and objectives of the United Nations"# UNSG @NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text2 = text1.split()
print(text2,len(text2))

text3 = [w for w in text2 if w.startswith('@')] # Generate a list with all elements that start with an @
print(text3,len(text3))

['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"#', 'UNSG', '@NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr', '@UN', '@UN_Women'] 22
['@NY', '@UN', '@UN_Women'] 3


#### **2. Find patterns with regular expressions**

$\qquad$(1) Callouts morethan just tokens beginning with '@':  
$\qquad\qquad$ e.g., @UN_Spokesperson, @katyperry, @coursera01.  
$\qquad\qquad$ **Solution**: match something after '@' that may include alphabets, numbers, or special symbols like '_'  
$\qquad\qquad$ **Regex**: *'@[A-Za-z0-9_]+'*  
$\qquad\qquad$ **Meaning**:  
$\qquad\qquad\qquad$ a. [] encloses a format of the items that can occur here (i.e., after the "@" in the example)  
$\qquad\qquad\qquad$ b. + indicates repeating for 1-infinity times

In [3]:
import re
text4 = [w for w in text2 if re.search('@[A-Za-z0-9_]+',w)]
print(text4,len(text4),text4==text3)

['@NY', '@UN', '@UN_Women'] 3 True


$\qquad$(2) General meta-characters (character matches)

$\qquad\qquad$ a. '.' - matches a single character (any; once)  
$\qquad\qquad$ b. '^' - indicates the start of a string  
$\qquad\qquad$ c. '$' - indicates the end of a string  

$\qquad\qquad$ d. '[]' - matches one of the set of characters within the bracket   
$\qquad\qquad\qquad$ [a-z]: matches one of the range of characters *a, b, ..., z*  
$\qquad\qquad\qquad$ [^abc]: matches a character that is **not** *a,b,c*   
$\qquad\qquad\qquad$ (**Note**: when a '^' is within a bracket, it indicates a 'non' meaning; when a '^' is outside a bracket, it indicates a 'start with' meaning.)  
$\qquad\qquad\qquad$ [a|b]: matches 'a' or 'b'  

$\qquad\qquad$ e. '()' - indicates scoping for an operator  
$\qquad\qquad$ f. '?:' - inside a pair of parentheses at the beginning to indicate that pull out ALL matched data, NOT only the matched data within the parentheses  

$\qquad\qquad$ g. '\\' - escapes character for special characters (e.g., \t,\n,\b)  
$\qquad\qquad\qquad$ \b: matches word boundary  
$\qquad\qquad\qquad$ \d: matches any digit (=[0-9])  
$\qquad\qquad\qquad$ \D: matches any non-digit (=[^0-9])  
$\qquad\qquad\qquad$ \s: matches any whitespace (=[ \t\n\r\f\v])  
$\qquad\qquad\qquad$ \S: matches any non-whitespace (=[^ \t\n\r\f\v])  
$\qquad\qquad\qquad$ \w: matches any alphanumeric character (=[A-Za-z0-9_])  
$\qquad\qquad\qquad$ \W: matches any non-alphanumeric character (=[^A-Za-z0-9_])  

$\qquad$(3) Meta-characters (repetitions)

$\qquad\qquad$ a. '*' - matches 0-infinity occurrences  
$\qquad\qquad$ b. '+' - matches 1-infinity occurrences  
$\qquad\qquad$ c. '?' - matches 0-1 occurrences  
$\qquad\qquad$ d. '{n}' - matches *n* times of occurrences (n>=0)  
$\qquad\qquad$ e. '{n,}' - matches **at least** *n* times of occurrences (n>=0)  
$\qquad\qquad$ f. '{,n}' - matches **at most** *n* times of occurrences (n>=0)  
$\qquad\qquad$ g. '{m,n}' - matches **at least** *m* times and **at most** *n* times of occurrences (n>=m>=0)  

In [4]:
text5 = [w for w in text2 if re.search('@\w+',w)]
print(text5,len(text5),text5==text4)

['@NY', '@UN', '@UN_Women'] 3 True


In [5]:
text6 = 'ouagadougou'
text7 = re.findall(r'[aeiou]',text6)
print(text7,len(text7))

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u'] 8


In [6]:
text8 =re.findall(r'[^aeiou]',text6)
print(text8,len(text8))

['g', 'd', 'g'] 3


#### 3. **Regular expression for dates**

##### (1) Pure numbers  
$\qquad$ Considering general cases: MM-DD-YYYY, MM/DD/YYYY, DD-MM-YYYY, DD/MM/YYYY:  
$\qquad\qquad$ **regex**: *\d{2}[-/]\d{2}[-/]\d{4}*

$\qquad$ For single months/dates: M-D-YYYY, M/D/YYYY, D-M-YYYY, D/M/YYYY:  
$\qquad\qquad$ **regex**: *\d{1,2}[-/]\d{1,2}[-/]\d{4}*  

$\qquad$ If only two digits are provided for years (e.g., 2023->23): M(MM) -(/) D(DD) -(/) YY(YYYY)  
$\qquad\qquad$ **regex**: *\d{1,2}[-/]\d{1,2}[-/]\d{2,4}*  

##### (2) With Months as Words
$\qquad$ Considering month in the center:  
$\qquad\qquad$ **regex**: *\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]\* \d{2,4}*

$\qquad$ Considering both cases of month at the beginning and in the center:  
$\qquad\qquad$ **regex**: *(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]\* (?:\d{1,2}, )?\d{2,4}*

In [7]:
# TEST IT!

dateStr = '23-10-2002\n23/10/2002\n23/10/02\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n'
print(dateStr)

date_no = re.findall(r'\d{1,2}[-/]\d{1,2}[-/]\d{2,4}',dateStr)
date_wd = re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{2,4}',dateStr)

print(date_no, len(date_no))
print(date_wd, len(date_wd))

23-10-2002
23/10/2002
23/10/02
10/23/2002
23 Oct 2002
23 October 2002
Oct 23, 2002
October 23, 2002

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002'] 4
['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002'] 4


In [8]:
date = re.findall(r'\d{1,2}[-/]\d{1,2}[-/]\d{2,4}|(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{2,4}',dateStr) # combine two regex into one.
print(date,len(date))

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002', '23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002'] 8


#### 4. **Regular expression with Pandas and Named Groups**

##### (1) Review of Pandas package

> Pandas package provides a convenient processing for series and dataframe, which arranges dataset in an array-like list or a table-like format.

$\qquad$ **Basic Pandas Objects**: Series, DataFrame  
$\qquad\qquad$ **create a pandas dataframe**: *pd.DataFrame(data, index, columns, dtype)*


In [9]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


$\qquad$ **Attribute**: *str* - access the set of *string* processing methods to make it easy to operate on each element in the *series*.  
$\qquad\qquad$ **Method**: *len* - check the number of characters or words (with *split*) in a string.

In [10]:
sen_df = df['text']
for sen in sen_df:
    print(sen)

len_df = sen_df.str.len() # Neither series nor dataframe has the attribute of "len", so 
print('The number of characters in text column of df\n', len_df)

wd_df = df['text'].str.split()
for wd in wd_df:
    print(wd)

wd_len_df = wd_df.str.len()
print('The number of words in text column of df\n', wd_len_df)

Monday: The doctor's appointment is at 2:45pm.
Tuesday: The dentist's appointment is at 11:30 am.
Wednesday: At 7:00pm, there is a basketball game!
Thursday: Be back home by 11:15 pm at the latest.
Friday: Take the train at 08:10 am, arrive at 09:00am.
The number of characters in text column of df
 0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64
['Monday:', 'The', "doctor's", 'appointment', 'is', 'at', '2:45pm.']
['Tuesday:', 'The', "dentist's", 'appointment', 'is', 'at', '11:30', 'am.']
['Wednesday:', 'At', '7:00pm,', 'there', 'is', 'a', 'basketball', 'game!']
['Thursday:', 'Be', 'back', 'home', 'by', '11:15', 'pm', 'at', 'the', 'latest.']
['Friday:', 'Take', 'the', 'train', 'at', '08:10', 'am,', 'arrive', 'at', '09:00am.']
The number of words in text column of df
 0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64


$\qquad\qquad$ **Methods**: *str.contains(t)* - check if a string contains a pattern and produce a *boolean mask*    
$\qquad\qquad\qquad$ ***Note***: If you want to retrieve the items that match the *boolean mask*, just use *series[boolean_mask]*

In [11]:
appt_bool = sen_df.str.contains('appointment')
print(appt_bool)

appt = sen_df[appt_bool]
print(appt)

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool
0       Monday: The doctor's appointment is at 2:45pm.
1    Tuesday: The dentist's appointment is at 11:30...
Name: text, dtype: object


$\qquad\qquad$ **Methods**: *str.count(t)* - count the occurrence of a pattern in each string of the series   

In [12]:
ct_digit = sen_df.str.count(r'\d')
print(ct_digit)

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64


$\qquad\qquad$ **Methods**: *str.findall(t)* - find all the occurrence of a pattern in each string of the series   
$\qquad\qquad\qquad$ Higher level: use regex to capture certain groups (e.g.,time)

In [13]:
all_digit = sen_df.str.findall(r'\d')
print(all_digit)

all_time = sen_df.str.findall(r'\d{1,2}:\d{2} ?[a|p]m')
print(all_time)

all_hm = sen_df.str.findall(r'(\d{1,2}):(\d{2})')
print(all_hm) 

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object
0               [2:45pm]
1             [11:30 am]
2               [7:00pm]
3             [11:15 pm]
4    [08:10 am, 09:00am]
Name: text, dtype: object
0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object


$\qquad\qquad$ **Methods**: *str.replace(t1,t2)* - replace all *t1* in the string to *t2*  
$\qquad\qquad\qquad$ Higher level: Change the words based on the original: Use *lambda x*

In [14]:
sen_df1 = sen_df.str.replace(r'\w+day\b','???') # reminder: '\b' indicates the boundary of a word
print(sen_df1)

sen_df2 = sen_df.str.replace(r'(\w+day\b)',lambda x: x.groups()[0][:3]) 
# use a pair of parentheses to create a group, and call the group in lambda function 'groups()'
# groups() returns a tuple, use [0] to get the first element (e.g., Monday) and [:3] to get the first three characters of the first element (e.g., Mon)
print(sen_df2)


0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object


  sen_df1 = sen_df.str.replace(r'\w+day\b','???') # reminder: '\b' indicates the boundary of a word
  sen_df2 = sen_df.str.replace(r'(\w+day\b)',lambda x: x.groups()[0][:3])


$\qquad\qquad$ **Methods**: *str.extract(t)* - create new columns using the extracted groups (only the first match)  
$\qquad\qquad\qquad$ Higher level:  
$\qquad\qquad\qquad$ 1. To get all matches, use *str.extractall(t)*  
$\qquad\qquad\qquad$ 2. Name a group: use *?P<name>*, extracted dataframe will use the group names as the column names

In [15]:
df_hm = sen_df.str.extract(r'(\d{1,2}):(\d{2})')
df_hm.columns = ['hour','minute']
print(df_hm)

df_hm_all = sen_df.str.extractall(r'(\w+day\b).+(\d{1,2}):(\d{2}) ?([a|p]m)')
'''
breakdown of the regex:
four pairs of () - extract four columns
First (): \w+day\b - extracts all patterns with any alphabetic characters + 'day' as the end of a word
.+ - any characters, not included in a parenthesis thus not extracted
Second (): \d{1,2} - extracts 1 or 2 digits for hours (H or HH)
: - a colon to separate HH and MM
Third (): \d{2} - extracts 2 digits for minutes (MM)
 ? - denotes there might be one or no space
Fourth (): [a|p]m - extracts the indication of morning (am) or afternoon (pm)
'''

df_hm_all.columns = ['day of a week','time','hour','Morning/Afternoon']
print(df_hm_all)

df_hm_name = sen_df.str.extractall(r'(?P<time>(?P<hour>\d{1,2}):(?P<minute>\d{2}) ?(?P<AM_PM>[ap]m))').rename(columns={'AM_PM':'AM/PM'})
print(df_hm_name)

  hour minute
0    2     45
1   11     30
2    7     00
3   11     15
4   08     10
        day of a week time hour Morning/Afternoon
  match                                          
0 0            Monday    2   45                pm
1 0           Tuesday    1   30                am
2 0         Wednesday    7   00                pm
3 0          Thursday    1   15                pm
4 0            Friday    9   00                am
             time hour minute AM/PM
  match                            
0 0        2:45pm    2     45    pm
1 0      11:30 am   11     30    am
2 0        7:00pm    7     00    pm
3 0      11:15 pm   11     15    pm
4 0      08:10 am   08     10    am
  1       09:00am   09     00    am


#### ***\* Take Home Concepts***

$\qquad$ - Concept and use of regular expressions  
$\qquad$ - Meta-characters (*character matches* and *repetitions*) for regular expressions  
$\qquad$ - Build a regular expression step-by-step to identify dates