![alt text](python.png "Title")

# Regular expressions

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. It's powerful and useful for writing compact and efficient code. Not so readable though...

Nice and easy cheat sheet: https://www.w3schools.com/python/python_regex.asp

In [None]:
# Import the regex package
import re

## re.findall()

Returns a list containing all matches

In [26]:
text = "Patient (id=100123) was discharged on 2012-05-06 (study period 04)"

# a pattern with 6 digits in a row
subjid = re.findall('\d{6}', text)  # that's a list

# a pattern with a white character followed by a zero and number that could be 1, 2, 3, 4 or 5.
period = int(re.findall('\s[0][12345]', text) [0]) # we take the first item in the list and convert to integer

print ( 'subjid=', subjid)
print ( 'period=', period)

subjid= ['100123']
period= 4


In [58]:
txt = "hello@world, my email is ndupuis@its.jnj.com "

# Finding emails in a string. Not easy to read, but imagine the alternative algorithm...
re.findall("([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", txt)

['ndupuis@its.jnj.com']

## re.search()

Returns a Match object if there is a match anywhere in a string.

In [42]:
def check_phone_number(txt):
    
    ''' Searches for occurences of phone numbers in a string. 
        Returns the phone number and position in string if an occurence was found, return None otherwise.
    '''
    
    # pattern=
    # 3 digits, followed by a single char, followed by 2 digits, followed by a single char, followed by 3 digits.
    match = re.search("\d{3}.\d{2}.\d{3}", txt)
    print(match) # a match object
    
    if match:
        return f'Valid phone number ({match.group()}, at position {match.span()}) \n'
    else: 
        return 'Not a valid phone number \n'

print ( check_phone_number("my phone number: 555-12-781") )
print ( check_phone_number("call me on 555/12-781") )
print ( check_phone_number("Yeah, 555-121-781 is my phone.") )

<re.Match object; span=(17, 27), match='555-12-781'>
Valid phone number (555-12-781, at position (17, 27)) 

<re.Match object; span=(11, 21), match='555/12-781'>
Valid phone number (555/12-781, at position (11, 21)) 

None
Not a valid phone number 



## re.split()

Returns a list where the string has been split at each match

In [52]:
txt = "Hello world! How are you today? I'm good :-)."

# Using \s (white characther) as regex in re.split(). Returns the same as split()
print('regex split: ', re.split("\s", txt) )
print('normal split:', txt.split() )

# re.split() offers more flexibility.
# For example, \W returns a match where the string does not contain any word characters:
tokens = re.split("\W", txt)
print(tokens)

# We can clean further:
tokens = [token for token in tokens if token != '']
print(tokens)

regex split:  ['Hello', 'world!', 'How', 'are', 'you', 'today?', "I'm", 'good', ':-).']
normal split: ['Hello', 'world!', 'How', 'are', 'you', 'today?', "I'm", 'good', ':-).']
['Hello', 'world', '', 'How', 'are', 'you', 'today', '', 'I', 'm', 'good', '', '', '', '', '']
['Hello', 'world', 'How', 'are', 'you', 'today', 'I', 'm', 'good']


## re.sub()

Replaces one or many matches with a string

In [55]:
txt = "hello world, hello!"
txt2 = "Oh, hello world, hello!"

# Replaces occurences of 'hello' with 'hi' only when the occurence is at the start of the string:
print ( re.sub('^hello', 'Hi', txt))
print ( re.sub('^hello', 'Hi', txt2))

Hi world, hello!
Oh, hello world, hello!


__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+