### RegEx

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re

#### Useful Links

- https://www.w3schools.com/python/python_regex.asp
- https://www.guru99.com/python-regular-expressions-complete-tutorial.html
- https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/
- https://www.tutorialspoint.com/python/python_reg_expressions.htm

Search the string to see if it starts with "The" and ends with "Spain":

In [8]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if (x):
  print("YES! We have a match!")
else:
  print("No match")

YES! We have a match!


--------------

### Functions

#### Match

The match() method finds match if it occurs at start of the string

In [25]:
str = "The rain in Spain"
x = re.match("ai", str)
y = re.match("The", str)
print(x)
print(y)
y.group()

None
<re.Match object; span=(0, 3), match='The'>


'The'

#### Findall

The findall() function returns a list containing all matches.

In [4]:
str = "The rain in Spain"
x = re.findall("ai", str)
print(x)

['ai', 'ai']


if no match is found - empty list is returned

In [16]:
str = "The rain in Spain"
x = re.findall("Por", str)
print(x)

[]


#### Search

The search() function searches the string for a match, and returns a Match object if there is a match. <br/>
If there is **more than one match**, only the first occurrence of the match will be returned <br/>
If **no matches** are found, the value None is returned:

In [83]:
str = "The rain in Spain"
x = re.search("Portugal", str)
y = re.search("Spain", str)
print(x)
print(y)

None
<re.Match object; span=(12, 17), match='Spain'>


In [7]:
str = "The rain in Spain"
x = re.search("\s", str)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


#### Split

The split() function returns a list where the string has been split at each match:

In [17]:
str = "The rain in Spain"
x = re.split("\s", str)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the maxsplit parameter:

In [18]:
# Split the string only at first occurence
str = "The rain in Spain"
x = re.split("\s", str, 1)
print(x)

['The', 'rain in Spain']


#### Sub

The sub() function replaces the matches with the text of your choice:

In [20]:
# replacing whitespace with underscore

str = "The rain in Spain"
x = re.sub("\s", "_", str)
print(x)

The_rain_in_Spain


You can control the number of replacements by specifying the count parameter:

In [22]:
str = "The rain in Spain"
x = re.sub("\s", "_", str,1)
print(x)

The_rain in Spain


---------------

### Metacharacters

#### [ ] - a set of characters

In [28]:
str = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", str)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


####  \ - Signals a special sequence 
can also be used to escape special characters

check next section for more details

In [30]:
#Find all digit characters:

str = "That will be 59 dollars"
x = re.findall("\d", str)
print(x)

['5', '9']


#### DOT(.) - Any character
except newline character

In [32]:
#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

str = "hello world"
x = re.findall("he..o", str)
print(x)

['hello']


#### ^ - starts with

In [6]:
str = "hello world"

#Check if the string starts with 'hello':
x = re.findall("^hello", str)
x

['hello']

#### $ - Ends with

In [7]:
str = "hello world"
#Check if the string ends with 'world':
x = re.findall("world$", str)

#### * - Zero or more occurences

In [9]:
str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 0 or more "x" characters:
x = re.findall("aix*", str)
x

['ai', 'ai', 'ai', 'ai']

#### + - One or more occurences

In [13]:
str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 1 or more "x" characters:
x = re.findall("ly+", str)
x

['ly']

#### {} - Exact the number of occurences

In [7]:
str = "The rain in Spain falls mainly in the plain! aall"
#Check if the string contains "a" followed by exactly two "l" characters:
x = re.findall("a{2}l{2}$", str)
x

['aall']

#### | - Either Or

In [16]:
str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", str)
x

['falls']

------------

### Special Sequences

A special sequence is a \ followed by one of the characters in the list below,

####  \A  - Beginning 

In [20]:
str = "The rain in Spain"
#Check if the string starts with "The":
x = re.findall("\AThe", str)
# x = re.findall("\Spain", str)
x

['The']

####  \b - Beginning or End of a word 

In [23]:
str = "The rain in Spain"
#Check if "ain" is present at the beginning of a WORD:
x = re.findall(r"\bain", str)

#Check if "ain" is present at the end of a WORD:
y = re.findall(r"ain\b", str)

x, y

([], ['ain', 'ain'])

####  \B - NOT at the beginning or at the end

In [30]:
str = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:
x = re.findall(r"\Bain", str)

#Check if "ain" is present, but NOT at the end of a word:
y = re.findall(r"ain\B", str)

x,y

(['ain', 'ain'], [])

####  \d - Digits

In [32]:
str = "The rain in Spain 8"
#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)
x

['8']

####  \s - white space character

In [33]:
str = "The rain in Spain"
#Return a match at every white-space character:
x = re.findall("\s", str)
x

[' ', ' ', ' ']

####  \S - NOT a white space character

In [34]:
str = "The rain in Spain"
#Return a match at every NON white-space character:
x = re.findall("\S", str)
x

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']

####  \w - Contains any word characters
characters from a to Z, digits from 0-9, and the underscore _ character

In [38]:
str = "The rain in Spain 8"
#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):
x = re.findall("\w", str)
x

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n', '8']

####  \W - Does NOT contains any word characters
word characters -> characters from a to Z, digits from 0-9, and the underscore _ character

In [41]:
str = "The rain in Spain 8 * #"
#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):
x = re.findall("\W", str)
x

[' ', ' ', ' ', ' ', ' ', '*', ' ', '#']

####  \Z - end

In [47]:
str = "The rain in Spain"
#Check if the string ends with "Spain":
x = re.findall("Spain\Z", str)
x

['Spain']

------------

### Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

#### - one of the specified characters

In [67]:
str = "The rain in Spain 8"
# Check if the string has any (s, p ,r, T, 9)
x = re.findall("[sprT8]", str)
x

['T', 'r', 'p', '8']

#### - any lower case character between specified

In [64]:
str = "The rain in Spain"
#Check if the string has any characters between a and e:
x = re.findall("[a-e]", str)
x

['e', 'a', 'a']

#### - any character EXCEPT

In [62]:
str = "The rain in Spain"
#Check if the string ends with "Spain":
x = re.findall("[^a-r]", str)
x

['T', ' ', ' ', ' ', 'S']

#### - any digits between specified 

In [74]:
str = "The rain in Spain 12346768"
#Check if the string ends with "Spain":
x = re.findall("[0-4]", str)
x

['1', '2', '3', '4']

#### - any two-digit numbers from 00 and 59

In [73]:
str = "8 times before 11:45 AM"
#Check if the string has any two-digit numbers, from 00 to 59:
x = re.findall("[0-5][0-9]", str)
x

['11', '45']

#### - any character combination alphabetically

In [79]:
str = "8 times before 11:45 AM"
#Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall("[a-zA-Z]", str)
x

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']

#### - special characters

In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

In [82]:
str = "8 times before 11:45 AM * # "
#Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall("[*#]", str)
x

['*', '#']