https://docs.python.org/3/library/re.html

https://docs.python.org/3/howto/regex.html

In [1]:
import re

source of text: https://www.bradezone.com/2008/09/13/boring/

In [2]:
# %load text.py
this_text="I go to the store. A car is parked. \
Many cars are parked or moving. Some are blue. \
Some are tan. They have windows. In the store, \
there are items for sale. These include such \
things as soap, detergent, magazines, and lettuce. \
You can enhance your life with these products. \
Soap can be used for bathing, be it in a bathtub \
or in a shower. My email address is myname@sc.edu. \
Apply the soap to your body and rinse. My phone \
number is 452-953-2942. Detergent is used to \
wash clothes. Place your dirty clothes \
into a washing machine and add some detergent \
as directed on the box. Your email is \
aperson@farm.com and your cell is 595-942-2424. \
Select the appropriate settings on your \
Alexs question 953-242 \
washing machine and you should be ready to \
begin. Magazines are stapled reading material \
made with glossy paper, and they cover a wide \
variety of topics, ranging from news and \
politics to business and stock market information."

Look for a number that exists in the text

In [3]:
re.findall("953",this_text)

['953', '953']

Look for a set of digits in a specific pattern

In [4]:
re.findall("\d\d\d-\d\d\d-\d\d\d\d",this_text)

['452-953-2942', '595-942-2424']

<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>

Same goal, but more concisely

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

In [5]:
re.findall(r"\d{3}-\d{3}-\d{4}",this_text)

['452-953-2942', '595-942-2424']

For more patterns, see http://regexlib.com/Search.aspx?k=phone

<BR>
<BR>

Look for an email address

In [6]:
re.findall(r"@[a-z]",this_text)

['@s', '@f']

To get all the letters in the domain, use +

In [7]:
re.findall(r"@[a-z]+",this_text)

['@sc', '@farm']

To get the letters proceeding the @, 

In [8]:
re.findall(r"[a-z]+@[a-z]+",this_text)

['myname@sc', 'aperson@farm']

Including the domain requires we escape the . since the period is a special character for regex

In [9]:
re.findall(r"[a-z]+@[a-z]+\.[a-z]+",this_text)

['myname@sc.edu', 'aperson@farm.com']

Once a Regular Expression gets complicated, use comments to convey your intention

In [10]:
re.findall(r"""[a-z]+  # user name
               @
               [a-z]+  # domain
               \.
               [a-z]+  # TLD, https://en.wikipedia.org/wiki/Top-level_domain
               """,this_text, re.VERBOSE)

['myname@sc.edu', 'aperson@farm.com']

That pattern for emails is not robust. For better versions, see http://regexlib.com/Search.aspx?k=email

# Use of regex for string replacement

In this example, we'll anonymize the phone numbers

In [11]:
re.sub(r"\d{4}", "XXXX", this_text)

'I go to the store. A car is parked. Many cars are parked or moving. Some are blue. Some are tan. They have windows. In the store, there are items for sale. These include such things as soap, detergent, magazines, and lettuce. You can enhance your life with these products. Soap can be used for bathing, be it in a bathtub or in a shower. My email address is myname@sc.edu. Apply the soap to your body and rinse. My phone number is 452-953-XXXX. Detergent is used to wash clothes. Place your dirty clothes into a washing machine and add some detergent as directed on the box. Your email is aperson@farm.com and your cell is 595-942-XXXX. Select the appropriate settings on your Alexs question 953-242 washing machine and you should be ready to begin. Magazines are stapled reading material made with glossy paper, and they cover a wide variety of topics, ranging from news and politics to business and stock market information.'

we can compile a Regular Expression for repeated use

In [12]:
list_of_lines = this_text.split('.')

In [13]:
list_of_lines

['I go to the store',
 ' A car is parked',
 ' Many cars are parked or moving',
 ' Some are blue',
 ' Some are tan',
 ' They have windows',
 ' In the store, there are items for sale',
 ' These include such things as soap, detergent, magazines, and lettuce',
 ' You can enhance your life with these products',
 ' Soap can be used for bathing, be it in a bathtub or in a shower',
 ' My email address is myname@sc',
 'edu',
 ' Apply the soap to your body and rinse',
 ' My phone number is 452-953-2942',
 ' Detergent is used to wash clothes',
 ' Place your dirty clothes into a washing machine and add some detergent as directed on the box',
 ' Your email is aperson@farm',
 'com and your cell is 595-942-2424',
 ' Select the appropriate settings on your Alexs question 953-242 washing machine and you should be ready to begin',
 ' Magazines are stapled reading material made with glossy paper, and they cover a wide variety of topics, ranging from news and politics to business and stock market informat

In [14]:
myregex = re.compile(r"\d{4}")
for line in list_of_lines:
    print(myregex.sub("XXXX", line))

I go to the store
 A car is parked
 Many cars are parked or moving
 Some are blue
 Some are tan
 They have windows
 In the store, there are items for sale
 These include such things as soap, detergent, magazines, and lettuce
 You can enhance your life with these products
 Soap can be used for bathing, be it in a bathtub or in a shower
 My email address is myname@sc
edu
 Apply the soap to your body and rinse
 My phone number is 452-953-XXXX
 Detergent is used to wash clothes
 Place your dirty clothes into a washing machine and add some detergent as directed on the box
 Your email is aperson@farm
com and your cell is 595-942-XXXX
 Select the appropriate settings on your Alexs question 953-242 washing machine and you should be ready to begin
 Magazines are stapled reading material made with glossy paper, and they cover a wide variety of topics, ranging from news and politics to business and stock market information



# Documentation 

As an example of why documentation matters, try to figure out what this regex does

In [15]:
a = re.compile(r"\d\.\d*")

compared to

In [16]:
a = re.compile(r"""\d +  # the integer part
                   \.    # the decimal point, aka radix: https://en.wikipedia.org/wiki/Radix
                   \d *  # some fractional digits""", re.VERBOSE)