In [3]:
import re

### RegEx 

https://docs.python.org/3/library/re.html

Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings

Functions:

- `findall`	Returns a list containing all matches
- `search`	Returns a Match object if there is a match anywhere in the string. If there is more than one match, only the first occurrence of the match will be returned.
- `split`	Returns a list where the string has been split at each match
- `sub`	Replaces one or many matches with a string

In [15]:
alphanumeric = "4298fsfsDFGHv012rvv21v9"

In [16]:
#use findall to pull out the letters only

re.findall("[A-z]", alphanumeric)

['f', 's', 'f', 's', 'D', 'F', 'G', 'H', 'v', 'r', 'v', 'v', 'v']

In [17]:
#findall using a known pattern can be used to pull pertinent information out of a text value
text = "Sian sian@google.com"
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [18]:
regex = re.compile(pattern, flags=re.IGNORECASE) #ignore the case of A-Z
regex.findall(text)

['sian@google.com']

In [19]:
#using findall to split out the parts of the email address by amending the pattern with () 
text = "Sian sian@google.com"
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [20]:
regex = re.compile(pattern, flags=re.IGNORECASE) #ignore the case of A-Z
regex.findall(text)

[('sian', 'google', 'com')]

In [21]:
my_string = "Kosta likes climbing. Kosta is a great TA so he also loves data"

In [22]:
# return all occurrances of 'Kosta' using re.findall()

re.findall("Kosta ", my_string)

['Kosta ', 'Kosta ']

In [23]:
# use re.sub() to replace "TA" by "Triceratops Alligator"

my_string = re.sub("TA", "Triceratops Alligator", my_string)

In [24]:
my_string

'Kosta likes climbing. Kosta is a great Triceratops Alligator so he also loves data'

In [25]:
x = re.search("ove", my_string)
print(x)

<re.Match object; span=(73, 76), match='ove'>


In [26]:
x = re.search(r"\bT\w+", my_string)
print(x.span())

(39, 50)


In [27]:
print(x.group())

Triceratops


In [28]:
multiples= "ear       hand  foot knee"

In [29]:
#use split with \s+ to comile and then split the passed text around the spaces
re.split('\s+', multiples)

['ear', 'hand', 'foot', 'knee']

**The Match object** has properties and methods used to retrieve information about the search, and the result:

- `.span()` returns a tuple containing the start-, and end positions of the match.
- `.string` returns the string passed into the function
- `.group()` returns the part of the string where there was a match

### Special Sequences

\A	Returns a match if the specified characters are at the beginning of the string	"\AThe"	
\b	Returns a match where the specified characters are at the beginning or at the end of a word	r"\bain"
r"ain\b"	
\B	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word	r"\Bain"
r"ain\B"	
\d	Returns a match where the string contains digits (numbers from 0-9)	"\d"	
\D	Returns a match where the string DOES NOT contain digits	"\D"	
\s	Returns a match where the string contains a white space character	"\s"	
\S	Returns a match where the string DOES NOT contain a white space character	"\S"	
\w	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	"\w"	
\W	Returns a match where the string DOES NOT contain any word characters	"\W"	
\Z	Returns a match if the specified characters are at the end of the string	"Spain\Z"

In [None]:
strings = ["there was a dog and there was a cat", 
           "if you capitalize this part of the string you will be in trouble",
           "this is the end of the string"]

In [None]:
# Use a special sequence to capitalize the strings above without getting into trouble
for string in strings:
    print(re.sub("\At", "T", string))

In [9]:
quotes = ["work hard all day, all days", 
          "There are 3 types of people: those who can count and those who can't",
          "Nice to be nice",
          "Some people feel the rain, others just get wet",
          "could you complete the exercise? wow"
         ]

In [10]:
#what will this capitalise?
for i in range(len(quotes)):
    quotes[i]= re.sub("\sw"," W", quotes[i])

In [11]:
quotes

['work hard all day, all days',
 "There are 3 types of people: those Who can count and those Who can't",
 'Nice to be nice',
 'Some people feel the rain, others just get Wet',
 'could you complete the exercise? Wow']

In [12]:
#what will this capitalise?
for quote in quotes:
    print(re.sub(r"\bw","W",quote))

Work hard all day, all days
There are 3 types of people: those Who can count and those Who can't
Nice to be nice
Some people feel the rain, others just get Wet
could you complete the exercise? Wow


In [4]:
# use a special sequence to find the numbers in the string
some_nums = "I have had 3 coffees this morning and I plan to drink 7 more"

In [5]:
re.findall("\d", some_nums)

['3', '7']

### `+`One or more occurrences

In [None]:
# use re.sub() together with + to fix the occurrance of too many whitespaces

spaces = "I   have too   many     spaces"
re.sub(" +", " ", spaces)

### `^`- Starts with

In [None]:
# print all veggies that start with a
veggies = ["tomato", "potato", "apple juice",
           "pear", "asparagus are tasty", "peach"]
for veg in veggies:
    print(re.findall(r"^a\S*", veg))