# NLP Basics: Learning how to use Regular Expressions

## Using regular expressions in Python

 Python's re package is the most commmonly used regex resource.

In [2]:
import re 

re_test = "Hello Everybody Regular expression 2nd tutorial"
re_test_messy = "Hello         Everybody Regular   expression 2nd  tutorial"
re_test_messy1 = "Hello-Everybody-Regular.expression>>>>>>>2nd''''''tutorial"

## Splitting a sentence into a list of words

In [3]:
re.split('\s', re_test)

['Hello', 'Everybody', 'Regular', 'expression', '2nd', 'tutorial']

In [4]:
re.split('\s', re_test_messy)

['Hello',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Everybody',
 'Regular',
 '',
 '',
 'expression',
 '2nd',
 '',
 'tutorial']

In [6]:
re.split('\s+', re_test_messy)

['Hello', 'Everybody', 'Regular', 'expression', '2nd', 'tutorial']

In [7]:
re.split('\s+', re_test_messy1)

["Hello-Everybody-Regular.expression>>>>>>>2nd''''''tutorial"]

In [8]:
re.split('\W+', re_test_messy1)

['Hello', 'Everybody', 'Regular', 'expression', '2nd', 'tutorial']

In [9]:
messy_me = " hello guys will are----- testing>>>something...new"

In [11]:
ls=re.split('\W+', messy_me)
print(ls)

['', 'hello', 'guys', 'will', 'are', 'testing', 'something', 'new']


In [15]:
messym=" ".join(ls)

In [17]:
re.split('\s+',messym.strip())

['hello', 'guys', 'will', 'are', 'testing', 'something', 'new']

In [18]:
re.findall('\S+', re_test)

['Hello', 'Everybody', 'Regular', 'expression', '2nd', 'tutorial']

In [19]:
re.findall('\S+', re_test_messy)

['Hello', 'Everybody', 'Regular', 'expression', '2nd', 'tutorial']

In [20]:
re.findall('\S+', re_test_messy1)

["Hello-Everybody-Regular.expression>>>>>>>2nd''''''tutorial"]

In [21]:
re.findall('\w+', re_test_messy1)

['Hello', 'Everybody', 'Regular', 'expression', '2nd', 'tutorial']

# Replacing a specific string

In [22]:
pep9_test = 'I try to follow PEP9 guidelines'
pep10_test = 'I try to follow PEP10 guidelines'
peep9_test = 'I try to follow PEEP9 guidelines'

In [23]:
import re 

re.findall('[a-z]+',pep9_test)

['try', 'to', 'follow', 'guidelines']

In [24]:
re.findall('[A-Z]+', pep9_test)

['I', 'PEP']

In [28]:
re.findall('[A-Z]+[0-9]+', peep9_test)

['PEEP9']

In [31]:
re.sub('[A-Z]+[0-9]+','PEP9 Python Styleguide', pep10_test)

'I try to follow PEP9 Python Styleguide guidelines'

 ***
# Other examples of regex methods
 - re.search()
 - re.match()
 - re.fullmatch()
 - re.finditer()
 - re.escape()

In [43]:
txts = "Search smthg in the String"
x = re.search("smthg", txts)
print(x)  #this will print an object  i.e. Match Object
print(x.start())

<re.Match object; span=(7, 12), match='smthg'>
7


### The Match object has properties and methods used to retrieve information about the search, and the result:

 - span() returns a tuple containing the start-, and end positions of the match.
 - string returns the string passed into the function
 - group() returns the part of the string where there was a match


In [44]:
txts = "Search smthg in the String"
x = re.search(r"\bS\w+", txts)
print(x.span())  # values of first match


(0, 6)


In [45]:
txts = "Search smthg in the String"
x = re.search(r"\bS\w+", txts)
print(x.string)

Search smthg in the String


In [46]:
txts = "Search smthg in the String"
x = re.search(r"\bS\w+", txts)
print(x.group())

Search


In [47]:
txts = "Search smthg in the String"
x = re.search(r"\bs\w+", txts)
print(x.span())

(7, 12)


In [57]:
txts = "Search smthg in the String"
x = re.match(r"\bS\w+", txts)
print(x)

<re.Match object; span=(0, 6), match='Search'>


In [58]:
txts = "Search smthg in the String"
x = re.fullmatch(r"\bS\w+", txts)
print(x)

None


##  re.search() vs re.match()

There is a difference between the use of both functions. Both return first match of a substring found in the string, but re.match() searches only in the first line of the string and return match object if found, else return none. But if a match of substring is found in some other line other than the first line of string (in case of a multi-line string), it returns none.
While re.search() searches for the whole string even if the string contains multi-lines and tries to find a match of the substring in all the lines of string.

### According to Python docs,

re.finditer(pattern, string, flags=0)

Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. 

The following code shows the use of re.finditer() method in Python regex

#### re.escape(pattern)

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

How to escape regular expression characters in Python
Meta-characters have special meanings in regular expressions. For example, "()" and ")" group ranges of letters and numbers. To treat them like literal characters, place a "\" before them like "\(" and "\)".

USE re.escape() TO TREAT META-CHARACTERS AS LITERAL CHARACTERS
Call re.escape(pattern) to place a "\" before all meta-characters in pattern.



### What is F string in Python 3?
Python f-string is the newest Python syntax to do string formatting. It is available since Python 3.6. Python f-strings provide a faster, more readable, more concise, and less error prone way of formatting strings in Python. The f-strings have the f prefix and use {} brackets to evaluate values

In [65]:
import re
s1 = 'Blue Berries'
pattern = 'Berries'
for match in re.finditer(pattern, s1):
    s = match.start()
    e = match.end()
    print ('String match "%s" at %d:%d' % (s1[s:e], s, e))

String match "Berries" at 5:12


In [67]:
import re
print (re.escape("Hello 123 .?!@ World"))

Hello\ 123\ \.\?!@\ World
