good tutorial url:
    https://www.machinelearningplus.com/python/python-regex-tutorial-examples/
    https://www.rexegg.com/regex-disambiguation.html#lookarounds

In [1]:
import re
regex = re.compile('\s+')

- The above code imports the 're' package and compiles a regular expression pattern that can match at least one or more space characters.
- If you intend to use a particular pattern multiple times, then you are better off compiling a regular expression.
- '\s+' matches any whitespace character. By adding a '+' notation at the end will make the pattern match at least 1 or more spaces. So, this pattern will match even tab '\t' characters as well.

- Adding a '+' symbol to it mandates the presence of at least 1 digit to be present in order to be found.
- Similar to '+', there is a '*' symbol which requires 0 or more digits in order to be found. It practically makes the presence of a digit optional in order to make a match

In [2]:
text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

I have three course items in the format of “[Course Number] [Course Code] [Course Name]”. The spacing between the words are not equal.

I want to split these three course items into individual units of numbers and words. How to do that?

In [3]:
regex.split(text)

['101',
 'COM',
 'Computers',
 '205',
 'MAT',
 'Mathematics',
 '189',
 'ENG',
 'English']

### 4. Finding pattern matches using findall, search and match
#### Let’s suppose you want to extract all the course numbers, that is, the numbers 101, 205 and 189 alone from the above text. How to do that?

In [4]:
regex_num = re.compile('[0-9]+')
regex_num.findall(text)

['101', '205', '189']

#### The findall method extracts all occurrences of the 1 or more digits from the text and returns them in a list.

In [5]:
text2 = """COM    Computers
205 MAT   Mathematics 189"""
regex_num.search(text2)

<_sre.SRE_Match object; span=(17, 20), match='205'>

In [6]:
regex_num.match(text2)

match returned nothing whereas search returned the match.
#### regex.search() returns a particular match object that contains the starting and ending positions of the first occurrence of the pattern.
#### Likewise, regex.match() also returns a match object. But the difference is, it requires the pattern to be present at the beginning of the text itself.

#### to get the matched text from .match or .search, use group() method

In [7]:
result = regex_num.search(text2)
result.group()

'205'

### 5. How to substitute one text with another using regex?
#### To replace texts, use the regex.sub()


In [8]:
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English"""  
print(text)

101   COM 	  Computers
205   MAT 	  Mathematics
189   ENG  	  English


From the above text, I want to even out all the extra spaces and put all the words in one single line.

In [9]:
regex = re.compile('\s+')#will match all variable spaces, including tab(\t and newline \n)
regex.sub(' ',text) #replace variable spaces with single spcae.

'101 COM Computers 205 MAT Mathematics 189 ENG English'

Suppose you only want to get rid of the extra spaces but want to keep the course entries in the new line itself. 

This can be done using a negative lookahead (?!\n). It checks for an upcoming newline character and excludes it from the pattern.

Question:  Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9).

In [10]:
def check(str_test):
    regex = re.compile('^[a-zA-Z0-9]+$')
    match = regex.findall(str_test)
    if(len(match)):
        print('passed')
    else:
        print('failed')

In [11]:
str_test = "sdfdh90"
check(str_test)

passed


In [12]:
str_test_failing = "sdf fgfg 90"
check(str_test_failing)

failed


Write a Python program that matches a string that has an a followed by zero or more b's.

In [13]:
def check(test):
    regex = re.compile('ab*')
    return regex.findall(test)

In [14]:
pass_str = 'ababab'
fail_str = 'dsdfds'
if len(check(pass_str)):
    print('passed')
if not len(check(fail_str)):
    print('passed')

passed
passed


Write a Python program that matches a string that has an a followed by one or more b's

In [15]:
def check(test):
    regex = re.compile('ab+')
    return regex.findall(test)

pass_str = 'abbabab'
fail_str = 'adsdfds'
if len(check(pass_str)):
    print('passed')
if not len(check(fail_str)):
    print('passed')

passed
passed


Write a Python program that matches a string that has an a followed by zero or one 'b'

In [16]:
def check(test):
    regex = re.compile('ab?')
    return regex.findall(test)

pass_str = 'aababab'
fail_str = 'sdfds'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program that matches a string that has an a followed by three 'b'

In [17]:
def check(test):
    regex = re.compile('ab{3}')
    return regex.findall(test)

pass_str = 'aabbbabab'
fail_str = 'abbabsdfds'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program that matches a string that has an a followed by two to three 'b'.

In [18]:
def check(test):
    regex = re.compile('ab{2,3}')
    return regex.findall(test)

pass_str = 'aabbbabbab'
fail_str = 'ababsdfds'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program to find sequences of lowercase letters joined with a underscore.

In [19]:
def check(test):
    regex = re.compile('[a-z]+(?=_[a-z]+)')
    return regex.findall(test)

pass_str = 'aabb_babbab'
fail_str = 'aabb__babbab'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program to find sequences of one upper case letter followed by lower case letters.

In [20]:
def check(test):
    regex = re.compile('[A-Z](?=[a-z]+)')
    return regex.findall(test)

pass_str = 'AaabbCbabbab'
fail_str = 'aabbCC'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program that matches a string that has an 'a' followed by anything, ending in 'b'. 

In [21]:
def check(test):
    regex = re.compile('a.*b$')
    return regex.findall(test)

pass_str = 'AaabbCbabbab'
fail_str = 'aabbCC'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program that matches a word at the beginning of a string.

In [22]:
def check(test):
    regex = re.compile('^\w+')
    return regex.findall(test)

pass_str = 'AaabbCbabbab'
fail_str = ' aabbCC'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program that matches a word at end of string, with optional punctuation.

In [23]:
def check(test):
    regex = re.compile('\w+[.,!]*$')
    return regex.findall(test)

pass_str = 'AaabbCbabba.'
fail_str = ' aabbCC '
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


 Write a Python program that matches a word containing 'z'.

In [24]:
def check(test):
    regex = re.compile('\w*z.\w*')
    return regex.findall(test)

pass_str = 'Aaa zbbCbabba.'
fail_str = 'mmm aabbCC '
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


Write a Python program that matches a word containing 'z', not start or end of the word

In [25]:
def check(test):
    regex = re.compile('[A-Ya-y]+z[A-Ya-y]+')
    return regex.findall(test)

pass_str = 'Aaa bzbCbabba.'
fail_str = 'mmm zaabbCC '
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

pass_str passed
passed


https://www.w3resource.com/python-exercises/re/
from exercise 14 and down

In [26]:
def check(test):
    regex = re.compile('[^a-zA-Z0-9_]')
    return regex.findall(test)

print('Q14')
fail_str = 'Aaa9bzbCbabba_'
pass_str = 'mmm zaabbCC'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

Q14
pass_str passed
passed


In [27]:
def check(test):
    regex = re.compile('^9')
    return regex.findall(test)

print('Q15')
pass_str = '9bzbCbabba_'
fail_str = '8mmm zaabbCC'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

Q15
pass_str passed
passed


In [28]:
def check(test):
    regex = re.compile('^0+')
    return regex.sub('',test)

print('Q16')
test_str = '001.01.02.12'
check(test_str)

Q16


'1.01.02.12'

In [29]:
def check(test):
    regex = re.compile('[0-9]$')
    return regex.findall(test)

print('Q17')
pass_str = '9bzbCbabba1'
fail_str = '8mmm zaabbCC'
if len(check(pass_str)):
    print('pass_str passed')
if not len(check(fail_str)):
    print('passed')

Q17
pass_str passed
passed


In [30]:
def check(test):
    regex = re.compile('(?<![0-9])[0-9]{1,3}(?![0-9])')
    return regex.findall(test)

print('Q18')
pass_str = '9bzbC234b4444abba1'
check(pass_str)

Q18


['9', '234', '1']

In [31]:
def check(test):
    regex = re.compile('fox|dog|horse')
    return regex.findall(test)

print('Q19')
pass_str = 'The quick brown fox jumps over the lazy dog.'
patterns = ['fox','dog','horse']
result = check(pass_str)
for pattern in patterns:
    if pattern not in result:
        print("not found: "+pattern)

Q19
not found: horse


In [32]:
def check(test):
    regex = re.compile('fox')
    return regex.search(test)

print('Q20')
pass_str = 'The quick brown fox jumps over the lazy dog.'
match = check(pass_str)
(match.start(),match.end(),match.group())

Q20


(16, 19, 'fox')

In [33]:
def check(test):
    regex = re.compile('exercise')
    for match in regex.finditer(test):
        print('start:'+str(match.start()))
        print('end:'+str(match.end()))

print('Q22')
pass_str = 'Python exercises, PHP exercises, C# exercises'
check(pass_str)

Q22
start:7
end:15
start:22
end:30
start:36
end:44


In [34]:
test_str = "pyth exer"
test_str = re.sub(' ','_',test_str)
print('Q23')
print(test_str)
test_str = re.sub('_',' ',test_str)
print(test_str)

Q23
pyth_exer
pyth exer


In [35]:
url = "https://www.washingtonpost.com/news/football-insider/wp/2016/09/02/odell-beckhams-fame-rests-on-one-stupid-little-ball-josh-norman-tells-author/"
expr = re.compile(r'(\d{4})/(\d{2})/(\d{2})')
print('Q24')
expr.findall(url)

Q24


[('2016', '09', '02')]

In [37]:
print('Q25')
re.sub(r'(\d{2})-(\d{2})-(\d{4})',r'\2-\1-\3','04-18-2014')

Q25


'18-04-2014'

In [39]:
re.findall(r'\bP.*?\b','Python P java Jquery pascal')

['Python', 'P']

In [41]:
print('Q27')
re.findall(r'[0-9]+','101 from 202 in 303')

Q27


['101', '202', '303']

In [44]:
print('Q28')
re.findall(r'\b[ae].*?\b','aeon flex encore krystal')

Q28


['aeon', 'encore']

In [49]:
print('Q29')
for match in re.finditer(r'[0-9]+','101 from 202 in 303'):
    print('start: '+str(match.start()))
    print('end: '+str(match.end()))
    print('match: '+str(match.group()))

Q29
start: 0
end: 3
match: 101
start: 9
end: 12
match: 202
start: 16
end: 19
match: 303


In [51]:
print('Q32')
re.sub(r'[ ,.]',r':','m t,s.q l n.t',2)

Q32


'm:t:s.q l n.t'

In [52]:
print('q33')
re.findall(r'\b\w{5}\b','india is a great country')

q33


['india', 'great']

In [55]:
print('q34')
re.findall(r'\b\w{3,5}\b','our india is a great country')

q34


['our', 'india', 'great']

In [56]:
print('q35')
re.findall(r'\b\w{4,}\b','india is a great country')

q35


['india', 'great', 'country']

In [62]:
print('q36')
test = 'getHTTPResponse'
str1 = re.sub(r'(.)([A-Z][a-z]+)',r'\1_\2',test)
re.sub(r'([a-z0-9]+)([A-Z])',r'\1_\2',str1).lower()

'get_http_response'

In [70]:
print('q37')
test = 'get_http_response'
components = re.split(r'_',test)
camel_case = ''
for component in components:
    camel_case += (str.capitalize(component))

camel_case

q37


'GetHttpResponse'