### Regex

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression

* .       : Any Character Except New Line

* \d      :  Digit (0-9)

* \D      :  Not a Digit (0-9)

* \w      :  Word Character (a-z, A-Z, 0-9, _)

* \W      :  Not a Word Character

* \s      :  Whitespace (space, tab, newline)

* \S      :  Not Whitespace (space, tab, newline)

* \b      - Word Boundary

* \B      - Not a Word Boundary

* ^       - Beginning of a String

* $       - End of a String


* []      - Matches Characters in brackets

* [^ ]    - Matches Characters NOT in brackets

* |       - Either Or

* ( )     - Group


#### Quantifiers:

*   (*)       - 0 or More

*  (+)       - 1 or More

*   ?       - 0 or One

*  {3}     - Exact Number

*  {3,4}   - Range of Numbers (Minimum, Maximum)

* {n,} — Matches ’n’ or more occurrences


#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

##### re.match(): 
search in begining of string
##### re.serch() : 
serach till end and return first occurance
##### re.findall(): 
till the end and return all occurance
##### re.split() : 
split according to regex and return list 
##### re.sub() : 
substitute new string with existing string 

In [1]:
import re
line = 'pet:cat i love cats'
print(line)
re.match(r'pet:\w\w\w',line) # match start with pet: and three alpha numeric charcater

pet:cat i love cats


<re.Match object; span=(0, 7), match='pet:cat'>

In [2]:
re.search(r'pet:\w\w\w',line)

<re.Match object; span=(0, 7), match='pet:cat'>

In [3]:
line1 = 'i love cats pet:cat'
print(re.match(r'pet:\w\w\w',line1)) # returned none because match find only at begining of the string 
print(re.search(r'pet:\w\w\w',line1))# it search all over the string

None
<re.Match object; span=(12, 19), match='pet:cat'>


In [4]:
line3 = 'pet:cat i love cats pet:cow i love cows'
re.findall(r'pet:\w\w\w',line3)

['pet:cat', 'pet:cow']

#### EX 1

In [5]:
str = 'cat mat rat bat'
print(str)
# result = re.serach('expression','str')
result = re.search(r'm\w\w', str) # give string start with m and contains two alpha numeric
print(result)

cat mat rat bat
<re.Match object; span=(4, 7), match='mat'>


#### EX 2

In [6]:
str = 'This : is the "core" python\s book'
print(str)
re.split(r'\W+' ,str) #split based on any non alpha numeric+ 1 0r more 

This : is the "core" python\s book


['This', 'is', 'the', 'core', 'python', 's', 'book']

In [7]:
str ='one two three four five six seven 8 9 10'
print(str)
print(re.findall(r'\w+',str))# find all based on alpha numeric character 
print(re.findall(r'\b\w{3,4}\b',str)) # find all based on boundary on both side and contains 3-4 alpha numeric
print(re.findall(r'\b\d\b',str))# based on boundary both sides and digits 

one two three four five six seven 8 9 10
['one', 'two', 'three', 'four', 'five', 'six', 'seven', '8', '9', '10']
['one', 'two', 'four', 'five', 'six']
['8', '9']


In [8]:
str = 'anil akhil anant arun arati arundhati abhijit ankur'
print(str)
print(re.findall(r'a[nk][\w]*',str)) # start with a , contains [nk],alpha numeric 0 or more
print(re.findall(r'\b\w{6,7}\b',str))

anil akhil anant arun arati arundhati abhijit ankur
['anil', 'akhil', 'anant', 'ankur']
['abhijit']


In [9]:
str = 'Hello world'
print(re.findall(r'^He',str))
print(re.search(r'world$',str))

['He']
<re.Match object; span=(6, 11), match='world'>


* First, we’re using a class range for anything matching ‘t’ or ‘T’
* Afterwards, we’re indicating anything followed by an ‘h’
* Later on, we’re using negation to bring anything not followed by an ‘i’
* And finally, any word finishing with and ‘s’ or a ‘t’

In [10]:
text = 'those THAT tilt that That 8'
re.findall(r'[Tt]h[^i][st]', text)

['thos', 'that', 'That']

### PAN checker

In [11]:
#valid pan : <char><char><char><char><char><digit><digit><digit><digit><char>
PAN_NUMBER  = 'ABcDE1234L'
match=re.search(r'[A-Z]{5}[0–9]{4}[A-Z]',PAN_NUMBER)
if match:
    print(True)
else:
    print(False)

False


### Domain Name finder

In [12]:
str = '<div class="reflist" style="list-style-type: decimal;"><ol class="references"><li id="cite_note-1"><span class="mw-cite-backlink"><b>^ ["Train (noun)"](http://www.askoxford.com/concise_oed/train?view=uk). <i>(definition – Compact OED)</i>. Oxford University Press<span class="reference-accessdate">. Retrieved 2008-03-18</span>.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.atitle=Train+%28noun%29&rft.genre=article&rft_id=http%3A%2F%2Fwww.askoxford.com%2Fconcise_oed%2Ftrain%3Fview%3Duk&rft.jtitle=%28definition+%E2%80%93+Compact+OED%29&rft.pub=Oxford+University+Press&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-2"><span class="mw-cite-backlink"><b>^</b></span> <span class="reference-text"><span class="citation book">Atchison, Topeka and Santa Fe Railway (1948). <i>Rules: Operating Department</i>. p. 7.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.au=Atchison%2C+Topeka+and+Santa+Fe+Railway&rft.aulast=Atchison%2C+Topeka+and+Santa+Fe+Railway&rft.btitle=Rules%3A+Operating+Department&rft.date=1948&rft.genre=book&rft.pages=7&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-3"><span class="mw-cite-backlink"><b>^ [Hydrogen trains](http://www.hydrogencarsnow.com/blog2/index.php/hydrogen-vehicles/i-hear-the-hydrogen-train-a-comin-its-rolling-round-the-bend/)</span></li><li id="cite_note-4"><span class="mw-cite-backlink"><b>^ [Vehicle Projects Inc. Fuel cell locomotive](http://www.bnsf.com/media/news/articles/2008/01/2008-01-09a.html)</span></li><li id="cite_note-5"><span class="mw-cite-backlink"><b>^</b></span> <span class="reference-text"><span class="citation book">Central Japan Railway (2006). <i>Central Japan Railway Data Book 2006</i>. p. 16.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.au=Central+Japan+Railway&rft.aulast=Central+Japan+Railway&rft.btitle=Central+Japan+Railway+Data+Book+2006&rft.date=2006&rft.genre=book&rft.pages=16&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-6"><span class="mw-cite-backlink"><b>^ ["Overview Of the existing Mumbai Suburban Railway"](http://web.archive.org/web/20080620033027/http://www.mrvc.indianrail.gov.in/overview.htm). _Official webpage of Mumbai Railway Vikas Corporation_. Archived from [the original](http://www.mrvc.indianrail.gov.in/overview.htm) on 2008-06-20<span class="reference-accessdate">. Retrieved 2008-12-11</span>.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.atitle=Overview+Of+the+existing+Mumbai+Suburban+Railway&rft.genre=article&rft_id=http%3A%2F%2Fwww.mrvc.indianrail.gov.in%2Foverview.htm&rft.jtitle=Official+webpage+of+Mumbai+Railway+Vikas+Corporation&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal" class="Z3988"><span style="display:none;"> </span></span></span></li></ol></div>'

In [13]:
str

'<div class="reflist" style="list-style-type: decimal;"><ol class="references"><li id="cite_note-1"><span class="mw-cite-backlink"><b>^ ["Train (noun)"](http://www.askoxford.com/concise_oed/train?view=uk). <i>(definition – Compact OED)</i>. Oxford University Press<span class="reference-accessdate">. Retrieved 2008-03-18</span>.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.atitle=Train+%28noun%29&rft.genre=article&rft_id=http%3A%2F%2Fwww.askoxford.com%2Fconcise_oed%2Ftrain%3Fview%3Duk&rft.jtitle=%28definition+%E2%80%93+Compact+OED%29&rft.pub=Oxford+University+Press&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-2"><span class="mw-cite-backlink"><b>^</b></span> <span class="reference-text"><span class="citation book">Atchison, Topeka and Santa Fe Railway (1948). <i>Rules: Operating Department</i>. p. 7.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3

##### | is the or operator here and match returns tuples where the pattern part inside () is kept.

In [14]:
match=re.findall(r'http(s:|:)\/\/(www.|ww2.|)([0-9a-z.A-Z-]*\.\w{2,3})',str)
for elem in match:
    print(elem)

(':', 'www.', 'askoxford.com')
(':', 'www.', 'hydrogencarsnow.com')
(':', 'www.', 'bnsf.com')
(':', '', 'web.archive.org')
(':', 'www.', 'mrvc.indianrail.gov.in')
(':', 'www.', 'mrvc.indianrail.gov.in')


#### 1.  Extracting email

In [18]:
import re

emails = '''
Hi my name is John and email address is john.doe@somecompany.co.uk and my friend's email is jane_doe124@gmail.com
'''
print(emails)


Hi my name is John and email address is john.doe@somecompany.co.uk and my friend's email is jane_doe124@gmail.com



In [19]:
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
print(pattern)

re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+')


In [20]:
matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(41, 67), match='john.doe@somecompany.co.uk'>
<re.Match object; span=(93, 114), match='jane_doe124@gmail.com'>
