# Regular Expressions

### The most common uses of regular expressions are:

- Search a string (search and match)
- Finding a string (findall)
- Break string into a sub strings (split)
- Replace part of a string (sub)

### Match

In [48]:
import re
mystring =  'AV Analytics AV'
re.match(r'AV', mystring)

<re.Match object; span=(0, 2), match='AV'>

    It matches at the beginning of the string

### Search

In [4]:
result = re.search(r'Analytics', 'AV Analytics Vidhya AV')
print (result.group(0))

Analytics


In [5]:
print(result)

<re.Match object; span=(3, 12), match='Analytics'>


In [20]:
result2 = re.search(r'AV', 'AV Analytics Vidhya AV')

In [21]:
print (result2)

<re.Match object; span=(0, 2), match='AV'>


In [22]:
print (result2.group(0))

AV


    Here you can see that, search() method is able to find a pattern from any position 
    of the string but it only returns the first occurrence of the search pattern.

### Findall

In [16]:
result3 = re.findall(r'AV', 'AV Analytics Vidhya AV')
print (result3)

['AV', 'AV']


    It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV’ in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.

### Split

    This methods helps to split string by the occurrences of given pattern

In [24]:
result4 = re.split(r'a', 'Analytics')
result4

['An', 'lytics']

In [30]:
result4 = re.split(r'\.\.\b', 'This is correct...This is also correct')
result4

['This is correct.', 'This is also correct']

In [34]:
result4 = re.split(r'\.\.', 'This is correct...This is also correct')
result4

['This is correct', '.This is also correct']

**Method split() has another argument “maxsplit“.**

In [35]:
result4 = re.split(r'i', 'This is correct...This is also correct')
result4

['Th', 's ', 's correct...Th', 's ', 's also correct']

In [36]:
result4 = re.split(r'i', 'This is correct...This is also correct', maxsplit = 2)
result4

['Th', 's ', 's correct...This is also correct']

### Sub

In [49]:
result5 = re.sub('India','the World','AV is largest Analytics community of India')
print(result5) 

AV is largest Analytics community of the World


### Compile

    We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [39]:
pattern = re.compile(r'AV')
result6 = pattern.findall('AV is Analytics vidhya AV')
result6

['AV', 'AV']

In [47]:
result6 = pattern.findall('AV is largest analytics community of India')
print (result6)

['AV']


### Most commonly used operators

    .	 Matches with any single character except newline ‘\n’.
    ?	 match 0 or 1 occurrence of the pattern to its left
    +	 1 or more occurrences of the pattern to its left
    *	 0 or more occurrences of the pattern to its left
    \w	 Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
    \d	  Matches with digits [0-9] and /D (upper case D) matches with non-digits.
    \s	 Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.
    \b	 boundary between word and non-word and /B is opposite of /b
    [..]	 Matches any single character in a square bracket and [^..] matches any single character not in square bracket
    \	 It is used for special meaning characters like \. to match a period or \+ for plus sign.
    ^ and $	 ^ and $ match the start or end of the string respectively
    {n,m}	 Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
    a| b	 Matches either a or b
    ( )	Groups regular expressions and returns matched text
    \t, \n, \r	 Matches tab, newline, return

#### Extract each character (using “\w“)

In [65]:
pattern = re.compile(r'\w')

In [68]:
mytext = 'Data is the new oil123.'
result = pattern.findall(mytext)
print(result)

['D', 'a', 't', 'a', 'i', 's', 't', 'h', 'e', 'n', 'e', 'w', 'o', 'i', 'l', '1', '2', '3']


#### Extract each word (using “\w” and “\b“)

In [69]:
pattern = re.compile(r'\w{3}')         # Extractin three letter word
mytext = 'Data is the new oil123.'
result = pattern.findall(mytext)
print(result)

['Dat', 'the', 'new', 'oil', '123']


In [70]:
pattern = re.compile(r'\w{3}\b')         # Extractin three letter word having space in the end
mytext = 'Data is the new oil123.'
result = pattern.findall(mytext)
print(result)

['ata', 'the', 'new', '123']


In [72]:
pattern = re.compile(r'\b\w{3}')         # Extractin three letter word having space at the start
mytext = 'Data is the new oil123.'
result = pattern.findall(mytext)
print(result)

['Dat', 'the', 'new', 'oil']


In [73]:
pattern = re.compile(r'\b\w{3}\b')         # Extractin three letter word having space at both ends
mytext = 'Data is the new oil123.'
result = pattern.findall(mytext)
print(result)

['the', 'new']


#### Extract each word (using “*” and “+“)

In [82]:
mytext = 'Data is the new oil123.'              #Finding all words in a text
result = re.findall(r'\w+',mytext)
print(result) 


['Data', 'is', 'the', 'new', 'oil123']


In [83]:
mytext = 'Data is the new oil123.'              #Finding all words starting a sntence
result = re.findall(r'^\w+',mytext)
print(result) 

['Data']


In [98]:
mytext = 'Data is the123 new oil123'              #Finding all words ending a sentence
result = re.findall(r'\w+$',mytext)
print(result) 

['oil123']


In [99]:
mytext = 'Data is the123 new oil123.'              #Finding all words ending a sentence with full stop.
result = re.findall(r'\w+\.$',mytext)
print(result)

['oil123.']


### Return the first two character of each word

In [100]:
mytext = 'Data is the123 new oil123.'              #Finding all words ending a sentence with full stop.
result = re.findall(r'\w\w',mytext)
print(result)

['Da', 'ta', 'is', 'th', 'e1', '23', 'ne', 'oi', 'l1', '23']


In [101]:
mytext = 'Data is the123 new oil123.'   # two characters those available at start of word boundary (using “\b“)
result = re.findall(r'\b\w\w',mytext)
print(result)

['Da', 'is', 'th', 'ne', 'oi']


### Return the domain type of given email-ids

In [106]:
#Extract all characters after “@”

mytext = '''@\w+','abc.test@gmail.com, xyz@test.in,           
           test.first@analyticsvidhya.com, first.test@rest.biz'''
result = re.findall(r'@\w+', mytext)                  
result 

['@gmail', '@test', '@analyticsvidhya', '@rest']

In [108]:
#“.com”, “.in” part is not extracted. To add it, we will go with below code.

mytext = '''@\w+','abc.test@gmail.com, xyz@test.in,           
           test.first@analyticsvidhya.com, first.test@rest.biz'''
result = re.findall(r'@\w+\.\w+', mytext)                  
result 

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']

In [111]:
#  Extract only domain name using “( )”

mytext = '''@\w+','abc.test@gmail.com, xyz@test.in,           
           test.first@analyticsvidhya.com, first.test@rest.biz'''
result = re.findall(r'@\w+\.(\w+)', mytext)                  
result 

['com', 'in', 'com', 'biz']

### Return date from given string

In [113]:
mystring = '''Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009'''
result = re.findall(r'\d{2}-\d{2}-\d{4}', mystring)
result

['12-05-2007', '11-11-2011', '12-01-2009']

In [114]:
#extract only year again parenthesis “( )” will help you.

mystring = '''Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009'''
result = re.findall(r'\d{2}-\d{2}-(\d{4})', mystring)
result

['2007', '2011', '2009']

### Return all words of a string those starts with vowel

In [116]:
# Return words starts with alphabets (using [])

mytext = 'Data is the123 new oil123.'   
result = re.findall(r'\b[aeiouAEIOU]\w+',mytext)
print(result)

['is', 'oil123']


In [128]:
#extract words those starts with consonents using “^” within square bracket.
mytext = 'Data is the123 new oil123.'   
result = re.findall(r'\b[^aeiouAEIOU]\w+',mytext) # Note here space is selected 
print(result)

['Data', ' is', ' the123', ' new', ' oil123']


In [127]:
#extract words those starts with consonents using “^” within square bracket.

mytext = 'Data is the123 new oil123.'   
result = re.findall(r'\b[^aeiouAEIOU ]\w+',mytext) # Note here space is provided inside square bracket
print(result)

['Data', 'the123', 'new']


### Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

In [129]:
phone = ['9999999999','999999-999','99999x9999']

for val in phone:
    if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
        print('yes')
    else:
        print('no')

yes
no
no


### Split a string with multiple delimiters

In [130]:
line = 'asdf fjdk;afed,fjek,asdf,foo'             # String has multiple delimiters (";",","," ").

result= re.split(r'[;,\s]', line)
print(result)


['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
