<a href="https://colab.research.google.com/github/sridevibonthu/NLPCourseR19/blob/main/note%20books/Regular_expressions_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A regex is a special sequence of characters that defines a pattern for complex string-matching functionality.

What we will learn?

- How to access the re module, which implements regex matching in Python
- How to use re.search() to match a pattern against a string
- How to create complex matching pattern with regex metacharacters

How to find out whether a string s contains the substring '123'?


In [None]:
s = 'vit123bvrm'

how to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'

### The re module

- Regex functionality in Python resides in a module named re. 
- The re module contains many useful functions and methods

- re.search() is one among them.

- re.search(\<regex\>, \<string\>)

- Scans a string for a regex match.

- re.search(\<regex\>, \<string\>) scans <string> looking for the first location where the pattern <regex> matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.


In [None]:
import re

In [None]:
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In [None]:
re.search('456',s)

In [None]:
if re.search('456', s):
  print("Match is found")
else:
  print("Match is not found")

Match is not found


#### Python Regex Metacharacters

- The real power of regex matching in Python emerges when \<regex\> contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

- Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

- In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [None]:
re.search('[0-9][0-9][0-9]', s)

<re.Match object; span=(3, 6), match='123'>

In [None]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [None]:
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard

In [None]:
re.search('1.3', 'vit123bvrm')

<re.Match object; span=(3, 6), match='123'>

In [None]:
print(re.search('1.3', 'vit13bvrm'))

None


### Some of the meta characters supported by re module of python
- . - matches any single character except newline
- ^ - Anchors a match at the start of a string
- $ - Anchors a match at the end of the string
- \* - Matches zero or more repetitions
- \+ - Mathces one or more repetitions
- \? - Matches zero or one repetition
- { } - matches an explicitly specified number of repetitions
- \ - Escapes a metacharacter of its special meaning
- [] - specifies a character class
- | - designates an alteration
- () - creates a group

In [None]:
# matching a single character
re.search('ba[artz]', 'foobarqux')

<re.Match object; span=(3, 6), match='bar'>

In [None]:
# Matching range of characters
re.search('[a-z]', 'FOObar')

<re.Match object; span=(3, 4), match='b'>

In [None]:
re.search('[A-Z]', 'FOObar')

<re.Match object; span=(0, 1), match='F'>

In [None]:
re.search('[0-9][0-9]', 'foo123bar')

<re.Match object; span=(3, 5), match='12'>

In [None]:
# Matching any character that is not a digit
re.search('[^0-9]', '12345foo')

<re.Match object; span=(5, 6), match='f'>

In [None]:
# The . metacharacter matches any single character except a newline:

re.search('foo.bar', 'fooxbar')

<re.Match object; span=(0, 7), match='fooxbar'>

In [None]:
print(re.search('foo.bar', 'foobar'))

None


In [None]:
print(re.search('foo.bar', 'foo\nbar'))

None


\w
\W

Match based on whether a character is a word character

In [None]:
re.search('\w', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [None]:
re.search('[a-zA-Z0-9_]', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [None]:
re.search('\W', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

\d
\D

Match based on whether a character is a decimal digit.

In [None]:
re.search('\d', 'abc4def')

<re.Match object; span=(3, 4), match='4'>

In [None]:
re.search('\D', '234Q678')

<re.Match object; span=(3, 4), match='Q'>

^
\A

Anchor a match to the start of <string>.

In [None]:
re.search('^foo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [None]:
print(re.search('^foo', 'barfoo'))

None


$
\Z

Anchor a match to the end of <string>.

In [None]:
re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [None]:
print(re.search('bar$', 'barfoo'))

None


#### Quantifiers
A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed.

*

Matches zero or more repetitions of the preceding regex.

In [None]:
print(re.search('foo-*bar', 'foobar'))
print(re.search('foo-*bar', 'foo-bar'))
print(re.search('foo-*bar', 'foo--bar'))

<re.Match object; span=(0, 6), match='foobar'>
<re.Match object; span=(0, 7), match='foo-bar'>
<re.Match object; span=(0, 8), match='foo--bar'>


+

Matches one or more repetitions of the preceding regex.

In [None]:
print(re.search('foo-+bar', 'foobar'))
print(re.search('foo-+bar', 'foo-bar'))
print(re.search('foo-+bar', 'foo--bar'))


None
<re.Match object; span=(0, 7), match='foo-bar'>
<re.Match object; span=(0, 8), match='foo--bar'>


?

Matches zero or one repetitions of the preceding regex.

In [None]:
print(re.search('foo-?bar', 'foobar'))
print(re.search('foo-?bar', 'foo-bar'))
print(re.search('foo-?bar', 'foo--bar'))            

<re.Match object; span=(0, 6), match='foobar'>
<re.Match object; span=(0, 7), match='foo-bar'>
None


{m}

Matches exactly m repetitions of the preceding regex.

In [None]:
print(re.search('x-{3}x', 'x--x'))

None
