# Regular Expressions in Python

## What is a Regular Expression?
It's a string pattern written in a compact syntax, that allows us to quickly check whether a given string matches or contains a given pattern. 

Regular expressions are helpful, but not many non-programmers know about them even though most modern text editors and word processors, such as Microsoft Word or OpenOffice, have find and find-and-replace features that can search based on regular expressions. Regular expressions are huge time-savers, not just for software users but also for programmers.

In Python, regular expressions are supported by the `re` module. That means that if you want to start using them in your Python scripts, you have to `import` this module with the help of `import`

In [1]:
import re

In [11]:
for i in dir(re):
    print(i, end=" , ")

A , ASCII , DEBUG , DOTALL , I , IGNORECASE , L , LOCALE , M , MULTILINE , RegexFlag , S , Scanner , T , TEMPLATE , U , UNICODE , VERBOSE , X , _MAXCACHE , __all__ , __builtins__ , __cached__ , __doc__ , __file__ , __loader__ , __name__ , __package__ , __spec__ , __version__ , _alphanum_bytes , _alphanum_str , _cache , _compile , _compile_repl , _expand , _locale , _pattern_type , _pickle , _subx , compile , copyreg , enum , error , escape , findall , finditer , fullmatch , functools , match , purge , search , split , sre_compile , sre_parse , sub , subn , template , 

In [12]:
print(len(dir(re)))

58


In [13]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.6/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

In [16]:
help(re.compile)

Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a pattern object.



In [19]:
help(re.purge)

Help on function purge in module re:

purge()
    Clear the regular expression caches



## Special Characters
`^` | Matches the expression to its right at the start of a string. It matches every such instance before each `\n` in the string.

`$` | Matches the expression to its left at the end of a string. It matches every such instance before each `\n` in the string.

`.` | Matches any character except line terminators like `\n`.

`\ `| Escapes special characters or denotes character classes.

`A|B` | Matches expression `A` or `B`. If `A` is matched first, `B` is left untried.

`+` | Greedily matches the expression to its left 1 or more times.

`*`| Greedily matches the expression to its left 0 or more times.

`?` | Greedily matches the expression to its left 0 or 1 times. But if `?` is added to qualifiers (`+`, `*`, and `?` itself) it will perform matches in a non-greedy manner.

`{m}` | Matches the expression to its left `m` times, and not less.

`{m,n}` | Matches the expression to its left `m` to `n` times, and not less.

`{m,n}?` | Matches the expression to its left `m` times, and ignores `n`.

## Character Classes (a.k.a. Special Sequences)
`\w` | Matches alphanumeric characters, which means `a-z`, `A-Z`, and `0-9`. It also matches the underscore,` _`.

`\d` | Matches digits, which means `0-9`.

`\D` | Matches any non-digits.

`\s` | Matches whitespace characters, which include the `\t`, `\n`, `\r`, and space characters.

`\S`| Matches non-whitespace characters.

`\b` | Matches the boundary (or empty string) at the start and end of a word, that is, between `\w` and `\W`.

`\B` | Matches where `\b` does not, that is, the boundary of `\w` characters.

`\A` | Matches the expression to its right at the absolute start of a string whether in single or multi-line mode.

`\Z` | Matches the expression to its left at the absolute end of a string whether in single or multi-line mode.



## Sets
`[ ]` | Contains a set of characters to match.

`[amk]` | Matches either `a`, `m`, or `k`. It does not match amk.

`[a-z]` | Matches any alphabet from `a` to `z`.

`[a\-z]` | Matches `a`, `-`, or `z`. It matches `-` because `\` escapes it.

`[a-]` | Matches `a` or `-`, because `-` is not being used to indicate a series of characters.

`[-a]` | As above, matches `a` or `-`.

`[a-z0-9]` | Matches characters from `a` to `z` and also from `0` to `9`.

`[(+*)]` | Special characters become literal inside a set, so this matches `(`, `+`, `*`, and `)`.

`[^ab5]` | Adding `^ `excludes any character in the set. Here, it matches characters that are not `a`, `b`, or `5`.

## Groups
`( )` | Matches the expression inside the parentheses and groups it.

`(? )` | Inside parentheses like this, `?` acts as an extension notation. Its meaning depends on the character immediately to its right.

`(?PAB)` | Matches the expression `AB`, and it can be accessed with the group name.

`(?aiLmsux)` | Here, `a`, `i`, `L`, `m`, `s`, `u`, and `x` are flags:

- a — Matches ASCII only
- i — Ignore case
- L — Locale dependent
- m — Multi-line
- s — Matches all
- u — Matches unicode
- x — Verbose

`(?:A)` | Matches the expression as represented by `A`, but unlike `(?PAB)`, it cannot be retrieved afterwards.

`(?#...)` | A comment. Contents are for us to read, not for matching.

`A(?=B)` | Lookahead assertion. This matches the expression `A` only if it is followed by `B`.

`A(?!B)` | Negative lookahead assertion. This matches the expression `A` only if it is not followed by `B`.

`(?<=B)A` | Positive lookbehind assertion. This matches the expression `A` only if `B` is immediately to its left. This can only matched fixed length expressions.

`(?<!B)A`| Negative lookbehind assertion. This matches the expression `A` only if `B` is not immediately to its left. This can only matched fixed length expressions.

`(?P=name)` | Matches the expression matched by an earlier group named “name”.

`(...)\1` | The number `1` corresponds to the first group to be matched. If we want to match more instances of the same expresion, simply use its number instead of writing out the whole expression again. We can use from `1` up to `99` such groups and their corresponding numbers.

## Popular `re` module Functions
`re.findall(A, B)` | Matches all instances of an expression `A` in a string `B` and returns them in a list.

`re.search(A, B)` | Matches the first instance of an expression `A` in a string `B`, and returns it as a re match object.

`re.split(A, B)` | Split a string B into a list using the delimiter `A`.

`re.sub(A, B, C)` | Replace `A` with `B` in the string `C`.

## Matching Regex Objects
A Regex object’s `search()` method searches the string it is passed for any matches to the regex. The `search()` method will return None if the regex pattern is not found in the string. If the pattern is found, the `search()` method returns a Match object. Match objects have a `group()` method that will return the actual matched text from the searched string.

In [8]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
match_obj = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + match_obj.group())

Phone number found: 415-555-4242


## Basic Patterns: Ordinary Characters
You can easily tackle many basic patterns in Python using the ordinary characters. Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.

In [20]:
pattern = r"Mango"
sequence = "Mango"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Match!


The `match()` function returns a match object if the text matches the pattern. Otherwise it returns `None`.

In [21]:
pattern = r"Mango"
sequence = "Orange"
print(re.match(pattern, sequence))

None


## Wild Card Characters: Special Characters
Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression.

`.` - A period. Matches any single character except newline character.

In [22]:
re.search(r'M.n.o', 'Mango').group()

'Mango'

`\w` - Lowercase w. Matches any single letter, digit or underscore.

In [23]:
re.search(r'M\wng\w', 'Mango').group()

'Mango'

`\W` - Uppercase w. Matches any character not part of `\w`(lowercase w).

In [24]:
re.search(r'S\Wgmail', 'S@gmail').group()

'S@gmail'

`\s` - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

In [25]:
re.search(r'Eat\scake', 'Eat cake').group()

'Eat cake'

`\S` - Uppercase s. Matches any character not part of `\s` (lowercase s).

In [26]:
re.search(r'M\Sngo', 'Mango').group()

'Mango'

`\d` - Lowercase d. Matches decimal digit 0-9.

In [38]:
re.search(r'Mang\d', 'Mang0').group()

'Mang0'

`^` - Caret. Matches a pattern at the start of the string.

In [39]:
re.search(r'^Eat', 'Eat Rice').group()

'Eat'

`$` - Matches a pattern at the end of string.

In [40]:
re.search(r'Rice$', 'Eat Rice').group()

'Rice'

`[a-zA-Z0-9]` - Matches any letter from (a to z) or (A to Z) or (0 to 9). Characters that are not within a range can be matched by complementing the set. If the first character of the set is` ^`, all the characters that are not in the set will be matched.

In [42]:
re.search(r'Number: [0-9]', 'Number: 8').group()

'Number: 8'

In [43]:
# Matches any character except 7
re.search(r'Number: [^7]', 'Number: 9').group()

'Number: 9'

`\A` - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [44]:
re.search(r'\A[A-R]ice', 'Rice').group()

'Rice'

`\b` - Lowercase b. Matches only the beginning or end of the word.

In [45]:
re.search(r'\b[A-R]ice', 'Rice').group()

'Rice'

## Matching Zero or More with the Star
The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again

In [9]:
superRegex = re.compile(r'Super(wo)*man')
match_obj1 = superRegex.search('The Adventures of Superman')
match_obj1.group()


'Superman'

In [5]:
mo2 = superRegex.search('The Adventures of Superwoman')
mo2.group()

'Superwoman'

In [6]:
mo3 = superRegex.search('The Adventures of Superwowowowoman')
mo3.group()

'Superwowowowoman'

## Matching One or More with the Plus
While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional. 

In [11]:
superRegex = re.compile(r'Super(wo)+man')
match_obj1 = superRegex.search('The Adventures of Superwoman')
match_obj1.group()


'Superwoman'

In [13]:
match_obj2 = superRegex.search('The Adventures of Superwowowowoman')
match_obj2.group()

'Superwowowowoman'

In [15]:
match_obj3 = superRegex.search('The Adventures of Superman')
# match_obj3.group()
match_obj3==None

True

`?` - Checks for exactly zero or one character to its left.

In [46]:
re.search(r'Colou?r', 'Color').group()

'Color'

`{x}` - Repeat exactly x number of times.

`{x,}` - Repeat at least x times or more.

`{x, y}` - Repeat at least x times but no more than y times.

In [47]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

## `re.match(pattern, string)`:
This method finds match if it occurs at start of the string. For example, calling `match()` on the string ‘Rua Analytics’ and looking for a pattern ‘Rua’ will match. However, if we look for only Analytics, the pattern will not match. 

In [50]:
result = re.match(r'Rua', 'Rua Analytics')
print(result)

<_sre.SRE_Match object; span=(0, 3), match='Rua'>


In [51]:
print(result.group(0))

Rua


## `re.search(pattern, string)`:
It is similar to` match()` but it doesn’t restrict us to find matches at the beginning of the string only.

In [52]:
result = re.search(r'Analytics', 'Rua Analytics')
print(result.group(0))

Analytics


## `re.findall (pattern, string)`:
It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘Rua’ in given string it will return both occurrence of Rua. While searching a string, I would recommend you to use `re.findall()` always, it can work like `re.search()` and `re.match()` both.

In [53]:
result = re.findall(r'Rua', 'Rua Analytics Rua')
print(result)

['Rua', 'Rua']


## `re.split(pattern, string, [maxsplit=0])`:
This methods helps to split string by the occurrences of given pattern.

In [54]:
result=re.split(r'e','occurrences')
print(result)

['occurr', 'nc', 's']


In [58]:
result=re.split(r'\s','It helps to get a list of all matching patterns')
print(result)

['It', 'helps', 'to', 'get', 'a', 'list', 'of', 'all', 'matching', 'patterns']


## `re.sub(pattern, repl, string)`:
It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [60]:
result=re.sub(r'notes','projects','Kaggle is the place to do data science notes')
print(result)

Kaggle is the place to do data science projects
