# <p style="color:red">Chapter 15 regular expression</p>

* we will say “matching,” referring to the term pattern-matching. 

* searching, i.e., looking for a pattern match in any part of a string, and matching, i.e., attempting to match a pattern to an entire string (starting from the beginning). Searches are accomplished using the search() function or method, and matching is done with the match() function or method.

### 1.REs are strings containing text and special charac- ters that describe a pattern with which to recognize multiple strings

* The power of regular expres- sions comes in when special characters are used to define character sets, sub- group matching, and pattern repetition. It is these special symbols that allow an RE to match a set of strings rather than a single one.

RE symbols:

* literal: Match literal string value literal, foo
* re1|re2: Match regular expressions re1 or re2. foo|bar
* .: Match any character (except NEWLINE), b.b
* ^: Match start of string, ^Dear
* \$: Match end of string, /bin/*sh$
* \*: Match 0 or more occurrences of preceding RE, [A-Za-z0-9]*
* +: Match 1 or more occurrences of preceding RE, [a-z]+\.com
* ?: Match 0 or 1 occurrence(s) of preceding RE, goo?
* {N}: Match N occurrences of preceding RE, [0-9]{3}
* {M,N}: Match from M to N occurrences of preceding RE, [0-9]{5,9}
* [...]: Match any single character from character class, [aeiou]
* [..x-y..]: Match any single character in the range from x to y, [0-9], [A-Za-z]
* [^...]: Do not match any character from character class, including any ranges, if present, [^aeiou],[^A-Za-z0-9\_]
* (\*|+|?|{})?: Apply “non-greedy” versions of above occurrence/repetition symbols ( \*, +, ?, {}), .\*?[a-z]
* (...): Match enclosed RE and save as subgroup, ([0-9]{3})?,f(oo|u)bar
* \d: Match any decimal digit, same as [0-9] (\D is inverse of \d: do not match any numeric digit), data\d+.txt
* \w: Match any alphanumeric character, same as [A-Za-z0-9_] (\W is inverse of \w), [A-Za-z_]\w+
* \s: Match any whitespace character, same as [ \n\t\r\v\f] (\S is inverse of \s), of\sthe
* \b: Match any word boundary (\B is inverse of \b), \bThe\b
* \nn: Match saved subgroup nn (see (...) above), price: \16
* \c: Match any special character c verbatim (i.e., with- out its special meaning, literal), \., \\, \*
* \A(\Z): Match start (end) of string (also see ^ and \$ above), \ADear


### 2. dot

* The dot or period ( . ) symbol matches any single character except for NEW- LINE (Python REs have a compilation flag [S or DOTALL], which can override this to include NEWLINEs.). Whether letter, number, whitespace not including “\n,” printable, non-printable, or a symbol, the dot can match them all.
* match dot (.): \\.

### 3. word boundary:\\b
    
* The \b and \B special characters pertain to word boundary matches. The difference between them is that \b will match a pattern to a word boundary, meaning that a pattern must be at the beginning of a word, whether there are any characters in front of it (word in the middle of a string) or not (word at the beginning of a line). And likewise, \B will match a pattern only if it appears starting in the middle of a word (i.e., not at a word boundary).

RE Pattern     Strings Matched
* the: Any string containing the
* \bthe: Any word that starts with the
* \bthe\b: Matches only the word the
* \Bthe: Any string that contains but does not begin with the

### 4: Creating Character Classes ( [ ] )
* if we wanted to match the string with the pattern “ab” followed by “cd,” we cannot use the brackets because they work only for single characters. In this case, the only solution is “ab|cd,” similar to the “r2d2/c3po” problem just mentioned.

### 5. Designating groups with parentheses(()):

   * extract any specific strings or substrings that were part of a successful match. To accomplish this, surround any RE with a pair of parentheses. 

### 6. RE module Core functions:

* compile(pattern,flags=0): Compile RE pattern with any optional flags and return a regex object.

### 6. RE module functions and regex object methods:
* match(pattern, string,flags=0): Attempt to match RE pattern to string with optional flags; return match object on success,None on failure.

* search(pattern, string,flags=0): Search for first occurrence of RE pattern within string with optional flags;return match object on success, None on failure.

* findall(pattern, string[,flags]): Look for all (non-overlapping) occurrences of pattern in string; return a list of matches

* finditer(pattern, string[, flags]): Same as findall() except returns an iterator instead of a list; for each match, the iterator returns a match object.

* split(pattern, string, max=0): Split string into a list according to RE pattern delimiter and return list of successful matches, splitting at most max times (split all occurrences is the default).

* sub(pattern, repl, string, max=0): Replace all occurrences of the RE pattern in string with repl, substituting all occurrences unless max pro- vided (also see subn() which, in addition, returns the number of substitutions made) 

### 6. Match object methods:

* group(num=0): Return entire match (or specific subgroup num)
* groups(): Return all matching subgroups in a tuple (empty if there weren’t any)

* Python code is eventually compiled into bytecode, which is then executed by the interpreter.
* calling eval() or exec with a code object rather than a string provides a significant performance improvement due to the fact that the compilation process does not have to be performed.


### 7. RE compilation:

* regular expression patterns must be compiled into regex objects before any pattern matching can occur.

* For REs, which are compared many times during the course of execution, we highly recommend using precompilation first because, again, REs have to be compiled anyway, so doing it ahead of time is prudent for performance reasons. re.compile() provides this functionality.

### 8. RE functions and regex objects methods:

    
* Almost all of the re module functions we will be describing shortly are available as methods for regex objects. Remember, even with our recommendation, precompilation is not required. If you compile, you will use methods; if you don’t, you will just use functions. The good news is that either way, the names are the same whether a function or a method.

* Optional flags may be given as arguments for specialized compilation. These flags allow for case-insensitive matching, using system locale settings for matching alphanumeric characters, etc.
    * re.LOCALE: Make \w, \W, \b, \B, \s and \S dependent on the current locale.
    * re.DOTALL: Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

* If you want to use these flags with the methods, they must already be integrated into the compiled regex objects.

### 9.Match objects and group(), groups() methods

* There is another object type in addition to the regex object when dealing with regular expressions, the match object. These are the objects returned on suc- cessful calls to match() or search(). Match objects have two primary methods, group() and groups().

* group() will either return the entire match, or a specific subgroup, if requested. groups() will simply return a tuple consisting of only/all the subgroups. If there are no subgroups requested, then groups() returns an empty tuple while group() still returns the entire match.

### 10. match(): Matching Strings

* The match() function attempts to match the pattern to the string, starting at the beginning. If the match is successful, a match object is returned, but on failure, None is returned.

* re.match('foo', 'seafood')-->no match, match() attempts to match the pattern to the string from the beginning, i.e., the “f” in the pattern is matched against the “s” in the string, which fails immediately

* The group() method of a match object can be used to show the successful match.


In [7]:
import re

In [8]:
m=re.match('foo','foo')
if m is not None:
    print(m.group())

foo


In [9]:
# We can even sometimes bypass saving the result 
# altogether, taking advan- tage of Python’s 
# object-oriented nature:
re.match('foo', 'food on the table').group()


'foo'

### 11. Search(): Looking for a Pattern within a String 

* It works exactly in the same way as match except that it searches for the first occurrence of the given RE pattern anywhere with its string argu- ment. Again, a match object is returned on success and None otherwise.


* search() looks for the first occurrence of the pattern within the string. search() searches strictly from left to right.

In [11]:
m=re.match('foo','seafood') # no match
if m is not None:
    print(m.group())

In [12]:
m=re.search('foo','seafood')
if m is not None:
    print(m.group())

foo


### 12. matching any single character (.)

* a dot cannot match a NEWLINE or a
non-character, i.e., the empty string.

In [15]:
anyend=".end"
m=re.match(anyend,"bend")
if m is not None:
    print(m.group())

bend


In [16]:
m=re.match('(\w\w\w)-(\d\d\d)','abc-123')
m.group()

'abc-123'

In [17]:
m.group(1)

'abc'

In [18]:
m.group(2)

'123'

In [19]:
m.groups()

('abc', '123')

#### group() is used in the normal way to show the entire match, but can also be used to grab individual subgroup matches. We can also use the groups() method to obtain a tuple of all the substring matches.

In [20]:
m=re.match('ab','ab')
m.group()# entire match

'ab'

In [21]:
m.groups()# all subgroups

()

In [24]:
m=re.match('(ab)','ab')
m.group()# entire match

'ab'

In [25]:
m.group(1)# subgroup(1)

'ab'

In [26]:
m.groups() # all subgroups

('ab',)

In [28]:
m=re.search(r'\Bthe','bitethe dog') 
# search non-boundary

In [29]:
m.group()

'the'

* it is a good idea to use raw strings with regular expressions.

### 13.finding every occurence with findall()
    * It looks for all non-overlap- ping occurrences of an RE pattern in a string. It is similar to search() in that it performs a string search, but it differs from match() and search() in that findall() always returns a list. The list will be empty if no occurrences are found but if successful, the list will consist of all matches found (grouped in left-to-right order of occurrence).

In [32]:
re.findall('car','car')

['car']

In [31]:
re.findall('car', 'carry the barcardi to the car')

['car', 'car', 'car']

    * for multiple successful matches, each subgroup match is a single element in a tuple, and such tuples (one for each suc- cessful match) are the elements of the resulting list.

### 14.Searching and Replacing with sub() [and subn()]

* There are two functions/methods for search-and-replace functionality: sub() and subn().
* They are almost identical and replace all matched occurrences of the RE pattern in a string with some sort of replacement.The replacement is usually a string, but it can also be a function that returns a replacement string.
    * subn() is exactly the same as sub(), but it also returns the total number of substitutions made—both the newly substituted string and the substitution count are returned as a 2-tuple.

In [34]:
re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')

'attn: Mr. Smith\n\nDear Mr. Smith,\n'

In [35]:
re.subn('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')

('attn: Mr. Smith\n\nDear Mr. Smith,\n', 2)

In [36]:
print(re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n'))

attn: Mr. Smith

Dear Mr. Smith,



### 15.Splitting (on Delimiting Pattern) with split()
* The re module and RE object method split() work similarly to its string counterpart, but rather than splitting on a fixed string, they split a string based on an RE pattern
* you can specify the maximum number of splits by setting a value (other than zero) to the max argument.
* If the delimiter given is not a regular expression that uses special symbols to match multiple patterns, then re.split() works in exactly the same manner as string.split()

In [37]:
re.split(':', 'str1:str2:str3')

['str1', 'str2', 'str3']

 * Problems may occur if there is a symbol used by both ASCII and regular expressions, so in the Core Note on the following page, we recommend the use of Python raw strings to prevent any problems. One more caution: the “\w” and “\W” alphanumeric character sets are affected by the L or LOCALE compilation flag and in Python 1.6 and newer, by Unicode flags starting in 2.0 (U or UNICODE).

* There are conflicts between ASCII characters and regular expression special characters. As a special symbol, “\b” represents the ASCII character for backspace, but “\b” is also a regular expression special symbol, meaning “match” on a word boundary. 
* solution: raw string

In [38]:
m=re.match(r'\bblow','blow')
m.group()

'blow'

In [42]:
data='Thu Feb 15 17:46:04 2007::uzifzf@dpyivihw.gov::1171590364-6-8'

In [45]:
patt='.+(\d+-\d+-\d+)'# greedy: 
#regular expressions are inherently greedy.
# .+ will match as many as possible
re.match(patt,data).group(1)

'4-6-8'

In [44]:
patt='.+?(\d+-\d+-\d+)'
re.match(patt,data).group(1)

'1171590364-6-8'