# Regular Expression 1

### RegEx Methods

FUNCTIONS

        compile(pattern, flags=0)

        escape(pattern)

        findall(pattern, string, flags=0)

        finditer(pattern, string, flags=0)

        fullmatch(pattern, string, flags=0)

        match(pattern, string, flags=0)

        purge()

        search(pattern, string, flags=0)

        split(pattern, string, maxsplit=0, flags=0)

        sub(pattern, repl, string, count=0, flags=0)

        subn(pattern, repl, string, count=0, flags=0)

        template(pattern, flags=0)


#### The special characters are:

        "."      Matches any character except a newline.
        "^"      Matches the start of the string.
        "$"      Matches the end of the string or just before the newline at
                 the end of the string.
        "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
                 Greedy means that it will match as many repetitions as possible.
        "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
        "?"      Matches 0 or 1 (greedy) of the preceding RE.
        
        
        *?,+?,?? Non-greedy versions of the previous three special characters.
        
        {m,n}    Matches from m to n repetitions of the preceding RE.
        {m,n}?   Non-greedy version of the above.
        
        "\\"     Either escapes special characters or signals a special sequence.
        []       Indicates a set of characters.
                 A "^" as the first character indicates a complementing set.
                 
        "|"      A|B, creates an RE that will match either A or B.
        
        (...)    Matches the RE inside the parentheses.
                 The contents can be retrieved or matched later in the string.
        (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
        (?:...)  Non-grouping version of regular parentheses.
        (?P<name>...) The substring matched by the group is accessible by name.
        (?P=name)     Matches the text matched earlier by the group named name.
        (?#...)  A comment; ignored.
        (?=...)  Matches if ... matches next, but doesn't consume the string.
        (?!...)  Matches if ... doesn't match next.
        (?<=...) Matches if preceded by ... (must be fixed length).
        (?<!...) Matches if not preceded by ... (must be fixed length).
        (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                           the (optional) no pattern otherwise.
    
#### The special sequences consist of "\\" and a character from the list below.  If the ordinary character is not on the list, then the resulting RE will match the second character.

        \number  Matches the contents of the group of the same number.
        \A       Matches only at the start of the string.
        \Z       Matches only at the end of the string.
        \b       Matches the empty string, but only at the start or end of a word.
        \B       Matches the empty string, but not at the start or end of a word.
        \d       Matches any decimal digit; equivalent to the set [0-9] in
                 bytes patterns or string patterns with the ASCII flag.
                 In string patterns without the ASCII flag, it will match the whole
                 range of Unicode digits.
        \D       Matches any non-digit character; equivalent to [^\d].
        \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
                 bytes patterns or string patterns with the ASCII flag.
                 In string patterns without the ASCII flag, it will match the whole
                 range of Unicode whitespace characters.
        \S       Matches any non-whitespace character; equivalent to [^\s].
        \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
                 in bytes patterns or string patterns with the ASCII flag.
                 In string patterns without the ASCII flag, it will match the
                 range of Unicode alphanumeric characters (letters plus digits
                 plus underscore).
                 With LOCALE, it will match the set [0-9_] plus characters defined
                 as letters for the current locale.
        \W       Matches the complement of \w.
        \\       Matches a literal backslash.

#### Some of the functions in this module takes flags as optional parameters:

        A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
                       match the corresponding ASCII character categories
                       (rather than the whole Unicode categories, which is the
                       default).
                       For bytes patterns, this flag is the only available
                       behaviour and needn't be specified.
        I  IGNORECASE  Perform case-insensitive matching.
        L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
        M  MULTILINE   "^" matches the beginning of lines (after a newline)
                       as well as the string.
                       
                       "$" matches the end of lines (before a newline) as well
                       as the end of the string.
                       
        S  DOTALL      "." matches any character at all, including the newline.
        X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
        U  UNICODE     For compatibility only. Ignored for string patterns (it
                       is the default), and forbidden for bytes patterns.

## Basic Tutorial

### Anchors — ^ and $

        ^The        matches any string that starts with The 

        end$        matches a string that ends with end

        ^The end$   exact string match (starts and ends with The end)

        roar        matches any string that has the text roar in it

### Quantifiers — * + ? and {}

        abc*        matches a string that has ab followed by zero or more c 
        abc+        matches a string that has ab followed by one or more c
        abc?        matches a string that has ab followed by zero or one c
        abc{2}      matches a string that has ab followed by 2 c
        abc{2,}     matches a string that has ab followed by 2 or more c
        abc{2,5}    matches a string that has ab followed by 2 up to 5 c
        a(bc)*      matches a string that has a followed by  zero or more copies of the sequence bc
        a(bc){2,5}  matches a string that has a followed by  2 up to 5 copies of the sequence bc

### OR operator — | or []

        a(b|c)     matches a string that has a followed by b or c 
        a[bc]      same as previous

### Character classes — \d \w \s and .

        \d         matches a single character that is a digit 
        \w         matches a word character (alphanumeric character plus underscore) 
        \s         matches a whitespace character (includes tabs and line breaks)
        .          matches any character 

        \d, \w and \s also present their negations with \D, \W and \S respectively.

        \D         matches a single non-digit character 

 ### In order to be taken literally, you must escape the characters 
        
        ^.[$()|*+?{\with a backslash \ as they have special meaning.

        \$\d       matches a string that has a $ before one digit

        Notice that you can match also non-printable characters like 
        tabs \t, new-lines \n, carriage returns \r

 ### Flags
        We are learning how to construct a regex but forgetting a fundamental concept: flags.

        A regex usually comes within this form /abc/, 
        where the search pattern is delimited by two slash characters /. 
        At the end we can specify a flag with these values (we can also combine them each other):

        g (global) does not return after the first match, 
        restarting the subsequent searches from the end of the  previous match  
          

        m (multi-line) when enabled ^ and $ will match the start and end of a line, 
        instead of the whole string
        
        i (insensitive) makes the whole expression case-insensitive (for instance /aBc/i would match AbC)

## Intermediate topics

### Grouping and capturing — ()

        a(bc)           parentheses create a capturing group with value bc 
        a(?:bc)*        using ?: we disable the capturing group 
        a(?<foo>bc)     using ?<foo> we put a name to the group 


        This operator is very useful when we need to extract information from strings or data using your preferred programming language. Any multiple occurrences captured by several groups will be exposed in the form of a classical array: we will access their values specifying using an index on the result of the match.

        If we choose to put a name to the groups (using (?<foo>...)) we will be able to retrieve the group values using the match result like a dictionary where the keys will be the name of each group.

### Bracket expressions — []

        [abc]            matches a string that has either an a or a b or a c -> is the same as a|b|c 
        [a-c]            same as previous
        [a-fA-F0-9]      a string that represents a single hexadecimal digit, case insensitively 
        [0-9]%           a string that has a character from 0 to 9 before a % sign
        [^a-zA-Z]        a string that has not a letter from a to z or from A to Z. 
                             In this case the ^ is used as negation of the expression 


        Remember that inside bracket expressions all special characters (including the backslash \)
        lose their special powers: thus we will not apply the “escape rule”.     

### Greedy and Lazy match

        The quantifiers ( * + {}) are greedy operators, 
        so they expand the match as far as they can through the provided text.

        For example, <.+> matches <div>simple div</div> in This is a <div> simple div</div> test. 
        In order to catch only the div tag we can use a ? to make it lazy:

        <.+?>            matches any character one or more times included inside < and >, expanding as needed

        Notice that a better solution should avoid the usage of . in favor of a more strict regex:

        <[^<>]+>         matches any character except < or > one or more times included inside < and >

## Advanced topics

### Boundaries — \b and \B

        \babc\b        performs a "whole words only" search

        \b             represents an anchor like caret (it is similar to $ and ^) matching positions where one                        side is a word character (like \w) and the other side is not a word character (for                              instance it may be the beginning of the string or a space character).

                       It comes with its negation, \B. This matches all positions where \b doesn’t match and                          could be if we want to find a search pattern fully surrounded by word characters.

        \Babc\B          matches only if the pattern is fully surrounded by word characters

### Back-references — \1

        ([abc])\1              using \1 it matches the same text that was matched by the first capturing group 

        ([abc])([de])\2\1      we can use \2 (\3, \4, etc.) to identify the same text that was matched by the                                  second (third, fourth, etc.) capturing group 

        (?<foo>[abc])\k<foo>   we put the name foo to the group and we reference it later (\k<foo>). 
                               The result is the same of the first regex 

### Look-ahead and Look-behind — (?=) and (?<=)

        d(?=r)       matches a d only if is followed by r, 
                     but r will not be part of the overall regex match 

        (?<=r)d      matches a d only if is preceded by an r, 
                     but r will not be part of the overall regex match



### You can use also the negation operator!

        d(?!r)       matches a d only if is not followed by r, 
                     but r will not be part of the overall regex match 

        (?<!r)d      matches a d only if is not preceded by an r, 
                     but r will not be part of the overall regex match