## REGEX 101
[Here](https://www.rexegg.com/) is a great resource for anyone that wants to did further into the black hole that is regular expressions..
And [here](https://pythex.org) is a great site for testing out your regular expressions.. 
Esp dealing with python.

And [here](https://docs.python.org/3.4/library/re.html), of course, is the actual documentation.

Rather than teaching all the ways to use regex.. I will bring in some test samples to try and match!

### Don't worry about these too much. Just know they are helping!

In [1]:
from utils import (
    findall, 
    match,
    search,
    check_matches)

First.. 
Let's look at some different ways you can go about matching.

In [2]:
text = "Here is some sample 1 text with 2 numbers 3!"
r = r"\d"

# Notice that you can iterate over the answers with this one.
print('findall..')
for val in findall(r, text):
    print(val)
    
   
# For the next two.. You capture the match, then test to see if it exists.
print('\nmatch..')
m = match(r, text)
print(m.group() if m else None)

print('\nsearch..')
m = search(r, text)
print(m.group() if m else None)

findall..
1
2
3

match..
None

search..
1


### With python.. You have the ability to add in comments to your regex to make them easier to read.

In [3]:
# Let's try to find one for match just for fun.. Remember that we need to start at the beginning of the string!
match(r"""
      [\w\s]+             # match all words or spaces
      (?P<digit>[0-9])    # group the LAST digit, and give it a name
      .*                  # everything after (!)""", 
      text)\
    .group('digit')

# NOTE: It stretched as far as it could before returning the number.. 

'3'

General syntax is as follows: 

```python
pattern = re.compile(r"<regex_pattern>")   # define your pattern, optionally pass in flags!
matches = re.<match_func>(pattern, string)

# validate here to see if you found anything with an if statement. 
# if you found something.. the match will be "truthy"
if matches:
    print(matches.group(<name_of_group>))
```

Now you try!!!

### Extract `12 years old`

In [4]:
text = "The boy is 12 years old"

### Extract BOTH `12 years old` and `8 years old`

In [5]:
text = "The boy is 12 years old. The girl is 8 years old."

### Capture 2 groups: `quantity` and `date_type`
Iterate over the different groups and print them out.

In [6]:
text = "The boy is 12 years old. The girl is 96 months old."

## Some helpful tips!!

* `.*` -- `.` matches any character, while `*` means zero or more, and is "greedy". This is a catch all for "everything"
* `+` -- One or more
* `{1}` or `{1, 2}` -- Specify exactly how many times, or a range of times
* `?` -- Once or none
* `( ... )` -- Is a capture group
* `(?: ... )` -- A group.. But no capture
* `|` -- _or_ operator
* `^` -- _not_, or the start of a line
* `$` -- the end of a line

### Lets play a game!!
Acording to [this article](https://www.mattcutts.com/blog/seo-glossary-url-definitions/), the url is made up of many parts..
Lets create a regular expression that appropriately captures the different parts. 

### URL to test against

In [7]:
url = "http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s"

Part definitions: 

* The `protocol` is http. Other protocols include https, ftp, etc.
* The `host` or hostname is video.google.co.uk.
* The `subdomain` is video.
* The `domain` name is google.co.uk.
* The `top-level domain` or TLD is uk. The uk domain is also referred to as a country-code top-level domain or ccTLD. For google.com, the TLD would be com.
* The `second-level domain` (SLD) is co.uk.
* The `port` is 80, which is the default port for web servers. Other ports are possible; a web server can listen on port 8000, for example. When the port is 80, most people leave out the port.
* The `path` is /videoplay. Path typically refers to a file or location on the web server, e.g. /directory/file.html
* This URL has `parameters`. The name of one parameter is docid and the value of that parameter is -7246927612831078230. URLs can have lots parameters. Parameters start with a question mark _(?)_ and are separated with an ampersand _(&)_.
* See the “#00h02m30s”? That’s called a `fragment` or a named anchor. The Googlers I’ve talked to are split right down the middle on which way to refer it. Disputes on what to call it can be settled with arm wrestling, dance-offs, or drinking contests. 🙂 Typically the fragment is used to refer to an internal section within a web document. In this case, the named anchor means “skip to 2 minutes and 30 seconds into the video.” I think right now Google standardizes urls by removing any fragments from the url.

So.. 
We are tryign to capture 10 different groups. 
Let's try breaking this down.

Based on the definitions above.. 
We have some good guidlines to write our expressions.

### RULES

> Based on the definitions above.. 
Create a _SINGLE_ regular expression that captures _ALL_ of the different groups separately.
I have created an expression for `protocol`. 
Continue to build off of this until you have them all completed.

In [8]:
# start at the beginning.. and capture the http only.
# Wrapping it in () specifies that it is a group!
pattern = r"""

    ^                      # start of string
    (?P<protocol>http)     # protocol group with name

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'


Next.. 
We are moving into into the host.. 
Which is basically everything after the `http://` and before the port `:80`

In [9]:
pattern = r"""

    ^                        # start of string
    (?P<protocol>http)       # the actual protocol
    ://                      # the part following the protocol
    (?P<host>.*)             # the group we want.. Captures everything
    :                        # specifies the end of this section. No group

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol',
        'host'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'


Subdomain!
This one is a little tricky, because the subdomain is a part of the host.
We will need to add a group.. Within the group!

In [10]:
pattern = r"""

    ^                      # start of string
    (?P<protocol>http)     # the actual protocol
    ://                    # the part following the protocol
    (?P<host>              # specify the start of a group.. The host!
        (?P<subdomain>\w+) # specify a new group, subdomain.. The first word `+` is greedy!
        \.                 # says there is a period
        .*                 # everything else
    )                      # end host group
    :                      # specifies the end of this section. No group

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'


Domain! Again.. This is a nested group. Should be pretty straightforward though!

In [11]:
pattern = r"""

    ^                        # start of string
    (?P<protocol>http)       # the actual protocol
    ://                      # the part following the protocol
    (?P<host>                # specify the start of a group.. The host!
        (?P<subdomain>\w+)   # specify a new group, subdomain.. The first word `+` is greedy!
        \.                   # says there is a period
        (?P<domain>.*)       # THIS IS WHAT CHANGED!! Easy peasy 
    )                        # end host group
    :                        # specifies the end of this section. No group

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain', 
        'domain'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'
domain:              'google.co.uk'


top-level domain. 
We've been doing this all along!

In [12]:
pattern = r"""

    ^                                   # start of string
    (?P<protocol>http)                  # the actual protocol
    ://                                 # the part following the protocol
    (?P<host>                           # specify the start of a group.. The host!
        (?P<subdomain>\w+)              # specify a new group, subdomain.. The first word `+` is greedy!
        \.                              # says there is a period
        (?P<domain>                     # Start of the domain group!
            \w+\.\w+\.                  # There is a word.. period.. word.. period..
            (?P<top_level_domain>\w+)   # and our top-level domain!
        )                               # Close domain
    )                                   # Close host
    :                                   # specifies the end of this section. No group

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain', 
        'domain', 
        'top_level_domain'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'
domain:              'google.co.uk'
top_level_domain:    'uk'


Second level domain.. 
Yet another nest!

In [13]:
pattern = r"""

    ^                   # start of string
    (?P<protocol>http)              # the actual protocol
    ://                 # the part following the protocol
    (?P<host>                   # specify the start of a group.. The host!
        (?P<subdomain>\w+)           # specify a new group, subdomain.. The first word `+` is greedy!
        \.              # says there is a period
        (?P<domain>               # Start of the domain group!
            \w+\.       # `google.`
            (?P<second_level_domain>           # Start of the second level domain
                \w+\.   # `co.`
                (?P<top_level_domain>\w+)   # and our top-level domain!
            )           # Close second-level domain
        )               # Close domain
    )                   # Close host
    :                   # specifies the end of this section. No group

"""

print(f"{'Target:': <20} '{url}'\n")
print('Notice the flip in order between top and second level domains!\n')
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain', 
        'domain', 
        'second_level_domain',
        'top_level_domain'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

Notice the flip in order between top and second level domains!

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'
domain:              'google.co.uk'
second_level_domain: 'co.uk'
top_level_domain:    'uk'


Finally we can get out of those nests!! The port should be easy :) 

In [14]:
pattern = r"""

    ^                                       # start of string
    (?P<protocol>http)                      # the actual protocol
    ://                                     # the part following the protocol
    (?P<host>                               # specify the start of a group.. The host!
        (?P<subdomain>\w+)                  # specify a new group, subdomain.. The first word `+` is greedy!
        \.                                  # says there is a period
        (?P<domain>                         # Start of the domain group!
            \w+\.                           # `google.`
            (?P<second_level_domain>        # Start of the second level domain
                \w+\.                       # `co.`
                (?P<top_level_domain>\w+)   # and our top-level domain!
            )                               # Close second-level domain
        )                                   # Close domain
    )                                       # Close host
    :                                       # specifies the end of this section. No group
    (?P<port>\d+)                           # Port group!

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain', 
        'domain', 
        'second_level_domain',
        'top_level_domain',
        'port'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'
domain:              'google.co.uk'
second_level_domain: 'co.uk'
top_level_domain:    'uk'
port:                '80'


In [15]:
pattern = r"""

    ^                                       # start of string
    (?P<protocol>http)                      # the actual protocol
    ://                                     # the part following the protocol
    (?P<host>                               # specify the start of a group.. The host!
        (?P<subdomain>\w+)                  # specify a new group, subdomain.. The first word `+` is greedy!
        \.                                  # says there is a period
        (?P<domain>                         # Start of the domain group!
            \w+\.                           # `google.`
            (?P<second_level_domain>        # Start of the second level domain
                \w+\.                       # `co.`
                (?P<top_level_domain>\w+)   # and our top-level domain!
            )                               # Close second-level domain
        )                                   # Close domain
    )                                       # Close host
    :                                       # specifies the end of this section. No group
    (?P<port>\d+)                           # Port group!
    (?P<path>/\w+)                          # Path group!

"""

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain', 
        'domain', 
        'second_level_domain',
        'top_level_domain',
        'port',
        'path'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'
domain:              'google.co.uk'
second_level_domain: 'co.uk'
top_level_domain:    'uk'
port:                '80'
path:                '/videoplay'


Parameters! These can be tricky since there can be one, many, or none of them. 

In [17]:
pattern = r"""

    ^                                       # start of string
    (?P<protocol>http)                      # the actual protocol
    ://                                     # the part following the protocol
    (?P<host>                               # specify the start of a group.. The host!
        (?P<subdomain>\w+)                  # specify a new group, subdomain.. The first word `+` is greedy!
        \.                                  # says there is a period
        (?P<domain>                         # Start of the domain group!
            \w+\.                           # `google.`
            (?P<second_level_domain>        # Start of the second level domain
                \w+\.                       # `co.`
                (?P<top_level_domain>\w+)   # and our top-level domain!
            )                               # Close second-level domain
        )                                   # Close domain
    )                                       # Close host
    :                                       # specifies the end of this section. No group
    (?P<port>\d+)                           # Port group!
    (?P<path>/\w+)                           # Path group!
    (?:\?
        (?P<parameters>[a-zA-Z1-9,&=]+)     # Paramenters can include any of these characters
    )?                                      # and specifiy that it's optional
    (?P<fragment>.*)                        # catch all at the end for fragment
    $                                       # end of string

"""

urls = 'http://video.google.co.uk:80/videoplay#123'

print(f"{'Target:': <20} '{url}'\n")
check_matches(
    pattern, 
    url, 
    [
        'protocol', 
        'host', 
        'subdomain', 
        'domain', 
        'second_level_domain',
        'top_level_domain',
        'port',
        'path',
        'parameters',
        'fragment'])

Target:              'http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s'

protocol:            'http'
host:                'video.google.co.uk'
subdomain:           'video'
domain:              'google.co.uk'
second_level_domain: 'co.uk'
top_level_domain:    'uk'
port:                '80'
path:                '/videoplay'


IncorrectMatchError: 
                Sorry, but the match is incorrect. 
                Please try again.
                Got docid=, expected ?docid=-7246927612831078230&hl=en