# Regex Intro

## Regex?
From wiki..
```
A regular expression, regex or regexp is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.
```

Essentially, it's a way to define patterns in text data!
Let's see an example.

<hr>

In [1]:
import re


sentences = [
    "Sentence about cats.", 
    "Sentence about dogs.", 
    "Sentence about rabbits."
]

Now, let's imagine that we wanted to know which sentences talked about either `cats` or `dogs`.
How could we do that?

Currently, our sentences are structured like this:
```
Sentence about <animal>.
```

So, we could define a pattern that matched this structure!

In [2]:
pattern = re.compile(r"Sentence about (\w+)\.")

for sentence in sentences:
    
    # try to find a match..
    match = pattern.match(sentence)
    
    # if you found a match, print it!
    if match:
        print(match.group(1))

cats
dogs
rabbits


Ok.. But now we have rabbits!

In [3]:
pattern = re.compile(r"Sentence about (cat|dog)")

for sentence in sentences:
    
    # try to find a match..
    match = pattern.match(sentence)
    
    # if you found a match, print it!
    if match:
        print(match.group(1))

cat
dog


### What's going on?
We defined a could patterns to flag the sentences that matched our criteria, and also returned the matches!

Starting off.. 
```
Sentence about ...
```

This part is normal. 
Looks like regular text.
The next parts are the tricky ones.

* `(\w+)\.` - Matches at least one _word_ character (`(\w+)`), then specifies a period (`\.`)
* `(cat|dog)` - matches the word `cat` _or_ `dog`

<hr>

## Regex Special Characters
Regex is kind of a mini-programming language.
There are normal characters, metacharacters, and special characters.

Most common special characters include:

* `.` - (Dot.) Matches any character except a newline.
* `^` - (Caret.) Matches the start of the string
* `$` - Matches the end of the string or just before the newline at the end of the string
* `*` - Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
* `+` - Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
* `?` - Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
* `\` - Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence
* `[]` - Used to indicate a set of characters.
* `|` - A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
* `(...)` - Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group

## Regex Meta Characters
These characters represent character _classes_.

Most common:

* `\d` -  Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters.
* `\D` - Matches any character which is not a decimal digit. This is the opposite of \d.
* `\s` - Matches Unicode whitespace characters (which includes [ \t\n\r\f\v]
* `\S` - Matches any character which is not a whitespace character. This is the opposite of \s.
* `\w` - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
* `\W` - Matches any character which is not a word character. This is the opposite of \w

## Regex - Python
Python has a package for handling regular expressions and working with them. 
This is the [re](https://docs.python.org/3.7/library/re.html) package.
It's built in!!

Generally you would..

1. Create a pattern and [compile](https://docs.python.org/3.7/library/re.html#re.compile) it
2. Try to either:
    * [match](https://docs.python.org/3.7/library/re.html#re.match) the text verbatim
    * [findall](https://docs.python.org/3.7/library/re.html#re.findall) instances of your match within a string
    * [search](https://docs.python.org/3.7/library/re.html#re.Pattern.search) the text for the first match (anywhere in the string)

In [4]:
pattern = re.compile(r'foo')

foo, not_foo = 'foo foo', 'not foo foo'

In [5]:
print(pattern.match(foo))

<re.Match object; span=(0, 3), match='foo'>


In [6]:
print(pattern.match(not_foo))

None


In [7]:
print(pattern.search(foo))

<re.Match object; span=(0, 3), match='foo'>


In [8]:
print(pattern.search(not_foo))

<re.Match object; span=(4, 7), match='foo'>


In [9]:
print(pattern.findall(foo))

['foo', 'foo']


In [10]:
print(pattern.findall(not_foo))

['foo', 'foo']


<hr>

## Some Examples

**GOAL**: Does the text talk about cats??

In [11]:
cat_sentences = [
    "I talk about cats!",
    "I might talk about cats..",
    "What sound does a cat make?",
    "Meow"
]

We can determine that a sentence talks about cats _if_ it contains the word `cat`.

In [12]:
pattern = re.compile(r'cat')

for s in cat_sentences:
    if pattern.search(s):
        print(s)

I talk about cats!
I might talk about cats..
What sound does a cat make?


<hr>

**GOAL**: Does the text contain a number?

In [13]:
number_sentences = [
    "1234567",
    "one two three four five six seven",
    "Cat!",
    "Meow"
]

In [14]:
pattern = re.compile(r'\d')  # a number!

for s in number_sentences:
    if pattern.search(s):
        print(s)

1234567


<hr>

**GOAL**: Does the number talk about either a cat or a number?

In [15]:
pattern = re.compile(r'cat|\d')  # cat or a digit

for s in cat_sentences + number_sentences:
    if pattern.search(s):
        print(s)

I talk about cats!
I might talk about cats..
What sound does a cat make?
1234567


<hr>

**GOAL**: Is the text a valid phone number???

Valid phone number being in the following format: `(615)123-45678`

In [16]:
phone_numbers = [
    "(615)123-4567",  
    "615-123-4567",
    "123-4567",
    "foo",
    "Cat!",
    "(765)432-1516"
]

In [17]:
pattern = re.compile(r"\(\d{3}\)\d{3}-\d{4}")

for s in phone_numbers:
    if pattern.match(s):
        print(s)

(615)123-4567
(765)432-1516


<hr>

**Fun practice!!**
* This regex [crossword](https://regexcrossword.com/)
* This [course](https://developers.google.com/edu/python/regular-expressions) by google on regular expressions


### Some exercises

**GOAL**: Is the text a valid _date_

* VALID: `1995-11-11`
* NOT VALID: `11-11-1995`, `11/11/1995`

In [18]:
dates = [
    "1995-11-11",
    "2001-1-1",
    "2019-12-1",
    "11-11-1995",  # Not valid!
    "11/11/1995",  # Not valid!
    "1234-56-78"   # Not valid!
]

In [19]:
# Update this!
pattern = re.compile(r'foo')

for date in dates:
    if pattern.match(date):
        print(date)

<hr>

**GOAL**: Is the text a valid _time_ ??

* VALID: 
    * `12:59:12.12345`
    * `23:01:59.54321`
    
* NOT VALID:
    * `1:1:1.12345`
    * `50:01:01.12345`
    * `12:60:01.12345`
    * `12:01:60.12345`
    * `5 o'clock`

In [20]:
times = [
    "12:59:12.12345",
    "23:01:59.54321",
    "1:1:1.12345",
    "50:01:01.12345",
    "12:60:01.12345",
    "12:01:60.12345",
    "5 o'clock",
    "The time is: 12:59:12.12345"  # Not valid!!
]

In [21]:
# Update this!
pattern = re.compile(r'foo')

for time in times:
    if pattern.match(time):
        print(time)

<hr>

### Extra Credit!!
**GOAL**: _What_ is the _year_ in the following dates?

_Hint_ : You will need to create a [capture group](https://stackoverflow.com/questions/48719537/capture-groups-with-regular-expression-python#answer-48719548) for the year. 

Extract it using the result's [group](https://docs.python.org/3/library/re.html#re.Match.group) method!

In [22]:
# all of these have valid years!
dates = [
    "1995-11-11",
    "11-11-1995",
    "01/01/2001",
    "Jan 1, 2012"
]

In [23]:
# do the code here! Look up top for some more examples of capture groups..