# Regex Intro

## Regex?
From wiki..
```
A regular expression, regex or regexp is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.
```

Essentially, it's a way to define patterns in text data!
Let's see an example.

<hr>

In [7]:
import re


sentences = [
    "Sentence about cats.", 
    "Sentence about dogs.", 
    "Sentence about rabbits."
]

Now, let's imagine that we wanted to know which sentences talked about either `cats` or `dogs`.
How could we do that?

Currently, our sentences are structured like this:
```
Sentence about <animal>.
```

So, we could define a pattern that matched this structure!

In [14]:
pattern = re.compile(r"Sentence about (\w+)\.")

for sentence in sentences:
    
    # try to find a match..
    match = pattern.match(sentence)
    
    # if you found a match, print it!
    if match:
        print(match.group(1))

cats
dogs
rabbits


Ok.. But now we have rabbits!

In [13]:
pattern = re.compile(r"Sentence about (cat|dog)")

for sentence in sentences:
    
    # try to find a match..
    match = pattern.match(sentence)
    
    # if you found a match, print it!
    if match:
        print(match.group(1))

cat
dog


### What's going on?
We defined a could patterns to flag the sentences that matched our criteria, and also returned the matches!

Starting off.. 
```
Sentence about ...
```

This part is normal. 
Looks like regular text.
The next parts are the tricky ones.

* `(\w+)\.` - Matches at least one _word_ character (`(\w+)`), then specifies a period (`\.`)
* `(cat|dog)` - matches the word `cat` _or_ `dog`

<hr>

## Regex Special Characters
Regex is kind of a mini-programming language.
There are normal characters, metacharacters, and special characters.

Most common special characters include:

* `.` - (Dot.) Matches any character except a newline.
* `^` - (Caret.) Matches the start of the string
* `$` - Matches the end of the string or just before the newline at the end of the string
* `*` - Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
* `+` - Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
* `?` - Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
* `\` - Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence
* `[]` - Used to indicate a set of characters.
* `|` - A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
* `(...)` - Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group

## Regex Meta Characters
These characters represent character _classes_.

Most common:

* `\d` -  Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters.
* `\D` - Matches any character which is not a decimal digit. This is the opposite of \d.
* `\s` - Matches Unicode whitespace characters (which includes [ \t\n\r\f\v]
* `\S` - Matches any character which is not a whitespace character. This is the opposite of \s.
* `\w` - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
* `\W` - Matches any character which is not a word character. This is the opposite of \w

## Regex - Python
Python has a package for handling regular expressions and working with them. 
This is the [re](https://docs.python.org/3.7/library/re.html) package.
It's built in!!

Generally you would..

1. Create a pattern and [compile](https://docs.python.org/3.7/library/re.html#re.compile) it
2. Try to either:
    * [match](https://docs.python.org/3.7/library/re.html#re.match) the text verbatim
    * [findall](https://docs.python.org/3.7/library/re.html#re.findall) instances of your match within a string
    * [search](https://docs.python.org/3.7/library/re.html#re.Pattern.search) the text for the first match (anywhere in the string)

In [31]:
pattern = re.compile('foo')

foo, not_foo = 'foo foo', 'not foo foo'

In [32]:
print(pattern.match(foo))

<re.Match object; span=(0, 3), match='foo'>


In [33]:
print(pattern.match(not_foo))

None


In [34]:
print(pattern.search(foo))

<re.Match object; span=(0, 3), match='foo'>


In [35]:
print(pattern.search(not_foo))

<re.Match object; span=(4, 7), match='foo'>


In [36]:
print(pattern.findall(foo))

['foo', 'foo']


In [37]:
print(pattern.findall(not_foo))

['foo', 'foo']


<hr>

## Some Examples

**GOAL**: Does the text talk about cats??

In [47]:
cat_sentences = [
    "I talk about cats!",
    "I might talk about cats..",
    "What sound does a cat make?",
    "Meow"
]

We can determine that a sentence talks about cats _if_ it contains the word `cat`.

In [48]:
pattern = re.compile('cat')

for s in cat_sentences:
    if pattern.search(s):
        print(s)

I talk about cats!
I might talk about cats..
What sound does a cat make?


<hr>

**GOAL**: Does the text contain a number?

In [43]:
number_sentences = [
    "1234567",
    "one two three four five six seven",
    "Cat!",
    "Meow"
]

In [44]:
pattern = re.compile('\d')  # a number!

for s in number_sentences:
    if pattern.search(s):
        print(s)

1234567


<hr>

**GOAL**: Does the number talk about either a cat or a number?

In [50]:
pattern = re.compile(r'cat|\d')  # cat or a digit

for s in cat_sentences + number_sentences:
    if pattern.search(s):
        print(s)

I talk about cats!
I might talk about cats..
What sound does a cat make?
1234567


<hr>

**GOAL**: Is the text a valid phone number???

Valid phone number being in the following format: `(615)123-45678`

In [51]:
phone_numbers = [
    "(615)123-4567",  
    "615-123-4567",
    "123-4567",
    "foo",
    "Cat!",
    "(765)432-1516"
]

In [53]:
pattern = re.compile("\(\d{3}\)\d{3}-\d{4}")

for s in phone_numbers:
    if pattern.match(s):
        print(s)

(615)123-4567
(765)432-1516


<hr>

**Fun practice!!**
* This regex [crossword]