# REGULAR EXPRESSIONS

![faces](imgs/express.jpg)

## Objectives

- Review foundational `re` syntax 
- Learn about tools to help
- Practice
___


## Review

A regular expression (regex or regexp for short) is a special text string for describing a search pattern.

![terrifying](imgs/example.png)


**This.  
Is.  
Terrifying.**

![gif_of_you](imgs/fear_ignorance.gif)

## Basic Parts

1) Capturing Group - the foundations of regular expressions. A capture group is a portion in the string you are searching for. <br>
 - It can be a word: "bottle", "AROUND", "Link"
 - It can be a number: "25", "19910312", "1.333"
 - It can be anything: "1337 5P3Ak", "<( '.' )>", "something@email.com"
    
A standard pattern in Python looks like `r" -CAPTURING GROUP- "`

2) Quantifiers - these instruct how to gather information around the group.
   - Do you want to get everything in front of a group?
   - Do you want only the 3 characters the follow the group?
   - Do you want to match only partially match the group?
    
Quantifiers are either in curly brakets `{ -N NUMBER TO CAPTURE- }`, `{ -N NUMBER TO CAPTURE-, BLANK }`, or as much (or as little) as we can.

Then it gets complicated when you add Anchors, Character Classes, and Metacharacters.

These are __regex specific__ terms that concisely can...
   - Pick a group that contains all lowercase letters.
   - Pick a group that had only digits.
   - Pick a group that only appears at the end of a string or after a specific group.
    
It is when all these things are thrown together that regex becomes a powerful tool.

In [None]:
import re

First, we need a string. The next step is to create a pattern (a combination of groups, quantifiers, metacharacters, and then some) to pull put the specific infomation we seek.

In [None]:
string_1 = 'This is test string. This took me all of 7 seconds to write it. I wrote it fast.'

Next, pick out something to extract. In this case, lets make a patter to extract "is". A capturing group is encased within parenthesis.

When specifying a pattern, **best pratice** is to lead the string with `r`. This turns the backslash character into a literal and visible regex and more! [Find out more on that here](https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it) if interested.

In [None]:
pattern_1 = r"is"

We will use `re.search` our entire pattern to see if there is a group that matches. 

### re.search

In [None]:
result = re.search(pattern_1, string_1)
result

`re.search` returns a `Match object`. `re.search` finds the first instance of the pattern in the string and then stops. The object gives the the specifc position of the string.

Now lets take a look at where in the string this pattern was found.

In [None]:
left_span = 2
right_span = 4

print(string_1[left_span - 2 :right_span + 2])
print(string_1)

### re.match

In [None]:
result = re.match(pattern_1, string_1)
print(result)

`re.match` returns a `Match` object. `re.match` only looks for the pattern at the **start** of a string.

In [None]:
pattern_2 = r"it"

In [None]:
string_2 = "it is crazy, isn't it!"

In [None]:
result = re.match(pattern_2, string_2)
print(result)

___
## Quantifiers

`regex` quantifiers can be broken into roughly three categories: specific, greedy, and lazy.
 
Greedy: match as many characters as possible.  
Lazy: match as few characters as possible. <br>
Specific: you will need set a certain quantity or a range of _same_ characters to gather. 

In [None]:
repeating_patterns = "ah aahhhh aaaaaaaah aaaahhhhhhh ahah "

pattern_1 = r"a"
result = re.search(pattern_1, repeating_patterns)
result

In [None]:
pattern_2 = r"a{2}"
result = re.search(pattern_2, repeating_patterns)
result

In [None]:
pattern_3 = r"a{3,}"
result = re.search(pattern_3, repeating_patterns)
result

**ALERT** METACHARATER

The simplest quantifier of regex is just a period, `.`. It is the _wildcard character_ for Regular Expression. It matches any single character.

In [None]:
state = "Mississippi"
anything_pattern = r"."
result = re.search(anything_pattern, state)
result

In [None]:
state = "Mississippi"
specific_pattern = r"p{2}"
result = re.search(specific_pattern, state)
result

Now lets take a look at greey and lazy quantifiers.

To make a lazy pattern, all that is required is a question mark `?`. This will grab the fewest amount of characters while obeying all the group constraints.

To make a greedy pattern, put an asterisk `*`

In [None]:
numbers = "101000000000100"
simple_pattern = r"."
result = re.search(simple_pattern, state)
result

In [None]:
numbers = "101000000000100"
lazy_pattern = r"1.?1"
result = re.search(lazy_pattern, state)
result

In [None]:
numbers = "101000000000100"
greedy_pattern = r"1.*1"
result = re.search(greedy_pattern, state)
result

A `Match object` has a method called `.groups()`. This is helpful when you start adding additional groups and quantifiers.

___
## Anchors, Ranges, and More

Characters | Anchors | Groups 
-|-|-
![characters](imgs/cclass.png) | ![anchors](imgs/anchors.png) | ![groups](imgs/groups.png)

In [None]:
string_1 = 'This is test string. This took me all of 7 \n seconds to write it. I wrote it fast.'
string_2 = 'itiscrazyhowlongthisrunonsentenceis!'
string_3 = 'How in the world would it work without being crazy?'

loop_through = [string_1, string_2, string_3]

In [None]:
pattern = r"(^it)" #adding the start of string anchor

In [None]:
for string in loop_through:
    print(re.search(pattern, string))

In [None]:
pattern = r"\d" #any digit.

In [None]:
for string in loop_through:
    print(re.search(pattern, string))

In [None]:
pattern = r"[a-z].*" #range of a or b or c ... or z with the quantifier '.' that has been made greedy.

In [None]:
for string in loop_through:
    print(re.search(pattern, string))

In [None]:
pattern = r"[\w]+" #any word AND Matches one or more consecutive `\w` characters.

In [None]:
for string in loop_through:
    print(re.search(pattern, string))


## Tools

From least helpful to most helpful
___
Cheat Sheets
- [Data Quest](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)
- [Rexegg](https://www.rexegg.com/regex-quickstart.html)
- [Debuggex](https://www.debuggex.com/cheatsheet/regex/python)
___
Tutorials
- [Regular-Expressions.info](https://www.regular-expressions.info/tutorial.html)
- [RegexOne](https://regexone.com)
- [RegexTutorials](http://regextutorials.com/)
___
Online live editors
- https://regex101.com
- https://regexr.com (No Python)
- https://www.regextester.com
___
Stackoverflow
- https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075

Even with all these resources, regex still takes a while to learn let alone master. That just means....


## PRACTICE

    - `re.match` only looks at the beginning of a string. Returns a Match obj.
    - `re.search` looks at the entire string and finds the first instance. Returns a Match obj.
    - `re.findall' finds all instances of pattern in the string. DOES NOT return a Match obj.

### 1) Take the first full first word of the strings.

_Hint: You will likely use the `+` quantifier_

In [None]:
list_of_sentences = ["How can you even do that?", "Who thinks this is a waste of their time?", "Nevermind, don't tell me if you think it is."]
pattern = None
for sentence in list_of_sentences:
    result = re.match(pattern, sentence)
    print(result)

### 2) Find the second full word of the strings but now using `re.search`

_Hint: Word boundries paried with something else. Look for patterns._

In [None]:
list_of_statements = ["I do not think you are having fun yet.", "See, this is why people don't like regex.", "Try now to your resources!"]
pattern = None
for statment in list_of_statements:
    result = re.search(pattern, statment)
    print(result)

### 3) Find all the lowercase letters in the follow strings

_Hint: What would you do if you had to take all the uppercase letters?_

In [None]:
list_of_gibberish = ["g3N3r471N' words 7O be 48l3 to 3X7r4c7 V14 R3g3x", "k4N anyone r34Lly read 7h1z?", "1 K4'nt but TH@ d032'nt 5T0P me pHR0m using 1t"]
pattern = None
for gibber in list_of_gibberish:
    result = re.findall(pattern, gibber)
    print(result)

### 4) Match any repeated characters.

_Hint: Investigate subpatterns_

In [None]:
list_of_repetitions = ["Abba is one of the best artists out there", "No, Bastille is!", "Both of you are wrong, it is Creedence Clearwater Revival"]
pattern = None
for rep in list_of_repetitions:
    result = re.search(pattern, rep)
    print(result)

### 5) Extract the **ALL** the phone numbers from the following strings.

_Hint: You will need to use a quantifier you JUST used, if you want the pattern as short as possible. Otherwise, make the pattern as long as you need it._

In [None]:
list_of_numbers = ["Call me at 530-657-9090", "Wasn't your number 606-849-9038", "No, it was 703-952-6949"]
pattern = None
for number in list_of_numbers:
    result = re.search(pattern, number)
    print(result)

### 6) Make a pattern that will identify all the following emails address.

_Hint: You have seen this before. Go look through everything again._

In [None]:
list_of_emails = ["please_stop@gmail.com", "STORMLORD668@doom.com", "jonnel_roxs@aol.com"]
pattern = None
for email in list_of_emails:
    result = re.findall(pattern, email)
    print(result)

___
## Further Practice

https://regexone.com/ - Lessons

https://regexcrossword.com - Practice recognizing and reasoning out regex