#### TITLE
Regular Expression Basics

#### OBJECTIVE
Regular Expressions are a powerful way of building patterns to match text.
![](./img/RegEx_1.png)

* Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
* Regular expressions are often faster to execute than their manual equivalents.
* Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.


#### DATASET
Learning regular expressions while performing analysis on a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).
The dataset we will be working with is based off this [CSV](https://www.kaggle.com/hacker-news/hacker-news-posts) of Hacker News stories from September 2015 to September 2016.

 The columns in the dataset are explained below:
 
 
 | **Column**      | **Definition** |
| :---------- | :--------- |
| **id**  | The unique identifier from Hacker News for the story|
| **title**     | The title of the story|
| **url** |The URL that the stories links to, if the story has a URL|
| **num_points** | The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes|
| **num_comments**  |The number of comments that were made on the story|
| **author**     |The username of the person who submitted the story|
| **created_at** |The date and time at which the story was submitted|


#### DATA INTRODUCTION

In [1]:
import pandas as pd

hn = pd.read_csv('./datasets/hacker_news.csv')
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


#### The Regular Expression Module
With regular expressions, we use the term **pattern** to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has **matched**.

Letters and numbers represent themselves in regular expressions. If we wanted to find the string **"and"** within another string, the regex pattern for that is simply **and**.
![title](./img/RegEx_2.png)
In the third example above, the pattern and does not match Andrew because even though **a** and **A** are the same letter, the two characters are unique.
Python also has a built-in module for regular expressions, the **re** module. This module contains a number of different functions and classes for working with regular expressions. 


##### Search Function

One of the most useful function from the **re** module is the re.search(), which takes two required arguments:
* The regex pattern
* The string we want to search that pattern for

In [2]:
import re

m = re.search('and', 'hand')
print(m)

<re.Match object; span=(1, 4), match='and'>


The re.search() function will return a Match object if the pattern is found anywhere within the string. If the pattern is not found, re.search() returns None:

In [3]:
m = re.search("and", "antidote")
print(m)

None


<div class="burk">IMP</div><i class="fa fa-lightbulb-o "></i>

The boolean value of a match object is **True** while **None** is **False** to easily check whether our regex matches each string in a list.

In [4]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

Match
No Match
No Match


So far, we haven't done anything with regular expressions that we couldn't do using the **in** keyword. The power of regular expressions comes when we use one of the special character sequences.
The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:
![title](./img/RegEx_3.png)

The regular expression above will match the strings mend, send, and bend.

![title](./img/RegEx_4.png)


In [5]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]
blue_mentions = 0
pattern = "[Bb]lue"

for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1


        print(blue_mentions)

2


Find out how many times Python is mentioned in the title of stories in our Hacker news dataset. Use a set to check for both Python with a 'P' and python with a lowercase 'p'.

In [7]:
import re

titles = hn["title"].tolist()
python_mentions = 0
pattern = '[Pp]ython'

for t in titles:
    if re.search(pattern, t):
        python_mentions += 1
        
print(python_mentions)

160


##### Counting Matches with Pandas Methods
We should avoid using loops in Pandas, and that vectorized methods are often faster and require less code.

In [11]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series, '\n')

pattern = '[Bb]lue'
pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0              Julie's favorite color is green.
1                Keli's favorite color is Blue.
2    Craig's favorite colors are bblue and red.
dtype: object 

0    False
1     True
2     True
dtype: bool




The result is a **boolean mask**: a series of True/False values.

One of the neat things about boolean masks is that you can use the Series.sum() method to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern:


In [12]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


In [13]:
titles = hn["title"]
python_mentions = 0
pattern = '[Pp]ython'

python_mentions = titles.str.contains(pattern).sum()
python_mentions

160

##### Using Regular Expressions to Select Data

In [14]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())

py_titles = titles[py_titles_bool]
print(py_titles.head())

py_titles = titles[titles.str.contains("[Pp]ython")]
print(py_titles.head())

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool
102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object
102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


In [15]:
ruby_titles = titles[titles.str.contains("[Rr]uby")]
print(ruby_titles.head())

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2022    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object


##### Quantifiers
We could use braces ({}) to specify that a character repeats in our regular expression. If we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression below
![title](./img/RegEx_5.png)
The name for this type of regular expression syntax is called a **quantifier**. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths.  As an example, we might want to match both e-mail and email. To do this, we would want to specify to match - either zero or one times. 
The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:
![title](./img/RegEx_6.png)


You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.
![title](./img/RegEx_7.png)

On this screen, we're going to find how many titles in our dataset mention **email** or **e-mail**. To do this, we'll need to use **?**, the optional quantifier, to specify that the dash character **-** is optional in our regular expression.

In [23]:
email_bool = titles.str.contains("e-?mail")
email_count = email_bool.sum()
print(email_count, '\n')
email_titles = titles[email_bool]
email_titles.head()

86 



119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object

##### Character Classes
So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like [pdf]. Here are a few examples of story titles with these tags:

[video] Google Self-Driving SUV Sideswipes Bus
New Directions in Cryptography by Diffie and Hellman (1976) [pdf]
Wallace and Gromit  The Great Train Chase (1993) [video]

In this screen, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex [pdf]. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters p, d, or f.
Without escaping characters
![title](./img/RegEx_8.png)
To match the substring "[pdf]", we can use backslashes to escape both the open and closing brackets: \[pdf\].
![title](./img/RegEx_9.png)
The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use character classes. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

1. The set notation using brackets to match any of a number of characters.
2. The range notation, which we used to match ranges of digits (like [0-9]).

Let's look at a summary of syntax for some of the regex character classes:

![title](./img/RegEx_10.png)

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

![title](./img/RegEx_11.png)

The one that we'll be using to match characters in tags is \w, which represents any number or letter. Each character class represents a single character, so to match multiple characters (e.g. words like video and pdf), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (\w) with the 'one or more' quantifier (+), giving us a combined pattern of \w+.

This will match sequences like pdf, video, Python, and 2018 but won't match a sequence containing a space or punctuation character like PHP-DEV or XKCD Flowchart. If we wanted to match those tags as well, we could use .+; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this screen:

* We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
* Character classes let us match certain groups of characters (e.g. \w will match any word character).
* Character classes can be combined with quantifiers when we want to match different numbers of characters.

We'll use these concepts to count the number of titles that contain a tag.

In [22]:
pattern = "\[\w+\]"
tag_titles = titles[titles.str.contains(pattern)]
print(tag_titles, '\n')
tag_count = tag_titles.shape[0]
print(tag_count)

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 444, dtype: object 

444


##### Accessing the Matching Text with Capture Groups
 we learned that we can use backslashes to escape the [ and ] characters. Backslashes are used to escape many other characters in regular expressions, as well as to denote some special character sequences (like character classes).

In Python, a backslash followed by certain characters represents an escape sequence — like the \n sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring \b:
![title](./img/RegEx_11_5.png)


The escape sequence \b represents a backspace, so the final letter from our string is removed. The character sequence \b has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":
![title](./img/RegEx_11_6.png)
This can make regular expressions even more difficult to read and interpret, so instead we use [raw strings](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals), which we denote by prefixing our string with the r character. Let's take a look at the code from above with a raw string:
![title](./img/RegEx_11_7.png)
We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use capture groups. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:
![title](./img/RegEx_12.png)

In [27]:
tag_5 = tag_titles.head()
print(tag_5, '\n')

pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

# Move the parentheses inside the brakets to get just the text:
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches, '\n')

# Get a frequency table of the tags
tag_5_freq = tag_5_matches.value_counts()
print(tag_5_freq)

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object 

            0
66      [pdf]
100  [German]
159     [pdf]
162     [pdf]
195    [Beta]
          0
66      pdf
100  German
159     pdf
162     pdf
195    Beta 

pdf       3
German    1
Beta      1
dtype: int64


Using the above technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags

In [29]:
pattern = r"\[(\w+)\]"
tag_freq = titles.str.extract(pattern).value_counts()
tag_freq

pdf            276
video          111
2015             3
audio            3
2014             2
slides           2
beta             2
viz              1
German           1
Petition         1
NSFW             1
Map              1
Live             1
JavaScript       1
Infograph        1
HBR              1
Challenge        1
GOST             1
Excerpt          1
React            1
CSS              1
Beta             1
Benchmark        1
Australian       1
ANNOUNCE         1
5                1
2008             1
Python           1
SpaceX           1
SPA              1
gif              1
updated          1
transcript       1
survey           1
song             1
satire           1
repost           1
png              1
much             1
map              1
detainee         1
Skinnywhale      1
crash            1
comic            1
coffee           1
blank            1
ask              1
Videos           1
Ubuntu           1
USA              1
videos           1
1996             1
dtype: int64

##### Negative Character Classes
**Negative character classes** are character classes that match every character except a character class. 
![title](./img/RegEx_13.png)

Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript

In [33]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

pattern = r"[Jj]ava[^Ss]"
java_titles = titles[titles.str.contains(pattern)]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1840                     Adopting RxJava on the Airbnb App
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
2910                 2016 JavaOne Intel Keynote  32mn Talk
3452     What are the Differences Between Java Platform...
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
5947                                        JavaFX is dead
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

##### Word Boundaries

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where Java occurs at the end of the string, like this title:

*Pippo  Web framework in Java*

This is because the negative set [^Ss] must match one character. Instances at the end of a string aren't followed by any characters, so there is no match.

A different approach to take in cases like these is to use the **word boundary anchor**, specified using the syntax \b. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:
![title](./img/RegEx_14.png)

Let's look at how using a word boundary changes the match from the string in the example above:

In [34]:
string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"

m1 = re.search(pattern_1, string)
print(m1)

None


The regular expression returns None, because there is no substring that contains Java followed by a character that isn't S.

Let's instead use word boundaries in our regular expression

In [35]:
pattern_2 = r"\bJava\b"

m2 = re.search(pattern_2, string)
print(m2)

<re.Match object; span=(41, 45), match='Java'>


With the word boundary, our pattern matches the Java at the end of the string.

In [36]:
pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

##### Matching at the Start and End of Strings
So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

On the previous screen, we learned that the word boundary anchor matches the space between a word character and a non-word character. More generally in regular expressions, an anchor matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string.

positional anchors
![title](./img/RegEx_15.png)
Note that the ^ character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a [ or not.

Let's start with a few test cases that all contain the substring Red at different parts of the string, as well as a test function:

In [39]:
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])
print(test_cases, '\n')

# Match the word Red only if it occurs at the start of the string
print(test_cases.str.contains(r"^Red"), '\n')

# Match the word Red only if it occurs at the end of the string
print(test_cases.str.contains(r"Red$"))

0    Red Nose Day is a well-known fundraising event
1                          My favorite color is Red
2          My Red Car was purchased three years ago
dtype: object 

0     True
1    False
2    False
dtype: bool 

0    False
1     True
2    False
dtype: bool


In [43]:
pattern_beginning = r"^\[\w+\]"
beginning_count = titles.str.contains(pattern_beginning).sum()
print(beginning_count)

pattern_ending =  r"\[\w+\]$"
ending_count = titles.str.contains(pattern_ending).sum()
print(ending_count)

15
417


##### Using Flags to Modify Regex Patterns
Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

email

Email

e Mail

e mail

E-mail

e-mail

eMail

E-Mail

EMAIL

emails

Emails

E-Mails

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use flags to specify that our regular expression should ignore case.

Both re.search() and the pandas regular expression methods accept an optional flags argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

A list of all available flags is in the documentation, but by far the most common and the most useful is the re.IGNORECASE flag, which is also available using the alias re.I for convenience.

When you use this flag, all uppercase letters will match their lowercase equivalents and vice versa. 

In [44]:
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")

0     True
1    False
2    False
3    False
dtype: bool

In [45]:
import re
email_tests.str.contains(r"email",flags=re.I)

0    True
1    True
2    True
3    True
dtype: bool

In [46]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])
pattern = r"\be[\-\s]?mails?\b"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()
email_mentions

141

#### REFERENCES
https://regexr.com/

https://docs.python.org/3/library/re.html#re.A

https://docs.python.org/3/library/re.html#re.I