# Regular Expressions

That said, learning (and loving!) regular expressions is something that is a worthwhile investment
<ul>
<li>Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.</li>
    <li>Regular expressions are often faster to execute than their manual equivalents.</li>
<li>Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.</li>
    </ul>

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:
<ul>
    <li><b>id</b>: The unique identifier from Hacker News for the story</li>
    <li><b>title</b>: The title of the story</li>
    <li><b>url</b>: The URL that the stories links to, if the story has a URL</li>
    <li><b>num_points</b>: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes</li>
    <li><b>num_comments</b>: The number of comments that were made on the story</li>
    <li><b>author</b>: The username of the person who submitted the story</li>
    <li><b>created_at</b>: The date and time at which the story was submitted</li>
    </ul>

For teaching purposes, the dataset has been reduced from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.

When working with regular expressions, we use the term <b>pattern</b> to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string "and" within another string, the regex pattern for that is simply and:
![ ](images/regular1.png)

<br>The first of these we'll learn is called a set. A set allows us to specify two or more characters that can match in a single character's position.</br>

<br>We define a set by placing the characters we want to match for in square brackets:</br>
![ ](images\regular2.png)

The regular expression above will match the strings <b>mend</b>, <b>send</b>, and <b>bend</b>.

Let's look at how we can add sets to match more of our example strings from earlier:
![ ](images\regular3.png)

### Series.str.contains()

In [2]:
import pandas as pd
import numpy as np
hn = pd.read_csv("hacker_news.csv")
hn.head()

FileNotFoundError: [Errno 2] No such file or directory: 'hacker_news.csv'

In [None]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series)

0             Julie's favorite color is green.
1               Keli's favorite color is Blue.
2    Craig's favorite colors are blue and red.
dtype: object


In [None]:
pattern = "[Bb]lue"
pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0    False
1     True
2     True
dtype: bool


<br>The result is a boolean mask: a series of <b>True/False</b> values.</br>

<br>One of the neat things about boolean masks is that you can use the <b>Series.sum()</b> method to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern:</br>

In [None]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


The following code explains how we can view the titles that match the <b>pattern</b>.

In [None]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool


In [None]:
py_titles = titles[py_titles_bool]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


In [None]:
py_titles.value_counts()

Cysignals: signal handling (SIGINT, SIGSEGV, ) for calling C from Python           1
Pineapple  A standalone front end to IPython for Mac                               1
Show HN: Stack overflow command line client added support to python 2              1
Show HN: TeachCraft  Learning Python Through Minecraft                             1
Python vs. Julia Observations                                                      1
                                                                                  ..
Ask HN: How to automate Python apps deployment?                                    1
Legofy  Python program to make an image to look as if it was created with Legos    1
Ubuntu Drops Python 2.7 from the Default Install in 16.04                          1
PyThalesians: Python Open Source Financial Library                                 1
Memspector: Inspect memory usage of Python functions                               1
Name: title, Length: 160, dtype: int64

## quantifier

<br>We could use braces <b>({})</b> to specify that a character repeats in our regular expression.</br>

<br>For instance, if we wanted to write a pattern that matches the numbers in text from <b>1000</b> to <b>2999</b> we could write the regular expression below:</br>
![ ](images\regular4.png)

The name for this type of regular expression syntax is called a quantifier.

Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both <b>e-mail</b> and <b>email</b>. To do this, we would want to specify to match<b> - </b>either zero or one times.

The specific type of quantifier we saw above is called a <b>numeric quantifier</b>. Here are the different types of numeric quantifiers we can use:
![ ](images\regular7.png)
You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.
![ ](images\regular8.png)

In [None]:
#We're going to find how many titles in our dataset mention email or e-mail. 
#To do this, we'll need to use ?, the optional quantifier, 
#to specify that the dash character - is optional in our regular expression.
titles = hn["title"]
pattern = "e-{0,1}mail"

email_bool = titles.str.contains(pattern)
email_count = email_bool.sum()
email_titles = titles[email_bool]

In [None]:
email_titles.head()

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object

## finding tags

In this screen, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex <b>[pdf]</b>. Unfortunately, the brackets would be interpreted as a <b>set</b>, so our pattern would match the single characters p, d, or f.
![ ](images\regular9.png)

To match the substring "[pdf]", we can use backslashes to escape both the open and closing brackets: \[pdf\].
![ ](images\regular10.png)

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use <b>character classes</b>. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:
<ol type="1">
    <li>
The set notation using brackets to match any of a number of characters.</li>
    <li>The range notation, which we used to match ranges of digits (like <b>[0-9])</b>.</li>
    </ol>
    
<br>Let's look at a summary of syntax for some of the regex character classes:</br>

![ ](images\regular11.png)
There are two new things we can observe from this table:
<ol type="1">
    <li>Ranges can be used for letters as well as numbers.</li>
    <li>Sets and ranges can be combined.</li>
</ol>
Just like with quantifiers, there are some other common character classes which we'll use a lot.

![ ](images\regular12.png)

In order to match word characters between our brackets, we can combine the <b>word character class (\w)</b> with the <b>'one or more'</b> <b>quantifier (+)</b>, giving us a combined pattern of <b>\w+</b>.

This will match sequences like pdf, video, Python, and 2018 but won't match a sequence containing a space or punctuation character like PHP-DEV or XKCD Flowchart. If we wanted to match those tags as well, we could use .+; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this screen:
<ul>
<li>We can use a backslash to escape characters that have special meaning in regular expressions (e.g. <b>\</b> will match an open bracket character).</li>
    <li>Character classes let us match certain groups of characters (e.g. <b>\w</b> will match any word character).</li>
<li>Character classes can be combined with quantifiers when we want to match different numbers of characters.</li>

In [None]:
pattern = "\[\w+\]"

tag_titles = titles.str.contains(pattern)

tag_count = tag_titles.sum()

In [None]:
tag_titles.value_counts()

False    19655
True       444
Name: title, dtype: int64

## raw strings

![ ](images\regular13.png)

![ ](images\regular14.png)

<br>We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively.</br>
<br>That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.</br>

## capture groups
Capture groups allow us to specify one or more groups within our match that we can access separately.
<br>In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.</br>

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

![ ](images\regular15.png)

We use the <b>Series.str.extract()</b> method to extract the match within our parentheses, in order to find out what the text of these tags were, and how many of each are in the dataset?

In [None]:
tag_5 = titles[tag_titles].head()
print(tag_5)

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object


In [None]:
pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

            0
66      [pdf]
100  [German]
159     [pdf]
162     [pdf]
195    [Beta]


We can move our parentheses inside the brackets to get just the text:

In [None]:
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

          0
66      pdf
100  German
159     pdf
162     pdf
195    Beta


## negative character classes
<br>Negative character classes are character classes that match every character except a character class.</br>

<br>We can see that there are a number of matches that contain <b>Java</b> as part of the word <b>JavaScript</b>. We want to exclude these titles from matching so we get an accurate count. One way of doing this is by using negeative character classes.</br>
Let's look at a table of the common negative character classes:
![ ](images\regular16.png)

In [None]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

pattern = r"[Jj]ava[^Ss]"
java_titles = titles[titles.str.contains(pattern)]
first_10_matches(pattern)

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
2367    Code that is valid in both PHP and Java, and p...
2493    Ask HN: I've been a java dev for a couple of y...
2751                Eventsourcing for Java 0.4.0 released
2910                2016 JavaOne Intel Keynote  32mn Talk
3452    What are the Differences Between Java Platform...
Name: title, dtype: object

## word boundary anchor
Specified using the syntax <b>\b</b>.
<br>A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:</br>
![ ](images\regular17.png)

In [None]:
pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]

In [None]:
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

## beginning anchor and the end anchor

<br>More generally in regular expressions, an anchor matches something that isn't a character, as opposed to character classes which match specific characters.</br>

<br>Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string, respectfully.</br>

![ ](images\regular18.png)

Note that the <b>^</b> character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a <b>[</b>  or not.

Let's start with a few test cases that all contain the substring Red at different parts of the string, as well as a test function:

In [None]:
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])
print(test_cases)

0    Red Nose Day is a well-known fundraising event
1                          My favorite color is Red
2          My Red Car was purchased three years ago
dtype: object


If we want to match the word <b>Red</b> only if it occurs at the start of the string, we add the beginning anchor to the start of our regular expression:

In [None]:
test_cases.str.contains(r"^Red")

0     True
1    False
2    False
dtype: bool

If we want to match the word Red only if it occurs at the end of the string, we add the end anchor to the end of our regular expression:

In [None]:
test_cases.str.contains(r"Red$")

0    False
1     True
2    False
dtype: bool

In [None]:
pattern1 = "^\[(\w+)\]"

beginning_count = titles.str.contains(pattern1).sum()
pattern2 = "\[(\w+)\]$"
ending_count = titles.str.contains(pattern2).sum()

  return func(self, *args, **kwargs)


In [None]:
beginning_count

15

In [None]:
ending_count

417

In [None]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern = r"\be[\-\s]?mails?\b"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()

In [None]:
email_mentions

141