## Regular Expressions

That said, learning (and loving!) regular expressions is something that is a worthwhile investment
<ul>
<li>Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.</li>
    <li>Regular expressions are often faster to execute than their manual equivalents.</li>
<li>Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.</li>
    </ul>

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:
<ul>
    <li><b>id</b>: The unique identifier from Hacker News for the story</li>
    <li><b>title</b>: The title of the story</li>
    <li><b>url</b>: The URL that the stories links to, if the story has a URL</li>
    <li><b>num_points</b>: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes</li>
    <li><b>num_comments</b>: The number of comments that were made on the story</li>
    <li><b>author</b>: The username of the person who submitted the story</li>
    <li><b>created_at</b>: The date and time at which the story was submitted</li>
    </ul>

For teaching purposes, the dataset has been reduced from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.

When working with regular expressions, we use the term <b>pattern</b> to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string "and" within another string, the regex pattern for that is simply and:
![ ](images/regular1.png)

<br>The first of these we'll learn is called a set. A set allows us to specify two or more characters that can match in a single character's position.</br>

<br>We define a set by placing the characters we want to match for in square brackets:</br>
![ ](images\regular2.png)

The regular expression above will match the strings <b>mend</b>, <b>send</b>, and <b>bend</b>.

Let's look at how we can add sets to match more of our example strings from earlier:
![ ](images\regular3.png)

### Series.str.contains()

In [1]:
import pandas as pd
import numpy as np
hn = pd.read_csv("hacker_news.csv")
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


In [2]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series)

0             Julie's favorite color is green.
1               Keli's favorite color is Blue.
2    Craig's favorite colors are blue and red.
dtype: object


In [3]:
pattern = "[Bb]lue"
pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0    False
1     True
2     True
dtype: bool


<br>The result is a boolean mask: a series of <b>True/False</b> values.</br>

<br>One of the neat things about boolean masks is that you can use the <b>Series.sum()</b> method to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern:</br>

In [4]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


The following code explains how we can view the titles that match the <b>pattern</b>.

In [5]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool


In [6]:
py_titles = titles[py_titles_bool]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


In [7]:
py_titles.value_counts()

Create a GUI Application Using Qt and Python in Minutes           1
Python 3 support in scientific Python projects                    1
Python one-liner to compare two files (conditions apply)          1
Free 2 Player Game [Python][Angular]                              1
Python 3.5.0                                                      1
                                                                 ..
Python 3 in 2016                                                  1
Pineapple  A standalone front end to IPython for Mac              1
Python 3 on Google App Engine flexible environment now in beta    1
Python, Machine Learning, and Language Wars                       1
Python wats                                                       1
Name: title, Length: 160, dtype: int64

## quantifier

<br>We could use braces <b>({})</b> to specify that a character repeats in our regular expression.</br>

<br>For instance, if we wanted to write a pattern that matches the numbers in text from <b>1000</b> to <b>2999</b> we could write the regular expression below:</br>
![ ](images\regular4.png)

The name for this type of regular expression syntax is called a quantifier.

Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both <b>e-mail</b> and <b>email</b>. To do this, we would want to specify to match<b> - </b>either zero or one times.

The specific type of quantifier we saw above is called a <b>numeric quantifier</b>. Here are the different types of numeric quantifiers we can use:
![ ](images\regular7.png)
You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.
![ ](images\regular8.png)

In [13]:
#We're going to find how many titles in our dataset mention email or e-mail. 
#To do this, we'll need to use ?, the optional quantifier, 
#to specify that the dash character - is optional in our regular expression.
titles = hn["title"]
pattern = "e-{0,1}mail"

email_bool = titles.str.contains(pattern)
email_count = email_bool.sum()
email_titles = titles[email_bool]

In [14]:
email_titles.head()

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object

## finding tags

In this screen, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex <b>[pdf]</b>. Unfortunately, the brackets would be interpreted as a <b>set</b>, so our pattern would match the single characters p, d, or f.
![ ](images\regular9.png)

To match the substring "[pdf]", we can use backslashes to escape both the open and closing brackets: \[pdf\].
![ ](images\regular10.png)

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use character classes. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:
<ol type="1">
    <li>
The set notation using brackets to match any of a number of characters.</li>
    <li>The range notation, which we used to match ranges of digits (like <b>[0-9])</b>.</li>
    
<br>Let's look at a summary of syntax for some of the regex character classes:</br>

![ ](regular11.png)