# Pandas | Advanced Regular Expressions
---
## Concepts:  
- Using multiple capture groups to extract URL data.
- How to use lookarounds to customize matches based on the surrounding text.
- How to substitute a regular expression match to clean inconsistent data.
- How to use named capture groups to extract dataframes from a text column.

---
We're going to continue working with the dataset from the previous mission from technology site Hacker News. Let's take a moment to refresh our memory of the different columns in this dataset:

- `id:` The unique identifier from Hacker News for the story
- `title`: The title of the story
- `url`: The URL that the stories links to, if the story has a URL
- `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the story
- `author`: The username of the person who submitted the story
- `created_at`: The date and time at which the story was submitted

In [1]:
import numpy as np
import pandas as pd
import re
# import seaborn as sns
# import matplotlib.pyplot as plt
# %matplotlib inline

In [2]:
hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

### Instructions

We have already imported pandas and re, read the CSV and extracted the title column.

1. Create a case insensitive regex pattern that matches all case variations of the letters `SQL`.
2. Use that regex pattern and the ignorecase flag to count the number of mentions of `SQL` in titles. Assign the result to `sql_counts`.

In [3]:
pattern = r"sql"
sql_counts = titles.str.contains(pattern, flags=re.I).sum()
print(sql_counts)

108


### Instructions

We have created a new dataframe, __`hn_sql`__, including only rows that mention a SQL flavor.

1. Create a new column called flavor in the __`hn_sql`__ dataframe, containing extracted mentions of SQL flavors, defined as:
    - Any time 'SQL' is preceded by one or more word characters.
    - Ignoring all case variation.
2. Use the __`Series.str.lower()`__ method to clean the values in the flavor column by converting them to lowercase. Assign the values back to the column in __`hn_sql`__.
3. Use the __`DataFrame.pivot_table()`__ method to create a pivot table, __`sql_pivot`__.
4. The index of the pivot table should be the __`flavor`__ column.
5. The values of the pivot table should be the mean of the __`num_comments`__ column, aggregated by SQL flavor.

In [4]:
hn_sql = hn[hn['title'].str.contains(r"\w+SQL", flags=re.I)].copy()

In [5]:
pattern = r"(\w+sql)"
flavor = hn_sql.title.str.extract(pattern, flags=re.I)
hn_sql['flavor'] = flavor
hn_sql.flavor = hn_sql.flavor.str.lower()

sql_pivot = hn_sql.pivot_table('num_comments', 'flavor')

In [6]:
sql_pivot

Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


### Instructions

1. Write a regular expression pattern which will match `Python` or `python`, followed by a space, followed by one or more digit characters or periods.
    - The regular expression should contain a capture group for the digit and period characters (the Python versions)
2. Extract the Python versions from `title` using the regular expression pattern.
3. Use `Series.value_counts()` and the `dict()` function to create a dictionary frequency table of the extracted Python versions. Assign the result to `py_versions_freq`.

In [7]:
pattern = r"[Pp]ython ([\d\.]+)"

py_versions = titles.str.extract(pattern)
py_versions_freq = dict(py_versions[0].value_counts())

In [8]:
py_versions_freq

{'3': 10,
 '2': 3,
 '3.5': 3,
 '3.6': 2,
 '8': 1,
 '1.5': 1,
 '2.7': 1,
 '4': 1,
 '3.5.0': 1}

### Instructions

We have provided a commented line of code containing the regular expression we used above.

1. Uncomment the line of code. Add a negative set to the end of the regular expression that excludes:
    - The period character `.`
    - The plus character `+`.
2. Use the `first_10_matches()` function to return the matches for the regular expression you built, assigning the result to `first_ten`.

In [9]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10
pattern_z = r"\b[Cc]\b[^+.]"

--- 
#  Lookarounds `?=` |  `?!`  | `?<=` | `?<!`   
Let's look at the result of the previous exercise:

In [10]:
first_10_matches(pattern_z)

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

In [11]:
test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']

We'll also create a function that will loop over our test cases and tell us whether our pattern matches. We'll use the re module rather than pandas since it tells us the exact text that matches, which will help us understand how the lookaround is working:



In [12]:
def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")

## Positive lookahead  `?=`
---
In each instance, we'll aim to match the substring __`Green`__ depending on the characters that precede or follow it. Let's start by using a __positive lookahead__ to include instances where the match is followed by the substring __`_Blue`__. We'll include the underscore character in the lookahead, otherwise we will get zero matches:

In [13]:
#'Red_Green_Blue'
#'Yellow_Green_Red'
#'Red_Green_Red'
#'Yellow_Green_Blue'
#'Green'

run_test_cases(r"Green(?=_Blue)")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
NO MATCH


## Negative lookahead  `?!`
---
Notice how the matches themselves are purely the text `Green` and don't include the lookahead. Let's look at a negative lookahead to include instances where the match is not followed by the substring `_Red`:



In [14]:
#'Red_Green_Blue'
#'Yellow_Green_Red'
#'Red_Green_Red'
#'Yellow_Green_Blue'
#'Green'

run_test_cases(r"Green(?!_Red)")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
<re.Match object; span=(0, 5), match='Green'>


## Positive lookbehind `?<=`
Next we'll use a __positive lookbehind__ to include instances where the match is preceded by the substring `Red_`:



In [15]:
#'Red_Green_Blue'
#'Yellow_Green_Red'
#'Red_Green_Red'
#'Yellow_Green_Blue'
#'Green'

run_test_cases(r"(?<=Red_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH


## Negative lookbehind `?<!`
And finally, using a __negative lookbehind__ to include instances where the match isn't preceded by the substring `Yellow_`:



In [16]:
#'Red_Green_Blue'
#'Yellow_Green_Red'
#'Red_Green_Red'
#'Yellow_Green_Blue'
#'Green'

run_test_cases(r"(?<!Yellow_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(0, 5), match='Green'>


The contents of a lookaround can include any other regular expression component. For instance, here is an example where we __match only cases that are followed by exactly five characters__:

In [17]:
run_test_cases(r"Green(?=.{5})")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
NO MATCH


The second and third test cases are followed by four characters, not five, and the last test case isn't followed by anything.

Sometimes programming languages won't implement support for all lookarounds (__notably, lookbehinds are not in the official JavaScript specification__). As an example, to get full support in the RegExr tool, you'll need to set it to use the PCRE regex engine.

In this exercise, we're going to use lookarounds to refine the regular expression we build on the last screen to capture mentions of the "C" programming language. As a reminder, here is the last of the regular expressions we attempted to use with this exercise earlier, and the resultant titles that match:

In [18]:
first_10_matches(r"\b[Cc]\b[^.+]")

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

### Instructions

1. Write a regular expression and assign it to `pattern`. 
    The regular expression should:
    - Match instances of `C` or `c` where they are not preceded or followed by another letter.
    - Exclude instances where the match is followed by a `.` or `+` character, without removing instances where the match occurs at the end of the string.
    - Exclude instances where the word `'Series'` immediately precedes the match.
2. Count how many stories in `titles` match the regular expression. Assign the result to `c_mentions`.

In [19]:
pattern = r"(?<!Series\s)\b[Cc]\b(?![\+\.])"
c_mentions = titles.str.contains(pattern).sum()

In [20]:
first_10_matches(pattern)

365                      The new C standards are worth it
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
2589    Phalcon  PHP framework delivered as a C extension
Name: title, dtype: object

In [21]:
c_mentions = titles.str.contains(pattern).sum()

In [22]:
c_mentions

102

In [23]:
test_cases = [
              "I'm going to read a book.",
              "Green is my favorite color.",
              "My name is Aaron.",
              "No doubles here.",
              "I have a pet eel."
             ]

for tc in test_cases:
    print(re.search(r"(\w)\1", tc))

<re.Match object; span=(21, 23), match='oo'>
<re.Match object; span=(2, 4), match='ee'>
None
None
<re.Match object; span=(13, 15), match='ee'>


Notice that there was no match for the word __`Aaron`__, despite it containing a double __`"a."`__ This is because the uppercase and lowercase __`"a"`__ are two different characters, so the backreference does not match.

We can easily achieve the same thing using pandas:

In [24]:
test_cases = pd.Series(test_cases)
print(test_cases.str.contains(r"(\w)\1"))

0     True
1     True
2    False
3    False
4     True
dtype: bool


  return func(self, *args, **kwargs)


Let's use this technique to identify story titles that have repeated words.

### Instructions

1. Write a regular expression to match cases of __repeated words__:
    - We'll define a word as a series of one or more word characters that are preceded and followed by a boundary anchor.
    - We'll define repeated words as the same word repeated twice, separated by a whitespace character.
2. Select only the items in `titles` that match the regular expression. Assign the result to `repeated_words.`

In [25]:
pattern_dw = r"\b(\w+)\s\1\b"
repeated_words = titles[titles.str.contains(pattern_dw)]

### Instructions

We have provided `email_variations`, a pandas Series containing all the variations of "email" in the dataset.
1. Use a regular expression to replace each of the matches in `email_variations` with `"email"` and assign the result to `email_uniform`.
    - You may need to iterate several times when writing your regular expression in order to match every item.
2. Use the same syntax to replace all mentions of email in `titles` with `"email"`. Assign the result to `titles_clean`.

In [26]:
email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

In [27]:
email_uniform = email_variations.str.replace(r"e-?\s?mail", "email", flags=re.I)

In [28]:
titles_clean = titles.str.replace(r"e-?\s?mail", "email", flags=re.I)

### Instructions

1. Write a regular expression to extract the domains from `test_urls` and assign the result to `test_urls_clean`. We suggest the following technique:
    - Using a series of characters that will match the protocol.
    - Inside a capture group, using a set that will match the character classes used in the domain.
    - Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.
2. Use the same regular expression to extract the domains from the `url` column of the `hn` dataframe. Assign the result to `domains`.
3. Use `Series.value_counts()` to build a frequency table of the domains in domains, limiting the frequency table to just to the top 20. Assign the result to `top_domains`.

In [29]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'
])

In [30]:
pattern_d = r"https?://([\w.]+)"
test_urls_clean = test_urls.str.extract(pattern_d, flags=re.I)
domains = hn['url'].str.extract(pattern_d, flags=re.I)

In [31]:
top_domains = domains[0].value_counts().head(20)

### Instructions

1. Write a regular expression that extracts URL components using three capture groups:
    - The first capture group should include the protocol text, up to but not including ://.
    - The second group should contain the domain, from after :// up to but not including /.
    - The third group should contain the page path, from after / to the end of the string.
2. Use the regular expression pattern to extract the URL components from the test_urls series. Assign the results to `test_url_parts`.
3. Use the regular expression pattern to extract the URL components from the url column of the hn dataframe. Assign the results to `url_parts`.


In [32]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'
])

In [33]:
# `test_urls` is available from the previous screen
pattern = r"(.+)://([\w\.]+)/?(.*)"

test_url_parts = test_urls.str.extract(pattern)
url_parts = hn.url.str.extract(pattern)

### Instructions

We have provided the regex pattern from the previous screen's solution.

1. Uncomment the regular expression pattern. Add names to each capture group:
    - The first capture group should be called `protocol`.
    - The second capture group should be called `domain`.
    - The third capture group should be called `path`.
2. Use the regular expression pattern to extract three named columns of url components from the `url` column of the `hn` dataframe. Assign the result to `url_parts`.

In [34]:
pattern = r"(?P<protocol>.+)://(?P<domain>[\w\.]+)/?(?P<path>.*)"

test_url_parts = test_urls.str.extract(pattern)
url_parts = hn.url.str.extract(pattern)

In [35]:
url_parts.head()

Unnamed: 0,protocol,domain,path
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
