--- 
# Regular expressions

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

- __`id`__: The unique identifier from Hacker News for the story
- __`title`__: The title of the story
- __`url`__: The URL that the stories links to, if the story has a URL
- __`num_points`__: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- __`num_comments`__: The number of comments that were made on the story
- __`author`__: The username of the person who submitted the story
- __`created_at`__: The date and time at which the story was submitted

For teaching purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.

## Concepts:
- `re.search()`
- `Series.str.contains()` 
- `{}` quantifier and `?`
-  `[tags]`
-  `Series.str.extract()`
-  [complex expressions  | `[^Ss]` |](https://regexr.com/)
- flags

In [1]:
# import numpy as np
import pandas as pd
import re 
# import seaborn as sns
# import matplotlib.pyplot as plt
# %matplotlib inline

### Instructions
1. Import the pandas library.
2. Read the __`hacker_news.csv`__ file into a pandas dataframe. Assign the result to __`hn`__.
3. After you have completed the code exercise, use the variable inspector to familiarize yourself with the dataset.

In [2]:
hn = pd.read_csv("hacker_news.csv")

In [3]:
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


### Instructions

We have provided code to import the __`re`__ module and extract a __list, `titles`__, containing all the titles from our dataset.

1. Initialize a variable __`python_mentions`__ with the integer value __`0`__.
2. Create a string — pattern — containing a regular expression pattern that uses a set to match __`Python`__ or __`python`__.
3. Use a loop to iterate over each item in the titles list, and for each item:
4. Use the __`re.search()`__ function to check whether pattern matches the title.
5. If __`re.search()`__ returns a match object, increment __`(add 1 to)`__ the __`python_mentions`__ variable.

In [4]:
titles = hn["title"].tolist()

python_mentions = 0
pattern_py = '[Pp]ython'

for i in titles: 
    if re.search(pattern_py, i):
        python_mentions += 1

In [5]:
python_mentions

160

### Instructions

We have provided the __regex pattern__ from the solution to the previous screen.

1. Assign the __`title`__ column from the __`hn`__ dataframe to the variable __`titles`__.
2. Use __`Series.str.contains()`__ and __`Series.sum()`__ with the provided __regex pattern to count how many Hacker News titles contain `Python` or `python`__. Assign the result to __`python_mentions`.

In [6]:
pattern = '[Pp]ython'
titles = hn.title
python_mentions = titles.str.contains(pattern).sum()

In [7]:
python_mentions

160

### Instructions

1. Use __`Series.str.contains()`__ to create a series of the values from titles that contain __`Ruby`__ or __`ruby`__. Assign the result to __`ruby_titles`__.

In [8]:
titles = hn.title
ruby_titles = titles[titles.str.contains("[Rr]uby")]
ruby_titles.head(3)

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
Name: title, dtype: object

### Instructions

1. Use a regular expression and __`Series.str.contains()`__ to create a boolean mask that matches items from __`titles`__ containing __`email`__ or __`e-mail`__. Assign the result to __`email_bool`__.
2. Use __`email_bool`__ to count the number of titles that matched the regular expression. Assign the result to __`email_count`__.
3. Use __`email_bool`__ to select only the items from titles that matched the regular expression. Assign the result to __`email_titles`__.

In [9]:
titles = hn.title

email_bool = titles.str.contains("e-?mail")
email_count = email_bool.sum()
email_titles = titles[email_bool]

### Instructions

1. Write a regular expression, assigning it as a string to the variable __`pattern`__. The regular expression should match, in order:
    - A single open bracket character.
    - One or more word characters.
    - A single close bracket character.
2. Use the regular expression to select only items from __`titles`__ that match. Assign the result to the variable __`tag_titles`__.
3. Count how many matching titles there are. Assign the result to __`tag_count`__.

In [10]:
titles = hn.title

# option 1
pattern = "\[\w+\]"
tag_titles = titles[titles.str.contains(pattern)]
tag_count = tag_titles.shape[0]

# option 2
pattern = "\[\w+\]"
tag_titles_2 = titles.str.contains(pattern)
tag_count_2 = tag_titles_2.sum()

In [11]:
tag_count == tag_count_2

True

In [12]:
tag_titles.head()

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object

### Instructions

We have provided a commented line of code with the pattern from the previous exercise.

1. Uncomment the line of code and add parentheses to create a capture group inside the brackets.
2. Use __`Series.str.extract()`__ and __`Series.value_counts()`__ with the modified regex pattern to produce a frequency table of all the tags in the titles series. Assign the frequency table to __`tag_freq`__.

In [13]:
pattern_1 = r"\[(\w+)\]"

tag_matches = titles.str.extract(pattern_1)

tag_freq = tag_matches[0].value_counts()

In [14]:
tag_freq.head()

pdf      276
video    111
2015       3
audio      3
2014       2
Name: 0, dtype: int64

### Instructions

1. Write a regular expression that will match titles containing Java.
    - You might like to use the __`first_10_matches()`__ function or a site like __`RegExr`__ to build your regular expression.
    - The regex should match whether or not the first character is capitalized.
    - The regex shouldn't match where 'Java' is followed by the letter __`'S'`__ or __`'s'`__.
2. Select every row from titles that match the regular expression. Assign the result to __`java_titles`__.

In [15]:
def all_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    return all_matches

java_titles = all_matches(r"[Jj]ava[^Ss]")

In [16]:
java_titles.head(3)

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
Name: title, dtype: object

### Instructions

1. Write a regular expression that will match titles containing Java.
    - You might like to use the __`first_10_matches()`__ function or a site like RegExr to build your regular expression.
2. The regex should match whether or not the first character is capitalized.
3. The regex should match only where __`'Java'`__ is preceded and followed by a word boundary.
4. Select from titles only the items that match the regular expression. Assign the result to __`java_titles`__.

In [17]:
def all_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    return all_matches

pattern_j = r"\b[Jj]ava\b"

java_titles = all_matches(pattern_j)

### Instructions

1. Count the number of times that a tag (e.g. __`[pdf]`__ or __`[video]`) occurs at the start of a title in titles. Assign the result to __`beginning_count`__.
2. Count the number of times that a tag (e.g. __`[pdf]`__ or __`[video]`__) occurs at the end of a title in titles. Assign the result to __`ending_count`__.

In [18]:
beginning_count = titles.str.contains(r"^\[\w+\]").sum()
ending_count = titles.str.contains(r"\[\w+\]$").sum()

In [19]:
beginning_count

15

In [20]:
ending_count

417

### Instructions

1. Write a regular expression that will match all variations of email included in the starter code. Write your regular expression in a way that will be compatible with the ignorecase flag.
    - As you build your regular expression, you might like to use `Series.str.contains()` like we did in the examples earlier in this screen.
2. Once your regular expression matches all the test cases, use it to count the number of mentions of email in titles in the dataset. Assign the result to `email_mentions`.

 __https://docs.python.org/3/library/re.html#re.A__

In [21]:
email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL'])

In [22]:
email_mentions = titles.str.contains(r"e-?\s?mail",flags=re.I).sum()

In [23]:
email_mentions

151

In [24]:
email_tests.str.contains(r"e-?\s?mail",flags=re.I)

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
dtype: bool