Let's start by reading our Hacker News dataset into a pandas dataframe.

In [4]:
# Import the pandas library.
import pandas as pd

# Read the hacker_news.csv file into a pandas dataframe
hn = pd.read_csv("hacker_news.csv")

hn.head(10)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48
5,10557283,Nuts and Bolts Business Advice,,3,4,shomberj,11/13/2015 0:45
6,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
7,11337617,"Shims, Jigs and Other Woodworking Concepts to ...",http://firstround.com/review/shims-jigs-and-ot...,34,7,zt,3/22/2016 16:18
8,10379326,That self-appendectomy,http://www.southpolestation.com/trivia/igy1/ap...,91,10,jimsojim,10/13/2015 9:30
9,11370829,Crate raises $4M seed round for its next-gen S...,http://techcrunch.com/2016/03/15/crate-raises-...,3,1,hitekker,3/27/2016 18:08


In [9]:
hn.describe()

Unnamed: 0,id,num_points,num_comments
count,20099.0,20099.0,20099.0
mean,11317550.0,50.296632,24.803025
std,696453.1,107.110322,56.108639
min,10176910.0,1.0,1.0
25%,10701720.0,3.0,1.0
50%,11284520.0,9.0,3.0
75%,11926130.0,54.0,21.0
max,12578980.0,2553.0,1733.0


We're going to basic pattern searching using regex (re) to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both Python with a capital 'P' and python with a lowercase 'p'.

In [13]:
import re
# extract a list, titles, containing all the titles from our dataset
titles = hn["title"].tolist()
# see if the data in the list look reasonable
titles[0:10]

['Interactive Dynamic Video',
 "Florida DJs May Face Felony for April Fools' Water Joke",
 'Technology ventures: From Idea to Enterprise',
 'Note by Note: The Making of Steinway L1037 (2007)',
 'Title II kills investment? Comcast and other ISPs are now spending more',
 'Nuts and Bolts Business Advice',
 'Ask HN: How to improve my personal website?',
 'Shims, Jigs and Other Woodworking Concepts to Conquer Technical Debt',
 'That self-appendectomy',
 'Crate raises $4M seed round for its next-gen SQL database']

In [23]:
python_mentions = 0
pattern = "[Pp]ython"
for title in titles:
    if (re.search(pattern, title)):
        python_mentions += 1
print("Python was mentioned {0} times out of {1} titles in our dataset".format(str(python_mentions),str(len(titles))))

Python was mentioned 160 times out of 20099 titles in our dataset


Let's replicate this using pandas vectorized methods to make use of speed and code efficiencies

In [32]:
pattern = '[Pp]ython'
titles = hn["title"]
python_mentions = titles.str.contains(pattern).sum()
print("Python was mentioned {0} times out of {1} titles in our dataset".format(python_mentions,len(titles)))

Python was mentioned 160 times out of 20099 titles in our dataset


Let's continue looking through the data set and select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

In [28]:
titles = hn['title']
ruby_titles = titles[titles.str.contains("[Rr]uby")]
ruby_titles.head(10)

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2022    Show HN: CrashBreak  Reproduce exceptions as f...
2163                  Ruby 2.3 Is Only 4% Faster than 2.2
2306    Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                      Why Startups Use Ruby on Rails?
2645    Ask HN: Should I continue working a Ruby gem f...
3290    Ruby on Rails and the importance of being stup...
Name: title, dtype: object

In [31]:
print("Ruby was mentioned {0} times".format(ruby_titles.size))

Ruby was mentioned 48 times


email next

In [34]:
pattern = "e-?mail"
email_bool = titles.str.contains(pattern)
email_count = email_bool.sum()
email_titles = titles[email_bool]
email_titles.head(10)

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
2685         Ask HN: Weather forecast in your email daily
3379           Ask HN: How do we solve the email problem?
3865    What Mailchimp does to make sure emails get de...
3889    Show HN: Do you know what emails your competit...
3921     Im killing most of my email capture. Here's why.
Name: title, dtype: object

In [42]:
print("email was mentioned {0} times".format(len(email_titles.index)))

email was mentioned 86 times


In [45]:
email_titles.count()

86