# Hacker News Pipeline

In this project, we would be making use of a created data pipeline in Python in an attempt to run a sequence of basic natural language processing tasks to find the top 100 keywords of Hacker News posts in 2014. This would give us an understanding of the most talked about tech topics in 2014. A JSON file from the Hacker News API named `hn_stories_2014.json` is obtained and it contains a single key `stories`, which contains a list of stories *(equivalent to Reddit's posts)*. Each story has a set of keys, the following keys are of interest to us:

* `created_at`: A timestamp of the story's creation time.
* `created_at_i`: A unix epoch timestamp.
* `url`: The URL of the story link.
* `objectID`: The ID of the story.
* `author`: The story's author (username on HN).
* `points`: The number of upvotes the story had.
* `title`: The headline of the post.
* `num_comments`: The number of a comments a post has.

In [1]:
# Importing the Pipeline class created from the pipeline module created and instantiate an instance
from pipeline import Pipeline
pipeline = Pipeline()

# Loading the JSON Data

Since JSON files resemble Python's key-value dictionaries, we would need to parse the JSON file into a Python `dict` object using the `json` module.

In [2]:
# Importing the json module
import json

In [3]:
@pipeline.task()
def file_to_json():
    with open("hn_stories_2014.json") as f:
        data = json.load(f)
    return data["stories"]

# Filtering the Stories

With the stories loaded in a list of dictionary objects, we can start filtering the list of stories to get the most popular stories of the year. We can filter for popular stories by ensuring they are links (not Ask HN posts), have a good number of points, and have some comments. That is, filter popular stories that have more than 50 points, more than 1 comment and titles that do not begin with "Ask HN".

In [4]:
@pipeline.task(depends_on = file_to_json)
def filter_stories(stories):
    def popular(story):
        points = story["points"]
        num_comments = story["num_comments"]
        title = story["title"]
        return points > 50 and num_comments > 1 and not title.startswith("Ask HN")
    return (story for story in stories if popular(story))      

# Convert to CSV

In [5]:
from pipeline import build_csv
import io
from datetime import datetime

@pipeline.task(depends_on = filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append((story["objectID"], datetime.strptime(story["created_at"], "%Y-%m-%dT%H:%M:%SZ"),
                     story["url"], story["points"], story["title"]))
    return build_csv(lines, header = ["objectID", "created_at", "url", "points", "title"], file=io.StringIO())

# Extract Title Column

In [6]:
import csv

@pipeline.task(depends_on = json_to_csv)
def extract_titles(file):
    reader = csv.reader(file)
    header = next(reader)
    idx = header.index("title")
    return (story[idx] for story in reader)

# Cleaning Titles

In [7]:
import string

@pipeline.task(depends_on = extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

# Building the Word Frequency Dictionary

In [8]:
from stop_words import stop_words

# Commonly used words that occur frequently in language like "the", "or" and etc...
stop_words

('a',
 'about',
 'above',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'bill',
 'both',
 'bottom',
 'but',
 'by',
 'call',
 'can',
 'cannot',
 'cant',
 'co',
 'con',
 'could',
 'couldnt',
 'cry',
 'de',
 'describe',
 'detail',
 'do',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eg',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'etc',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fify',
 'fill',
 'find',
 'fire',
 'first',
 'five

In [9]:
@pipeline.task(depends_on = clean_title)
def build_keyword_dictionary(titles):
    word_freq = {}
    for title in titles:
        for word in title.split(" "):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

# Sorting Top Words

Finally, we are ready to sort the top words used in all the titles.

In [10]:
@pipeline.task(depends_on = build_keyword_dictionary)
def sort_top_hundred(word_freq):
    list_tuples = [(word, freq) for word, freq in word_freq.items()]
    sorted_list = sorted(list_tuples, key=lambda pair: pair[1], reverse = True)
    return sorted_list[:100]

# Running the Pipeline

The pipeline is now ready and complete. Let's try running and print the output.

In [11]:
ran = pipeline.run()
print(ran[sort_top_hundred])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3