# Hacker News Pipeline

In this project, we will build a pipeline, and apply it to a real world dataset. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns JSON data of the top stories in 2014. Hacker News is a link aggregator website that users vote up stories that are interesting to the community. It is similar to Reddit, but the community only revolves around on computer science and entrepreneurship posts.

To make things easier, a list of JSON posts has already been downloaded to a file called `hn_stories_2014.json`. The JSON file contains a single key `stories`, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

- `created_at`: A timestamp of the story's creation time.
- `created_at_i`: A unix epoch timestamp.
- `url`: The URL of the story link.
- `objectID`: The ID of the story.
- `author`: The story's author (username on HN).
- `points`: The number of upvotes the story had.
- `title`: The headline of the post.
- `num_comments`: The number of a comments a post has.

The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

In [15]:
from pipeline import Pipeline
pipeline = Pipeline()

## Loading JSON data

We'll start the project by loading the JSON file data into Python. Because JSON files resemble a key-value dictionary, the goal is to parse the JSON file into a Python `dict` object. We can accomplish this using the `json` module.

In [16]:
import json

@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories

## Filtering the stories

Now that we have loaded in all the stories as a list of `dict` objects, we can now operate on them. Let's start by filtering the list of stories to get the most popular stories of the year.

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not Ask HN posts), have a good number of points, and have some comments.

In [17]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return (
            story['points'] > 50 
            and story['num_comments'] > 1 
            and not story['title'].startswith('Ask HN')
        )
    
    return (
        story for story in stories
        if is_popular(story)
    )

## Convert to csv

With a reduced set of stories, we can write these `dict` objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

In [26]:
from pipeline import build_csv
from datetime import datetime
import io

@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append(
            (
                story['objectID'], 
                datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), 
                story['url'], 
                story['points'], 
                story['title']
            )
        )
        
    return build_csv(
        lines, 
        header=['objectID', 'created_at', 'url', 'points', 'title'], 
        file=io.StringIO()
    )


## Extract title column

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task.

In [28]:
import csv

@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    
    return (line[idx] for line in reader)

## Clean titles

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like '`Google`', '`google`', '`GooGle?`', and '`google.`', all mean the same keyword: '`google`'. If we were to split the title into words, however, they would all be lumped into different categories.

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation.

In [20]:
import string

@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

## Create word frequency dictionary

With a cleaned title, we can now build the word frequency dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text.

To find actual keywords, we should enforce the word frequency dictionary to not include stop words. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

In [21]:
from stop_words import stop_words

@pipeline.task(depends_on=clean_title)
def build_keyword_dictionary(titles):
    word_freq = {}
    
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
                
    return word_freq

## Sorting the top words

Finally, we're ready to sort the top words used in all the titles.

In [22]:
@pipeline.task(depends_on=build_keyword_dictionary)
def top_keywords(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

## Running the pipeline

In [29]:
ran = pipeline.run()
ran[top_keywords]

[('new', 186),
 ('google', 168),
 ('bitcoin', 102),
 ('open', 93),
 ('programming', 91),
 ('web', 89),
 ('data', 86),
 ('video', 80),
 ('python', 76),
 ('code', 73),
 ('released', 72),
 ('facebook', 72),
 ('using', 71),
 ('javascript', 66),
 ('2013', 66),
 ('free', 65),
 ('source', 65),
 ('game', 64),
 ('internet', 63),
 ('microsoft', 60),
 ('c', 60),
 ('linux', 59),
 ('app', 58),
 ('pdf', 56),
 ('work', 55),
 ('language', 55),
 ('2014', 53),
 ('software', 53),
 ('startup', 52),
 ('make', 51),
 ('apple', 51),
 ('use', 51),
 ('yc', 49),
 ('time', 49),
 ('security', 49),
 ('github', 46),
 ('nsa', 46),
 ('windows', 45),
 ('like', 42),
 ('way', 42),
 ('world', 42),
 ('heartbleed', 41),
 ('computer', 41),
 ('1', 41),
 ('project', 41),
 ('ios', 38),
 ('users', 38),
 ('git', 38),
 ('dont', 38),
 ('design', 38),
 ('life', 37),
 ('os', 37),
 ('developer', 37),
 ('vs', 37),
 ('ceo', 37),
 ('twitter', 37),
 ('big', 36),
 ('day', 36),
 ('android', 35),
 ('online', 35),
 ('years', 34),
 ('court', 3

The final result yielded some interesting keywords. There were terms like 'bitcoin' (the cryptocurrency), 'heartbleed' (the 2014 hack), and many others. Even though this was a basic natural language processing task, it did provide some interesting insights into conversations from 2014. 

## Next steps 

Here are few potential next steps:

- Rewrite the Pipeline class' output to save a file of the output for each task. This will allow us to "checkpoint" tasks so they don't have to be run twice.
- Use the nltk package for more advanced natural language processing tasks.
- Convert to a CSV before filtering, so we can keep all the stories from 2014 in a raw file.
- Fetch the data from Hacker News directly from a JSON API. Instead of reading from the file, and perform additional data processing using newer data.