# Building a Hacker News pipeline

The data we will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns JSON data of the top stories in 2014. 

The JSON file contains a single key stories, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

- created_at: A timestamp of the story's creation time.
- created_at_i: A unix epoch timestamp.
- url: The URL of the story link.
- objectID: The ID of the story.
- author: The story's author (username on HN).
- points: The number of upvotes the story had.
- title: The headline of the post.
- num_comments: The number of a comments a post has.

Using this dataset, we will run a sequence of basic natural language processing tasks using our Pipeline class. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

The pipeline will do the following:
1. Read the data
2. Filter popular stories (links (not Ask HN posts), have a good number of points, and have some comments)
3. Write popular stories to a CSV file
4. Extracting titles and cleaning them
5. Getting word frequencies and sorting the frequencies

## Loading the data and starting the pipeline

In [1]:
from datetime import datetime
import json
import io
import csv
import string
import itertools 

from pipeline import build_csv, Pipeline
from stop_words import stop_words

In [2]:
pipeline = Pipeline()

In [3]:
# loading the data
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories
        
# Getting the most popular stories
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return (story['points'] > 50) and (story['num_comments'] > 1) and not (story['title'].startswith('Ask HN'))
    
    return (story for story in stories if is_popular(story))

# Writing popular stories to a CSV file
@pipeline.task(depends_on=filter_stories)
def json_to_csv(filtered_stories):
    lines = []
    for story in filtered_stories:
        lines.append((story['objectID'], 
                      datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), 
                      story['url'], story['points'], story['title']))
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())

# Extracting titles
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    
    return (line[idx] for line in reader)

# Cleaning titles
@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

# Calculating word frequencies
@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(cleaned_titles):
    word_freq = {}
    for title in cleaned_titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

# Sorting word frequencies in descending order
@pipeline.task(depends_on=build_keyword_dictionary)
def most_frequent(keyword_dict):
    sorted_dict ={k: v for k, v in sorted(keyword_dict.items(), key=lambda item: item[1], reverse=True)}
    sorted_100 = dict(itertools.islice(sorted_dict.items(), 100)) 
    top_100 = []
    for key, val in sorted_100.items():
        top_100.append((key,val))
    return top_100

#     alternative
#     freq_tuple = [
#         (word, word_freq[word])
#         for word in sorted(word_freq, key=word_freq.get, reverse=True)
#     ]
#     return freq_tuple[:100]
    

In [4]:
ran = pipeline.run()
print(ran[most_frequent])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3

Possible next steps:

- Rewrite the Pipeline class' output to save a file of the output for each task. This will allow you to "checkpoint" tasks so they don't have to be run twice.
- Use the nltk package for more advanced natural language processing tasks.
- Convert to a CSV before filtering, so you can keep all the stories from 2014 in a raw file.
- Fetch the data from Hacker News directly from a JSON API. Instead of reading from the file we gave, you can perform additional data processing using newer data.
