# Guided Project: Hacker News Pipeline

## In git bash on Win ---> __[winpty python](https://stackoverflow.com/questions/32597209/python-not-working-in-the-command-line-of-git-bash)__



## 1.1 Introduction to the Data

In this course, we began with the concepts of functional programming, and then built our own data pipeline class in Python. We learned about advanced Python concepts such as the decorators, closures, and good API design. In the last mission, we also learned how to implement a directed acyclic graph as the scheduler for our pipeline.

After completing all these missions, we have finally built a robust data pipeline that schedules our tasks in the correct order! In this guided project, we will use the pipeline we have been building, and apply it to a real world data pipeline project. From a JSON API, we will __filter, clean, aggregate, and summarize__ data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a __[Hacker News (HN) API](https://news.ycombinator.com/)__ that returns __[JSON data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON)__ of the top stories in 2014. If you're unfamiliar with Hacker News, it's a link aggregator website that users vote up stories that are interesting to the community. It is similar to __Reddit__, but the community only revolves around on computer science and entrepreneurship posts.

![img alt](https://www.3scale.net/wp-content/uploads/2014/11/Hacker-News-APIs.io_.png)

To make things easier, we have already downloaded a list of JSON posts to a file called __hn_stories_2014.json__. The JSON file contains a single key __stories__, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

* created_at: A timestamp of the story's creation time.
* created_at_i: A unix epoch timestamp.
* url: The URL of the story link.
* objectID: The ID of the story.
* author: The story's author (username on HN).
* points: The number of upvotes the story had.
* title: The headline of the post.
* num_comments: The number of a comments a post has.

Here's an example of the full list of keys in a story:

```python
{
    "story_text": "",
    "created_at": "2014-05-29T08:23:46Z",
    "story_title": null,
    "story_id": null,
    "comment_text": null,
    "created_at_i": 1401351826,
    "url": "http://bits.blogs.nytimes.com/2014/05/28/making-twitter-easier-to-use/",
    "parent_id": null,
    "objectID": "7815285",
    "author": "Leynos",
    "points": 1,
    "title": "Making Twitter Easier to Use",
    "_tags": [
        "story",
        "author_Leynos",
        "story_7815285"
        
```

Using this dataset, we will run a sequence of basic natural language processing tasks using our __Pipeline__ class. The goal will be to find the __top 100 keywords__ of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

We have provided a __[solution](https://github.com/dataquestio/solutions/blob/master/Mission267Solutions.ipynb)__ to the guided project for you. You can find it in this link. While the solution is provided to you, we recommend trying to go through the project on your own first.

#### Instructions

* Import the Pipeline class from the pipeline module. You can import it like so: from pipeline import Pipeline.
* Instantiate an instance of the Pipeline class and assign it to the variable pipeline.

In [1]:
from pipeline import Pipeline
from pipeline import build_csv
import json
import datetime
import itertools
import io
import csv

In [2]:
pipeline = Pipeline()

## 1.2 Loading the JSON Data

We'll start the project by loading the JSON file data into Python. Because JSON files resemble a key-value dictionary, the goal is to parse the JSON file into a Python dict object. We can accomplish this using the __[json module](https://docs.python.org/3/library/json.html)__.

In a previous Dataquest mission, we worked with this JSON parser before. As a reminder, this is how you can parse JSON strings:

```python
import json
​
# Notice that `sample_json` is a string, and
# NOT a dict.
sample_json = '{"hello": "world"}'
sample_dict = json.loads(sample_json)
print(sample_dict)
>> {'hello': 'world'}
```

To load in a file, json exposes a method called json.load() which takes in a Python file object as the first argument. Using this json.load() method, we'll load the hn_stories_2014.json file as a Python dict.

#### Instructions

* Create a __pipeline.task()__ function that takes in no arguments.
* Call the function __file_to_json()__, where the function does the following:
    * Loads the __hn_stories_2014.json__ file into a Python dict.
    * Returns the list of __stories__.
    
    
#### __[JSON: JavaScript Object Notation](http://www.json.org/)__
1. A lightweight __data-interchange format__
2. JSON is a __text__ format that completely __language independent__
3. easy __for human__ to read and write
4. easy __for machines__ to parse and generate
5. Two structures:
    * object: { string:value pairs }
    * array: [ ordered collections of values ]
6. value:
    * string
    * number
    * object
    * array
    * true
    * false
    * null
7. 4 useful methods:
    * string: dumps, loads
    * file: dump, load
    
#### __[working with large dataset -- json -- Dataquest - Blog](https://www.dataquest.io/blog/python-json-tutorial/)__

In [3]:
## show the size of variable (in bytes)
## Do json.load(f) read the whole file from the hard drive into memory ????????????
from sys import getsizeof  #### object.__sizeof__

with open('Data/hn_stories_2014.json') as f:
    data = json.load(f) ### why is the size of data so small???? only 240 bytes????
    print(getsizeof(data))
    print(getsizeof(data['stories']))
    print(getsizeof(data['stories'][0]))
    print('len of data: ', len(data))
    print('keys of data: ', data.keys()) ## only one key in the JSON file
    print("len of data['stories']", len(data['stories']))
    


240
927568
648
len of data:  1
keys of data:  dict_keys(['stories'])
len of data['stories'] 107504


In [4]:
##### =================== 快速浏览字符串文件 ==============############: 两种常用stream through方法， 防止整体读入文件耗时过长！！！
##### 1. with open as f
##### 2. command line: more filename

#### glance over the first few lines of the file to see its structure !!!
#### Two approaches !!!!
#### Never try to open the file manually with a text editor, especially when the dataset is large!!!

## Approach 1: since its a text file, we can always open it as a iterator seperated by '\n'
with open('Data/hn_stories_2014.json') as f:
    counter = 0
    for line in f:
        print(line)
        counter+=1
        if counter == 10:
            break
            
## Approach 2: With the help of COMMAND LINE:
#  In windows, "More filename"

{

    "stories": [

        {

            "story_text": "",

            "created_at": "2014-05-29T08:25:40Z",

            "story_title": null,

            "story_id": null,

            "comment_text": null,

            "created_at_i": 1401351940,

            "url": "https://duckduckgo.com/settings",



In [5]:
### rsort to ijson to deal with dataset which are too large to be fit into the memory
import ijson
dir(ijson)

['IncompleteJSONError',
 'JSONError',
 'ObjectBuilder',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'backend',
 'backends',
 'basic_parse',
 'common',
 'compat',
 'items',
 'parse']

In [6]:
#@pipeline.task()
def file_to_json(inputs):
    with open(inputs) as f:
        data = json.load(f)
        stories = data['stories']
        return stories   

In [7]:
with open('Data/hn_stories_2014.json') as f:
    data = json.load(f)
    stories = data['stories']

In [8]:
print(getsizeof(stories)/(1024*1024))
display(stories[1])

0.8845977783203125


{'_highlightResult': {'author': {'matchLevel': 'none',
   'matchedWords': [],
   'value': 'Leynos'},
  'story_text': {'matchLevel': 'none', 'matchedWords': [], 'value': ''},
  'title': {'matchLevel': 'none',
   'matchedWords': [],
   'value': 'Making Twitter Easier to Use'},
  'url': {'matchLevel': 'none',
   'matchedWords': [],
   'value': 'http://bits.blogs.nytimes.com/2014/05/28/making-twitter-easier-to-use/'}},
 '_tags': ['story', 'author_Leynos', 'story_7815285'],
 'author': 'Leynos',
 'comment_text': None,
 'created_at': '2014-05-29T08:23:46Z',
 'created_at_i': 1401351826,
 'num_comments': 0,
 'objectID': '7815285',
 'parent_id': None,
 'points': 1,
 'story_id': None,
 'story_text': '',
 'story_title': None,
 'story_url': None,
 'title': 'Making Twitter Easier to Use',
 'url': 'http://bits.blogs.nytimes.com/2014/05/28/making-twitter-easier-to-use/'}

## 1.3 Filtering the Stories

Great! Now that we have loaded in all the stories as __a list of dict objects__, we can now operate on them. Let's start by filtering the list of stories to get the most popular stories of the year.

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not __Ask HN__ posts), have a good number of points, and have some comments.

#### Instructions

* Create a __pipeline.task()__ function that depends on the __file_to_json()__ function.
* Call the new function __filter_stories()__, that filters popular stories that have more than 50 points, more than 1 comment, and do not begin with __Ask HN__.
* __filter_stories()__ should return a generator of these filtered stories.

In [9]:
#@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    is_popular = lambda story: story['num_comments']>1 and story['points']>50 and not story['title'].startswith('Ask HN')
    return (story for story in stories if is_popular(story)) ## return a generator  ---- Don't forget lambda()

## 1.4 Convert to CSV

With a reduced set of stories, it's time to write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

#### Instructions

* Create a __pipeline.task()__ function that depends on the __filter_stories()__ function.
* Call the new function __json_to_csv()__, that writes the filtered JSON stories to a CSV file:
    * Import __build_csv__ from the __pipelines module__ and __io__. The __build_csv()__ function has the same API as the one you wrote in the second and third mission.
    * Create a CSV file with the headers 'objectID', 'created_at', 'url', 'points', and 'title'.
    * Parse the __created_at__ column using __datetime.datetime__.
* __json_to_csv()__ should return the value from __build_csv()__ using the above header, lines, and the __io.StringIO()__ file.

In [10]:
## nested list by list comprehension
## ([line[key] for key in header] for line in filtered_stories)

#@pipeline.task(depends_on=filter_stories)
def json_to_csv(filtered_stories):
    header = ['objectID', 'created_at', 'url', 'points', 'title']
    parse_datetime = lambda x: datetime.datetime.strptime(x['created_at'], "%Y-%m-%dT%H:%M:%SZ")
    def parse_data(lines):
        for line in lines:
            row = []
            for key in header:
                if key == 'created_at':
                    row.append(parse_datetime(line))
                else:
                    row.append(line[key])
            yield row 

    rows = parse_data(filtered_stories)
    return build_csv(rows, header=header, file=io.StringIO())
    

In [11]:
### testing parse_data function
header = ['objectID', 'created_at', 'url', 'points', 'title']
parse_datetime = lambda x: datetime.datetime.strptime(x['created_at'], "%Y-%m-%dT%H:%M:%SZ")
def parse_data(lines):
    for line in lines:
        row = []
        for key in header:
            if key == 'created_at':
                row.append(parse_datetime(line))
            else:
                row.append(line[key])
        yield row

filtered_stories = filter_stories(stories)
parsed_data = parse_data(filtered_stories)
next(parsed_data)

['7814725',
 datetime.datetime(2014, 5, 29, 4, 27, 42),
 'http://krebsonsecurity.com/2014/05/true-goodbye-using-truecrypt-is-not-secure/',
 60,
 'True Goodbye: ‘Using TrueCrypt Is Not Secure’']

## 1.5 Extract Title Column

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task. To extract the titles, we'll follow the steps in the tasks we wrote in mission two and three.

The steps were: 1. Import csv, and create a csv.reader() object from the file object. 2. Find the index of the title in the header. 3. Iterate the through the reader, and return each item from the reader in the corresponding title index position.

#### Instructions

* Create a __pipeline.task()__ function that depends on the __json_to_csv()__ function.
* Call the new function __extract_titles()__, that returns of a __generator__ of every Hacker News story title:
    * Follow the steps listed in the instructions.
* __extract_titles()__ should return a generator of titles.

In [12]:
#@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    return (row[idx] for row in reader)

In [13]:
print(iter(stories) == iter(iter(stories)))
print([s for s in iter(stories)] == [s for s in iter(iter(stories))])

False
True


## 1.6 Clean the Titles

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like Google, google, GooGle?, and google., all mean the same keyword: google. If we were to split the title into words, however, they would all be lumped into different categories.

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation. An easy way to rid a string of punctuation is to check each character, determine if it is a letter or punctuation, and only keep the letter. From the __string__ package, we are given a handy string constant that contains all the punctuation needed:

```python
import string
​
print(string.punctuation)
>> '!"#%&'()*+,-./:;<=>?@[\\]^_`{|}~'
```

#### Instructions

* Create a pipeline.task() function that depends on the extract_titles() function.
* Call the new function clean_titles(), that returns of a generator of cleaned titles:
    * Ensure the title is lower case.
    * Remove any punctuation from the title.

In [14]:
import string

In [15]:
#@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join([s for s in title if s not in string.punctuation])
        yield title

In [16]:
print(list(string.punctuation))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


## 1.7 Create the Word Frequency Dictionary

With a cleaned title, we can now build the word frequency dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text. Here's an example of how a word frequency would work on a single string:

```python
sample_text = "Wow, the Dataquest Data Engineering track is the best track!"
​
print(word_freq_from_string(sample_text))
>> {'wow': 1, 'the': 2, 'dataquest': 1, 'data': 1, 'engineering': 1, 'track': 2, 'is': 1, 'best': 1}
```

As you can see, the title has been stripped of its punctuation and lower cased. Furthermore, to find actual keywords, we should enforce the word frequency dictionary to not include stop words. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

We have included a module called stop_words with a tuple of the most common used stop words in the English language. You can import in your notebook by using from stop_words import stop_words. Here's what the sample text would look like without the stop words:

```python
sample_text = "Wow, the Dataquest Data Engineering track is the best track!"
​
print(word_freq_no_stop_words(sample_text))
>> {'wow': 1, 'dataquest': 1, 'data': 1, 'engineering': 1, 'track': 2, 'best': 1}
```

#### Instructions

* Create a pipeline.task() function that depends on the clean_titles() function.
* Call the new function build_keyword_dictionary(), that returns a dictionary of the word frequency of all the HN titles.
    * The word frequency should not include stop words.
    * You can find the words by spliting the titles dictionary on the empty space character .
    * Empty words should be ignored.

In [17]:
## browse the tuple object - stop_words by CMD 
## to use cmd command: prefix "!"
!more stop_words.py

stop_words = ("a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "ge

In [18]:
from stop_words import stop_words
len(stop_words)

322

In [19]:
#@pipeline.task(depends_on=cleaned_titles)
def build_keyword_dictionary(lines):
    words_freq = {}
    for line in lines:
        words = line.split()
        for word in words:
            if word in stop_words:
                continue
            if word in words_freq:
                words_freq[word] += 1
            else:
                words_freq[word] = 1
    return words_freq
    

## 1.8 Sort the Top Words

Finally, we're ready to sort the top words used in all the titles. In this final task, it's up to you to decide how you want to sort the top words. The goal is to output a list of tuples with (word, frequency) as the entries sorted from most used, to least most used.

#### Instructions

* Create a pipeline.task() function that depends on the build_keyword_dictionary() function.
* The new function can be named whatever you want, but it should return a list of the top 100 tuples described in the explanation above.
* Run the pipline using pipeline.run(), and print the ouput of the new task function.

In [20]:
def top_keywords(word_freq):
    sorted_keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    return sorted_keywords[:50]

## Testing All

In [21]:
with open('Data/hn_stories_2014.json') as f:
    data = json.load(f)
    stories = data['stories']

In [22]:
filtered_stories = filter_stories(stories) # generator
csv_file = json_to_csv(filtered_stories) # generator
titles = extract_titles(csv_file) # generator
cleaned_titles = clean_title(titles) # generator
word_freq = build_keyword_dictionary(cleaned_titles) # dictionary
top_words = top_keywords(word_freq)

In [23]:
print(top_words)

[('new', 185), ('google', 167), ('bitcoin', 101), ('open', 92), ('programming', 90), ('web', 88), ('data', 85), ('video', 79), ('python', 76), ('code', 72), ('facebook', 71), ('released', 71), ('using', 70), ('2013', 65), ('javascript', 65), ('free', 64), ('source', 64), ('game', 63), ('internet', 62), ('microsoft', 59), ('c', 59), ('linux', 58), ('app', 57), ('pdf', 55), ('work', 54), ('language', 54), ('software', 52), ('2014', 52), ('startup', 51), ('apple', 50), ('use', 50), ('make', 50), ('time', 48), ('yc', 48), ('security', 48), ('nsa', 45), ('github', 45), ('windows', 44), ('1', 41), ('world', 41), ('way', 41), ('like', 41), ('project', 40), ('computer', 40), ('heartbleed', 40), ('git', 37), ('users', 37), ('dont', 37), ('design', 37), ('ios', 37)]


## Putting Together

In [24]:
from pipeline import Pipeline
from pipeline import build_csv
from stop_words import stop_words
import json
from datetime import datetime
import itertools
import io
import csv
import string

pipeline = Pipeline()

In [25]:
@pipeline.task()
def file_to_json():
    with open('Data/hn_stories_2014.json') as f:
        data = json.load(f)
        stories = data['stories']
        return stories
    
@pipeline.task(depends_on=file_to_json)    
def filter_stories(stories):
    is_popular = lambda story: story['points']>50 and story['num_comments']>1 and not story['title'].startswith('Ask HN')
    return (story for story in stories if is_popular(story)) ## return a generator

    
@pipeline.task(depends_on=filter_stories)
def json_to_csv(filtered_stories):
    header = ['objectID', 'created_at', 'url', 'title']
    parse_datetime = lambda x: datetime.strptime(x['created_at'], "%Y-%m-%dT%H:%M:%SZ")
    def parse_data(lines):
        for line in lines:
            row = []
            for key in header:
                if key == 'created_at':
                    row.append(parse_datetime(line))
                else:
                    row.append(line[key])
            yield row 

    rows = parse_data(filtered_stories)
    return build_csv(rows, header=header, file=io.StringIO())

@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    return (row[idx] for row in reader)

@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join([s for s in title if s not in string.punctuation])
        yield title

@pipeline.task(depends_on=clean_title)
def build_keyword_dictionary(lines):
    words_freq = {}
    for line in lines:
        words = line.split()
        for word in words:
            if word in stop_words:
                continue
            if word in words_freq:
                words_freq[word] += 1
            else:
                words_freq[word] = 1
    return words_freq

@pipeline.task(depends_on=build_keyword_dictionary)
def top_keywords(word_freq):
    sorted_keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    return sorted_keywords[:50]

ran = pipeline.run()
print(ran[top_keywords])

[('new', 185), ('google', 167), ('bitcoin', 101), ('open', 92), ('programming', 90), ('web', 88), ('data', 85), ('video', 79), ('python', 76), ('code', 72), ('facebook', 71), ('released', 71), ('using', 70), ('2013', 65), ('javascript', 65), ('free', 64), ('source', 64), ('game', 63), ('internet', 62), ('microsoft', 59), ('c', 59), ('linux', 58), ('app', 57), ('pdf', 55), ('work', 54), ('language', 54), ('software', 52), ('2014', 52), ('startup', 51), ('apple', 50), ('use', 50), ('make', 50), ('time', 48), ('yc', 48), ('security', 48), ('nsa', 45), ('github', 45), ('windows', 44), ('1', 41), ('world', 41), ('way', 41), ('like', 41), ('project', 40), ('computer', 40), ('heartbleed', 40), ('git', 37), ('users', 37), ('dont', 37), ('design', 37), ('ios', 37)]


## 1.9 Next Steps

The final result yielded some interesting keywords. There were terms like bitcoin (the cryptocurrency), heartbleed (the 2014 hack), and many others. Even though this was a basic natural language processing task, it did provide some interesting insights into conversations from 2014. Nonetheless, now that you have created the pipeline, there are additional tasks you can perform with the data.

Here are just a few:

* Rewrite the Pipeline class' output to save a file of the output for each task. This will allow you to "checkpoint" tasks so they don't have to be run twice.
* Use the __[nltk](http://www.nltk.org/)__ package for more advanced natural language processing tasks.
* Convert to a CSV before filtering, so you can keep all the stories from 2014 in a raw file.
* Fetch the data from Hacker News directly from a __[JSON API](https://hn.algolia.com/api)__. Instead of reading from the file we gave, you can perform additional data processing using newer data.

## Solutions:

#### Why there is 1 less for all the frequency between my answer and the solution ???


In [26]:
from datetime import datetime
import json
import io
import string
import csv

from pipeline import build_csv, Pipeline
from stop_words import stop_words

pipeline = Pipeline()

@pipeline.task()
def file_to_json():
    with open('Data/hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories

@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN')
    
    return (
        story for story in stories
        if is_popular(story)
    )

@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = [] #### not generator !!!!!!1 not good!!!
    for story in stories:
        lines.append(
            (story['objectID'], datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), story['url'], story['points'], story['title'])
        )
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())

@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    
    return (line[idx] for line in reader)

@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

@pipeline.task(depends_on=clean_title)
def build_keyword_dictionary(titles):
    word_freq = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words: ###???
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

@pipeline.task(depends_on=build_keyword_dictionary)
def top_keywords(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

ran = pipeline.run()
print(ran[top_keywords])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3

In [27]:
##################3 lambda 函数的错误使用 ------>>>>>造成以上filter_stories() 计算错误 ###################

test = lambda x: x>10
display([x for x in range(20) if test]) #### 漏了(), 造成了条件永久失效
display([x for x in range(20) if test(x)])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

[11, 12, 13, 14, 15, 16, 17, 18, 19]