## Take untagged / to_file pocket articles, parse w/ readability, tag w/ calais

### Dependencies
- [pocket-ruby](https://github.com/turadg/pocket-ruby)
- [open_calais](https://github.com/PRX/open_calais)
- [sanitize](https://github.com/rgrove/sanitize)

And also a pocket consumer key, as well as tokens from the oauth server (run `server.rb` to acquire these)

In [1]:
require 'pocket-ruby'
require 'open_calais'

require 'sanitize'

require 'hashie'

require './lib/url_tagger'
require './lib/key_manager'

true

## Load keys from 'current_tokens.json'
### If this is empty, run `ruby server.rb` and oauth with some services to populate it
### You also need an open calais key for tagging, and a pocket consumer key for pocket fetching (sorry)

In [2]:
keys = JSON.parse open('current_tokens.json').read
if keys.size == 0
  puts "run ruby server.rb -p8080 to get some oauth keys"
else
  keys.keys
end

["feedly", "pocket", "pocket_consumer", "calais", "readability"]

In [3]:
keys

list of keys in your key file

## Create a `UrlTagger` with access to Open Calais and Readability

In [4]:
tagger = UrlTagger.new(calais_key: keys["calais"], readability_key: keys["readability"])

#<UrlTagger:0x00000003942840 @calais_client=#<OpenCalais::Client:0x000000039427a0 @current_options={:api_key=>"(redacted)", :adapter=>:excon, :endpoint=>"https://api.thomsonreuters.com/permid/calais", :user_agent=>"OpenCalais Ruby Gem 0.3.2"}, @api_key="(redacted)", @adapter=:excon, @endpoint="https://api.thomsonreuters.com/permid/calais", @user_agent="OpenCalais Ruby Gem 0.3.2">, @readability_key="(redacted)">

### The `tags_for` method assigns tags via open calais

In [5]:
url = "https://www.rockpapershotgun.com/2016/02/10/eves-project-legion-scrapped-new-fps-coming/"
tagger.tags_for(url)

[{:name=>"Environment", :score=>0.638, :original=>"Environment"}, {:name=>"eve online", :score=>0.9}, {:name=>"dust 514", :score=>0.9}, {:name=>"ccp games", :score=>0.9}, {:name=>"dust", :score=>0.7}, {:name=>"eve", :score=>0.7}]

## Automated tagging demo with Pocket
### requires `pocket_consumer` key and oauthing with Pocket

In [6]:
pocket_client = Pocket.client(consumer_key: keys["pocket_consumer"], access_token: keys["pocket"])

#<Pocket::Client:0x0000000384a8c0 @adapter=:net_http, @consumer_key="(redacted)", @access_token="(redacted)", @endpoint="https://getpocket.com/v3/", @redirect_uri=nil, @format=:json, @user_agent="Pocket Ruby Gem 0.0.6", @proxy=nil>

### Define a bunch of lambdas instead of a class because I'm still prototyping this

In [7]:
false_positive_tags = ["http cookie"]

# Choose the better of two tag sets; pretty cure algorithm for now
better_tags = ->(item, seta, setb) do
  return seta if setb.count == 0
  return setb if seta.count == 0
  
  a_conf = seta.sum{|u| u[:score]} / seta.count
  b_conf = setb.sum{|e| e[:score]} / setb.count
  
  if a_conf > b_conf
    if seta.any?{|sa| false_positive_tags.include?(sa)}
      setb
    else
      seta
    end
  else
    setb
  end
end

# get the 'better' of either tags based on the article excerpt, or the full (html scrubbed) text
auto_tag = ->(item, tagger) do
  url = item["resolved_url"] || item["given_url"]
  
  url_tags = tagger.tags_for(url)
  excerpt_tags = tagger.get_tags(item["excerpt"])
  
  tags = better_tags[item, url_tags, excerpt_tags]
  
  tags
end

# set tags in pocket
set_tags = ->(item_id, tags) do
  payload = [
    {action: "tags_add", tags: tags, item_id: item_id}
     ]
  puts payload
  pocket_client.modify(payload)
end

#<Proc:0x00000001ea0820@(pry):55 (lambda)>

## Fetch some entries from pocket

In [8]:
pocket_entries = pocket_client.retrieve(count: 6, offset: 0)
entries = pocket_entries["list"]

entries.count

6

### What have we got here?

In [9]:
titles = entries.map{|id, val| val["resolved_title"]}
puts titles.join("\n")

North Korea Uncovered: The Crowd-Sourced Mapping of the World’s Most Secret State
How prominent black voices are divided on Clinton v Sanders
Quick Easy Fish Stew
Gravitational Waves Found: Kip Thorne Explains
Why the Authors Guild Is Still Wrong About Google’s Book Scanning
StevenBlack/hosts


### Let's take a look at one

In [10]:
e=entries.values.first
puts JSON.pretty_unparse e

{
  "item_id": "1191716972",
  "resolved_id": "1191716972",
  "given_url": "http://blogs.loc.gov/maps/2016/02/north-korea-uncovered-the-crowd-sourced-mapping-of-the-worlds-most-secret-state/",
  "given_title": "Crowdsourced Mapping of North Korea",
  "favorite": "0",
  "status": "0",
  "time_added": "1455251176",
  "time_updated": "1455253938",
  "time_read": "0",
  "time_favorited": "0",
  "sort_id": 0,
  "resolved_title": "North Korea Uncovered: The Crowd-Sourced Mapping of the World’s Most Secret State",
  "resolved_url": "http://blogs.loc.gov/maps/2016/02/north-korea-uncovered-the-crowd-sourced-mapping-of-the-worlds-most-secret-state/",
  "excerpt": "Begin Press Release: Library of Congress to Hold Lecture on the Crowd-Sourced Mapping of North Korea, Feb. 24 Curtis Melvin, a researcher at the U.S.",
  "is_article": "1",
  "is_index": "0",
  "has_video": "0",
  "has_image": "0",
  "word_count": "513"
}


### Fetch tags based on the excerpt

In [11]:
etgs=tagger.get_tags(e["excerpt"])

[{:name=>"Politics", :score=>0.897, :original=>"Politics"}, {:name=>"Environment", :score=>0.467, :original=>"Environment"}, {:name=>"melvin", :score=>0.9}]

### Fetch tags for the whole article

In [12]:
utgs = tagger.tags_for(e["resolved_url"])

[{:name=>"member states of the united nations", :score=>0.9}, {:name=>"republics", :score=>0.9}, {:name=>"military of north korea", :score=>0.9}, {:name=>"geography of north korea", :score=>0.7}, {:name=>"north korea uncovered", :score=>0.7}, {:name=>"foreign relations of north korea", :score=>0.7}, {:name=>"yongbyon nuclear scientific research center", :score=>0.7}, {:name=>"north korea", :score=>0.7}, {:name=>"korean language", :score=>0.7}, {:name=>"south korea", :score=>0.7}]

### Use the `better_tags` function to determine which set of tags is better (?)

In [13]:
best_tags=better_tags[e, etgs, utgs]

[{:name=>"member states of the united nations", :score=>0.9}, {:name=>"republics", :score=>0.9}, {:name=>"military of north korea", :score=>0.9}, {:name=>"geography of north korea", :score=>0.7}, {:name=>"north korea uncovered", :score=>0.7}, {:name=>"foreign relations of north korea", :score=>0.7}, {:name=>"yongbyon nuclear scientific research center", :score=>0.7}, {:name=>"north korea", :score=>0.7}, {:name=>"korean language", :score=>0.7}, {:name=>"south korea", :score=>0.7}]

### Set the pocket item to the chosen 'best' set of tags

In [14]:
set_tags[e["item_id"], best_tags]

[{:action=>"tags_add", :tags=>[{:name=>"member states of the united nations", :score=>0.9}, {:name=>"republics", :score=>0.9}, {:name=>"military of north korea", :score=>0.9}, {:name=>"geography of north korea", :score=>0.7}, {:name=>"north korea uncovered", :score=>0.7}, {:name=>"foreign relations of north korea", :score=>0.7}, {:name=>"yongbyon nuclear scientific research center", :score=>0.7}, {:name=>"north korea", :score=>0.7}, {:name=>"korean language", :score=>0.7}, {:name=>"south korea", :score=>0.7}], :item_id=>"1191716972"}]


{"action_results"=>[true], "status"=>1}

### Tada! tags are set in Pocket. Now to figure out what to do with the tagged pocket articles...

## Innards of the tagger

### Parsing a url with the Readability parse API

In [15]:
parsed = tagger.readability_parse_url(e["resolved_url"])
puts "keys: #{parsed.keys}"
parsed["content"][1..1000] # This can get pretty long

keys: ["domain", "next_page_id", "url", "short_url", "author", "excerpt", "direction", "word_count", "total_pages", "content", "date_published", "dek", "lead_image_url", "title", "rendered_pages"]


"div><div class=\"entry-content\">\n\t\t<div id=\"attachment_558\" class=\"wp-caption aligncenter\"><a href=\"http://blogs.loc.gov/maps/files/2016/02/Digital-Atlas-screenshot.jpg\"><img class=\"wp-image-558 size-full\" src=\"http://blogs.loc.gov/maps/files/2016/02/Digital-Atlas-screenshot.jpg\" alt=\"\" width=\"1001\"></a><p class=\"wp-caption-text\">Above, is an excerpt from the <a href=\"http://www.38northdigitalatlas.org/\" class=\"external\">38 North Digital Atlas</a>. 38 North is a project of the U.S.-Korea Institute at the Paul H. Nitze School of Advanced International Studies (SAIS), Johns Hopkins University. Copyright &#xA9; 2009-2016. Image courtesy of Curtis Melvin.</p></div>\n<p>Begin Press Release:</p>\n<h2>Library of Congress to Hold Lecture on the</h2>\n<h2>Crowd-Sourced Mapping of North Korea, Feb. 24</h2>\n<p>Curtis Melvin, a researcher at the U.S.-Korea Institute at Johns Hopkins University, will discuss the crowd-sourced mapping of North Korea, which resulted in one o

### Remove HTML tags and weird content (confuses the semantic taggers)

In [16]:
scrubbed = tagger.scrub_html(parsed["content"])
puts scrubbed[1..1000]

bove, is an excerpt from the 38 North Digital Atlas. 38 North is a project of the U.S.-Korea Institute at the Paul H. Nitze School of Advanced International Studies (SAIS), Johns Hopkins University. Copyright © 2009-2016. Image courtesy of Curtis Melvin.    Begin Press Release:   Library of Congress to Hold Lecture on the   Crowd-Sourced Mapping of North Korea, Feb. 24   Curtis Melvin, a researcher at the U.S.-Korea Institute at Johns Hopkins University, will discuss the crowd-sourced mapping of North Korea, which resulted in one of the most detailed maps of North Korea that has ever been available to the public.   Melvin will present “North Korea Uncovered: The Crowd-Sourced Mapping of the World’s Most Secret State” at noon on Wednesday, Feb. 24 in the Mumford Room on the sixth floor of the James Madison Memorial Building, 101 Independence Ave. S.E., Washington, D.C. The event, free and open to the public, is hosted by The Philip Lee Phillips Map Society, the Friends Group of the Libr

### And fetch tags using Open Calais

In [17]:
tags = tagger.get_tags(scrubbed)

[{:name=>"member states of the united nations", :score=>0.9}, {:name=>"republics", :score=>0.9}, {:name=>"military of north korea", :score=>0.9}, {:name=>"geography of north korea", :score=>0.7}, {:name=>"north korea uncovered", :score=>0.7}, {:name=>"foreign relations of north korea", :score=>0.7}, {:name=>"yongbyon nuclear scientific research center", :score=>0.7}, {:name=>"north korea", :score=>0.7}, {:name=>"korean language", :score=>0.7}, {:name=>"south korea", :score=>0.7}]

## For our last trick, tag a whole bunch of articles at once. YOLO!

In [18]:
list_entries = pocket_client.retrieve(count: 6, offset: 16) # get some new Entries
entries = list_entries["list"]

entries.values.map{|e| e["resolved_title"]}

["Fleeting Wonders: An Influx of Manatees", "NASA’s asteroid mission isn’t dead—yet", "In pictures: Mumbai cabbies", "The Insouciant Heiress Who Became the First Western Woman to Enter Palmyra", "The Portland of Portugal", "Your Stupid-Ass Typing Style Might Not Actually Be So Bad"]

### All Kinds of horrible things can go wrong parsing and tagging, so catch any errors and report them

In [19]:
results = entries.map do |id, e|
  begin
    best_tags = auto_tag[e, tagger]
  rescue StandardError => e
    puts "error: #{e.inspect}"
  end
  
  puts "#{e["resolved_title"]}: #{best_tags}\n\n"
  
  set_tags[id, best_tags] if best_tags.count > 0
end

Fleeting Wonders: An Influx of Manatees: [{:name=>"sirenians", :score=>0.9}, {:name=>"manatee", :score=>0.9}, {:name=>"crystal river", :score=>0.9}, {:name=>"three sisters springs", :score=>0.7}, {:name=>"edge species", :score=>0.7}, {:name=>"west indian manatee", :score=>0.7}]


[{:action=>"tags_add", :tags=>[{:name=>"sirenians", :score=>0.9}, {:name=>"manatee", :score=>0.9}, {:name=>"crystal river", :score=>0.9}, {:name=>"three sisters springs", :score=>0.7}, {:name=>"edge species", :score=>0.7}, {:name=>"west indian manatee", :score=>0.7}], :item_id=>"1190177662"}]
NASA’s asteroid mission isn’t dead—yet: [{:name=>"Technology & Internet", :score=>0.961, :original=>"Technology_Internet"}, {:name=>"Environment", :score=>0.807, :original=>"Environment"}, {:name=>"spaceflight", :score=>0.9}, {:name=>"asteroid redirect mission", :score=>0.9}, {:name=>"planetary defense", :score=>0.9}, {:name=>"human mission to an asteroid", :score=>0.7}, {:name=>"nasa", :score=>0.7}, {:name=>"space launch

[{"action_results"=>[true], "status"=>1}, {"action_results"=>[true], "status"=>1}, {"action_results"=>[true], "status"=>1}, {"action_results"=>[true], "status"=>1}, {"action_results"=>[true], "status"=>1}, {"action_results"=>[true], "status"=>1}]