# Scraping and Analyzing 'Pocketed' Articles (https://getpocket.com/)
<br>

I wrote a script that works with the https://getpocket.com/developer/ "Retrieve" API. This script writes out four files called <b>pocket_data_raw.json</b>, <b>pocket_data_raw.pkl</b>, <b>pocket_data.pkl</b>, and <b>pocket_links.pkl</b>. Here is the script documentation:  
 

In [2]:
import pocket_onboarding
print pocket_onboarding.__doc__


Run:    python pocket_onboarding.py

Input:  Https://getpocket.com account credentials

Output: Three DataFrames stored as pocket_data_raw.pkl, pocket_data.pkl, and pocket_links.pkl. 
        One JSON file stored as pocket_data_raw.json
        
        1) pocket_data_raw.pkl contains columns of meta data described on
           https://getpocket.com/developer/docs/v3/retrieve as well as 'article_text' 
           which is the body of text extracted using beautifulsoup/requests on 'resolved_url'

        2) pocket_data.pkl is a cleaned up version of pocket_data_raw.pkl. See below (def cleanData)
           for the full list of cleaning steps that were taken, but for example this includes
           eliminating null values from important fields and making assumptions about 'article_text'
           word counts. 

        3) pocket_links.pkl is a two column data frame that maps (many-to-many) 'resolved_url' to
           links found while scraping 'resolved_url'. 



<br>
## pocket_data_raw.json
This file contains the raw data returned from the aforementioned "Retrieve" API which is described <a href="https://getpocket.com/developer/docs/v3/retrieve">here</a>. 

In [9]:
import json
import pprint

with open('pocket_data_raw.json', 'rb') as fp:
    pocket_data_raw = json.load(fp)
    print 
    print "Number of Articles: ", len(pocket_data_raw['list'].keys())
    print 
    print "-- Sample Article --"
    print 
    it = pocket_data_raw['list'].itervalues()
    it.next(); it.next()
    pprint.pprint(it.next())


Number of Articles:  874

-- Sample Article --

{u'excerpt': u"One reason programmers dislike meetings so much is that they're on a different type of schedule from other people. Meetings cost them more.  There are two types of schedule, which I'll call the manager's schedule and the maker's schedule. The manager's schedule is for bosses.",
 u'favorite': u'0',
 u'given_title': u'http://www.paulgraham.com/makersschedule.html',
 u'given_url': u'http://www.paulgraham.com/makersschedule.html',
 u'has_image': u'0',
 u'has_video': u'0',
 u'is_article': u'1',
 u'is_index': u'0',
 u'item_id': u'14878635',
 u'resolved_id': u'14878635',
 u'resolved_title': u"Maker's Schedule, Manager's Schedule",
 u'resolved_url': u'http://www.paulgraham.com/makersschedule.html',
 u'sort_id': 256,
 u'status': u'1',
 u'tags': {u'management': {u'item_id': u'14878635', u'tag': u'management'},
           u'workplace': {u'item_id': u'14878635', u'tag': u'workplace'}},
 u'time_added': u'1447636682',
 u'time_favorited'

<br>
## pocket_data_raw.pkl 
The next step was to use beautifulsoup and requests to scrap and extract <b>'article_text'</b> from <b>'resolved_url'</b>. Once that is done, I converted the JSON data to a pandas DataFrame and stored it to disk using the <a href="https://docs.python.org/2/library/pickle.html">pickle</a> module. 

In [29]:
import pandas as pd
# make sure we display all the columns
pd.set_option('display.max_columns', 25)
df = pd.read_pickle('pocket_data_raw.pkl')
print "Shape of Data: ", df.shape
print 
print "-- Sample Article --"
print
print df.iloc[3]
print
print "-- Sample 'article_text' --"
print 
print df.iloc[3].article_text[:500], '...'
print 

Shape of Data:  (873, 25)

-- Sample Article --

article_text      want to start a startup get funded by y combin...
authors                                                         NaN
excerpt           One reason programmers dislike meetings so muc...
favorite                                                          0
given_title           http://www.paulgraham.com/makersschedule.html
given_url             http://www.paulgraham.com/makersschedule.html
has_image                                                         0
has_video                                                         0
image                                                           NaN
images                                                          NaN
is_article                                                        1
is_index                                                          0
item_id                                                    14878635
resolved_id                                                14878635

<br>
## pocket_data.pkl 
This file is a cleaned up version of <b>pocket_data_raw.pkl</b>. Here I've restricted the 25 columns to just the columns I'm interested in using, defined a couple new columns, excluded rows with strange 'null' values that I encountered while examining the data, and excluded rows that didn't live up to the word count requirements I'm enforcing. 

In a nutshell, here are the cleaning steps I took:
```
  # clean all missing article_text - not sure why this happens
  df = df[pd.notnull(df['article_text'])]
  
  # translate binary fields to true/false
  df['is_archived'] = df.status.map(lambda x: int(x) !=0)
  
  # define actual word count
  df['actual_word_count'] = df.article_text.map(lambda x: len(x.split()))

  # select only the columns we are interested in using
  df = df[['resolved_id', 'word_count', 'actual_word_count', 'resolved_title', 
    'resolved_url', 'article_text', 'is_archived', 'excerpt', 'tags']]

  # TfidfVectorizer, which will most likely be used to featurize, 
  # doesn't work well on smaller samples. Also, pages that depend
  # mostly on javascript can have zero words in either column.
  df = df[df.word_count.astype(int) > 100]
  df = df[df.actual_word_count.astype(int) > 100]    

  # find the percent difference between the two counts
  percent_diffs = np.abs( df.actual_word_count.astype(int) - df.word_count.astype(int) )
  percent_diffs = ( percent_diffs * 100 ) / df.word_count.astype(int)
  df['percent_diffs'] = percent_diffs

  # exclude anything greater than 300% difference in word count
  df = df[df.percent_diffs < 300]
```


In [31]:
import pandas as pd
# make sure we display all the columns
pd.set_option('display.max_columns', 25)
df = pd.read_pickle('pocket_data.pkl')
print "Shape of Data: ", df.shape
print 
print "-- Sample Article --"
print
print df.iloc[3]
print

Shape of Data:  (638, 10)

-- Sample Article --

resolved_id                                                   14878635
word_count                                                        1128
actual_word_count                                                 1134
resolved_title                    Maker's Schedule, Manager's Schedule
resolved_url             http://www.paulgraham.com/makersschedule.html
article_text         want to start a startup get funded by y combin...
is_archived                                                       True
excerpt              One reason programmers dislike meetings so muc...
tags                 {u'management': {u'item_id': u'14878635', u'ta...
percent_diffs                                                 0.531915
Name: 3, dtype: object



### Column Definitions
* **resolved_id** - A unique identifier similar to the item_id but is unique to the actual url of the saved item. The resolved_id identifies unique urls. For example a direct link to a New York Times article and a link that redirects (ex a shortened bit.ly url) to the same article will share the same resolved_id. If this value is 0, it means that Pocket has not processed the item. Normally this happens within seconds but is possible you may request the item before it has been resolved.
* **word_count** - How many words are in the article (determined by pocket)
* **actual_word_count** - How many words are in the article (determined after scraping + cleaning)
* **resolved_title** - The title that Pocket found for the item when it was parsed
* **resolved_url** - The final url of the item. For example if the item was a shortened bit.ly link, this will be the actual article the url linked to.
* **article_text** - The text that was scraped + cleaned
* **is_archived** - True/False (Have I read the article and archived it?)
* **excerpt** - The first few lines of the item (articles only)
* **tags** - A JSON object of the user tags associated with the item
* **percent_diffs** - Difference betwween word_count and actual_word_count

<br>
## pocket_links.pkl 
While using beautifulsoup to extract **'article_text'**, I also scraped the links found in each article and dumped them into a dataframe for later. It is my hope to build a recommendation system using these links as inputs. 

In [37]:
links = pd.read_pickle('pocket_links.pkl')
print links.shape
links.head(5)

(42301, 2)


Unnamed: 0,found_link,resolved_id
0,http://time.com/3928685/nepal-earthquake-recov...,909801869
1,http://cgi.timeinc.net/cgi-bin/mail/dnp/terms_...,909801869
2,http://time.com/us/,909801869
3,https://secure.customersvc.com/servlet/Show?WE...,909801869
4,https://www.pinterest.com/timemagazine/,909801869


In [41]:
print links.resolved_id.value_counts().head()
print links.resolved_id.nunique()

332012769    589
322918430    465
941408661    431
915810002    420
624520141    391
Name: resolved_id, dtype: int64
827
