# Feedly Data Extraction Demo

Python 3

In [292]:
from feedly.client import *
from feedly import *
from newspaper import Article, ArticleException # http://newspaper.readthedocs.io/en/latest/
from time import sleep
import numpy as np
import pandas as pd
import datetime
import math
pd.set_option('max_rows',300)

In [36]:
## IAP Crds
TOKEN = "A2zjasgZJawkY8etL3a9w1QP_BFLH7YcnaW_s-7kR7oU8Nkrz-ZY8spKj_rGuqYtyAJ4vYItikat_WS35cBCKA9jqYrbg_frpzLL_987_THA8BB4cXYfVGReSQMoScif6g7HI72_aKHYcheyqFVjObZX6QYiCZbrDAyzE1XvvvORiy8MjTSwRXQoX3in0_ywGYgFsfxJRA5M073PVSJJDv0Tv67JxC-GlvFRV3xiLqthS3Ed_8Qzztk:feedlydev"
FEEDLY_REDIRECT_URI = "http://fabreadly.com/auth_callback"
FEEDLY_CLIENT_ID="d8f62d80-bd91-4b23-bdc3-c219d0489a26"

# Load the Feed

Reference: [Feedly Documentation](https://developer.feedly.com/cloud/)

In [37]:
import json
import requests

# Feedly
feedaccess = TOKEN

## Use url below to get the feed ids. 
myurl = 'https://cloud.feedly.com/v3/subscriptions'
headers = {'Authorization': 'OAuth ' + feedaccess}
res = requests.get(url=myurl, headers=headers)
con = res.json()
output = json.dumps(con , indent=4)

# See all IAP Feeds and their IDs

From the API you can pull specific feeds by ID, or you can pull everything. One issue is that IAP is also a content generator and actually pushes things to Feedly, so you also end up pulling that stuff. Anything marked EWS, or ewsdata.rightsindevelopment.org or a link to that site is an IAP generated entry, we need to filter these out. We should be able to do this faitly easily with the metadata that is available for each item. Other option is pull from a bunch of different feeds, but my guess is that filtering will actually be easier. 

Code below shows all the different feeds - the ALL feed is not listed but is coded as an option in the `pull_feed` function.

In [38]:
def see_feeds(feedaccess=feedaccess):
    """
        Get the list of IAP feeds and the feed id - we need this when pulling the feed data. 
    """
    myurl = 'https://cloud.feedly.com/v3/subscriptions'
    headers = {'Authorization': 'OAuth ' + feedaccess}
    res = requests.get(url=myurl, headers=headers)
    con = res.json()
    output = json.dumps(con , indent=4)
    df = pd.DataFrame([(c['title'] , c['categories'][0]['id']) for c in con])
    df.columns = ['Title','id']
    return df, con

In [39]:
df, raw  = see_feeds(feedaccess)

In [40]:
df.head()

Unnamed: 0,Title,id
0,All - EWS,user/d8f62d80-bd91-4b23-bdc3-c219d0489a26/cate...
1,EWS SA,user/d8f62d80-bd91-4b23-bdc3-c219d0489a26/cate...
2,ADB,user/d8f62d80-bd91-4b23-bdc3-c219d0489a26/cate...
3,WB,user/d8f62d80-bd91-4b23-bdc3-c219d0489a26/cate...
4,"Title: World Bank, Text: Loan",user/d8f62d80-bd91-4b23-bdc3-c219d0489a26/cate...


In [41]:
def pull_feed(feed_id, feedcount, all_feeds=False,  feedaccess=feedaccess):
    """
    Pull the feed information from the Feedly API and returns a list of pulled JSON objects. 
    Returns a list in case we are pulling more then 1000 items, then we have multiple JSON objects. 
    
    feed_id: Id of the feed we want to pull from. (str)
    feedcount: Target number of items to pull from the feed. (int)
    all_feeds: If true then pulls all items in the IAP feed - value of feed_id will be ignored (Bool)
    feedaccess: Token Information (str)
    """
    
    feedcount = str(feedcount)
    current_count = 0
    continuation_rounds = math.ceil(int(feedcount) / 1000.0)
    json_data = []
    continuation_id = None
    if all_feeds:
        feed_id = 'user/d8f62d80-bd91-4b23-bdc3-c219d0489a26/category/global.all'

    for i in range(continuation_rounds):
        print('Pulling Data - Round %s' % str(i+1))
        myurl = "http://cloud.feedly.com/v3/streams/contents?streamId=" + feed_id + "&count=" + feedcount
        
        if continuation_id:
            myurl += "&continuation={}".format(continuation_id)
        headers = {'Authorization': 'OAuth ' + feedaccess}
        res = requests.get(url=myurl, headers=headers)
        con = res.json()
        json_data.append(con)
        
        if int(feedcount) > 1000:
            print(con.keys())
            continuation_id = con['continuation']
    
    print('Complete')
    return json_data

In [237]:
pulled_json = pull_feed('',20000,all_feeds=True)

Pulling Data - Round 1
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 2
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 3
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 4
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 5
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 6
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 7
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 8
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 9
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 10
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 11
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 12
dict_keys(['updated', 'continuation', 'items', 'id'])
Pulling Data - Round 13
dict_keys(['updated', 'continuation', 'items', 'i

---------------

# Process the Feed 

Convert to a dataframe

#TODO - Figure out what tags we need to preserve here - like from which news feed were they pulled - should be valuable for identifying the bank being mentioned.  Might have all of them but might be some others that could be useful. 

items reference: https://developer.feedly.com/v3/entries/

In [238]:
def process_pulled_data(json_data):
    df_data = []
    
    for grp in range(len(json_data)):
        data = json_data[grp]
        for i in range(len(data['items'])):

            vals = data['items'][i]
            article_data = []
            article_data += [vals['fingerprint'], vals['published'], vals['title'],vals['alternate'][0]['href'],vals['categories'][0]['label']]
            try:
                article_data.append(vals['content']['content'])
            except:
                article_data.append(None)

            try:
                article_data.append(vals['summary']['content'])
            except:
                article_data.append(None)
            df_data.append(article_data)
        
        
    df = pd.DataFrame(df_data, columns=None)
    df.columns = ['article_id','published','title','url','feed_label','content','summary']
    df.published = [datetime.datetime.fromtimestamp(i/1000.0) for i in df.published]
    return df

In [239]:
json2df = process_pulled_data(pulled_json)

---------------------

## Filter the Items 

**Remove the EWS Posts - these are from IAP we don't need to process them **

In [240]:
json2df['keep'] = [False if 'ews.rightsindevelopment.org' in i else True for i in json2df.url]

In [241]:
json2df.keep.value_counts()

True     12098
False     7902
Name: keep, dtype: int64

In [242]:
json2df = json2df[json2df.keep]

**Filter out File Uploads and other Non-Articles**

It appears that if the `summary` field is empty the item is not an article. 

In [243]:
json2df = json2df[json2df['summary'].notnull()]
print(json2df.shape)

(11036, 8)


---------------

## De-Dupe a Bit 

This is just a basic group by - does not look for articles that are duplicated in content but maybe from a different url, or differnt source. 

In [277]:
grp_df = json2df.groupby(['article_id','title','url','keep']).agg({
    'content':'min',
    'summary':'min',
    'published':'max',
    'feed_label': lambda x: ','.join(set(x))}
    ).reset_index()

In [278]:
grp_df.shape

(9194, 8)

--------------------

## Export - Pre Scrape

This is just to create a file the IAP can use to create labeled data. 

In [296]:
# grp_df = grp_df.sample(frac=1)

# print(grp_df.shape)

# grp_df[['article_id','published','title','url','feed_label']].to_csv('../Temp_Output/article20k_pull4labeling.csv',index=False)

----------------------------

## Scrape the articles

**NOTE** - This is Slow - so may need to run in batches or overnight, or both. 

Doing some scraping for article content --- I've pull around 20K articles which when deduped and filtered is around 11K actual news articles. This is a lot of content to scrape at once - hence the use of the file cache. Code is all a little hacky - just wanted to get some stuff pulled quickly. We don't want to be pulling data at the event . 

**Cache**
I'm saving the "scraped" article content in a dictionary and then writing it to a file. If we change the information we are pulling using the newspaper library we will need to recreate this cache. 

In [353]:
try:
    with open('../Temp_Output/article_cache.pkl', 'rb') as file:
        cache = pickle.load(file)
except:
    cache = {}
    print('Creating New Cache.. Is this Correct')
    
error_count = 0

In [356]:
def get_text_via_Article(url, article_id, try_hard=False):
    """
    Returns scraped article content and the keywords- using the newspaper3k module (http://newspaper.readthedocs.io/en/latest/)
    """
    
    global cache  ## Just writing to the global cahce object this way we can interrupt the run without losing the data
    global error_count
    if article_id not in cache:  ## Check to see if we have already scraped this article (maybe in a previous run of this)
        article = Article(url)   ## Newspaper Article Object
        article.download()
        try:
            article.parse()  ## Sometimes this step fails because the download doesn't complete
        except ArticleException:  ## In that case we give the download an additional 10 seconds to complete. 
            if try_hard: ## If we want to actually try to download the ones that failed
                print('Encountered Exception',url)
                article.download()
                print('\nGoing to try a longer download period. ')
                sleep(10) #Sometimes it take a lil bit to download the article - longer is better but then it takes longer .... 
                try:  
                    article.parse()  ## Try again 
                except ArticleException:  #Otherwise lets just keep going
                    print('Failed - Article Not Downloaded\n')
                    error_count += 1
                    return None 
            else:
                error_count += 1
                return None
        ## Now Process Article 
        article.nlp()
        cache[article_id] = (article.text, article.keywords)
        return None
    else:
        return None
    

**Loop over the dataframe and extract the article content (if it hasn't been scraped yet ) **

In [357]:
for cnt, idx in enumerate(grp_df.index):
    if cnt%25 == 0:
        print('** Iteration Count', cnt,' **')
        print('** Error Count', error_count, ' **')
    row = grp_df.loc[idx]
    get_text_via_Article(row['url'],row['article_id'])

** Completed Count 0  **
** Error Count 0  **
Article `download()` failed with 404 Client Error: Not Found for url: https://www.adb.org/node/415051?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+adb_news+%28ADB.org+News+Releases+RSS%29 on URL http://feedproxy.google.com/~r/adb_news/~3/-LPJT1oP4-w/415051
** Completed Count 25  **
** Error Count 1  **
Article `download()` failed with 404 Client Error: Not Found for url: http://barbadostoday.bb/2018/02/13/arthurs-cure/ on URL https://www.barbadostoday.bb/2018/02/13/arthurs-cure/
** Completed Count 50  **
** Error Count 2  **
** Completed Count 75  **
** Error Count 2  **
Article `download()` failed with 503 Server Error: Service Temporarily Unavailable for url: https://www.brecorder.com/2018/06/06/421675/profile-of-dr-shamshad-akhtar/ on URL https://www.brecorder.com/2018/06/06/421675/profile-of-dr-shamshad-akhtar/
** Completed Count 100  **
** Error Count 3  **
** Completed Count 125  **
** Error Count 3  **
Article `download

  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))
  " Skipping tag %s" % (size, len(data), tag))


Article `download()` failed with 404 Client Error: Unknown site! for url: http://energyinfrastructure.cleantechnology-business-review.com/news/eib-provides-eur100m-loan-to-acciona-to-develop-digitalisation-strategy-220118-6033130 on URL http://energyinfrastructure.cleantechnology-business-review.com/news/eib-provides-eur100m-loan-to-acciona-to-develop-digitalisation-strategy-220118-6033130
Article `download()` failed with 410 Client Error: Gone for url: http://www.wral.com/the-latest-mnuchin-urges-lending-shift-to-poorer-countries/17501859/ on URL http://www.wral.com/the-latest-mnuchin-urges-lending-shift-to-poorer-countries/17501859/
You must `download()` an article first!
** Completed Count 3625  **
** Error Count 129  **
Article `download()` failed with 404 Client Error: Not Found for url: https://economictimes.indiatimes.com/news/international/business/adb-china-backed-aiib-to-co-finance-more-projects-this-year/articleshow/62474176.cms on URL https://economictimes.indiatimes.com/ne

KeyboardInterrupt: 

In [358]:
import pickle

with open('../Temp_Output/article_cache.pkl', 'wb') as file:
    pickle.dump( cache, file)
grp_df['scraped_content'] = grp_df['article_id'].map(cache)

In [366]:
grp_df['article_text'] = [i[0] if pd.notnull(i) else np.nan for i in grp_df.scraped_content]
grp_df['article_keywords'] = [i[1] if pd.notnull(i) else np.nan for i in grp_df.scraped_content]

In [367]:
grp_df.head()

Unnamed: 0,article_id,title,url,keep,summary,content,published,feed_label,scraped_content,article_text,article_keywords
7280,cd88676c,Tonga: World Bank Drone-Led Damage Assessments...,https://reliefweb.int/report/tonga/tonga-world...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-02-22 00:03:16,NEWS WB- All Streams,"(Nuku’alofa, February 22, 2018 – Following the...","Nuku’alofa, February 22, 2018 – Following the ...","[insurance, work, development, pacific, world,..."
8244,e72cf1f7,China's unstoppable momentum,https://gooruf.com/uk/news/2018/01/17/china-un...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-01-17 06:56:52,NEWS AIIB - All Streams,(1 1\n\nSince there seem to be 2 very differen...,1 1\n\nSince there seem to be 2 very different...,"[candidates, different, usual, 1since, momentu..."
2058,44fcbdb7,ADB agrees $375m loan for Madhya Pradesh irrig...,https://www.txfnews.com/Ticker/Redirect/84616f...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-06-01 03:32:41,NEWS ADB - All Streams,(The Asian Development Bank has approved a $37...,The Asian Development Bank has approved a $375...,"[study, agrees, 375m, work, systems, system, s..."
7052,c77bc305,Ethiopia: European Investment Bank Injects €3m...,http://allafrica.com/stories/201802060591.html,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-02-06 06:46:17,NEWS EIB - All streams,(The European Investment Bank (EIB) committed ...,The European Investment Bank (EIB) committed t...,"[system, mobile, ethiopia, financial, european..."
6403,b6e91d1e,"AIIB approves two new applicants, expands memb...",http://www.xinhuanet.com/english/2018-05/02/c_...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-05-02 01:58:00,NEWS AIIB - All Streams,(Source: Xinhua| 2018-05-02 13:46:27|Editor: C...,Source: Xinhua| 2018-05-02 13:46:27|Editor: Ch...,"[aiib, applicants, 86, expands, approves, pros..."


## Review Results and add in Some Language Detection

IAP only has content in English currently so tagging articles in other languages is likely too complicated at this time as it wold also involve a translation step. Therefore we may want to filter out non English language articles. 

Newspaper 3k has this functionality - but it is slow - this work pretty fast. 

In [381]:
from langdetect import detect_langs
from langdetect.lang_detect_exception import LangDetectException

In [391]:
detect_langs("""
'Nuku’alofa, February 22, 2018 – Following the severe impact of Tropical Cyclone Gita, the World Bank has now begun work to support the government of Tonga, which is leading a Rapid Damage Assessment to assist with recovery and reconstruction planning in the coming months.\n\n“Our work in mapping the damage wreaked by Cyclone Gita will be crucial to helping the government of Tonga to determine priority areas for recovery and reconstruction,” said World Bank Country Director for Papua New Guinea and the Pacific Islands, Michel Kerf. “In the immediate aftermath of recent natural disasters in the Pacific, including cyclones Winston (Fiji, 2016) and Pam (Vanuatu, 2015)_, the World Bank has been called upon to lead the immediate damage assessment process.”_\n\nThe World Bank, together with partners including the governments of Australia and New Zealand, the Asian Development Bank, Japan International Cooperation Agency, European Union and United Nations Development Programme, is now working alongside Tongan authorities to identify priority sectors for the rapid damage assessment, which include housing, agriculture and energy.\n\nAs part of this assessment work, a fleet of Unmanned Aerial Vehicles (UAVs, or drones) have been transported to Tonga with the support of the Australian government, to provide a comprehensive visual assessment of the damage caused by Cyclone Gita.\n\nTonga has received a payout of more than US$3.5 million from the Pacific Catastrophe Risk Insurance Company (PCRIC) – the first payout made by the region’s first catastrophe risk insurance platform established in 2016. PCRIC was formed as part of the World Bank’s regional project PCRAFI: Furthering Disaster Risk Finance in the Pacific, which provides technical assistance to 14 Pacific Island countries, with financial support from Germany, Japan, the United Kingdom and the United States of America.\n\n“Despite the tragic circumstances, it has been good to see the Pacific Catastrophe Risk Insurance Company delivering much-needed relief through its disaster insurance system,” _said Mr. Kerf. “This is the first payout of its kind, and is a testament to the hard work of many governments and development partners, who have worked hard over many years to create this critical support system for the Pacific Islands, home to many of the world’s most disaster at-risk countries.”_\n\nThe World Bank continues to stand as a dedicated partner in resilient development in the Pacific Islands.'""")

[en:0.9999970030496718]

In [417]:
def detect_lang(x):
    try:
        return detect_langs(x)
    except:
        return np.nan

In [418]:
grp_df['lang'] = grp_df.article_text.apply(detect_lang)

In [409]:
test

Unnamed: 0,article_id,title,url,keep,summary,content,published,feed_label,scraped_content,article_text,article_keywords,lang
7280,cd88676c,Tonga: World Bank Drone-Led Damage Assessments...,https://reliefweb.int/report/tonga/tonga-world...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-02-22 00:03:16,NEWS WB- All Streams,"(Nuku’alofa, February 22, 2018 – Following the...","Nuku’alofa, February 22, 2018 – Following the ...","[insurance, work, development, pacific, world,...",
8244,e72cf1f7,China's unstoppable momentum,https://gooruf.com/uk/news/2018/01/17/china-un...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-01-17 06:56:52,NEWS AIIB - All Streams,(1 1\n\nSince there seem to be 2 very differen...,1 1\n\nSince there seem to be 2 very different...,"[candidates, different, usual, 1since, momentu...",
2058,44fcbdb7,ADB agrees $375m loan for Madhya Pradesh irrig...,https://www.txfnews.com/Ticker/Redirect/84616f...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-06-01 03:32:41,NEWS ADB - All Streams,(The Asian Development Bank has approved a $37...,The Asian Development Bank has approved a $375...,"[study, agrees, 375m, work, systems, system, s...",
7052,c77bc305,Ethiopia: European Investment Bank Injects €3m...,http://allafrica.com/stories/201802060591.html,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-02-06 06:46:17,NEWS EIB - All streams,(The European Investment Bank (EIB) committed ...,The European Investment Bank (EIB) committed t...,"[system, mobile, ethiopia, financial, european...",
6403,b6e91d1e,"AIIB approves two new applicants, expands memb...",http://www.xinhuanet.com/english/2018-05/02/c_...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-05-02 01:58:00,NEWS AIIB - All Streams,(Source: Xinhua| 2018-05-02 13:46:27|Editor: C...,Source: Xinhua| 2018-05-02 13:46:27|Editor: Ch...,"[aiib, applicants, 86, expands, approves, pros...",
8409,eb93e49a,"How to attract finance into real estate, by ex...",https://newtelegraphonline.com/2018/02/300m-cr...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-02-13 09:15:00,NEWS WB- All Streams,(The implementation progress of the Nigeria Ho...,The implementation progress of the Nigeria Hou...,"[credit, scheme, sector, nmrc, world, low, nig...",
2491,50a4db7e,AIIB eyes dollar bond,http://www.globaltimes.cn/content/1083718.shtml,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-01-07 08:08:40,NEWS AIIB - All Streams,(The Asian Infrastructure Investment Bank ( AI...,The Asian Infrastructure Investment Bank ( AII...,"[aiib, end, soon, treasurer, eyes, window, dol...",
6842,c29a9119,Six new peeping frogs discovered in western Me...,https://news.mongabay.com/2018/05/six-new-peep...,True,"<img alt="""" sizes=""(max-width: 100px) 100vw, 1...",Scientists have discovered six new species of ...,2018-05-10 20:00:00,NEWS - Mongabay,(Scientists have discovered six new species of...,Scientists have discovered six new species of ...,"[mexico, jalisco, scientists, grünwald, eleuth...",
8947,fa38adc,DLL and EIB Provide €200MM to Finance Small Bu...,https://www.monitordaily.com/news-posts/dll-ei...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-01-31 10:13:31,NEWS EIB - All streams,(The European Investment Bank (EIB) granted a ...,The European Investment Bank (EIB) granted a €...,"[million, provide, smes, 100, dll, small, 200m...",
509,1e0855,"No need to devalue rupee again, says Miftah Is...",https://tribune.com.pk/story/1691851/2-no-need...,True,"<table border=""0"" cellspacing=""3"" cellpadding=...",,2018-04-21 12:27:00,NEWS ADB - All Streams,(Advise­r to PM on financ­e says curren­cy has...,Advise­r to PM on financ­e says curren­cy has ...,"[need, trade, rupee, international, pakistan, ...",


In [402]:
detect_langs(grp_df.article_text.iloc[0])

[en:0.9999975243547723]

In [48]:
for i in test_df.head().scraped_content:
    try:
        print(detect_langs(i))
    except LangDetectException:  
        continue
    print(i[0:300],
         '\n')
    print ('*******')

[en:0.9999952972346535]
European Union institution the European Investment Bank (EIB) has renewed its partnership with French port Marseille Fos under a €50 million funding agreement to support five key development projects.



The projects, which require a total investment of €136 million, include connecting the two exist 

*******
[en:0.9999991131684797]
India signed a USD 500 million (Rs 3,371 crore) loan pact with World Bank today to provide additional financing for PMGSY rural road projects.

The loan has a 3-year grace period, and a maturity of 10 years, the finance Ministry said in a release.

It will provide additional financing for the Pradha 

*******
[en:0.9999952599411848]
9700 Jamaican families enrolled in study

Data on fathers’ impact on child development collected for the first time

The UWI JA KIDS Birth Cohort Study Research Team will host a conference at the University of the West Indies from May 31 to June 1 to share ground-breaking findings from their seven-y 

***

# TODO 

1. Detect Language
2. Error Test on larger set
3. Manually verify extact is generally correct 
4. Extract article content for all articles in DataDive dataset (Could take a long time)
5. Method for Identifying articles previously scraped (we need a unique identifier so we can only scrape new articles in the future) -- This is just something to keep in mind. `article_id` might be fine (based on the json items `fingerprint` field.

-------------------

# End