#### Notebook purpose:
This notebooks aims to collect data from http://eventregistry.org API. We extracted 10k articles related to politics, 10k articles related to sports and 10k articles not related neither to politics nor politics (we'll use this third sample to build the documents classifiers). The selection criterion to filter the 10k documents is the relevance measure returned by Event Registry API.

#### Start session with Event Registry API

In [1]:
from eventregistry import *

In [None]:
key = '' # add your key here

In [2]:
er = EventRegistry(apiKey=key)

using user provided API key for making requests
Event Registry host: http://eventregistry.org
Text analytics host: http://analytics.eventregistry.org


#### Specify the data range for the queries

In [3]:
import datetime

In [4]:
end_date = datetime.date(2019, 3, 9)
start_date = end_date - datetime.timedelta(days=30)

#### Specify the informations required for each article

In [5]:
output = ArticleInfoFlags(bodyLen=-1,
                          basicInfo=True,
                          title=True,
                          body=True,
                          location=True,
                          url=False,
                          eventUri=False,
                          authors=False,
                          concepts=False,
                          categories=True,
                          links=False,
                          videos=False,
                          image=False,
                          sentiment=False,
                          dates=False,
                          extractedDates=False,
                          duplicateList=False,
                          originalArticle=False,
                          storyUri=False)

#### Specify the queries to run

In [6]:
q_sports = QueryArticlesIter(dateStart=start_date, 
                             dateEnd=end_date,
                             conceptUri=er.getConceptUri('sports'),
                             categoryUri=er.getCategoryUri('sports'),
                             lang='eng',
                             isDuplicateFilter='skipDuplicates')

In [7]:
q_politics = QueryArticlesIter(dateStart=start_date, 
                               dateEnd=end_date,
                               conceptUri=er.getConceptUri('politics'),
                               categoryUri=er.getCategoryUri('politics'),
                               lang='eng',
                               isDuplicateFilter='skipDuplicates')

In [8]:
q_other = QueryArticlesIter(dateStart=start_date, 
                            dateEnd=end_date,
                            ignoreConceptUri=QueryItems.OR([er.getConceptUri('politics'), er.getConceptUri('sports')]),
                            ignoreCategoryUri=QueryItems.OR([er.getCategoryUri('politics'), er.getCategoryUri('sports')]),
                            lang='eng',
                            isDuplicateFilter='skipDuplicates')

#### Turn the query results into a pandas dataframes and save results

In [9]:
import utils_api as utils

In [15]:
%%time 
sports_article_df = utils.build_df_from_query(q_sports, er, 10000, ReturnInfo(articleInfo=output))
sports_article_df.to_csv('../data/sports_10k.csv', index=False)

CPU times: user 1h 31min 27s, sys: 15min 8s, total: 1h 46min 36s
Wall time: 22min 42s


In [16]:
%%time
politics_article_df = utils.build_df_from_query(q_politics, er, 10000, ReturnInfo(articleInfo=output))
politics_article_df.to_csv('../data/politics_10k.csv', index=False)

CPU times: user 2h 5min 59s, sys: 21min 3s, total: 2h 27min 3s
Wall time: 30min 54s


In [17]:
%%time 
other_article_df = utils.build_df_from_query(q_other, er, 10000, ReturnInfo(articleInfo=output))
other_article_df.to_csv('../data/other_10k.csv', index=False)

CPU times: user 56min 22s, sys: 8min 21s, total: 1h 4min 43s
Wall time: 18min 40s
