<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">Data Analytics for Strategic Decision Makers</div>

# Extending Analytics - Guardian API access


## Guardian API access

First of all, it is necessary to access Guardian API to search the useful articles which can explain the program.

In [2]:
#import required libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import numpy as np
import random
import requests
import json
import re
import time


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
#load personal API key
with open('KEY/key.txt', 'r') as file:
    key = file.read().strip()
key

'9b23a079-34d8-43eb-b5cc-60cf458199f8'

To searching the articles related to the topic precisely, I set 'UQ', 'COVID-19 Vaccine' and 'Queensland University' as the keywords, and limited the sources from Australia.  Since the program started in 2020, I filtered the time period from 2020 to 2022 to observe the outcome within three years.

In [4]:
#build a search URL
base_url = 'https://content.guardianapis.com/'
search_string = "UQ%20AND%20(COVID-19%20Vaccine)%20AND%20(Queensland%20University)"
production_office = "AUS"
from_date = "2020-01-01"
end_date = "2022-12-31"

full_url = base_url+f"search?q={search_string}&production-office={production_office}&from-date={from_date}&to-date={end_date}&show-fields=body&api-key={key}"

#url = baseUrl+'"'+searchString+'"'+'&production-office='+production_office+'&from-date='+fromDate+'&api-key='+key
print(full_url[:500])

https://content.guardianapis.com/search?q=UQ%20AND%20(COVID-19%20Vaccine)%20AND%20(Queensland%20University)&production-office=AUS&from-date=2020-01-01&to-date=2022-12-31&show-fields=body&api-key=9b23a079-34d8-43eb-b5cc-60cf458199f8


In [5]:
# get data from server
server_response = requests.get(full_url)
server_data = server_response.json()
resp_data = server_data.get('response','')
if resp_data == '':
    print("ERROR obtaining results:",server_data)
else:
    print("SUCCESS!")
    print(f"{resp_data['total']} results found available in {resp_data['pages']} pages")
    print(f"{resp_data['pageSize']} results per page")
    results = resp_data.get('results',[])
    

SUCCESS!
33 results found available in 4 pages
10 results per page


The searching result shows there are 33 relative articles in the Guardian sources. Then, the next several steps conducted to organize the data and process JSON response to extract data and write them to a JSON file for further analyze from them.

In [6]:
# extract articles from each page
def articles_from_page_results(page_results):
    articles = {}
    for result in page_results:
        article_date = result['webPublicationDate']
        article_title = result['webTitle']+f" [{article_date}]" 
        article_html = result['fields']['body']
        article_text = re.sub(r'<.*?>','',article_html)
        articles[article_title] = article_text
    return articles

In [7]:
def get_all_articles_for_response(response_json,full_url):
    total_pages = response_json['pages']
    total_articles = response_json['total']
    print(f"Fetching {total_articles} articles from {total_pages} pages...")
    all_articles = {}
    page1_articles = articles_from_page_results(response_json['results'])
    all_articles.update(page1_articles)
    print("Added articles for page: 1")
    
    for page in range(2,total_pages+1):
        print("Getting articles from API for page:",page)
        page_response = requests.get(full_url+f"&page={page}")
        page_data = page_response.json()['response']
        print("Processing results for page:",page_data['currentPage'])
        page_articles = articles_from_page_results(page_data['results'])
        print(f"Fetched {len(page_articles)} articles.")
        all_articles.update(page_articles)
        print("Added articles for page:",page)
        print(f"Status: {len(all_articles)} articles.")
        time.sleep(1) # make sure we're not hitting the API to hard
    
    print(f"FINISHED: Fetched {len(all_articles)} articles.")
    return all_articles


In [8]:
my_articles = get_all_articles_for_response(resp_data,full_url)

Fetching 33 articles from 4 pages...
Added articles for page: 1
Getting articles from API for page: 2
Processing results for page: 2
Fetched 10 articles.
Added articles for page: 2
Status: 20 articles.
Getting articles from API for page: 3
Processing results for page: 3
Fetched 10 articles.
Added articles for page: 3
Status: 30 articles.
Getting articles from API for page: 4
Processing results for page: 4
Fetched 3 articles.
Added articles for page: 4
Status: 32 articles.
FINISHED: Fetched 32 articles.


In [9]:
file_path = 'UA2-YungHsin-n11750804'
file_name = "UQ-COVID19.json"

with open(f"{file_path}/{file_name}",'w', encoding='utf-8') as fp:
    fp.write(json.dumps(my_articles))
