<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# STUDIO :: News API

[The Guardian](https://www.theguardian.com/au) is a quality **open** News outlet with an easy to use [open-platform API](https://open-platform.theguardian.com).

* Explore and experiment with the [platform here](https://open-platform.theguardian.com/explore/)
* Get your own [developer API key here](https://bonobo.capi.gutools.co.uk/register/developer)

For this example, I've save my key at the beginning of a file called `guardian_key.txt`. I load the key before anything else...

In [None]:
#load key
with open('data/guardian_key.txt', 'r') as file:
    key = file.read().strip()
len(key) # check key loaded by reading its length - don't want to display the actual key!!

To use the API, we need the `requests` library to connect to it, and the `json` library to be able to work with the json data that the API returns.

In [None]:
#import required libraries
import requests
import json

The Guardian API is called through a single URL which needs to be composed according to your search requirements.

In [None]:
 #build a search URL
baseUrl = 'https://content.guardianapis.com/search?q=' # content search

searchString = "submarine"
office = "&production-office=aus"
tag = "&tag=politics/politics"
fromDate = "&fromDate=2021-09-01"

url = baseUrl+'"'+searchString+'"'+office+fromDate
print(url)

Now that we have the URL that we want to use for our search, we add our api-key to the end of it and send the request to the server.

In [None]:
# get data from server
urlkey = url +'&api-key='+key # add my API key to the end of the URL
response = requests.get(urlkey)

We're hoping for a `200` response from the server to say that everything was OK. If you get a different number response, then there was probably an issue with your URL.

In [None]:
response

We can look at the content of the response - which should be in json format if your request was successful.

In [None]:
response.content

Load the json from the content into a variable (as a dictionary) that we can navigate in python.

In [None]:
data = json.loads(response.content)
data

In [None]:
results = data['response']['results']
results

Once we have our results, it could be helpful to get a list of the titles. Then we could do unstructed data analytics on the titles to see if they contain specific words or phrases of interest.

In [None]:
titles = []
for result in results:
    titles.append(result['webTitle'])

In [None]:
titles

### Go further with web scraping

Once we have a title of interest, we could extract the webURL and then scrape the webpage for the main story

In [None]:
webUrl1 = results[0]['webUrl']
webUrl1

In [None]:
from bs4 import BeautifulSoup

def get_HTML(url):
    response = requests.get(url)
    html = response.content
    return html

In [None]:
page = get_HTML(webUrl1)
page

#### The main story

Find the main story within the page. Thankfully the guardian uses an `id` tag to identify this content.

In [None]:
soup = BeautifulSoup(page,"html.parser")
main_content = soup.find("div", {"id": "maincontent"})
main_content

#### Just the text

Extract the text from the page without the HTML tags

In [None]:
' '.join([s.strip() for s in main_content.stripped_strings])

---
### Make it easier to explore

In [None]:
# a function to build the URL

def buildUrl(search_text,office="",tag="",fromDate=""):
    baseUrl = 'https://content.guardianapis.com/search?q='
    # Only include office, tag and fromDate  if they have values
    if office:
        office = '&production-office='+office
    if tag:
        tag = '&tag='+tag
    if fromDate:
        fromDate = '&fromDate='+fromDate
    fullurl =  baseUrl+'"'+search_text+'"'+office+tag+fromDate
    print(fullurl)
    return fullurl

In [None]:
# create a function to make it easier
def getData(url,key):
    response = requests.get(url+'&api-key='+key)
    data = json.loads(response.content)
    if data['response']['status']=='ok':
        total = data['response']['total']
        pages = data['response']['pages']
        print("Found a total of {} records, returning first of {} pages.".format(total,pages))
        print("-------------------------------------------------------")
    else:
        print("ERROR:")
        print(response)
    return data

In [None]:
buildUrl("submarine")

In [None]:
getData(buildUrl("submarine"),key)

In [None]:
getData(buildUrl("submarine","aus","","2021-09-01"),key)