<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Tutorial: External Concerns & Unstructured Data

In [None]:
#import required libraries
import requests
import json
import re
import time
import pandas as pd

## Guardian API access

You will need to apply for a free developer key in order to use the API fully within Jupyter:

[Guardian Open Platform - Getting started](https://open-platform.theguardian.com/access/)

You will receive your personal API key via email. Please ensure that you do not share your key with anyone.
To keep your key invisible in the code cell, follow these steps:
1. Create a subfolder named `private`
2. Inside the `private` subfolder, create a new text file named `guardian_key.txt`
3. Copy and paste your API key into the `guardian_key.txt` file

In [None]:
#load your personal API key
with open('private/guardian_key.txt', 'r') as file:
    key = file.read().strip()
len(key)

## Question
What question do you wish to explore? Write down your question below:

* **Question**: **???**

Explain **why** the question is significant.

**???**

## Explore Guardian API
You can explore what is possible with the API here:

[Guardian Open Platform - explore](https://open-platform.theguardian.com/explore/)

### What filters can you customise in Guardian API?

Spend some time exploring filters, for example:
* `sections`
* `tag`
* `production-office`
* `from-date` and `to-date`

In [None]:
# explore guardian API 


## Get articles from Guardian API

In [None]:
# build a search URL
base_url = 'https://content.guardianapis.com/'

# modify your search terms & filters
search_string = "paris%20olympics"
section='football'
production_office = 'aus'
from_date = '2023-07-01'


# this is an example of a search URL
full_url = base_url+f"search?q={search_string}&section={section}&production-office={production_office}&from-date={from_date}&show-fields=body&api-key={key}"

print(full_url[:150]) #<-- modify the length so it won't show your api-key

In [None]:
# get data from server
response = requests.get(full_url)
resp_data = response.json()['response']
resp_data

In [None]:
num_pages = resp_data['pages']
num_pages

In [None]:
def articles_from_page_results(page_results):
    articles = {}
    for result in page_results:
        article_date = result['webPublicationDate']
        article_title = result['webTitle']+f" [{article_date}]"
        article_html = result['fields']['body']
        article_text = re.sub(r'<.*?>','',article_html)
        articles[article_title] = article_text
    return articles

In [None]:
def get_all_articles_for_response(response_json,full_url):
    total_pages = response_json['pages']
    total_articles = response_json['total']
    print(f"Fetching {total_articles} articles from {total_pages} pages...")
    all_articles = {}
    page1_articles = articles_from_page_results(response_json['results'])
    all_articles.update(page1_articles)
    print("Added articles for page: 1")
    
    for page in range(2,total_pages+1):
        print("Getting articles from API for page:",page)
        page_response = requests.get(full_url+f"&page={page}")
        page_data = page_response.json()['response']
        print("Processing results for page:",page_data['currentPage'])
        page_articles = articles_from_page_results(page_data['results'])
        print(f"Fetched {len(page_articles)} articles.")
        all_articles.update(page_articles)
        print("Added articles for page:",page)
        print(f"Status: {len(all_articles)} articles.")
        time.sleep(1) # make sure we're not hitting the API to hard
    
    print(f"FINISHED: Fetched {len(all_articles)} articles.")
    return all_articles


In [None]:
my_articles = get_all_articles_for_response(resp_data,full_url)

In [None]:
print("Total Articles:",len(my_articles))
for title,text in my_articles.items():
    print(title)

In [None]:
# save articles to a json file
file_path = "data/"
file_name = "paris_olympics.json" # <-- rename the file

with open(f"{file_path}{file_name}",'w', encoding='utf-8') as fp:
    fp.write(json.dumps(my_articles))

## Analysis
Now check the titles of the articles. Are all the articles relevant to your [question(s)](#Question) you are interested in?
Explore ways how you can get most relevant articles to your concerns. Some ideas to find most relevant articles:
* Revisit your [search URL](#Get-articles-from-Guardian-API) and requery articles. [Query and filters operators](https://open-platform.theguardian.com/documentation/), such as `AND`, `OR`, `NOT` and grouping(`()`), may be helpful. 
* Work with downloaded articles by filtering out the irrelevant articles. Are there any patterns from the relevant (or irrelevant) articles?

In [None]:
# explore article json file - add more cells if you like


Jot down some ideas how you can get useful information from the most relevant articles. You may also think about whether there are ways to visualise some of the key information. 

* 
* 
* 

In [None]:
# explore ways to extract relevant information - add more cells if you like


## Visualise

Try using `HTML` and `CSS` to visualise the useful information.

In [None]:
# visualise some key information - add more cells if you like 


Don't forget to record any findings and insights from your analysis and visualisation.

* 
* 
* 

## Insights
1. What is the concern?
2. What data did we use?
3. How did we analyse it, what decisions and why?
4. What do the visualisations tell us?
5. What is the recommendation for the concern? What other information would be helpful? What *doesn't* the data tell us? Can we make inferences?

**Response**: 