# Tutorial :: Threats and opportunities in external data - the power of the news

**CONCERN**

You are working for a consultancy firm in charge of the Australian government's political image. In September 2021, the Australian government had a high-profile problem with France due to a deal to buy french submarines being called off. A report has already been generated with the titles of news items. However, your job as an analyst is to create a more thorough report taking into consideration additional information inside each news item.

In particular, your clients want to be aware of **threats** and **opportunities** suggested by the news.

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="graphics/QDAVI_cycle_sm.png" width="50%" />

### 1. Question

How has the news affected the image of the Australian government?

**Tip:** You can combine web scraping and APIs

### 2. Data

You must use The Guardian API

**Tip:** Check the studio session and tutorial notebooks from Week 3 for information about how to call the guardian API

In [None]:
# Libraries for the analysis
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup

In [None]:
# Build a search URL
baseUrl = 'https://content.guardianapis.com/search?q=' # content search

searchString = "submarine"
office = "&production-office=aus"
tag = "&tag=politics/politics"
fromDate = "&from-date=2021-09-01"
toDate = "&to-date=2021-11-30"

url = baseUrl+'"'+searchString+'"'+office+fromDate+toDate+"&api-key=test"
print(url)

In [None]:
# Call the API
response = requests.get(url)
data = json.loads(response.content)
results = data['response']['results']
results

The results contain the URL to the news items on the website. After inspecting a couple of pages, which information could be easily extracted from it

In [None]:
# Get HTML function
def get_HTML(url):
    # get data from server
    response = requests.get(url)
    html = response.content
    return html

In [None]:
# Beautiful soup function for subtitle
def extract_subTitle(HTML):
    soup = BeautifulSoup(HTML, "html.parser") # the html input and the parser name
    article = soup.find("article") # the tag that contains the article
    div_element = article.find("div", attrs={"data-gu-name": "standfirst"}) # the tag that can be found using an attribute
    if div_element is not None:
        target_element = div_element.find("p")
        return target_element.text
    else:
        return ""
    

In [None]:
# Beautiful soup function for body
def extract_body(HTML):
    soup = BeautifulSoup(HTML, "html.parser") # the html input and the parser name
    article = soup.find("article") # the tag that contains the article
    div_element = article.find("div", attrs={"id": "maincontent"}) # the tag that can be found using an attribute
    if div_element is not None:
        div_div_element = div_element.find("div")
        target_elements = div_element.findAll("p")
        result = ""
        for te in target_elements:
            result += te.text
        return result
    else:
        return ""

#### Clean/preprocess data

In [None]:
# Create a dataframe
df = pd.DataFrame(columns=["Date", "Section", "Title", "Subtitle", "Body"])
df

In [None]:
# Populate the dataframe
for news in results:
    html = get_HTML(news["webUrl"])
    data = {"Date": news["webPublicationDate"], "Section": news["sectionName"], "Title": news["webTitle"], "Subtitle": extract_subTitle(html), "Body": extract_body(html)}
    df_to_append = pd.DataFrame([data])
    df = pd.concat([df,df_to_append], ignore_index=True)
df

In [None]:
df

### 3. Analysis

Information extraction?

#### Inspect the data

Read a few articles at random to get a feel for what is important to analyse.

#### One approach - a basic sentiment analysis that looks for positive and negative words in the text

In [None]:
# Define lists of positive and negative words
positive_words = ["good", "positive", "excellent", "success"] # add words you think are good indicators
negative_words = ["bad", "poor", "negative", "disappointing"]


# Function to calculate a basic sentiment score
def analyze_sentiment(article):
    positive_count = 0
    negative_count = 0
    
    # Convert article to lowercase and split into words
    words = article.lower().split()
    
    # Count occurrences of positive and negative words
    for word in words:
        if word in positive_words:
            positive_count += 1
        if word in negative_words:
            negative_count += 1
            
    # Compute sentiment score
    sentiment_score = positive_count - negative_count
    return sentiment_score



In [None]:
# Analyze the articles

# Create a list of article bodies from the df column
article_body_texts = df["Body"].tolist()

# Loop through and run the analyze_sentiment function on each
for article_text in article_body_texts:
    
    score = analyze_sentiment(article_text)

    # Print sentiment score
    print(f"Sentiment Score: {score}")

#### Combine the scores and do some analysis

##### Tip: You could add them as a new column of your existing df


In [None]:
???

### 4. Visualisation

In [None]:
???

### 5. Insights

What might be some limitations of how you analysed the data?

#

# Scrape some data from the web to include as background in your final report

Use code similar to that in this week's studio session to scrape some data from the web relevant to the submarine issue.

Suggestion: A list of current Australian submarines with their names and launch dates scraped from the web like the one at https://en.wikipedia.org/wiki/Collins-class_submarine#Submarines_in_class)