<div class="alert alert-block alert-success"><h3>IFN619 - Data Analytics for Strategic Decision Makers</h4></div>

## Module 1A Lecture

1. Recap
2. Decision theory
3. Retrieving data from APIs
4. Analysing unstructured data
5. Visualising with HTML
6. Introducing Assignments 1 & 3

### [1] Recap

* Getting familiar with Juoyter notebooks
    * Logging in to Jupyter
    * Syncing files from GitHub to Jupyter
    * Opening notebooks and other files (e.g. CSV files)
    * Navigating and using the notebook - markdown and code cells

* The data analytics cycle (QDAVI)
    1. **Q**uestion
    2. **D**ata
    3. **A**nalysis
    4. **V**isualisation
    5. **I**nsight
    

* Basics
    * Importing libraries
    * Loading data into a dataframe from a CSV file
    * Extracting data from a dataframe
    * Plotting numerical data with matplotlib

## [2] Decision theory (very briefly)

* What?
    - Reasoning and Judgements, not just actions
    - Preferences over prospects, not just choices
    - More than choice under risk, values and beliefs
    - Expected Utility


* Why?
    - The definition of utility is very important
    - The implications of uncertainty are very important
    - Are people always rational agents?
    - Are data and data analytics objective?

## [3] Retrieving data from APIs

Data is not always conveniently stored in local files.

Increasingly data is being made open via **Application Programming Interfaces (APIs)**.

First some **functions** to help us. Using functions, we can avoid typing the same (or very similar) code over and over again.


In [58]:
# We will need some libraries...
from urllib import request, response
import json

# Functions to fetch string/json from an API

def fetch_string_from_api(url):
    req = request.Request(url)
    resp = request.urlopen(req)
    return resp.read().decode('utf8')

def fetch_json_from_api(url):
    body = fetch_string_from_api(url) #Uses the function above
    return json.loads(body)

In [59]:
#Fetch the data for the latest xkcd comic
xkcd_url = 'http://xkcd.com/info.0.json'
xkcd_json = fetch_json_from_api(xkcd_url)
print(xkcd_json)

{'month': '6', 'num': 2314, 'link': '', 'year': '2020', 'news': '', 'safe_title': 'Carcinization', 'transcript': '', 'alt': "Nature abhors a vacuum and also anything that's not a crab.", 'img': 'https://imgs.xkcd.com/comics/carcinization.png', 'title': 'Carcinization', 'day': '1'}


In [60]:
comicUrl = xkcd_json.get("img")
print(comicUrl)

https://imgs.xkcd.com/comics/carcinization.png


In [61]:
# A library to display HTML
from IPython.core.display import display, HTML

# Use the library to display the comic
display(HTML('<img src="'+comicUrl+'"/>'))

What about something a little more serious?

In [62]:
country = "australia"
name = "queensland"
unis_url = "http://universities.hipolabs.com/search?country="+country+"&name="+name
unis_json = fetch_json_from_api(unis_url)
print(unis_json)

[{'web_pages': ['http://www.usq.edu.au/'], 'state-province': None, 'country': 'Australia', 'domains': ['usq.edu.au'], 'name': 'University of Southern Queensland', 'alpha_two_code': 'AU'}, {'web_pages': ['http://www.uq.edu.au/'], 'state-province': None, 'country': 'Australia', 'domains': ['uq.edu.au'], 'name': 'University of Queensland', 'alpha_two_code': 'AU'}, {'web_pages': ['http://www.cqu.edu.au/'], 'state-province': None, 'country': 'Australia', 'domains': ['cqu.edu.au'], 'name': 'Central Queensland University', 'alpha_two_code': 'AU'}, {'web_pages': ['http://www.jcu.edu.au/'], 'state-province': None, 'country': 'Australia', 'domains': ['jcu.edu.au'], 'name': 'James Cook University of North Queensland', 'alpha_two_code': 'AU'}, {'web_pages': ['http://www.qut.edu.au/'], 'state-province': None, 'country': 'Australia', 'domains': ['qut.edu.au'], 'name': 'Queensland University of Technology', 'alpha_two_code': 'AU'}]


In [63]:
len(unis_json)

5

In [64]:
for uni in unis_json:
    name = uni.get("name")
    url = uni.get("web_pages")[0]
    link = '<p><a href="'+url+'">'+name+'</a></p>'
    display(HTML(link))

## [4] Analysing unstructured data
* What is structured data?
* What is unstructured data?
* What is semi-structured data?

With the following code, we transform unstructured data into structured data.

We start by loading the data from from the file system. In this case it is a text file of 50 Amazon reviews.

In [65]:
file = open("data/kaggle-amazon_reviews-first50.txt")
rawtext = file.read()
file.close()

FileNotFoundError: [Errno 2] No such file or directory: 'data/kaggle-amazon_reviews-first50.txt'

In [66]:
rawtext

NameError: name 'rawtext' is not defined

Easy first step in structuring the data: split the string into a list of strings

In [67]:
reviews = rawtext.split('\n')
if reviews[-1]=='':
    del reviews[-1] #Remove last empty item

NameError: name 'rawtext' is not defined

In [68]:
reviews

NameError: name 'reviews' is not defined

Now we structured each review further, by extracting the sentiment and the subject.

In [69]:
import re # Need the python regular expression library for these functions

def getSentimentLabel(text):
    match = re.search(r"(?<=__label__)[0-9]+",text)
    value = match.group(0)
    if value=='1':
        return "negative"
    elif value=='2':
        return "positive"

def getSubject(text):
    split = re.split(r"(?<=__label__)[0-9]+",text)
    return split[1].strip()

Now that we have the bits, we can store them in our own custom data structure `Review` based on a `namedtuple`. We also create a function to parse the reviews into this data structure

In [70]:
from collections import namedtuple # Need a library that gives us a named tuple

Review = namedtuple('review',['label','subject','text'])

In [71]:
def parseReview(text):
    textSplit = text.split(":")
    leader = textSplit[0]
    text = textSplit[1]
    subject = getSubject(leader)
    label = getSentimentLabel(leader)
    return Review(label,subject,text)

In [72]:
structuredReviews = list(map(parseReview,reviews))
structuredReviews

NameError: name 'reviews' is not defined

## [5] Visualising with HTML

We have structured data now, but it is difficult to explore as it is not in a format that is easy for humans to read. Let's fix that...

In [73]:
def reviewsToHtml(reviewList):
    def pTag(review): #function that wraps review in tags
        return '<p><b class="'+review.label+'">'+review.subject+"</b>: "+review.text+"</p>"
    paras = map(pTag,reviewList) #Apply the wrapping function to the list
    return HTML(''.join(paras)) #Join the paragraphs together and return as HTML

structReviewsHtml = reviewsToHtml(structuredReviews)
css = HTML("""<style>
.positive { color: green; }
.negative { color: red; }
</style>""")

NameError: name 'structuredReviews' is not defined

In [74]:
display(css,structReviewsHtml)

NameError: name 'css' is not defined

**DISCUSSION**
- We did this 50 reviews. How many could we do this task on?
- What other structuring could we do to the data?
- In what way/s might we have *corrupted* the data?

## [6] Introducing Assignment 3

* Assignment 3 (Reflective Journal) is on blackboard
* Recommend using [GoingOK http://qut.goingok.org](http://qut.goingok.org) to record regular reflections
* Formative component

## [7] Introducing Assignment 1

* Assignment 1 (Data Analytics Notebook) is on blackboard
* 2 parts:
    - Part A - basic skills notebook. Multiple attempts, guaranteed grade, computer marked.
    - Part B - data analytics notebook for two questions. Demonstrate understanding of applying techniques to a specific question.
* Success in part A will guarantee a grade of 4, degree of success in part B will determine final grade.