<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

## WORKSHOP :: Open data for business environment assessment

* Business as society and environmental embedded organisations
    * Society - conditions
    * Political environments
    * Natural environment
       

### Activity - Exploring Open Data

- Scan the website [Open Data as a business tool (World Bank)](http://blogs.worldbank.org/ic4d/open-data-business-tool-learning-initial-pilots)
    - What is open data?
    - Why is it valuable to business?

### Recap - The Data Analytics Cycle

1. Business Concern
2. Data
3. Analyse
4. Visualise
5. Insight in business concern

### Recap - Wrangling Data

*Taming your data so that it easy to work with*

- Getting the data into a form that you can analyse
- Turning unstructured data into structured data
- Visualising the data in a way that makes it easy to interpret


First we need to load the data from some external source. Last week we read the data from a file.

In [18]:
# Libraries used by this notebook
from urllib import request, response
from IPython.core.display import display, HTML
from collections import namedtuple
import json
import re

In [19]:
file = open("IAB303-notebooks/labs/data/kaggle-amazon_reviews-first50.txt")
rawtext = file.read()
file.close()

In [20]:
rawtext

'__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"\n__label__2 One of the best game music soundtracks - for a game I didn\'t really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there\'s not too many

We need to start to structure this data into a form that enables us to analyse it

In [21]:
reviews = rawtext.split("\n")
if reviews[-1]=='':
    del reviews[-1] #Remove last empty item

In [22]:
reviews

['__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"',
 "__label__2 One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too ma

Now we have a list of individual reviews which is a good start, but we can structure the data further. Last week we created a number functions to do this work for us. Functions enable us to encapsulate multiple lines of code into a single function call

In [23]:
def getSentimentLabel(text):
    match = re.search(r"(?<=__label__)[0-9]+",text)
    value = match.group(0)
    if value=='1':
        return 'negative'
    elif value=='2':
        return 'positive'

In [24]:
def getSubject(text):
    split = re.split(r"(?<=__label__)[0-9]+",text)
    return split[1].strip()

We also create a data structure to put our data in.

In [25]:
Review = namedtuple('review',['label','subject','text'])

and then a function to do the work

In [26]:
def parseReview(text):
    textSplit = text.split(':')
    text = textSplit[1]  
    subject = getSubject(textSplit[0])
    label = getSentimentLabel(textSplit[0])
    return Review(label,subject,text)

Now we can run our function parseReview over the list of reviews

In [27]:
structuredReviews = list(map(parseReview,reviews))

In [28]:
structuredReviews

[review(label='positive', subject='Great CD', text=' My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"'),
 review(label='positive', subject="One of the best game music soundtracks - for a game I didn't really play", text=" Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad 

We have now turned the unstructured data into a structured format. However, it is difficult to explore due to the way the structure is displayed. Let's improve our ability to explore the data.

In [29]:
def reviewsToHtml(reviewList):
    def pTag(review): #function that wraps review in tags
        return '<p><b class="'+review.label+'">'+review.subject+"</b>: "+review.text+"</p>"
    paras = map(pTag,reviewList) #Apply the wrapping function to the list
    return HTML(''.join(paras)) #Join the paragraphs together and return as HTML

In [30]:
structReviewsHtml = reviewsToHtml(structuredReviews)
css = HTML("""<style>
.positive { color: green; }
.negative { color: red; }
</style>""")

In [31]:
display(css,structReviewsHtml)

### Recap - Structured, Unstructured and Semi-structured Data

*Understanding the format of the data so that you can select the appropriate tools to use for analysis*

- Structured: Usually tabular like the house price index for country and year
- Unstructured: Usually text like the review data
- Semi-structured: What the review data looked like after we finished wrangling it, but also like the data that we are going to look at today

### Activity - Exploring Open Data

- Explore the [data.qld.gov.au](https://data.qld.gov.au) website - what data is available?

- What other open data can you find?

### Demo - Obtaining data via APIs

An Application Programming Interface (API) allows a computer program to connect to other software for the purpose of utilising a service offered by that software.

Particularly relevant to this unit is are web API's which provide data services and can be 'called' by other software.

In [33]:
#Fetch the data for the latest xkcd comic
comicRequest = request.Request('http://xkcd.com/info.0.json')
comicResponse = request.urlopen(comicRequest)
print(comicResponse.status)                   
body = comicResponse.read().decode('utf8')
print(body)

200
{"month": "3", "num": 2122, "link": "", "year": "2019", "news": "", "safe_title": "Size Venn Diagram", "transcript": "", "alt": "Terms I'm going to start using: The Large Dipper, great potatoes, the Big Hadron Collider, and Large Orphan Annie.", "img": "https://imgs.xkcd.com/comics/size_venn_diagram.png", "title": "Size Venn Diagram", "day": "11"}


The XKCD API gave us back info about the latest comic, including a URL to the comic itself, so let's extract that...

In [34]:
jsonData = json.loads(body)
comicUrl = jsonData.get("img")
print(comicUrl) 

https://imgs.xkcd.com/comics/size_venn_diagram.png


Now, let's display the image as HTML...

In [35]:
display(HTML('<img src="'+comicUrl+'"/>'))

Try calling other APIs from these sites...
* [Any API](https://any-api.com)
* [toddmotto public APIs](https://github.com/toddmotto/public-apis)

### Why Open Data?

* [Open Opportunities](http://smallville.com.au/open-opportunities-business-open-data/)
* What do businesses have to gain by sharing data?
    * Expand customer base
    * Customer loyalty via transparency - evidence of claims
    * Knowledge base adds value to business
    * Product improvement - communication - realtime
* What do businesses have to gain by using open data?
    * Identify new customers and markets
    * Identify trends
    * Guage sentiment of customers
    * Adjust operations more quickly in response to external events
    * Competitive knowledge
* What environmental factors?
    * Political events
    * Geographic/weather conditions