<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Module 6 -  Opportunities and Threats

In this module we focus on analysis of data that is **external** to the business, and look at why it is important for identifying opportunities for growth and threats to survival.

## The Kodak Case

<img src="https://kevinuhrmacher.files.wordpress.com/2011/11/uhrmacher_kodak.jpg?w=1240"></img>

[Kevin Uhrmacher, 2011](https://kevinuhrmacher.wordpress.com/2011/11/10/the-end-of-kodaks-moment/)


### Kodak Threats and Opportunities

Read this Harvard Business Review article on [Kodak’s Downfall Wasn’t About Technology](https://hbr.org/2016/07/kodaks-downfall-wasnt-about-technology), and make a list of possible threats and opportunities.

What does the author say is the main reason for Kodak's downfall?

## Learning about customers

### Psycographic segmentation

Watch this segement of[4 Main Types of Market Segmentation & Their Benefits](https://youtu.be/EQ2pgHbvK0A?t=206), and identify the kinds of data that might be analysed, and the benefits for the business of using this type of market segmentation.


In [None]:
#!pip install tweepy

In [None]:
import tweepy
import pandas as pd
import json
from IPython.display import display, HTML

with open('data/twitter_credentials.json', 'r') as file:
    credentials = json.load(file)

auth = tweepy.OAuthHandler(credentials['API_KEY'], credentials['API_SECRET'])
auth.set_access_token(credentials['ACCESS_TOKEN'], credentials['ACCESS_SECRET'])
api = tweepy.API(auth)


In [None]:
tweets = api.search(q="#gardening", lang = "en", count=50,include_entities=True)

In [None]:
for tweet in tweets:
    print(tweet.created_at)
    print(tweet.id)
    print(tweet.text,'\n')

What kind of data do we receive for each tweet?

In [None]:
first_tweet = tweets[2]
first_tweet._json

What other data can we extract?

In [None]:
for tweet in tweets:
    if 'media' in tweet.entities:
        for image in  tweet.entities['media']:
            url = image['media_url']
            display(HTML('<img src="'+url+'" width="30%"/>'))

We can also get user information...

In [None]:
first_tweet.user._json

And even more detail is available on each user by querying the API on the user

In [None]:
api.get_user('andrewresearch')._json

**DISCUSSION**
* What is it that is different about social media data like twitter?
* What other kinds of data sources produce this type of data?
* How is this significant for business?

## Lagging or Realtime data - tapping the stream

Watch this video on [How Walmart is using Big Data & IOT](https://youtu.be/42xErufN1e8), and identify how external and internal information is blurred through realtime use of Internet of Things (IoT) technologies.

### Streaming - Google Maps

* What you get
* What google gets
* The feedback loop

### Streaming data and analytics

* What is it and why is it important? (Mike Gualtieri)



> **Streaming Analytics:** Technology that ingests, analyses, and acts on high throughput of data from live data sources to identify patterns, detect urgent situations, and automate immediate actions in real time.
> [Mike Gualtieri](https://tdwi.org/articles/2016/08/29/define-your-business-case-for-streaming-analytics.aspx)

* All data originates in real time
* Insights are perishable
* How can you ... **right now**?

#### LEARN MORE
  
> "Information derived from such analysis gives companies visibility into many aspects of their business and customer activity such as –service usage (for metering/billing), server activity, website clicks, and geo-location of devices, people, and physical goods –and enables them to respond promptly to emerging situations."
>
> [What is Streaming Data?](https://aws.amazon.com/streaming-data/)


> "The success of a streaming analytics program is critically bound to establishing a proper business case."
>
> [Define Your Business Case for Streaming Analytics](https://tdwi.org/articles/2016/08/29/define-your-business-case-for-streaming-analytics.aspx)


## Accessing Streaming Data from Twitter

We need to be able to apply a function to each element of the stream

In [None]:
class MyStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print(status.text)
        if 'media' in status.entities:
            for image in  status.entities['media']:
                url = image['media_url']
                display(HTML('<img src="'+url+'" width="30%"/>'))

In [None]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)
myStream.filter(track=['#trump'])

### Accessing web based information


In [None]:
#!pip install bs4
#!pip install html5lib

To validate that you now have the module installed, execute the following 'cell'. Provided that you have installed it correctly, the cell will run without error:

In [None]:
from bs4 import BeautifulSoup
import html5lib
import urllib.request

In [None]:
def get_HTML(url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html

Read this page (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to learn more about the functions ('find', 'findNext') used in the code above.



## Finding opportunities and threats via web pages

Information about threats and opportunities can be found on web pages. This information can be scraped to enable analysis.

### A Web Scraping Scenario

As a market analyst working for a tourism agency, your boss has approached you with a client in need of a recommendation regarding the top tourist destinations of 2018.

While this may sound easy, in hopes that it will improve their tourism experience, the client has also requested that places that have high quality of life be prioritised in the recommendation.

Fortunately for this task, the top tourist destinations of 2018 are stored on the following URL:

In [None]:
top_tourism_destinations = 'https://en.wikipedia.org/wiki/World_Tourism_rankings'

Using the Developer Tools, identify things that could be used to isolate the names of the countries in the table, in the section entitled "Most visited destinations by international tourist arrivals". 

For this task, the details have been given, however, the code that retrieves the values is only half completed:

Details:
     * A 'span' element contains a 'h2' element with the title of the target 'table' inside it.
     * A 'table' element proceeds the 'span' element.
     * There are 'td' elements inside the 'table' element.
     * Each 'td' element has an attribute of 'align' with the value 'left'.
     * In each 'td' element, there is an 'a' element with the name of a given country inside it.

In [None]:
top_tourist_locations = []

Tourism_Wiki_HTML = get_HTML('https://en.wikipedia.org/wiki/World_Tourism_rankings')
soup = BeautifulSoup(Tourism_Wiki_HTML, "html.parser")
span_element = soup.find(text="Most visited destinations by international tourist arrivals")
h2_element = span_element.parent
table_element = h2_element.findNext('table')
for td_element in table_element.findAll('td',attrs={'align':'left'}):
    a_element = td_element.find('a')
    if a_element != None:
        top_tourist_locations.append(a_element.text)

# If you enter the missing code, this will return a list of names of the top tourist destinations for 2018.
top_tourist_locations

Knowing that the client is also looking for places that have higher quality of life, what could we use from a single country's Wikipedia page to determine this quality?

The HDI of the country will may be an indication of this; so how do we describe the HDI?

Once again, here are some details to help:

   * The text 'HDI' is in an 'a' element.
   * The 'a' element is in a 'th' element.
   * The 'th' is proceeded by a 'td' element.
   * The 'td' element contains an 'img' element.
   * Next to the 'img' element is the HDI value.

The code that retrieves the HDI from a country's Wikipedia page is included in the following method, but it is incomplete:

In [None]:
def get_country_HDI(html):
    soup = BeautifulSoup(html, "html.parser")
    a_element = soup.find('a',text="HDI")
    th_element = a_element.parent
    td_element = th_element.findNext('td')
    HDI_value = td_element.find('img').findNextSibling(text=True)
    return HDI_value.strip()

# If you enter the missing code, this function will produce the value '0.897'
France_wiki = get_HTML("https://en.wikipedia.org/wiki/France")
get_country_HDI(France_wiki)

Now all we have to do to get the HDI of each country is to substitute each country's name into the Wikipedia country's URL, and to feed the returned HTML into the 'get_country_HDI' method:

In [None]:
for i in range(0, len(top_tourist_locations)):
    print("Country: "+top_tourist_locations[i])
    print("Ranking: "+str(i+1))
    print("HDI: "+get_country_HDI(
        get_HTML('https://en.wikipedia.org/wiki/'+top_tourist_locations[i].replace(' ','%20'))
    ))
    print('\n')