# Module 4 Tutorial - Unstructured External Business Data

In this tutorial, we'll be working with a variety of datasets that are administered through online websites and APIs.

## Retrieving Data From An Authorised API - Traffic Data

### [1] Question - Planning A Parade

Let us consider the business concern of today's tutorial:

> **CONCERN:** The Brisbane City Council has decided to host an inner-city parade for environmental conservation efforts. The parade's coordinators have advised that it should start and end at two separate parks in the CBD, with at least one of them being on the riverfront. To minimise disruptions for transit and business-related affairs, they would like to know which two parks would be the best for the occassion.

For this task, we are going to need a dataset (or perhaps two datasets) that can help us understand what the typical commotion looks like in the Brisbane CBD. We hint that this analysis will involve retrieving data from the following website: https://data.gov.au/

>**QUESTION:** Can you figure out what datasets would be most suitable to retrieve from https://data.gov.au/?

>>**ANSWER:** ???

Like many websites that distribute data, the website https://data.gov.au/ is strict about who can access their content.

>**QUESTION:** Why would any organisation be concerned about who they give their data to?

>>**ANSWER:** ???

As such, to retrieve the datasets we are interested in, we must first create an account with the government's data dispensory site. This can be done by firstly clicking the login button on the main page of the site:

<img src="graphics/4_login.png" style="margin-left: 50px; width: 40%;">

Then navigating to the "Create Account" option on the page that appears:

<img src="graphics/4_create_account.png" style="margin-left: 50px; width: 40%;">

Once you've done this, you should be able to access the data you are interested in, without being blocked out. Download the appropriate files, and upload them into your Jupyter Lab environment. Once you've done that, we can begin the data analytics process:

### [2] Data - Loading The ___data.gov.au___ Datasets Into Our Environment

We'll begin by loading in the datasets we are concerned with; Python has a unique function for this:

#### You will need to use the following python commands:

```python
    open()
    .read()
```

In [115]:
??? = ???(???, 'r')
??? = ???.???()

??? = ???(???, 'r')
??? = ???.???()

Next, you're going to need to open each of the datasets in the Jupyter _Editor_, and inspect the kind of file types we are using.

>**QUESTION:** Can you determine the file types of the datasets, and what this means we'll need to do to interpret the data?

>>**ANSWER:** ???

Now convert the raw text of each of the datasets files into JSON, using the appropriate Python commands:

#### You will need to use the following python commands:

```python
    import
    .loads()
```

In [116]:
??? json
??? = ???.???(???)
??? = ???.???(???)

### [3] Analysis - Understanding The Contents Of The Datasets

Now let us analyse the data to understand the relationship between the various datasets. You can do this a few different ways, however one such way might be to grab a sample (e.g. an item in the JSON file), and to study it.

__HINT__: You can go to dataset's description at https://data.gov.au/ to understand it better.

In [None]:
???[0]

In [None]:
???[0]

>**QUESTION:** After studying the dataset's samples, what would you say is the relationship between them?

>>**ANSWER:** ???

Given that we now understand the relationship between the datasets, let us return to our business concern by asking how what we should be emphasizing in these datasets.

>**QUESTION:** What is the important information to record in these datasets?

>>**ANSWER:** ???

Given that the important attributes are split between datasets, lets write a function to help us group them up. Our function should accept the `subsystem` and `tsc` of some intersection, and return the density of the `ds1` attribute from its traffic volume data:

In [118]:
def traffic_ds1(???,???):
    for record in intersection_volumes_data:
        if ((record['ss'] == ???) and (record['tsc'] == ???)):
            return record['ds1']
    return None

In the next step, we are going to be referencing coordinates a lot. Let's write a function to help us easily get a sample item's latitude and longitude:

In [123]:
def item_lat_lng(item):
    return [item['coordinates']['latLng'][???], item['coordinates']['latLng'][???]]

### [4] Visualisation - Plotting Map Data

For the next part of our analysis, we are going to need to plot some map data. To do this, we will be installing the `folium` map library. The `folium` library is a great tool for geographic visualisation. You can read more about it here: https://python-visualization.github.io/folium/quickstart.html

In [None]:
!pip install folium

Next we will visualise all the intersections in Brisbane using the `folium` library. To do this, we are going to be using two functions from the `folium` library. The functions are `folium.Map()` and `folium.Circle()`.

For your convenience, we've included the documentation for both these functions here:

* `folium.Map()` - https://python-visualization.github.io/folium/modules.html?highlight=map#folium.folium.Map
* `folium.Circle()` - https://python-visualization.github.io/folium/modules.html?highlight=circle#folium.vector_layers.Circle

We've also partially completed the code that visualises the intersection data. Using your understanding of the provided functions, tinker with the missing fields to make the visualisation appear correctly.

In [None]:
import folium

# First we set up the map

map_of_intersections = folium.Map(
    location=item_lat_lng(intersection_locations_data[0]),
    zoom_start=???, tiles='Stamen Terrain'
)

# Then we add all the coordinates to the map from the intersection data.

for coordinate in intersection_locations_data:
    try:
        folium.Circle(
            radius=???,
            location=item_lat_lng(???),
            color=???,
            fill=False).add_to(???)
    except:
        pass
map_of_intersections

While this visualisation clearly shows where all the recorded intersections are placed, it doesn't communicate the commotion caused by traffic. Once again, we've provided some partially completed code to visualise this next part. See if you can fill in all the blanks:

In [None]:
# First we set up the map

map_of_intersections_with_commotion = folium.Map(
    location=item_lat_lng(intersection_locations_data[0]),
    zoom_start=???,
    tiles='Stamen Terrain'
)

# Then we add all the coordinates to the map from the intersection data, and give them weightings using the volume data.

for coordinate in intersection_locations_data:
    try:
        density_reading = traffic_ds1(coordinate[???], coordinate[???])
        if (??? != None):
            folium.Circle(
                radius=???*40,
                location=item_lat_lng(???),
                color=???,
                fill=True,
                fill_opacity=???,
                weight=???).add_to(map_of_intersections_with_commotion)
    except:
        pass

map_of_intersections_with_commotion

Now we can clearly see what the commotion looks like at each of the intersections.

### [5] Insights - What Parks Should Be Chosen For The Parade?

We have now gotten to the stage where we have a visualisation showing the commotion at various intersections in Brisbane. Our last step in the data analytics cycle is to relate this back to the main business concern.

>**QUESTION:** At what two parks should we host the parade?

>>**ANSWER:** ???

>**QUESTION:** What issues can you see in our analytics process?

>>**ANSWER:** ???

>**QUESTION:** How might we improve our analytics process?

>>**ANSWER:** ???

### [6] Additional Task (Optional) - When Would The Weather Suit A Parade?

For students who manage to complete the first task, or would like to implement a 'weather data' analysis, we have provided the following supplementary concern.

> **CONCERN:** Greatly impressed with the quality of your previous analysis, the Brisbane City Council has now comissioned you to find out what day would be best to host the parade. The Lord Mayor has personally telephoned you, advising that he will not accept a cloud in the sky during what he hopes will be the choice of a beautiful sunny day. Furthermore, he has advised that the parade should take place within the next 5 days. What is your recommendation?

For this task, we strongly recommend using the API provided by https://openweathermap.org/. The process for obtaining the dataset is much like before, however this time you will need to generate an API key, as an added measure of security. See if you can figure out how to generate the said API key. Then once you are done, fill in the blanks in the code below:

In [None]:
import requests

params = {
    "q" : "Brisbane",
    "appid" : ???
}
r = requests.get("https://api.openweathermap.org/data/2.5/forecast", params=params)
r.content

>**QUESTION:** What data has been returned here, and why didn't it require that we manually download a dataset?

__HINT__: Use the following URL to understand the code above in a more meaningful way: https://openweathermap.org/forecast5

>>**ANSWER:** ???

Let us once again take a look at a sample item from the dataset. To do this, we'll once again need to convert our data into JSON:

In [None]:
???.???(???)['list'][0]

>**QUESTION:** What datapoint immediately stands out from the data?

__HINT__: Recall the Lord Mayor's stance on clouds.

>>**ANSWER:** ???

Let's record the number of clouds across all items in the dataset for the next 5 days, and store it in a 'list' variable:

In [144]:
??? = []
for record in ???.???(???)['list']:
    ???.append(record['clouds']['all'])

Then we can visualise our data using the `matplotlib.pyplot` library, and the `numpy` library:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
plt.plot(???)
plt.title(???)
plt.ylabel(???)
plt.xlabel(???)
plt.xticks(np.array(range(5))*10,[???, ???, ???, ???, ???])
plt.show()

>**QUESTION:** Given the results of the visualisation, what is your recommendation to the Lord Mayor ie. what is the most suitable day for hosting the event?

>>**ANSWER:** ???