<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Internal concerns, structured data & open data

In this session we focus on the internal concerns of businesses, look at how structured data is important for addressing these concerns, and also take a look at how open data can be valuable for addressing internal concerns.

---

### QDAVI

When addressing business concerns, our interest is much more than just the data analytics. We are interested in what is *appropriate, efficous, ethical ...* --  what is the ***right*** kind of analytics to help provide the ***right*** kind of insights for business. To provide some structure to our approach, we follow a cycle - **QDAVI** - to address a business concern:

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="graphics/QDAVI_cycle_sm.png" width="50%" />

### Internal business concerns

Review business concerns

What are internal business concerns?

### Structured data

With structured data, the structure is defined in advance and the data that populates the structure is consistent.

Typically, structured data is in a tabular format with rows and columns:

* Columns normally represent fields or properties that are consistent across the whole of the data e.g. postcode or phone number
* Each row is a separate record in the data e.g. each customer has their own row in a customers table 

Often structured data is saved in `CSV` file format on the file system and we read it into a `pandas` `dataframe`. Review the video and notebook for **Structured Data** for an introduction on how to load and save.


---
### Example 1 - workplace safety

**BUSINESS CONCERN:**

Workplace safety can have a significant impact on business success. Not only do accidents and injury cost time and money, but a safe healthy environment can contribute to a positive culture which in turn can improve employee wellbeing and lift productivity and efficacy.

https://www.comcare.gov.au/safe-healthy-work/healthy-workplace/benefits

#### Question:

To improve safety, should we target any particular groups of people?

#### DATA

In [None]:
# Sample data from https://www.contextures.com/xlsampledata01.html#morefiles

# To use pandas, we need to import it (normally as 'pd')
import pandas as pd

# We can then open a CSV file into a new dataframe
safety_df = pd.read_csv('data/sampledatasafety.csv')

# view the dataframe
safety_df

#### Analysis:

There are usually a number of ways of doing a single an analysis task. We will use a few different techniques over the notebook to get familiar with what is possible.

Let's start by checking how many Male and Female are involved in incidents, by counting the number of rows for each.

In [None]:
# First how many incidents are in the database?
incident_count = len(safety_df)
incident_count

In [None]:
# We can get a boolean value series for a whole column by applying a condition to that column
safety_df['Gender']=='Male'

In [None]:
# To get the true values of the column we just use the series as the selector for the dataframe
male_safety_df = safety_df[safety_df['Gender']=='Male']
male_safety_df

In [None]:
# How many incidents for Males?
male_incident_count = len(male_safety_df)
male_incident_count

In [None]:
# Output as percentage:
male_ratio = male_incident_count/incident_count
print("Of all safety incidents, {:.2%} involved males".format(male_ratio))

Based on this, we could focus training efforts on men as that is likely to have more impact, but this is pretty broad. Let's have a look at age groups. 

Filtering rows for each group is tedious, what we need to is to *group* the data frame into rows by `Age Group` and then count the number of rows in each group. Turns out, pandas has a way of doing exactly that with the `GroupBy` function.

In [None]:
# Just doing the groups with the males to start with
male_safety_df.groupby(['Age Group']).count()

In [None]:
# What about Department?
male_safety_df.groupby(['Department']).count()

The incidents are fairly evenly spread across age groups and departments, but perhaps there is a difference in the seriousness of the injuries for these groups, so maybe we need to filter the data based on days lost:

In [None]:
minor_male_df = male_safety_df[male_safety_df['Days Lost'] == 0]
minor_male_df

In [None]:
# Output as percentage:
minor_male_ratio = len(minor_male_df)/male_incident_count
print("Of all male safety incidents, {:.2%} were minor involving {} days lost".format(minor_male_ratio,0))

In [None]:
# More than 1 day lost
major_male_df = male_safety_df[male_safety_df['Days Lost'] > 2]
major_male_df

In [None]:
# Retry age
major_male_df.groupby(['Age Group']).count()

In [None]:
# ... and department
major_male_df.groupby(['Department']).count()

In [None]:
# What about both with a total of days lost?
major_male_df.groupby(['Department','Age Group']).sum('Days Lost')

In [None]:
# Just days lost
male_days_df = major_male_df.groupby(['Department','Age Group']).sum('Days Lost')['Days Lost'] 
male_days_df

In [None]:
# Filter total days more than 5
max_male_days_df = male_days_df[male_days_df > 10]
max_male_days_df

#### Visualisation:

Although we have some helpful information which we can use for insights, often the visualisation process can help us derive further insights.

In [None]:
max_male_days_df.plot(kind='bar')

In [None]:
# Unpack this a bit more using the unstack() function

max_male_days_df.unstack().plot(kind='bar')

### Insights

* ???

---
### Example 2 - customer contact info


**BUSINESS CONCERN:**

Arguably the most critical data for a business is data about their customers. Understanding their customers is critical for almost all business decisions including sales, product development, logistics, and even human resources.

https://www.business.qld.gov.au/starting-business/planning/market-customer-research/researching-customers/customer-needs

#### Question:

Where should we start in providing more localised support for our customers?

#### Data:

In [None]:
# Sample data from https://www.briandunning.com/sample-data/

customer_df = pd.read_csv('data/au-500.csv')

# view the dataframe
customer_df

#### Analysis:

We follow our intuintion in conducting the analysis. Remember there are often many ways of addressing the **question**. This analysis is just one approach.

In [None]:
# Where are our customers located? Which state? Post code?
postal_df = customer_df.groupby(['state','post']).count()
postal_df

In [None]:
# More than 2 customers in a postcode?
postal_df[postal_df['company_name']>2]

In [None]:
# What about cities?
city_df = customer_df.groupby(['state','city']).count()
city_df

In [None]:
# More than 2 customers in a postcode?
city_df[city_df['company_name']>1]

The city data is not quite as helpful as the postal data, but perhaps we need a wider geographic area. Let's use the postcode construction to get a wider net:

In [None]:
# Modify postcodes to get first 2 digits
code = 4051
int(code/100)

In [None]:
# Apply this to dataframe
def getArea(code):
    return int(code/100)

customer_df['area'] = customer_df['post'].apply(getArea)
customer_df

In [None]:
# Group by new area column
area_s = customer_df.groupby(['state','area'])['company_name'].nunique()
area_s

In [None]:
# More than 2 customers in a postcode?
top_areas = area_s[area_s>15]
top_areas

##### Visualisation:

In [None]:
top_areas.unstack().plot(kind='bar')

#### Insights

We need to keep in mind assumptions that we take while conducting analyses. For example, there could be some issues with postcodes, such as the geographic size, and that some cross state boundaries.

https://www.abs.gov.au/websitedbs/censushome.nsf/home/factsheetspoa?opendocument&navpos=450

* ???

---
### Open data

Businesses are increasingly waking up to the value of using open data - data provided openly by governments and other organisations. Government data is particularly important for many businesses as can provide high level information that might be difficult to a smaller business to obtain.

https://data.gov.au


---
### Example 3 - market size



**BUSINESS CONCERN:**

Expanding a business is risky and expensive, particular if the expansion involves exporting to other countries. It is essential to have a good understanding of potential markets and the degree of competition in those markets.

https://export.business.gov.au/find-export-markets/tips-for-choosing-export-markets

For the purposes of this exercise, our business fits into a trade category of
`Made-up textile articles`

#### Question:

 If we want to start exporting, what is the market size for our Australian competitors, and which are the biggest destinations?

#### Data:

In [None]:
# Sample data from https://data.gov.au/data/dataset/australia-s-merchandise-trade-by-state-territory-by-country-sitc-to-fy2017

# We can then open a CSV file into a new dataframe
trade_df = pd.read_csv('data/trade_data-fy2017-658.csv')

# view the dataframe
trade_df

#### Analysis:

In [None]:
# What is the export value in various geographic levels?
l1_df = trade_df.groupby(['Trade type','Geographic level 1']).sum()
l1_df

In [None]:
# Lets make this easier to read
l1_df.round(0)

#### Visualisation:

Sometimes our data is difficult to explore without visualising. In these instances, it is good to use the visualisation as part of the exploration process. This is the equivalent of doing multiple cycles of QDAVI (with sub questions) within the overall question.

In [None]:
# Reverse the groupby to make it easier to see imports and exports for each country
l1_df = trade_df.groupby(['Geographic level 1','Trade type']).sum()
l1_df.round(0)

In [None]:
# Visualise this - unstack allows levels of grouping to feed into the plot function
l1_df.unstack().plot(kind='bar')

Doing the visualisation as part of the exploration helps us see an **insight** that imports from Asia are so big (> \$11B) that its difficult to see what is going on with exports.
So lets just zoom in on the exports.

In [None]:
# Just exports
exports_df = trade_df[trade_df['Trade type']=='Total Exports']
l1_exports_df = exports_df.groupby(['Geographic level 1']).sum().round(0)
l1_exports_df

In [None]:
l1_exports_df.plot(kind='bar')

That's better, and it gives us an overview, so perhaps we can look at by `State` as well.

In [None]:
l1_exports_df = exports_df.groupby(['State','Geographic level 1']).sum().round(0)
l1_exports_df.unstack(0).plot(kind='bar')

Given that `Oceania & Antarctica` dominate and that Antarctica is *not* likely to be significant, what is the dominant Oceania region. We can try looking at different levels.

In [None]:
# Let's define a function to make it easy

def getDataForLevel(level):
    return exports_df.groupby(['State','Geographic level '+str(level)]).sum().round(0)

In [None]:
getDataForLevel(2).unstack(0).plot(kind='bar')

In [None]:
getDataForLevel(3).unstack(0).plot(kind='bar')

I suspect this is *New Zealand*. Let's confirm by taking a look at all countries with exports of over \$10M.

In [None]:
# What about specific countries greater than 10M
country_df = exports_df.groupby(['Partner country']).sum().round(0)
top_country_df = country_df[country_df["A$'000"] > 10000]
top_country_df

In [None]:
top_country_df.plot(kind='bar')

#### Insights:

* ???