<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Internal concerns, structured data & open data

In this session we focus on the internal concerns of businesses, look at how structured data is important for addressing these concerns, and also take a look at how open data can be valuable for addressing internal concerns.

---

### QDAVI

When addressing business concerns, our interest is much more than just the data analytics. We are interested in what is *appropriate, efficous, ethical ...* --  what is the ***right*** kind of analytics to help provide the ***right*** kind of insights for business. To provide some structure to our approach, we follow a cycle - **QDAVI** - to address a business concern:

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="../studios/graphics/QDAVI_cycle_sm.png" width="50%" />

### Internal business concerns

Review business concerns

What are internal business concerns?

### Structured data

With structured data, the structure is defined in advance and the data that populates the structure is consistent.

Typically, structured data is in a tabular format with rows and columns:

* Columns normally represent fields or properties that are consistent across the whole of the data e.g. postcode or phone number
* Each row is a separate record in the data e.g. each customer has their own row in a customers table 

Often structured data is saved in `CSV` file format on the file system and we read it into a `pandas` `dataframe`. Review the video and notebook for **Structured Data** for an introduction on how to load and save.


---
### Exercise 1 - workplace safety

**BUSINESS CONCERN:**

Workplace safety can have a significant impact on business success. Not only do accidents and injury cost time and money, but a safe healthy environment can contribute to a positive culture which in turn can improve employee wellbeing and lift productivity and efficacy.

https://www.comcare.gov.au/safe-healthy-work/healthy-workplace/benefits

#### Question:

How incident costs could be reduced?

#### DATA

In [None]:
# Sample data from https://www.contextures.com/xlsampledata01.html#morefiles

# To use pandas, we need to import it (normally as 'pd')
import pandas as pd

# We can then open a CSV file into a new dataframe
safety_file_path =  # Write the path where the sampledatasafety.csv file can be found
safety_df = pd.read_csv(safety_file_path)

# view the dataframe
safety_df

In [None]:
# This are the data types in the data frame
safety_df.dtypes

In [None]:
# Convert the Incident Cost to float
safety_df["Incident Cost"] = safety_df["Incident Cost"].str.replace("$","",regex=True).str.replace(",","",regex=True).astype(float)
safety_df

In [None]:
# Check the data types in the data frame again
safety_df.dtypes

#### Analysis:

There are usually a number of ways of doing a single an analysis task. We will use a few different techniques over the notebook to get familiar with what is possible.

Let's start by checking the incident types, by counting the number of rows for each.

In [None]:
# First how many incidents are in the database?
incident_count = len(safety_df)
incident_count

In [None]:
# How much all the incidents cost?
incident_cost = safety_df[" "].sum() # Insert the column we want to calculate
incident_cost

In [None]:
# We can group by the incident type to check the ocurrance of each of them
incident_type_group = safety_df.groupby([" "]) # Insert the column we want to group by
incident_type_group.count()

In [None]:
# We can check the details on the incident types that occur the most
burn_df = safety_df[safety_df[" "] == " "] # Complete the condition to find the Burn incident type
burn_df

In [None]:
# How much does it cost the incident that occurs the most
burn_cost =  # Calculate the total cost of the Burn incident
burn_cost

In [None]:
# Burn cost output as percentage
print("Burn incidents are {:.2%} of the total incidents".format(len(burn_df) / len(safety_df)))
print("Burn incidents are {:.2%} of the total costs".format(burn_cost / incident_cost))

We could do the same analysis with each of the incident types. However, there is a fastes way to get the results we want

In [None]:
# Iterate the incident type groupby
incident_type_df = pd.DataFrame(columns = ["Incident Type", "Incident Count", "Incident Count %", "Incident Cost", "Incident Cost %"])
for incident_type, group in incident_type_group:
    incident_type_df = incident_type_df.append({"Incident Type": incident_type, 
                                                "Incident Count": len(group), 
                                                "Incident Count %": len(group) / len(safety_df) * 100, 
                                                "Incident Cost": group["Incident Cost"].sum(),
                                                "Incident Cost %": group["Incident Cost"].sum() / incident_cost * 100}, ignore_index=True)
    print("{0} incidents are {1:.2%} of the total incidents".format(incident_type, len(group) / len(safety_df)))
    print("{0} incidents are {1:.2%} of the total costs".format(incident_type, group["Incident Cost"].sum() / incident_cost))
    print("-----------------------------------------------------------------")

In [None]:
# We can do a similar analysis to check if factors such as shift or the plant has an impact on the incident costs


Based on this, where should we focus our training to have more impact? age group? department? plant?

#### Visualisation:

In [None]:
incident_type_group["Incident Cost"].count().plot(kind='bar')

In [None]:
incident_type_group["Incident Cost"].sum().plot(kind='bar')

In [None]:
incident_type_df = incident_type_df.set_index("Incident Type")
incident_type_df[["Incident Count %", "Incident Cost %"]].plot(kind='bar')

In [None]:
incident_type_df[["Incident Count %", "Incident Cost %"]].sort_values(by="Incident Cost %", ascending=False).plot(kind="bar")

### Insights:

---
### Exercise 2 - customer contact info


**BUSINESS CONCERN:**

Arguably the most critical data for a business is data about their customers. Understanding their customers is critical for almost all business decisions including sales, product development, logistics, and even human resources.

https://www.business.qld.gov.au/starting-business/planning/market-customer-research/researching-customers/customer-needs

#### Question:

The company is trying to expand, which state/territory do we need to look at carefully in terms of viability?

#### Data:

In [None]:
# Sample data from https://www.briandunning.com/sample-data/
customer_file_path = "../../studios/data/au-500.csv" # Write the path where the au-500.csv file can be found
customer_df = pd.read_csv(customer_file_path)

# view the dataframe
customer_df

#### Analysis:

There are usually a number of ways of doing a single an analysis task.

Let's start by deciding the aim of the analysis, do we want to find where are more customers to take advantage of the company prescense? or should we find where are less customers to create prescense?

In [None]:
# Where are our customers located? Which state? Post code?
state_df = customer_df.groupby(["state"]).count()
state_df

In [None]:
# There are more customers in NSW, now where should the company open?
nsw_df = customer_df[ ] # Insert the condition to find NSW state
nsw_df

In [None]:
# Let's check the cities in NSW
nsw_city_df =  # Group the nsw_df by city and count
nsw_city_df

In [None]:
# More than 2 customers in a city?
nsw_city_df[ ] # Insert the condition to find more than 1 customer

The city data is not quite as helpful as the postal data, but perhaps we need a wider geographic area. Let's use the postcode construction to get a wider net:

In [None]:
# Modify postcodes to get first 2 digits
code = 4051
int(code/100)

In [None]:
# Apply this to dataframe
def getArea(code):
    return int(code/100)

customer_df['area'] = customer_df['post'].apply(getArea)
customer_df

In [None]:
# Group by new area column
nsw_area = customer_df[customer_df["state"] == "NSW"].groupby(['area'])["company_name"].count()
nsw_area

In [None]:
# We can do a similar analysis to explore the states/territories that have less customers


##### Visualisation:

In [None]:
nsw_area.sort_values(ascending=False).plot(kind='bar')

#### Insights:

https://www.abs.gov.au/websitedbs/censushome.nsf/home/factsheetspoa?opendocument&navpos=450


---
### Open data

Businesses are increasingly waking up to the value of using open data - data provided openly by governments and other organisations. Government data is particularly important for many businesses as can provide high level information that might be difficult to a smaller business to obtain.

https://data.gov.au


---
### Exercise 3 - market size (Homework)



**BUSINESS CONCERN:**

Expanding a business is risky and expensive, particular if the expansion involves exporting to other countries. It is essential to have a good understanding of potential markets and the degree of competition in those markets. Our market is Made-up textile and the data file has been already been filtered to contain only those rows.

https://export.business.gov.au/find-export-markets/tips-for-choosing-export-markets


#### Question:

Which countries provide the largest and least contributions to imports in our sector? How might this inform our marketing?

#### Data:

In [None]:
# Sample data from https://data.gov.au/data/dataset/australia-s-merchandise-trade-by-state-territory-by-country-sitc-to-fy2017

# We can then open a CSV file into a new dataframe
trade_file_path = "../../studios/data/trade_data-fy2017-658.csv" # Write the path where the trade_data-fy2017-658.csv file can be found
trade_df = pd.read_csv(trade_file_path)

# view the dataframe
trade_df

#### Analysis:

In [None]:
# What is the export value in various geographic levels?

In [None]:
# Which states are exporting the most into the biggest markets?

#### Visualisation:

#### Insights: