# Module 1 - The Data Analytics Cycle

In our approach to data analytics, we will follow a process that requires that we address 5 questions:

1. Which is the right question?
2. Which is the right data?
3. Which is the right analysis?
4. Which is the right visualisation?
5. Which is the right insight?

For this unit, we are concerned with more than just data analytics, we are interested in what is *appropriate, efficous, ethical ...* what is ***right!***

You can easily remember the data analytics cycle by the acronym **QDAVI**:

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="graphics/QDAVI_cycle_sm.png" width="50%" />


## QDAVI Example

Let's take a look an example of a complete **QDAVI** cycle...




### 1. Question

> **CONCERN:** A business is looking to launch an agricultural product in either Australia or New Zealand. However, management is unsure which country to start with.

What questions might the business be interested in answering, and how might we use data analytics to address these questions?

### 1. Data

What data may be helpful in finding out the importance of agriculture to each country?

Perhaps, data that shows the contribution of agriculture to the economy:

1. Take a look at [GapMinder](https://www.gapminder.org/data/) - (based on [uw-madison resource](https://uw-madison-aci.github.io/python-novice-gapminder/39-plotting/))
2. Find the "Agriculture, percent of GDP" (economy>sectors>agriculture) and download the CSV
3. Upload the CSV to your Jupyter files section with the 'upload' button into a 'data' directory.

#### Required libraries

For any data analysis, we need to use existing software that has been loaded into the Jupyter environment in the form of 'libraries', 'packages', or 'modules'. To make these libraries available to your notebook, you need to `import` them.

In [None]:
# Import pandas for dataframes and matplotlib for plotting
import matplotlib.pyplot as plt
import pandas

#### Load the data

Now that we have the data file in our Jupyter environment, we can load the data out of the file into our notebook so that we can work with it.

In [None]:
# Set variables for file and index column
filename = ??? #the name of your uploaded file - ensure that you use quotes "filename.csv"
colname = ??? #open the csv and have a look at what the index column is called

# Read in the percent of gdp data
ag_gdp = pandas.read_csv(filename, index_col= colname)

# Show the shape of the data
print(ag_gdp.shape)

What does the shape tell us?

Take a look at the data. 

TIP: You can view any variable by typing its name in a cell and running the cell.

In [None]:
# Display loaded data
???

#### Clean the data

Data is rarely in a form where it is ready to analyse immediately. One of the most common tasks in data analytics is cleaning. See [The Ultimate Guide to Data Cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)

For this task, we're going to work with a subset of the data, and we will select data that needs minimal cleaning. For other tasks, you may need to do a lot of cleaning. How much always depends on both the question being addressed and the data that you have selected.

In [None]:
# Take the last 5 years of the GDP data
most_recent_five_years = [???,???,???,???,???] # TIP: Ensure you put names of columns in quotes "colname"
ag_gdp_clean = ag_gdp.filter(most_recent_five_years, axis=1)
print(???.shape)

We are only interested in Australia and New Zealand, so we don't need 189 rows. We can use the .loc function of the dataframe to obtain the row.

In [None]:
ag_gdp_clean.loc["Zimbabwe"]

In [None]:
So we can take the appropriate rows and assign them to new variables for each country

In [None]:
# Just select the countries we are interested in by referencing the index
ag_gdp_au = ag_gdp_clean.loc[???]
ag_gdp_nz = ag_gdp_clean.loc[???]

In [None]:
# Take a look at the data for AU
???

In [None]:
# Take a look at the data for NZ
???

### 3. Analysis

* What is the problem with the NZ data?
* What can be done about this?
* What are the implications for the question?

For this exercise, we are not going to do any computational analysis on this data, but we still need to 'analyse' the data by **critiquing** it in terms of the question. We could work with the raw numbers, but a visualisation of those numbers may be more helpful.

### 4. Visualisation

At the beginning of the notebook we imported the plotting library and called it `plt`. Here we use this software to visualise our data.

In [None]:
# Plot the data for the 2 countries
plt.plot(???)
plt.plot(???)

This visualisation could easily be misinterpreted. Let's add some additional features to the visualisation to improvement.

In [None]:
# Add labels and set colours
plt.plot(ag_gdp_au,'g-',label=???)
plt.plot(ag_gdp_nz,'m-',label=???)

# Create legend.
plt.legend(loc='upper right')
plt.xlabel(???)
plt.ylabel(???)

### 5. Insight

Our data analytics is not complete at this point. We still need to identify insights from the analytics process that can help address our original questions or address the main business concern.
* What did we find, and how does it relate to the original quest?
* What is the recommendation for the concern? 
* What other information would be helpful? 
* What *doesn't* the data tell us? 
* Can we make inferences? What inferences should we avoid?