<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2023_sem1)</div>

# IFN619 :: A2-DataAnalyticsCycle

### QDAVI

In our approach to data analytics, we will follow a process that requires that we address 5 questions:

1. Which is the right question?
2. Which is the right data?
3. Which is the right analysis?
4. Which is the right visualisation?
5. Which is the right insight?

For this unit, we are concerned with more than just data analytics, we are interested in what is *appropriate, efficous, ethical ...* what is ***right!***

You can easily remember the data analytics cycle by the acronym **QDAVI**:

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="graphics/QDAVI_cycle_sm.png" width="50%" />


---

## QDAVI Example

### Question

In this part of the cycle we will:
1. consider the context, and the main concern of stakeholders 
2. identify a specific question (or questions) to address the context and stakeholder concern/s, and 
3. plan how we might connect the question with available data for analysis

**CONTEXT:** We have many students in IFN619 coming from different backgrounds and choosing to study for different reasons. A better understanding of the cohort may be helpful for the teaching team, allowing them to adapt the learning experiences to groups with different needs.

> **QUESTION** What different groups of students are studying IFN619, and how might information about the cohort help the teaching team provide meaningful learning experiences?

**PLAN:** To answer this question, we might consider the kinds of groups that might be possible within available data. Some thoughts are:
- courses
- majors/minors
- part time/full time (mode)
- connections between studios and tutes (registration)
- relationship between mode and registration

The courses and majors/minors may provide information on whether the student has some knowledge of IT or whether their main field is not IT. The mode and registration information may provide an indicator of students that are studying while undertaking other responsibilities (like working and/or caring).

### Data

In this part of the cycle, we will:
1. Identify appropriate data to address the question
2. Read the data into the analysis environment (Jupyter)
3. Clean the data and format it so that it is ready for analysis

As part of this process, we will take into account: 
- the shape of the data and whether there is meaning in the structure - e.g. do rows and columns mean something?
- the completeness of the data - is any data missing?
- the appropriateness of the data - do any aspects of the data need to be modified prior to analysis - e.g. dates and times

#### Required libraries

For any data analysis, we need to use existing software that has been loaded into the Jupyter environment in the form of 'libraries', 'packages', or 'modules'. To make these libraries available to your notebook, you need to `import` them.

In [None]:
import pandas as pd # Dataframes
import numpy as np  # Mathematical functions
import re           # Regular expressions

#### Read in the data

1. Take a look at the data first to identify its structure
2. Use appropriate code to read the data in to Jupyter
3. Display the data to check it was read correctly

In [None]:
# Read the CSV into a dataframe
file_name = ???
class_df = pd.read_csv(f"data/{file_name}",index_col='id')
class_df.shape

In [None]:
# We can take a look at the dataframe by adding the variable name as the last line of a cell
class_df

#### Clean the data

Looking at the data, there are some cells that have `NaN` in them. This is short for *not a number* and is an indicator that there is nothing in these cells. Importantly, in code *nothing* or `null` is different to `0` or an empty string`""`.

Let's check how many of these `NaN`s we have in the data...


In [None]:
# For each column check if a cell isna() and then sum() to get total
class_df.isna().sum()

Before we can fix this, we need to know what the missing data means.

In the case of the course code, the value is missing because the data didn't include any courses of 1 or 2 people (as they may be identified by their uniqueness in the data). So this missing data could be characterised as `OTHER` (a course other than the main courses listed).

Let's replace the `NaN`s with `OTHER`s for the course column...

In [None]:
# Replace missing data for course column
class_df[???] = class_df[???].fillna(???)
# Check that it worked
class_df.isna().sum()


Now we can replace the the `NaN`s with `NR`s for the tutorial and studio columns...

In [None]:
# Replace missing data for tutorial and studio columns
columns = [???,???]
class_df[columns] = class_df[columns].fillna(???)
# Check that it worked
class_df.isna().sum()

In the case of the major column, there is something else lurking. Let's take a look at more rows of the dataframe...

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
class_df

What's going on here? We have some cells with `NaN` and some that are just blank. We need to take a closer look at the data in the major column...

In [None]:
# Take a look at the dataframe
print(list(class_df.major))

It's now obvious that there is *whitespace* in the column as well as `NaN`s. Let's convert the whitespace to `NaN`s and then replace the `NaN`s with `NONE`

In [None]:
# Replace whitespace major column
class_df[???] = class_df[???].replace(r'^\s*$', np.nan, regex=True)
# Check that it worked
class_df.isna().sum()

In [None]:
# Replace missing data for course column
class_df[???] = class_df[???].fillna(???)
# Check that it worked
class_df.isna().sum()

In [None]:
class_df

### Analysis

In this part of the cycle, we will:
1. Select appropriate analysis to address the question
2. Analyse the data using selected techniques
3. Check suitability of results and re-analyse as necessary

As part of this process, we will take into account: 
- the composition of the data and how techniques might yield useful results
- starting with simple approaches and working up to more complex as necessary (and/or feasible)
- the appropriateness of the results - is the analysis yielding useful information that can help answer the question

In [None]:
# Find out the proportion of categories for each column
class_df.groupby(???).count()

In [None]:
# Just take the course column
class_df.groupby('course')[???].count()

In [None]:
# Include the major
class_df.groupby(['course',???]).size()

In [None]:
# turn the groupby into a dataframe and concat course and major
cm_df = class_df.groupby([???,???]).size().reset_index(name="total")
cm_df = cm_df.sort_values(by='total',ascending=False)
cm_df['course_major'] = cm_df.course + '_' + cm_df.major 
cm_df.set_index(???,inplace=True)
cm_df.drop([???,???],axis=1,inplace=True)
cm_df

### Visualise

At this point, it could be helpful to visualise what we found.

In [None]:
cm_chart = cm_df.total.plot(kind='bar')

However, the original grouped data lends itself to a stacked bar chart.

In [None]:
# Get the courses in sorted order
courses = list(class_df.course.unique())
#Use the grouped data, ordered by courses to created a stacked bar chart
cm_chart = class_df.groupby([???,???]).size().unstack().loc[courses].plot(kind=???, stacked=True,colormap="turbo")

When we are undertaking analysis, we may need to get information from other sources to help us make sense of the data. In this case, looking up QUT course codes is helpful:

| Code | Course
| -----|--------
| IN20 | Master IT
| IN23 | Master BPM
| IN26 | Grad Cert Data Analytics
| IN27 | Master Data Analytics

### Analyse

We can return to the analysis to do a similar analysis for the Studio and Tutorial sessions

In [None]:
# groupby for studio and tutes
studio_tute = class_df.groupby([???,???]).size().reset_index(name="total")
studio_tute

### Visualise

And visualise these results

In [None]:
st_chart = class_df.groupby([???,???]).size().unstack().plot(kind=???, stacked=???)

### Insights

To derive insights, we need to think about the analysis and visualisation and relate them to the original question:

> **QUESTION** What different groups of students are studying IFN619, and how might information about the cohort help the teaching team provide meaningful learning experiences?

- What does our analysis and visualisation tell/show us that is relevant to the initial question?
- Which of the results are the most insightful and why?
- Do we need to do further analysis? Why?
- What further analysis is possible? Should it be considered? Why?
- What are the limitations of the data analytics that we've done. Can we overcome those limitations?


Considering these questions carefully can sometimes result in...
- re-thinking the original question
- looking for additional data to supplement the original data
- undertaking more cleaning of the original data
- pursuing further analysis or visualisation
- (occasionally) abandoning the approach and adopting a new approach

For example, it might be helpful to consider part time and full time modes and their relationship to the times of the tutorials and studio sessions...

In [None]:
# Create a dictionary to map times to tutorials and studios
class_times = {"STU 1":"afternoon",
               "STU 2":"morning",
               "STU 3":"afternoon",
               "TUT 1":"evening",
               "TUT 2":"morning",
               "TUT 3":"afternoon",
               "TUT 5":"evening",
               "TUT 6":"morning",
               "NR":"none"}

# Create a function to return the time for the session
def get_time(session):
    return class_times[???]

# Test the function
get_time(???)

In [None]:
# Create new columns for the times calculated based on the class_times dictionary
class_df['studio_time'] = class_df[???].apply(get_time)
class_df['tutorial_time'] = class_df[???].apply(get_time)
class_df

In [None]:
times_chart = class_df.groupby([???,???,'mode']).size().unstack().plot(kind='bar', stacked=True)

### Insights

Insights from this visualisation at first glance don't look surprising, but there are some factors in the data that can skew our interpretation...
- there are only limited times for students to select, and they are not offered in equal numbers
- not all offerings are available when class registration opens
- once a class is full, students need to register for a different class even if its not their preference
- there are not equal number of full-time and part-time students


### FINAL INSIGHTS

{synthesise the insights from above here in written form}


---

## Next steps...

1. Experiment with the data and the code above to try different approaches to the analysis
2. Attend the tutorial, and try the data analytics cycle with a different problem
3. Try out some of the exercises during the week