In [1]:
import pandas as pd
import os

# Getting the results and enrolments data

This section of the notebook involves using the pandas dataframe to import the data from the .csv file.

The data is stored in a root folder titled `student_data`

This data is categorised into subsequent years. For example: 2015, 2016, ...

The `years` array needs to be updated should the new data of the new year is added.

Example of file naming convention for the data files:

* Results from 2015: `results2015.csv`
* Enrolments from 2015: `enrolments2015.csv`


In [2]:
from studentpathway.dataprocessing.get_data_frames import get_data_frames
from studentpathway.dataprocessing.get_year_list import get_year_list

In [3]:
print(get_data_frames.__doc__)

Returns a list of dataframe of the requested data_name

    Keyword arguments:
    data_name -- The name of the data to be imported (example: data_name=enrolments)
    root_folder -- The path to data folder where all the csv files are present (example: root_folder=students_data)
    years -- A list of years which is the subfolder inside the root_folder

    Returns:
    data -- list of all the pandas dataframe of the data_name
    


In [4]:
print(get_year_list.__doc__)

Returns list of years from the subfolder named as years

    Keyword arguments:
    root_folder -- path of the root folder to get the years from (example: root_folder="students_data")

    Returns:
    years -- list of years
    


# Constant Values

The following section contains the constant values that are used while data processing

In [5]:
# Root directory of the data
ROOT_FOLDER = "students_data"
RESULTS = "results"
ENROLMENTS = "enrolments"

## Year list from the sub folders of data

The data for the enrolments and results are stored in subfolder in the years.

Instead of hard coding the years in an array, the following section finds the years by the subfolder name which needs to be in years.

They years are converted into `int` datatype and sorted in `years` list.

In [6]:
years = get_year_list(ROOT_FOLDER)
print(years)

[2015, 2016, 2017, 2018, 2019]


## Result data

In [7]:
# Reading results files
results_data = []

results_data = get_data_frames(RESULTS, ROOT_FOLDER, years)

  if (await self.run_code(code, result,  async_=asy)):


In [8]:
# Standardising the columns
results_column_header = ["student_id", "course_code", "unit_cohort", "unit_code", "unit_name", "outcome_date", "teaching_calendar", "grade", "mark"]

for i in range(len(results_data)):
    results_data[i].columns = results_column_header

In [9]:
# Combining the results data

results = pd.concat(results_data, axis=0, sort=False).reset_index(drop=True)

In [10]:
# Size of the data
results.shape

(1277496, 9)

## Enrolment data

In [11]:
# Reading Enrolment files

enrolment_data = []

enrolment_data = get_data_frames(ENROLMENTS, ROOT_FOLDER, years)

In [12]:
# Standardising the columns
enrolments_column_header = ["student_id", "course_code", "student_cohort", "school_name", "course_start_date", "course_attempt_status", "gender", "campus_code", "campus_name", "citizenship", "indigenous_type", "date_of_birth", "discontinued_date", "lapsed_date"]

for i in range(len(enrolment_data)):
    enrolment_data[i].columns = enrolments_column_header

In [13]:
# Combining the enrolment data

enrolments = pd.concat(enrolment_data, axis=0, sort=False).reset_index(drop=True)

In [14]:
# Size of the data
enrolments.shape

(139348, 14)

# Merging Results and Enrolments Data

In [15]:
# Removing the repeated columns
enrolments = enrolments.drop(["course_code", "school_name"], axis=1)

In [16]:
# Combining the tables

final_data = results.join(enrolments.set_index('student_id'), on="student_id")

In [17]:
# Sorting the dataset with student ID

final_data = final_data.sort_values(by=['student_id']).reset_index(drop=True)

In [18]:
# Organising date for outcome_date
final_data['outcome_date'] = pd.to_datetime(final_data.outcome_date)

In [19]:
# Organising date for course_start_date
final_data['course_start_date'] = pd.to_datetime(final_data.course_start_date)

In [20]:
# Organising date for date_of_birth
final_data['date_of_birth'] = pd.to_datetime(final_data.date_of_birth)

In [21]:
final_data.shape

(2107370, 20)

## Duplicate data removal from final data

Run the following section after merging the **results** and **enrolments**.

The following code gets rid of the duplicate values.

In [22]:
final_data = final_data.drop_duplicates().reset_index(drop=True)

In [23]:
final_data.shape

(1960101, 20)

# data cleaning - duplicates

This code will remove further duplicates on the basis of given parameters

**TODO**:

* Identify the column which defines the duplicates
* Filter for the data with those column
* Remove the rows

# Encrypt Student ID

The following code must be run before saving the `final_data`.

The code uses sha1 algorithm to encrypt student ID

In [24]:
import hashlib

students = final_data["student_id"].to_list()

encrypted_id = []

for student in students:
    encrypted_id.append(hashlib.sha1(str(student).encode('ASCII')).hexdigest())

final_data["student_id"] = encrypted_id

# Storing the data

The `final_data` is a pandas dataframe.

Run the following section for storing `final_data` into a `final_data.csv` file.

File path: `students_data/combined_data/final_data.csv`

In [25]:
final_data.to_csv(r'students_data/combined_data/final_data.csv', index=False)

# More information about Final Data

**Note: Running this section is optional. It just provides more insight into the data.**

Following section points out the different attributes obtained from the final dataset.

In [26]:
final_data.nunique()

student_id               83099
course_code               1089
unit_cohort                639
unit_code                 2702
unit_name                 2885
outcome_date             75707
teaching_calendar           51
grade                       34
mark                       101
student_cohort             588
course_start_date         1584
course_attempt_status        6
gender                       4
campus_code                 25
campus_name                 25
citizenship                  6
indigenous_type              5
date_of_birth            11043
discontinued_date         9675
lapsed_date                760
dtype: int64

In [27]:
final_data["gender"].value_counts()

F    1045295
M     784658
X        553
U        105
Name: gender, dtype: int64

In [28]:
final_data["indigenous_type"].value_counts()

NEITHER  ABORIGINAL NOR TORRES STRAIT ISLANDER ORIGIN    1803807
OF ABORIGINAL ORIGIN                                       24320
OF ABORIGINAL AND TORRES STRAIT ISLANDER ORIGIN             1962
OF TORRES STRAIT ISLANDER ORIGIN                             454
NO INFORMATION                                                68
Name: indigenous_type, dtype: int64

In [29]:
final_data["course_attempt_status"].value_counts()

DISCONTIN    456684
ENROLLED     431103
LAPSED       389138
COMPLETED    373220
INACTIVE     155974
INTERMIT      24492
Name: course_attempt_status, dtype: int64

In [30]:
final_data["citizenship"].value_counts()

AUSTRALIAN CITIZEN                               1522682
TEMPORARY ENTRY PERMIT OR NON NZ DIPLOMAT         206724
PERMANENT RESIDENT (EXCLUDING NEW ZEALANDERS)      44766
PERMANENT HUMANITARIAN VISA                        30805
NEW ZEALAND CITIZEN OR DIPLOMAT                    19566
OTHER RESIDENCY STATUS                              6068
Name: citizenship, dtype: int64