<a href="https://colab.research.google.com/github/swamhtetg90/DAVI-CA2/blob/Swam-N-Ben-Merge/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DAVI Data Cleaning
## Group-10
## Group Member: Devendran Yoheswaran, Kaung Myat San, Swam Htet Aung
---

## Meta Data
---
### Student Profiles

> **Note:** Data is manually entered, so the values are not standardized.

| **Field Name**                            | **Description**                                                                                                                          | **Example**                |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | -------------------------- |
| **STUDENT ID**                            | Student ID is made up of three attributes: `<Course code>-<Intake No>/<Index Number of student in the intake>`                           | `1101-013/001`             |
| **GENDER**                                | Gender                                                                                                                                   | `M`, `F`                   |
| **SG CITIZEN**                            | Singapore Citizen                                                                                                                        | `Y` or blank               |
| **SG PR**                                 | Singapore Permanent Resident                                                                                                             | `Y` or blank               |
| **FOREIGNER**                             | Neither SG Citizen nor SG PR (mutually exclusive with SG CITIZEN and SG PR)                                                              | `Y` or blank               |
| **COUNTRY OF OTHER NATIONALITY**          | Country of nationality (only for SG PR or foreigner)                                                                                     | `Malaysia`, `India`, etc.  |
| **DOB**                                   | Date of Birth. Format: `DD/MM/YYYY`                                                                                                      | `04/03/1978`               |
| **HIGHEST QUALIFICATION**                 | Highest qualification attained prior to this course                                                                                      | `Certificate`, `Diploma`   |
| **NAME OF QUALIFICATION AND INSTITUTION** | Institute where the highest qualification was attained                                                                                   | As provided by participant |
| **DATE ATTAINED HIGHEST QUALIFICATION**   | Date when the qualification was awarded. Format: `DD/MM/YYYY`                                                                            | `06/11/2016`               |
| **DESIGNATION**                           | Job designation                                                                                                                          | As provided by participant |
| **COMMENCEMENT DATE**                     | Course start date. Format: `DD/MM/YYYY`                                                                                                  | `06/01/2023`               |
| **COMPLETION DATE**                       | Course end date. Blank if course is ongoing. Format: `DD/MM/YYYY`                                                                        | `06/04/2024`               |
| **FULL-TIME OR PART-TIME**                | Whether the course is Full-time or Part-time                                                                                             | `Full-Time`, `Part-Time`   |
| **COURSE FUNDING**                        | Course funding type:<br>- `Individual`<br>- `Individual - SFC` (SkillsFuture Credit)<br>- `Sponsored`<br>- `Individual - waived App Fee` | `Individual - SFC`         |
| **REGISTRATION FEE**                      | Registration fee in SGD                                                                                                                  | As entered                 |
| **PAYMENT MODE**                          | Mode of payment                                                                                                                          | `NETS`, `Giro`, `PayNow`   |
| **COURSE FEE**                            | Course fee in SGD                                                                                                                        | As entered                 |

---

### Course Codes

| **S/N**         | **Description**        | **Example**                          |
| --------------- | ---------------------- | ------------------------------------ |
| **CODE**        | Course code (4 digits) | `1101`                               |
| **COURSE NAME** | Course name            | `Diploma in Business Administration` |

---

### Semester Results

| **S/N**        | **Description**                                                                                                           | **Example**    |
| -------------- | ------------------------------------------------------------------------------------------------------------------------- | -------------- |
| **STUDENT ID** | Must match the ID in the Student Profile dataset                                                                          | `1101-013/001` |
| **PERIOD**     | Semester number<br>• Certificate: 1 semester<br>• Diploma: 3 semesters<br>• Master’s: 2 semesters (some exceptions apply) | `1`, `2`, `3`  |
| **GPA**        | GPA for the semester.<br>• Max GPA: 4<br>• Pass GPA: Certificate/Diploma = 2, Master's = 2.3                              | `3.2`          |

---


## Importing Modules
---

In [1]:
import pandas as pd
import numpy as np

## Loading Data
---

#### Course_Code Data

In [2]:
course_code = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Course%20Codes.xlsx?raw=true")
course_code.head()

Unnamed: 0,CODE,COURSE NAME
0,1101,Diploma in Business Administration
1,1102,Diploma in Business Analytics
2,2101,Certificate in Digital Marketing
3,2102,Certificate in HR Management
4,2013,Certificate in Tourism Management


#### Semester Results Data

In [3]:
semester_results = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Semester%20Results.xlsx?raw=true")
semester_results.head()

Unnamed: 0,STUDENT ID,PERIOD,GPA
0,1101-009/001,Sem 1,3.5
1,1101-009/001,Sem 2,3.6
2,1101-009/001,Sem 3,3.7
3,1101-009/002,Sem 1,3.4
4,1101-009/002,Sem 2,3.5


#### Student Profiles Data

In [4]:
student_profiles = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Student%20Profiles.xlsx?raw=true")
student_profiles.head()

Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812


## Data Analysis
---

### Course Code

In [5]:
course_code.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CODE         7 non-null      int64 
 1   COURSE NAME  7 non-null      object
dtypes: int64(1), object(1)
memory usage: 244.0+ bytes


In [6]:
course_code

Unnamed: 0,CODE,COURSE NAME
0,1101,Diploma in Business Administration
1,1102,Diploma in Business Analytics
2,2101,Certificate in Digital Marketing
3,2102,Certificate in HR Management
4,2013,Certificate in Tourism Management
5,5112,Specialist Diploma in Business Innovation and ...
6,5113,Specialist Diploma in Intelligent Systems


#### Course Code DataFrame Summary

- The **`Course Code` DataFrame** contains **no null values**.
- It has a total of **7 rows**.
- The data appears to be **cleaned and ready for use**.


### Semester Results
---
#### STUDENT ID
---

In [7]:
# Get sets of Student IDs from both DataFrames
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Find IDs that are in semester_results but not in student_profiles
only_in_results = ids_results - ids_profiles

if only_in_results:
    print("Student IDs found in semester_results but not in student_profiles:")
    print(only_in_results)

    # Optionally, display the rows from semester_results for these IDs
    mismatched_rows = semester_results[semester_results['STUDENT ID'].isin(only_in_results)]
    print("\nMismatched rows from semester_results:")
    display(mismatched_rows)
else:
    print("All STUDENT IDs in semester_results are also present in student_profiles.")

Student IDs found in semester_results but not in student_profiles:
{'2101-106/005', '5112-007/006', '5112-007/005', '2101-106/001', '5112-007/001', '5112-007/003', '2101-106/003', '2101-106/002', '5112-007/002', '2101-106/004', '5112-007/004'}

Mismatched rows from semester_results:


Unnamed: 0,STUDENT ID,PERIOD,GPA
116,2101-106/001,Sem 1,2.4
117,2101-106/002,Sem 1,3.1
118,2101-106/003,Sem 1,3.4
119,2101-106/004,Sem 1,2.8
120,2101-106/005,Sem 1,2.3
130,5112-007/001,Sem 1,2.9
131,5112-007/001,Sem 2,3.8
132,5112-007/002,Sem 1,3.2
133,5112-007/002,Sem 2,3.3
134,5112-007/003,Sem 1,3.6


From looking at the above, we can see that there are students which only exists in `Semester Results` and not in `Student Profiles`.

We will be removing these rows as
* We don't have information on these students

#### PERIOD
---

In [8]:
# Check unique values in the 'PERIOD' column
unique_periods = semester_results['PERIOD'].value_counts().reset_index()
unique_periods.columns = ['Period', 'Count']
print("Unique values in 'PERIOD' column:")
display(unique_periods)

Unique values in 'PERIOD' column:


Unnamed: 0,Period,Count
0,Sem 1,223
1,Sem 2,124
2,Semester 1,56
3,Sem 3,45
4,Semester 2,37
5,Semester 3,37
6,Sem1,30
7,Sem 4,1
8,Semester 4,1
9,Sem2,1


Since the values are not standardized, we will convert it to numbers:
example:
* `Sem 1` → `1`
* `Semester 1` → `1`
* `Sem 2` → `2`

#### GPA
---

In [9]:
# Check data type and summary statistics for 'GPA'
print("Info and descriptive statistics for 'GPA':")
semester_results['GPA'].info()
display(semester_results['GPA'].describe())

# Check unique values in 'GPA' to spot any non-numeric entries
print("\nUnique values in 'GPA' column:")
display(semester_results['GPA'].unique())

# Attempt to convert 'GPA' to numeric, coercing errors
semester_results['GPA_cleaned'] = pd.to_numeric(semester_results['GPA'], errors='coerce')

# Check for values that couldn't be parsed (will be NaN)
invalid_gpa = semester_results[semester_results['GPA_cleaned'].isna() & semester_results['GPA'].notna()]

print("\nRows with invalid 'GPA' values that could not be converted to numeric:")
display(invalid_gpa[['STUDENT ID', 'PERIOD', 'GPA', 'GPA_cleaned']])

# Drop the temporary cleaned column for now if there are invalid values, or proceed with cleaning
# If there are invalid values, you'll need to decide how to handle them (e.g., drop rows, impute)
# If there are no invalid values, you can drop the original and rename the cleaned column
if invalid_gpa.empty:
    print("\nNo invalid GPA values found that could not be converted to numeric.")
    # Proceed to check for values outside the expected range (0-4)
    unrealistic_gpa = semester_results[(semester_results['GPA_cleaned'] < 0) | (semester_results['GPA_cleaned'] > 4)]
    if not unrealistic_gpa.empty:
        print("\nRows with unrealistic 'GPA' values (outside 0-4 range):")
        display(unrealistic_gpa[['STUDENT ID', 'PERIOD', 'GPA', 'GPA_cleaned']])
    else:
        print("\nNo unrealistic GPA values found (outside 0-4 range).")
else:
    print("\nPlease handle the invalid GPA values shown above before proceeding.")

# Drop the temporary cleaned column
semester_results = semester_results.drop(columns=['GPA_cleaned'], errors='ignore')

Info and descriptive statistics for 'GPA':
<class 'pandas.core.series.Series'>
RangeIndex: 555 entries, 0 to 554
Series name: GPA
Non-Null Count  Dtype  
--------------  -----  
555 non-null    float64
dtypes: float64(1)
memory usage: 4.5 KB


Unnamed: 0,GPA
count,555.0
mean,3.104865
std,0.606161
min,1.6
25%,2.7
50%,3.2
75%,3.6
max,4.0



Unique values in 'GPA' column:


array([3.5, 3.6, 3.7, 3.4, 3.3, 3.2, 3.9, 3.8, 2.8, 2.9, 3. , 2.1, 2.2,
       2.3, 1.9, 2. , 2.4, 1.6, 2.3, 2.7, 2.5, 4. , 3.8, 2.4, 2.9, 2.2,
       3. , 3.9, 2.5, 1.9, 1.7, 3.1, 3.2, 2.5, 2.6, 3.3, 3.1, 3.7, 2.6,
       3. , 2.8, 2.1, 2. , 2.7, 3.6, 3.6, 1.8])


Rows with invalid 'GPA' values that could not be converted to numeric:


Unnamed: 0,STUDENT ID,PERIOD,GPA,GPA_cleaned



No invalid GPA values found that could not be converted to numeric.

No unrealistic GPA values found (outside 0-4 range).


#### Check for duplicated rows
---

In [10]:
# Check for duplicate rows based on 'STUDENT ID' and 'PERIOD'
duplicate_semester_results_period = semester_results[semester_results.duplicated(subset=['STUDENT ID', 'PERIOD'], keep=False)]

if duplicate_semester_results_period.empty:
    print("No duplicate rows found based on 'STUDENT ID' and 'PERIOD'.")
else:
    print("Duplicate rows found based on 'STUDENT ID' and 'PERIOD':")
    # Sort by 'STUDENT ID' and 'PERIOD' to show duplicates next to each other
    display(duplicate_semester_results_period.sort_values(by=['STUDENT ID', 'PERIOD']))

# Calculate total number of duplicate rows based on STUDENT ID and PERIOD
total_period_duplicates_count = duplicate_semester_results_period.shape[0]
print(f"\nTotal number of duplicate rows based on 'STUDENT ID' and 'PERIOD': {total_period_duplicates_count}")


# Calculate the number of exact duplicate rows
exact_duplicates = semester_results[semester_results.duplicated(keep=False)]
exact_duplicates_count = exact_duplicates.shape[0]
print(f"Total number of exact duplicate rows (all columns): {exact_duplicates_count}")

# Calculate the number of rows where STUDENT ID and PERIOD are the same, but GPA is different
# This is the total period duplicates minus the exact duplicates
mismatched_gpa_duplicates_count = total_period_duplicates_count - exact_duplicates_count
print(f"Number of duplicate rows based on 'STUDENT ID' and 'PERIOD' with different 'GPA': {mismatched_gpa_duplicates_count}")

# Optionally, display rows where STUDENT ID and PERIOD are the same, but GPA is different
if mismatched_gpa_duplicates_count > 0:
    print("\nDuplicate rows based on 'STUDENT ID' and 'PERIOD' with different 'GPA':")
    # Filter out exact duplicates from the period duplicates
    mismatched_gpa_rows = duplicate_semester_results_period[~duplicate_semester_results_period.duplicated(keep=True)]
    display(mismatched_gpa_rows.sort_values(by=['STUDENT ID', 'PERIOD']))

Duplicate rows found based on 'STUDENT ID' and 'PERIOD':


Unnamed: 0,STUDENT ID,PERIOD,GPA
243,1102-003/001,Semester 1,1.9
264,1102-003/001,Semester 1,1.9
244,1102-003/001,Semester 2,2.0
265,1102-003/001,Semester 2,2.0
245,1102-003/001,Semester 3,2.4
...,...,...,...
507,2102-069/009,Sem 1,3.0
496,2102-069/010,Sem 1,3.5
508,2102-069/010,Sem 1,3.5
497,2102-069/011,Sem 1,3.2



Total number of duplicate rows based on 'STUDENT ID' and 'PERIOD': 66
Total number of exact duplicate rows (all columns): 66
Number of duplicate rows based on 'STUDENT ID' and 'PERIOD' with different 'GPA': 0


Since they are the exact same, we will drop one of them later.


### Student Profile
---

#### Student_ID

In [11]:
# Check for duplicate student IDs in student_profiles
duplicate_student_profiles = student_profiles[student_profiles.duplicated(subset=['STUDENT ID'], keep=False)]

if duplicate_student_profiles.empty:
    print("No duplicate STUDENT IDs found in student_profiles.")
else:
    print("Duplicate STUDENT IDs found in student_profiles:")
    # Sort by 'STUDENT ID' to show duplicates next to each other
    display(duplicate_student_profiles.sort_values(by='STUDENT ID'))

Duplicate STUDENT IDs found in student_profiles:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
74,2101-107/001,F,,Y,,Malaysia,08/07/1984,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2016-04-04,Manager,2022-04-24 00:00:00,2022-09-23 00:00:00,Part Time,Indivodual,107,NETS,2996
75,2101-107/001,F,,Y,,Malaysia,08/07/1984,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2016-04-04,Manager,2022-04-24 00:00:00,2022-09-23 00:00:00,Part Time,Indivodual,107,NETS,2996
217,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,N.A.,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
218,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,Admin Manager,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
221,5112-008/004,F,,,Y,China,06/08/1989,Degree,Bachelor of Business (Accounting)/\nMurdoch Un...,2018-10-20,Admin Supervisor,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
222,5112-008/004,F,,,Y,China,06/08/1989,Degree,Bachelor of Business (Accounting)/\nMurdoch Un...,2018-10-20,Admin Supervisor,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
276,5113-007/001,M,Y,,,,11/02/1972,Degree,Bachelor of Arts in Human Resource Management ...,1996-05-10,Manager,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803
285,5113-007/001,M,Y,,,,11/02/1972,Degree,Bachelor of Arts in Human Resource Management ...,1996-05-10,Manager,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803
277,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
286,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803


There seemed to be rows where there are duplicated Student ID.

From looking through these rows it seemed there are rows where the whole row is not duplicated.
Let us take a look at those.

In [12]:
# Find rows that are duplicates based on 'STUDENT ID'
duplicate_ids_mask = student_profiles.duplicated(subset=['STUDENT ID'], keep=False)

# Find rows that are exact duplicates across all columns
exact_duplicates_mask = student_profiles.duplicated(keep=False)

# Filter to find rows where 'STUDENT ID' is duplicated but the whole row is not an exact duplicate
mismatched_duplicate_rows = student_profiles[duplicate_ids_mask & ~exact_duplicates_mask].copy()

if mismatched_duplicate_rows.empty:
    print("No rows found where STUDENT ID is duplicated but the entire row is different.")
else:
    print("Rows with duplicate STUDENT IDs but different content:")
    # Sort by 'STUDENT ID' to group mismatched rows for the same student together
    mismatched_duplicate_rows_sorted = mismatched_duplicate_rows.sort_values(by='STUDENT ID')
    display(mismatched_duplicate_rows_sorted)

    print("\nDifferences between the rows:")
    # Group by STUDENT ID and compare rows within each group
    for student_id, group in mismatched_duplicate_rows_sorted.groupby('STUDENT ID'):
        print(f"\n--- Differences for STUDENT ID: {student_id} ---")
        # Assuming there are only two rows for each mismatched ID for simplicity in comparison
        if len(group) >= 2:
            row1 = group.iloc[0]
            row2 = group.iloc[1]
            differences = row1.compare(row2)
            if differences.empty:
                print("Rows are identical (no differences found).")
            else:
                display(differences)
        else:
            print("Not enough rows to compare for this STUDENT ID.")

Rows with duplicate STUDENT IDs but different content:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
217,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,N.A.,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
218,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,Admin Manager,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
277,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
286,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803



Differences between the rows:

--- Differences for STUDENT ID: 5112-008/001 ---


Unnamed: 0,self,other
DESIGNATION,N.A.,Admin Manager



--- Differences for STUDENT ID: 5113-007/002 ---


Unnamed: 0,self,other
COMPLETION DATE,2025-03-18 00:00:00,2024-09-20 00:00:00


From looking at the differences:
* **5112-008/001:** We can take the row where Admin Manager is in Designation as he might have forgotten to add it the first time.
* **5113-007/002:** For Completion Date, let us check other students in the same Course and Intake No to see which is more viable

In [13]:
# Filter student_profiles for the specific course code and intake number
course_intake_identifier = '5113-007'
relevant_students = student_profiles[student_profiles['STUDENT ID'].str.startswith(course_intake_identifier)].copy()

# Display the student IDs and their completion dates for this group
print(f"Completion dates for students in intake {course_intake_identifier}:")
display(relevant_students[['STUDENT ID', 'COMPLETION DATE']])

Completion dates for students in intake 5113-007:


Unnamed: 0,STUDENT ID,COMPLETION DATE
276,5113-007/001,2024-09-20 00:00:00
277,5113-007/002,2025-03-18 00:00:00
278,5113-007/003,2024-09-20 00:00:00
279,5113-007/004,2024-09-20 00:00:00
280,5113-007/005,2024-09-20 00:00:00
281,5113-007/005,2024-09-20 00:00:00
282,5113-007/006,2024-09-20 00:00:00
283,5113-007/007,2024-09-20 00:00:00
284,5113-007/008,2024-09-20 00:00:00
285,5113-007/001,2024-09-20 00:00:00


Therefore, from looking at the Date, `2024-09-20 00:00:00` seems more viable, therefore we will be taking this row instead of `2025-03-18 00:00:00`.

So, we will remove row `[217, 277]`

In [14]:
# Remove rows with index 217 and 277
rows_to_remove = [217, 277]
student_profiles_cleaned = student_profiles.drop(index=rows_to_remove, errors='ignore').copy()

print(f"Number of rows before removal: {len(student_profiles)}")
print(f"Number of rows after removal: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

print("\nStudent profiles after removing specified rows:")
display(student_profiles.head())

Number of rows before removal: 307
Number of rows after removal: 305

Student profiles after removing specified rows:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812


In [15]:
import pandas as pd

# Step 1: Get unique student IDs from each DataFrame
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Step 2: Compare sets
if ids_profiles == ids_results:
    print("STUDENT ID columns match exactly in both datasets.")
else:
    print("Mismatch found between STUDENT ID columns.")

    # Extra: Show differences
    only_in_profiles = ids_profiles - ids_results
    only_in_results = ids_results - ids_profiles

    if only_in_profiles:
        print("Student IDs only in student_profiles:")
        print(only_in_profiles)

    if only_in_results:
        print("Student IDs only in semester_results:")
        print(only_in_results)


Mismatch found between STUDENT ID columns.
Student IDs only in student_profiles:
{'5113-009/007', '2101-111/006', '2101-111/008', '5113-009/002', '2101-111/003', '5113-009/003', '2101-111/002', '2101-111/007', '5113-009/006', '2101-111/001', '5113-009/004', '5113-009/005', '5113-009/001', '2101-111/004', '2101-111/005'}
Student IDs only in semester_results:
{'2101-106/005', '5112-007/006', '5112-007/005', '2101-106/001', '5112-007/001', '5112-007/003', '2101-106/003', '2101-106/002', '5112-007/002', '2101-106/004', '5112-007/004'}


In [16]:
# Get sets of Student IDs
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Find mismatched IDs (present in semester_results but not in student_profiles)
only_in_results = ids_results - ids_profiles

# Filter and print those rows from semester_results
mismatched_rows = semester_results[semester_results['STUDENT ID'].isin(only_in_results)]

print("❌ Mismatched rows from semester_results:")
print(mismatched_rows)


❌ Mismatched rows from semester_results:
       STUDENT ID PERIOD  GPA
116  2101-106/001  Sem 1  2.4
117  2101-106/002  Sem 1  3.1
118  2101-106/003  Sem 1  3.4
119  2101-106/004  Sem 1  2.8
120  2101-106/005  Sem 1  2.3
130  5112-007/001  Sem 1  2.9
131  5112-007/001  Sem 2  3.8
132  5112-007/002  Sem 1  3.2
133  5112-007/002  Sem 2  3.3
134  5112-007/003  Sem 1  3.6
135  5112-007/003  Sem 2  3.7
136  5112-007/004  Sem 1  3.5
137  5112-007/004  Sem 2  2.6
138  5112-007/005  Sem 1  2.1
139  5112-007/005  Sem 2  2.3
140  5112-007/006  Sem 1  3.2
141  5112-007/006  Sem 2  3.1


In [17]:
import pandas as pd

# Extract all unique prefixes from semester_results
prefixes = set(semester_results['STUDENT ID'].str.extract(r'^(\d{4}-\d{3})')[0])

# Filter student_profiles to include only rows whose STUDENT ID starts with one of the prefixes
matched_profiles = student_profiles[
    student_profiles['STUDENT ID'].str.extract(r'^(\d{4}-\d{3})')[0].isin(prefixes)
].copy()

# Display result
print("✅ Matched student profiles (by prefix only):")
matched_profiles


✅ Matched student profiles (by prefix only):


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,5113-008/003,F,Y,,,,18/11/1991,Degree,Bachelor of Business (Marketing)/\nRMIT Univer...,2017-03-21,Regional Recruiter,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
296,5113-008/004,F,Y,,,,29/04/1974,Degree,Bachelor of Commerce in Management and Marketi...,2017-02-28,Confidential Assistant,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
297,5113-008/005,F,Y,,,,19/10/1981,Degree,"Bachelor of Arts in Human Resource Management,...",2018-01-30,Journey Management Team Lead,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
298,5113-008/006,F,Y,,,,19/03/1971,Degree,Bachelor of Arts (Sociology)/\nState Universit...,1995-05-30,Academy Program Coordinator,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803




---

#### Checking students which are not found in semester_results

In [18]:
# Get sets of Student IDs
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Find mismatched IDs (present in student_profiles but not in student_profiles)
only_in_results =  ids_profiles - ids_results

# Filter and print those rows from student_profiles
mismatched_rows = student_profiles[student_profiles['STUDENT ID'].isin(only_in_results)]

print("❌ Mismatched rows from student_profiles:")
mismatched_rows.head(20)


❌ Mismatched rows from student_profiles:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
108,2101-111/001,F,,Y,,Malaysian,08/07/1995,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2020-04-04,Manager,2025-04-24 00:00:00,,Part Time,Indivodual,107,PayNow,2996
109,2101-111/002,F,Y,,,,08/09/1997,Certificate,Certificate in Grammar & Writing Intermediate ...,2021-10-04,Admin Assistant,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
110,2101-111/003,F,Y,,,,19/06/1999,Diploma,"Diploma in Mechatronic Engineering, Nee Ann Po...",2020-12-24,HR Manager,2025-04-24 00:00:00,,Part Time,Sponsored,107,PayNow,2696
111,2101-111/004,F,Y,,,,28/09/2010,Certificate,"Higher Nitec in Hospitality Operations, ITE",2021-09-24,Admin Executive,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
112,2101-111/005,F,Y,,,,19/10/1990,Certificate,O' level,2010-07-24,Admin Executive,2025-04-24 00:00:00,,Part Time,Individual - waived App Fee,Waived,Waived,2596
113,2101-111/006,F,Y,,,,10/04/1999,Certificate,Higher Nitec in Business Studies (Service Mana...,2012-04-24,Admin Assistant,2025-04-24 00:00:00,,Part Time,Sponsored,107,PayNow,2696
114,2101-111/007,F,Y,,,,24/05/2000,Certificate,O' levels,2020-04-24,Secretary,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
115,2101-111/008,F,Y,,,,12/04/2001,Certificate,O' levels,2022-05-24,Executive,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
300,5113-009/001,F,,,Y,India,14/11/1985,Degree,Bachelor of Arts in Business with Logistics an...,2019-08-04,Program Coordinator,2025-04-14 00:00:00,,Part-Time,Individual,107,Nets,5803
301,5113-009/002,F,Y,,,,28/01/1993,Degree,Bachelor of Business (Business Administration)...,2020-10-04,"Assistant Director, Business Development",2025-04-14 00:00:00,,Part-Time,Individual,107,Nets,5803


Students that are included in `student_profile dataset` but not in semester_results are those who haven't finish a single semester or haven't been graded.

#### Gender
---

In [19]:
# Check unique values in the 'GENDER' column
unique_genders = student_profiles['GENDER'].value_counts().reset_index()
unique_genders.columns = ['Gender', 'Count']
print("Unique values in 'GENDER' column:")
display(unique_genders)

Unique values in 'GENDER' column:


Unnamed: 0,Gender,Count
0,F,265
1,M,40


From looking at the unique values, it seemed to be clean.

#### SG CITIZEN, SG PR & FOREIGNER
---

In [20]:
import pandas as pd
import numpy as np

# Load the student_profiles DataFrame
student_profiles = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Student%20Profiles.xlsx?raw=true")

# Replace empty strings with NaN in the specified columns
cols_to_check = ['SG CITIZEN', 'SG PR', 'FOREIGNER']
for col in cols_to_check:
    student_profiles[col] = student_profiles[col].replace(r'^\s*$', np.nan, regex=True)

# Check if there is only one non-null value among 'SG CITIZEN', 'SG PR', and 'FOREIGNER' for each row
student_profiles['Residential_Status_Check'] = (student_profiles[['SG CITIZEN', 'SG PR', 'FOREIGNER']].notna().sum(axis=1) == 1)

# Print rows where the condition is False (i.e., not exactly one non-null value)
mismatched_residential_status = student_profiles[student_profiles['Residential_Status_Check'] == False]

if mismatched_residential_status.empty:
    print("Each student has exactly one residential status specified.")
else:
    print("The following rows have more or less than one residential status specified:")
    display(mismatched_residential_status[['STUDENT ID', 'SG CITIZEN', 'SG PR', 'FOREIGNER', 'Residential_Status_Check']])

# Drop the temporary check column
student_profiles = student_profiles.drop(columns=['Residential_Status_Check'])

Each student has exactly one residential status specified.


Therefore, we can combine these 3 columns into 1 columnn called "Residential Status".

#### Nationality
---

In [21]:
# Count non-null unique values
unique_nationalities = student_profiles['COUNTRY OF OTHER NATIONALITY'].value_counts(dropna=False).reset_index()
unique_nationalities.columns = ['Nationality', 'Count']

# Separate null count from the rest
null_count = unique_nationalities[unique_nationalities['Nationality'].isna()]['Count'].values[0] if unique_nationalities['Nationality'].isna().any() else 0

# Drop the NaN row to show only actual nationalities in the list
non_null_nationalities = unique_nationalities[unique_nationalities['Nationality'].notna()].copy()

# Display summary
print("Unique non-null values in 'COUNTRY OF OTHER NATIONALITY' column:")
display(non_null_nationalities)

print(f"\nNumber of null (missing) values: {null_count}")
print(f"Total number of records: {len(student_profiles)}")


Unique non-null values in 'COUNTRY OF OTHER NATIONALITY' column:


Unnamed: 0,Nationality,Count
1,,88
2,Malaysia,27
3,China,14
4,India,12
5,Philippines,8
6,Myanmar,3
7,Netherlands,1
8,Malaysian,1
9,Vietnam,1
10,Indonesia,1



Number of null (missing) values: 151
Total number of records: 307


Based on the meta data given, it is said that for SG citizens, it will show blank (null values) in this column, so we need to check whether number of null values match number of 'Y' in SG CITIZEN column.

In [22]:
# Count occurrences including NaN (null) values
sg_citizen_counts = student_profiles['SG CITIZEN'].value_counts(dropna=False).reset_index()

# Rename columns for clarity
sg_citizen_counts.columns = ['SG CITIZEN Value', 'Count']

# Display the counts
print("Count of each value in the 'SG CITIZEN' column (including nulls):")
display(sg_citizen_counts)


Count of each value in the 'SG CITIZEN' column (including nulls):


Unnamed: 0,SG CITIZEN Value,Count
0,Y,224
1,,68
2,Yes,15


Based on the results, you can see that if it is blank in "COUNTRY OF OTHER NATIONALITY", the Nationality is Singapore

#### DOB (Date of Birth)
---

In [23]:
# Check for missing values in the 'DOB' column
missing_dob = student_profiles['DOB'].isnull().sum()
print(f"Number of missing values in 'DOB': {missing_dob}")

# Attempt to convert 'DOB' to datetime, coercing errors
student_profiles['DOB_datetime'] = pd.to_datetime(student_profiles['DOB'], errors='coerce', format='%d/%m/%Y')

# Check for values that couldn't be parsed (will be NaT - Not a Time)
invalid_dob = student_profiles[student_profiles['DOB_datetime'].isna() & student_profiles['DOB'].notna()]

print("\nRows with invalid 'DOB' values:")
display(invalid_dob[['STUDENT ID', 'DOB', 'DOB_datetime']])

# Drop the temporary datetime column
student_profiles = student_profiles.drop(columns=['DOB_datetime'])

Number of missing values in 'DOB': 0

Rows with invalid 'DOB' values:


Unnamed: 0,STUDENT ID,DOB,DOB_datetime
105,2101-110/006,13-Feb-1984,NaT
106,2101-110/007,13-Jul-1987,NaT
107,2101-110/008,15-Jul-1994,NaT
184,2102-067A/011,16-07-1991,NaT


From looking at values that cannot be chnged to datetime, it shows that month column is in its short abbreviation form instead of number.

So, we will have to convert it.

#### Name of Qualification and Institution
----

We decide that splitting qualification and institution, from the column "NAME OF QUALIFICATION AND INSTITUTION", would be useful for our future analysis. First, we will add a new column named "Comma Count", which includes number of columns for each row.

In [24]:
# Step 1: Create a new column with the count of commas in each row
student_profiles['Comma Count'] = student_profiles['NAME OF QUALIFICATION AND INSTITUTION'].astype(str).str.count(',')

# Step 2: Group by the number of commas and count rows in each group
comma_group_counts = student_profiles['Comma Count'].value_counts().sort_index().reset_index()
comma_group_counts.columns = ['Number of Commas', 'Number of Rows']

# Step 3: Display the result
print("Rows grouped by number of commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(comma_group_counts)


Rows grouped by number of commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,Number of Commas,Number of Rows
0,0,137
1,1,143
2,2,21
3,3,6


Data Analysis Based on Comma Count

- **Rows with 1 comma (143 rows)**:
  - Clear structure for data splitting
  - Can use comma "," as delimiter to split data into qualification and institute
  - Will implement this splitting approach for these rows

- **Rows with 0 commas (137 rows)**:
  - Expected to contain only qualification name OR institute name
  - Require further examination to determine content type
  - Need additional analysis to classify these entries

- **Rows with 2 or more commas (27 rows)**:
  - Complex structure requiring detailed investigation
  - Need further examination to understand data format
  - May require custom parsing logic for proper data extraction

We will examine each group of comma count, we will first start with rows that have 0 commas.

##### Rows with 0 commas

In [25]:
# Filter rows with no commas (Comma Count == 0) and reset index
no_comma_rows = student_profiles[student_profiles['Comma Count'] == 0].reset_index(drop=True)

# Display the filtered rows
print("Rows with NO commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(no_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])



Rows with NO commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,SPM
1,N' level
2,O' level
3,N' Level
4,SPM
...,...
132,Bachelor of Arts in Business with Logistics an...
133,Bachelor of Business (Business Administration)...
134,Bachelor of Science in Accounting and Finance ...
135,Bachelor of Commerce (Accounting and Finance)/...


From this table, I've observed that many of them actually contain institution names, but are connected by "/" instead of commas, so for these rows, I will again count the number of "/".

In [26]:
# Count '/' in rows that have no commas
no_comma_rows['Slash Count'] = no_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary = no_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows WITHOUT commas:")
display(slash_count_summary)


Distribution of '/' in rows WITHOUT commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,57
1,1,80


We will examine the rows with 0 slashes and 1 slashes separately, to see their structure.

In [27]:
# Count '/' in rows that have no commas
no_comma_rows['Slash Count'] = no_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')



In [28]:
# Filter rows in no_comma_rows that contain no slashes
rows_with_noslashes_in_no_comma = no_comma_rows[no_comma_rows['Slash Count'] == 0].reset_index(drop=True)

# Display the filtered rows
print("Rows with 0 slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):")
display(rows_with_noslashes_in_no_comma[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with 0 slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,SPM
1,N' level
2,O' level
3,N' Level
4,SPM
5,O' level
6,O' level
7,N' Level
8,SPM
9,SPM


These are all qualifications, so we will put rows with 0 commas, and 0 slashes into the column "Qualification".

In [29]:
# Filter rows in no_comma_rows that contain at least one slash
rows_with_slashes_in_no_comma = no_comma_rows[no_comma_rows['Slash Count'] > 0].reset_index(drop=True)

# Display the filtered rows
print("Rows with slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):")
display(rows_with_slashes_in_no_comma[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,Degree of Bachelor of Science with Honours in ...
1,Degree of Bachelor of Science with Honours in ...
2,Honours Degree of Bachelor of Science (Managem...
3,Master of Arts (Chinese Studies)/\nNational Un...
4,Bachelor of Business (Accounting)/\nMurdoch Un...
...,...
75,Bachelor of Arts in Business with Logistics an...
76,Bachelor of Business (Business Administration)...
77,Bachelor of Science in Accounting and Finance ...
78,Bachelor of Commerce (Accounting and Finance)/...


In [30]:
# Count '/' in rows that have no commas
no_comma_rows['Slash Count'] = no_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Filter rows in no_comma_rows that contain at least one slash
rows_with_slashes_in_no_comma = no_comma_rows[no_comma_rows['Slash Count'] > 0].reset_index(drop=True)

# Display the filtered rows
print("Rows with slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):")
display(rows_with_slashes_in_no_comma[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,Degree of Bachelor of Science with Honours in ...
1,Degree of Bachelor of Science with Honours in ...
2,Honours Degree of Bachelor of Science (Managem...
3,Master of Arts (Chinese Studies)/\nNational Un...
4,Bachelor of Business (Accounting)/\nMurdoch Un...
...,...
75,Bachelor of Arts in Business with Logistics an...
76,Bachelor of Business (Business Administration)...
77,Bachelor of Science in Accounting and Finance ...
78,Bachelor of Commerce (Accounting and Finance)/...


As observed, these rows have institute names, by joined with slashes.

##### Rows with 1 comma

In [31]:
# Filter rows with no commas (Comma Count == 1) and reset index
one_comma_rows = student_profiles[student_profiles['Comma Count'] == 1].reset_index(drop=True)

# Display the filtered rows
print("Rows with one comma in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(one_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])


Rows with one comma in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Certificate in Office Skills, ITE"
1,"Lower Secondary Education, Malaysia"
2,"Bahcelor of Management (Marketing), Universiti..."
3,"Diploma in Multimedia & Infocomm Technology, N..."
4,"Diploma in Business, Temasek Polytechnic"
...,...
138,"BBA, Monash University"
139,"BBA, Monash University"
140,"Bachelor of Arts in Human Resource Management,..."
141,Bachelor of Science in Hotel Administration (H...


There are some pecularity in these rows, some don't even contain institution names, but many qualifications, some use comma as a separator between university and country. I will first find number of slashes in each row.

In [32]:
# Count '/' in rows that have one comma
one_comma_rows['Slash Count'] = one_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary = one_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows with ONE commas:")
display(slash_count_summary)

Distribution of '/' in rows with ONE commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,137
1,1,6


As observed, only 6 rows have slashes.

In [33]:
# Further filter those with NO slash
no_slash_in_one_comma = one_comma_rows[one_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/') == 0].reset_index(drop=True)

# Display the result
print("Rows with EXACTLY 1 comma and no slash in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(no_slash_in_one_comma[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with EXACTLY 1 comma and no slash in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Certificate in Office Skills, ITE"
1,"Lower Secondary Education, Malaysia"
2,"Bahcelor of Management (Marketing), Universiti..."
3,"Diploma in Multimedia & Infocomm Technology, N..."
4,"Diploma in Business, Temasek Polytechnic"
...,...
132,"Bachelor of Arts and Social Science, National ..."
133,"BBA, Monash University"
134,"BBA, Monash University"
135,"Bachelor of Arts in Human Resource Management,..."


Most are in this structure, "qualification" , "institute", but some have country name in place of institute.

In [34]:
# Further filter those with exactly one slash
one_slash_in_one_comma = one_comma_rows[one_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/') == 1].reset_index(drop=True)

# Display the result
print("Rows with EXACTLY 1 comma and EXACTLY 1 slash in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(one_slash_in_one_comma[['NAME OF QUALIFICATION AND INSTITUTION']])


Rows with EXACTLY 1 comma and EXACTLY 1 slash in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"SPM, LCCI Level 2 Certificate in Book-keeping ..."
1,"Advanced Diploma in Business Administration, U..."
2,"Bachelor of Accountancy, jointly offered by NU..."
3,Professional Diploma in Leadership and People ...
4,Bachelor of Science in Hotel Administration (H...
5,Bachelor of Science in Hotel Administration (H...


These rows have multiple qualifications, plus institute, some include city names.

##### Rows with 2 commas

In [35]:
two_comma_rows = student_profiles[student_profiles['Comma Count'] == 2].reset_index(drop=True)

# Display the filtered rows
print("Rows with two commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(two_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with two commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Bachelor of Business Administration, Universit..."
1,"Office Management Diploma, NCOI Rotterdam, The..."
2,"Advanced Diploma in Tourism, Hospitality and E..."
3,"Certified Accounting Technicians (CATS), ACCA,..."
4,"Certified Accounting Technicians (CATS), ACCA,..."
5,"NTC 2, N Level, WPLN"
6,"WSQ Higher Certificate in Human Resources, WPL..."
7,"Certificate in Payroll Administration, SHRI Ac..."
8,"NTC 2, N Level, WPLN"
9,"SPM, Certificate in Business Studies (Business..."


In [36]:
# Count '/' in rows that have two commas
two_comma_rows['Slash Count'] = two_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary_two_commas = two_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary_two_commas.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows with TWO commas:")
display(slash_count_summary_two_commas)

Distribution of '/' in rows with TWO commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,21


These rows don't have slashes, only commas, so commas are the only delimiter to get small strings from these rows.

##### Rows with 3 commas

In [37]:

three_comma_rows = student_profiles[student_profiles['Comma Count'] == 3].reset_index(drop=True)

# Display the filtered rows
print("Rows with three commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(three_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with three commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Diploma in Business Admininstration, LCCI Leve..."
1,"O' levels, Diploma in Business Management (dis..."
2,"Diploma in Business Admininstration, LCCI Leve..."
3,"Diploma in Financial Informatics, Nanyang Poly..."
4,"Bachelor of Arts, New Era University, Quezon C..."
5,"Diploma in Business Administration (HRM), PSB ..."


In [38]:
# Count '/' in rows that have three commas
three_comma_rows['Slash Count'] = three_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary_three_commas = three_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary_three_commas.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows with THREE commas:")
display(slash_count_summary_three_commas)

Distribution of '/' in rows with THREE commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,5
1,2,1


Only one row contains slashes. Now, we come up with final logic flow to identify qualification and institute correctly.

##### Splitting qualification and institution
This logic demonstrates how to parse a mixed column  
(`NAME OF QUALIFICATION AND INSTITUTION`) into **two separate columns**:  

- **Qualification** (e.g. Diploma, Bachelor, Nitec, etc.)  
- **Institute** (e.g. ITE, PSB Academy, SHRI, University names, etc.)  

We’ll use:
- **Regex normalization** to handle messy text (like `o’level`, `o' level`, etc.)  
- **Keyword matching** for known qualifications and institutes  
- **Memory sets** to ensure consistency (if `"SHRI Academy"` is identified as an institute once, then `"SHRI"` alone will always be classified as institute afterwards).  
- **Collector for unknowns** so we can later inspect misclassified or new terms.  


#### Highest Qualification

In [39]:
# Count unique values including nulls
unique_qualifications = student_profiles['HIGHEST QUALIFICATION'].value_counts(dropna=False).reset_index()

# Rename columns for clarity
unique_qualifications.columns = ['Highest Qualification', 'Count']

# Display the result
print("Unique values in 'HIGHEST QUALIFICATION' column (including nulls):")
display(unique_qualifications)


Unique values in 'HIGHEST QUALIFICATION' column (including nulls):


Unnamed: 0,Highest Qualification,Count
0,Degree,137
1,Certificate,87
2,Diploma,71
3,Master,11
4,,1


In [40]:
# Check for rows where 'HIGHEST QUALIFICATION' is null or empty
null_qualification_rows = student_profiles[student_profiles['HIGHEST QUALIFICATION'].isnull() | (student_profiles['HIGHEST QUALIFICATION'] == ' ') | (student_profiles['HIGHEST QUALIFICATION'] == '')]

print("Rows with null or empty 'HIGHEST QUALIFICATION':")
display(null_qualification_rows)

Rows with null or empty 'HIGHEST QUALIFICATION':


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE,Comma Count
128,2102-063/013,F,,,Y,Vietnam,24/07/1989,,,2016-06-06,Accounts Executive,2022-04-18 00:00:00,2022-09-14 00:00:00,Part-Time,Individual,107,Nets,888,0


Since only 1 row have got null values for  Highest Qualification, we will drop it.

#### DATE ATTAINED HIGHEST QUALIFICATION
----

In [41]:
# Convert 'DATE ATTAINED HIGHEST QUALIFICATION' to datetime objects
student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] = pd.to_datetime(student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'], errors='coerce')

# Convert 'COMMENCEMENT DATE' to datetime objects
student_profiles['COMMENCEMENT DATE'] = pd.to_datetime(student_profiles['COMMENCEMENT DATE'], errors='coerce')


# Check for unrealistic dates: Date attained should be after DOB and likely before or around COMMENCEMENT DATE
unrealistic_qualification_dates = student_profiles[
    (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'].notna()) &
    (
        (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] < student_profiles['DOB']) |
        (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] > student_profiles['COMMENCEMENT DATE']) # Assuming qualification is attained before or around course start
    )
]

print("Rows with potentially unrealistic 'DATE ATTAINED HIGHEST QUALIFICATION':")
display(unrealistic_qualification_dates[['STUDENT ID', 'DOB', 'DATE ATTAINED HIGHEST QUALIFICATION', 'COMMENCEMENT DATE']])

# Check for missing values after conversion
missing_qualification_dates = student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'].isna().sum()
print(f"\nNumber of missing values in 'DATE ATTAINED HIGHEST QUALIFICATION' after conversion: {missing_qualification_dates}")

Rows with potentially unrealistic 'DATE ATTAINED HIGHEST QUALIFICATION':


Unnamed: 0,STUDENT ID,DOB,DATE ATTAINED HIGHEST QUALIFICATION,COMMENCEMENT DATE



Number of missing values in 'DATE ATTAINED HIGHEST QUALIFICATION' after conversion: 0


In [42]:
# Convert 'DOB' to datetime objects, coercing errors
student_profiles['DOB'] = pd.to_datetime(student_profiles['DOB'], errors='coerce', format='%d/%m/%Y')

# Convert 'DATE ATTAINED HIGHEST QUALIFICATION' to datetime, if not already
student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] = pd.to_datetime(
    student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'], errors='coerce', format='%d/%m/%Y')

# Calculate age in years at time of qualification
student_profiles['QUALIFICATION_AGE_YEARS'] = (
    (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] - student_profiles['DOB']).dt.days / 365.25
).round(1)

# Sort the DataFrame by calculated age in ascending order
sorted_qualification_age = student_profiles.sort_values(by='QUALIFICATION_AGE_YEARS', ascending=True)

print("Age in years at the time of attaining highest qualification (sorted ascending):")
display(sorted_qualification_age[['STUDENT ID', 'DOB', 'NAME OF QUALIFICATION AND INSTITUTION', 'DATE ATTAINED HIGHEST QUALIFICATION', 'QUALIFICATION_AGE_YEARS']])

Age in years at the time of attaining highest qualification (sorted ascending):


Unnamed: 0,STUDENT ID,DOB,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,QUALIFICATION_AGE_YEARS
111,2101-111/004,2010-09-28,"Higher Nitec in Hospitality Operations, ITE",2021-09-24,11.0
33,1101-012/001,1988-12-04,"Nitec in Service Skills (Office), ITE",2001-11-06,12.9
113,2101-111/006,1999-04-10,Higher Nitec in Business Studies (Service Mana...,2012-04-24,13.0
8,1101-009/009,1983-11-30,N' Level,2000-05-07,16.4
12,1101-010/002,1973-04-20,O' level,1990-02-09,16.8
...,...,...,...,...,...
195,2102-069/001,1961-04-01,"O level, Private Secretary Certificate, LCCI",2019-02-28,57.9
105,2101-110/006,NaT,Bachelor of Commerce (Management and Marketing...,2007-05-16,
106,2101-110/007,NaT,"Bachelor of Science in Business, SUSS",2010-01-16,
107,2101-110/008,NaT,"Diploma in Integrated Events Management, Repub...",2018-09-16,


From looking at this data, I realized that it is impossible for some of these students to have these qualification. Therefore, we will be dropping them.
`[ 111, 33, 113, 38, 267]`

#### Designation
---

In [43]:
# Check unique values and their counts in the 'DESIGNATION' column
designation_counts = student_profiles['DESIGNATION'].value_counts().reset_index()
designation_counts.columns = ['Designation', 'Count']

print("Unique values and counts in 'DESIGNATION' column:")
# Display top 50 designations if there are many unique values
if len(designation_counts) > 50:
    display(designation_counts.head(50))
    print(f"\n... and {len(designation_counts) - 50} more unique designations.")
else:
    display(designation_counts)

Unique values and counts in 'DESIGNATION' column:


Unnamed: 0,Designation,Count
0,-,35
1,Admin Assistant,14
2,HR Executive,13
3,Admin Executive,8
4,Manager,8
5,HR Manager,8
6,Secretary,6
7,HR Assistant,6
8,Business Development Executive,4
9,Senior HR Executive,4



... and 114 more unique designations.


In [44]:
from collections import Counter
import re

# Combine all designations into a single string, handling potential NaNs
all_designations = ' '.join(student_profiles['DESIGNATION'].dropna().astype(str).str.lower())

# Split the string into words, using regex to find word characters
words = re.findall(r'\b\w+\b', all_designations)

# Count the frequency of each word
word_counts = Counter(words)

# Convert to a DataFrame for easier display
word_counts_df = pd.DataFrame.from_dict(word_counts, orient='index', columns=['Count']).reset_index()
word_counts_df = word_counts_df.rename(columns={'index': 'Word'})

# Sort by count in descending order
word_counts_df = word_counts_df.sort_values(by='Count', ascending=False)

print("Count of each word in 'DESIGNATION' column:")
display(word_counts_df.head(50)) # Displaying the top 50 most frequent words

Count of each word in 'DESIGNATION' column:


Unnamed: 0,Word,Count
6,executive,81
1,hr,65
0,admin,57
5,manager,47
2,assistant,41
19,senior,21
16,human,14
21,officer,14
17,resource,11
8,administrator,9


In [45]:
# Function to categorize designations based on keywords
def categorize_designation(designation):
    if pd.isna(designation) or designation.strip() in ['', '-', 'N.A.']:
        return 'Unknown'
    designation = str(designation).lower()
    if 'manager' in designation:
        return 'Manager'
    elif 'executive' in designation:
        return 'Executive'
    elif 'assistant' in designation:
        return 'Assistant'
    elif 'officer' in designation:
        return 'Officer'
    elif 'specialist' in designation:
        return 'Specialist'
    elif 'consultant' in designation:
        return 'Consultant'
    elif 'coordinator' in designation:
        return 'Coordinator'
    elif 'head' in designation:
        return 'Head'
    elif 'director' in designation:
        return 'Director'
    elif 'analyst' in designation:
        return 'Analyst'
    elif 'administrator' in designation:
        return 'Administrator'
    elif 'clerk' in designation:
        return 'Clerk'
    elif 'teacher' in designation or 'lecturer' in designation:
        return 'Educator'
    elif 'audit' in designation:
        return 'Audit'
    elif 'finance' in designation or 'accountant' in designation:
        return 'Finance/Accounting'
    elif 'marketing' in designation or 'business development' in designation:
        return 'Marketing/BD'
    elif 'recruitment' in designation or 'hr' in designation or 'human resource' in designation:
        return 'HR/Recruitment'
    elif 'it' in designation or 'developer' in designation:
        return 'IT/Tech'
    elif 'operation' in designation:
        return 'Operations'
    # Add more categories as needed
    else:
        return 'Other'

# Apply the categorization function
student_profiles['DESIGNATION_CATEGORY'] = student_profiles['DESIGNATION'].apply(categorize_designation)

# Check the counts of the new categories
designation_category_counts = student_profiles['DESIGNATION_CATEGORY'].value_counts().reset_index()
designation_category_counts.columns = ['Designation Category', 'Count']
print("\nCounts of Designation Categories:")
display(designation_category_counts)


Counts of Designation Categories:


Unnamed: 0,Designation Category,Count
0,Executive,81
1,Manager,47
2,Unknown,38
3,Other,36
4,Assistant,32
5,Officer,14
6,Administrator,9
7,HR/Recruitment,9
8,IT/Tech,7
9,Specialist,6


#### Commence Date & Completion Date
---

In [46]:
# Convert 'COMMENCEMENT DATE' and 'COMPLETION DATE' to datetime objects
student_profiles['COMMENCEMENT DATE'] = pd.to_datetime(student_profiles['COMMENCEMENT DATE'], errors='coerce')
student_profiles['COMPLETION DATE'] = pd.to_datetime(student_profiles['COMPLETION DATE'], errors='coerce')

# Add 'COURSE_COMPLETED' column
student_profiles['COURSE_COMPLETED'] = student_profiles['COMPLETION DATE'].notna()

# Display the relevant columns to verify
display(student_profiles[['STUDENT ID', 'COMMENCEMENT DATE', 'COMPLETION DATE', 'COURSE_COMPLETED']])

Unnamed: 0,STUDENT ID,COMMENCEMENT DATE,COMPLETION DATE,COURSE_COMPLETED
0,1101-009/001,2022-04-18,2023-09-17,True
1,1101-009/002,2022-04-18,2023-09-17,True
2,1101-009/003,2022-04-18,2023-09-17,True
3,1101-009/004,2022-04-18,2023-09-17,True
4,1101-009/005,2022-04-18,2023-09-17,True
...,...,...,...,...
302,5113-009/003,2025-04-14,NaT,False
303,5113-009/004,2025-04-14,NaT,False
304,5113-009/005,2025-04-14,NaT,False
305,5113-009/006,2025-04-14,NaT,False


In [47]:
# Check rows where 'COMPLETION DATE' is null (NaT)
missing_completion_date_rows = student_profiles[student_profiles['COMPLETION DATE'].isna()]

print("Rows with null 'COMPLETION DATE':")
display(missing_completion_date_rows)

Rows with null 'COMPLETION DATE':


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,...,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE,Comma Count,QUALIFICATION_AGE_YEARS,DESIGNATION_CATEGORY,COURSE_COMPLETED
5,1101-009/006,F,,Y,,Malaysia,1968-10-17,Certificate,"Lower Secondary Education, Malaysia",1995-05-07,...,NaT,Part-Time,Individual - SFC,107,GIRO,3636,1,26.6,Administrator,False
30,1101-011/008,F,Y,,,,1993-12-05,Diploma,"Diploma in Environmental Science, Republic Pol...",2016-11-06,...,NaT,Part-Time,Individual,107,GIRO,5136,1,22.9,Executive,False
108,2101-111/001,F,,Y,,Malaysian,1995-07-08,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2020-04-04,...,NaT,Part Time,Indivodual,107,PayNow,2996,1,24.7,Manager,False
109,2101-111/002,F,Y,,,,1997-09-08,Certificate,Certificate in Grammar & Writing Intermediate ...,2021-10-04,...,NaT,Part Time,Individual,107,PayNow,2996,1,24.1,Assistant,False
110,2101-111/003,F,Y,,,,1999-06-19,Diploma,"Diploma in Mechatronic Engineering, Nee Ann Po...",2020-12-24,...,NaT,Part Time,Sponsored,107,PayNow,2696,1,21.5,Manager,False
111,2101-111/004,F,Y,,,,2010-09-28,Certificate,"Higher Nitec in Hospitality Operations, ITE",2021-09-24,...,NaT,Part Time,Individual,107,PayNow,2996,1,11.0,Executive,False
112,2101-111/005,F,Y,,,,1990-10-19,Certificate,O' level,2010-07-24,...,NaT,Part Time,Individual - waived App Fee,Waived,Waived,2596,0,19.8,Executive,False
113,2101-111/006,F,Y,,,,1999-04-10,Certificate,Higher Nitec in Business Studies (Service Mana...,2012-04-24,...,NaT,Part Time,Sponsored,107,PayNow,2696,1,13.0,Assistant,False
114,2101-111/007,F,Y,,,,2000-05-24,Certificate,O' levels,2020-04-24,...,NaT,Part Time,Individual,107,PayNow,2996,0,19.9,Other,False
115,2101-111/008,F,Y,,,,2001-04-12,Certificate,O' levels,2022-05-24,...,NaT,Part Time,Individual,107,PayNow,2996,0,21.1,Executive,False


In [48]:
# Find rows with invalid commencement dates (NaT after conversion)
invalid_commencement_dates_rows = student_profiles[student_profiles['COMMENCEMENT DATE'].isna()]

# Get the student IDs from rows with invalid commencement dates
student_ids_with_invalid_commencement = invalid_commencement_dates_rows['STUDENT ID']

# Check which of these student IDs exist in the semester_results DataFrame
students_in_semester_results_bool = student_ids_with_invalid_commencement.isin(semester_results['STUDENT ID'])

# Count students with invalid commencement dates who are in semester_results
count_in_semester_results = students_in_semester_results_bool.sum()

# Count students with invalid commencement dates who are not in semester_results
count_not_in_semester_results = (~students_in_semester_results_bool).sum()

print(f"Number of students with invalid commencement dates found in semester_results: {count_in_semester_results}")
print(f"Number of students with invalid commencement dates not found in semester_results: {count_not_in_semester_results}")

# Display the rows from semester_results for students with invalid commencement dates (as done before)
students_in_semester_results_df = semester_results[semester_results['STUDENT ID'].isin(student_ids_with_invalid_commencement)]
print("\nRows from semester_results for students with invalid commencement dates:")
display(students_in_semester_results_df)

Number of students with invalid commencement dates found in semester_results: 5
Number of students with invalid commencement dates not found in semester_results: 0

Rows from semester_results for students with invalid commencement dates:


Unnamed: 0,STUDENT ID,PERIOD,GPA
15,1101-009/006,Sem 1,2.1
16,1101-009/006,Sem 2,2.2
17,1101-009/006,Sem 3,2.3
174,5113-005/005,Sem 1,3.1
175,5113-005/005,Sem 2,3.5
201,1101-011/008,Sem 1,3.3
202,1101-011/008,Sem 2,3.8
203,1101-011/008,Sem 3,3.4
346,2102-064/013,Sem1,3.6
496,2102-069/010,Sem 1,3.5


From looking at this, i can insert the commence date and and completion date based on rows that have the same `COURSE_ID` and `INTAKE_NO`

#### FULL-TIME OR PART-TIME
----

In [49]:
# Check unique values in the 'FULL-TIME OR PART-TIME' column
unique_course_type = student_profiles['FULL-TIME OR PART-TIME'].value_counts().reset_index()
unique_course_type.columns = ['Course Type', 'Count']
print("Unique values in 'FULL-TIME OR PART-TIME' column:")
display(unique_course_type)

Unique values in 'FULL-TIME OR PART-TIME' column:


Unnamed: 0,Course Type,Count
0,Part-Time,237
1,Full-Time,41
2,Part Time,29


Based on the count, I believe we can combine `Part-Time` and `Part Time` together by changing ` ` to `-`

#### COURSE FUNDING
----

In [50]:
# Check unique values and their counts in the 'COURSE FUNDING' column
course_funding_counts = student_profiles['COURSE FUNDING'].value_counts(dropna=False).reset_index()
course_funding_counts.columns = ['Course Funding', 'Count']

print("Unique values and counts in 'COURSE FUNDING' column:")
display(course_funding_counts)

Unique values and counts in 'COURSE FUNDING' column:


Unnamed: 0,Course Funding,Count
0,Individual,154
1,Individual - SFC,64
2,Sponsored,33
3,Sponsored - no SDF,11
4,Individual,6
5,Sponsored,6
6,Individual-SFC,5
7,Individual - waived App Fee,5
8,Individual - SFC + $1000 SCHOLARSHIP,4
9,Individual,4


From looking at the unqiue values, it seemed the data hasn't been standardized yet.

Below is the table which shows how to standardized it


| **Original Values**                                   | **Cleaned / Grouped As**         |
| ----------------------------------------------------- | -------------------------------- |
| Individual<br>Individual  <br>Indivodual<br>Indvidual | `Individual`                     |
| Individual - SFC<br>Individual-SFC<br>Indvidual - SFC  | `Individual - SFC`               |
| Individual - waived App Fee                           | `Individual - Waived App Fee`    |
| Individual - SFC + \$1000 SCHOLARSHIP                 | `Individual - SFC + Scholarship` |
| Sponsored<br>Sponsored  <br>Sponsored - no SDF  <br>Sponsored-no SDF       | `Sponsored`                      |
| Sponsored - SDF                                       | `Sponsored - SDF`                |




#### REGISTRATION FEE
----

In [51]:
# Check unique values and their counts in the 'REGISTRATION FEE' column
registration_fee_counts = student_profiles['REGISTRATION FEE'].value_counts(dropna=False).reset_index()
registration_fee_counts.columns = ['Registration Fee', 'Count']

print("Unique values and counts in 'REGISTRATION FEE' column:")
display(registration_fee_counts)

Unique values and counts in 'REGISTRATION FEE' column:


Unnamed: 0,Registration Fee,Count
0,107,300
1,Waived,5
2,107\n107,2


From looking at this there seem to be a typing error `107\n107`,
We will be changing this to 107
As for Waived, we will change this to 0

#### PAYMENT MODE
----

In [52]:
# 15. Payment Mode
# Check unique values and their counts in the 'PAYMENT MODE' column
payment_mode_counts = student_profiles['PAYMENT MODE'].value_counts().reset_index()
payment_mode_counts.columns = ['Payment Mode', 'Count']

print("Unique values and counts in 'PAYMENT MODE' column:")
display(payment_mode_counts)

Unique values and counts in 'PAYMENT MODE' column:


Unnamed: 0,Payment Mode,Count
0,Nets,207
1,Giro,28
2,PayNow,25
3,NETS,23
4,GIRO,17
5,Waived,4
6,Bank,1
7,Cr Card,1
8,CC JPM,1


From looking at the unique values, We will need to standardized it

| **Original Values** | **Cleaned / Grouped As** |
| ------------------- | ------------------------ |
| Nets                | NETS                     |
| NETS                | NETS                     |
| Giro                | Giro                     |
| GIRO                | Giro                     |
| PayNow              | PayNow                   |
| Cr Card             | Credit Card              |
| Waived              | Waived                   |
| Bank                | Bank                     |


#### COURSE FEE
---

In [53]:
student_profiles['COURSE FEE']

Unnamed: 0,COURSE FEE
0,5136
1,5136
2,5136
3,5136
4,4812
...,...
302,5803
303,5803
304,5803
305,5803


## Data Wrangling for Student Profiles
---

### Data Cleaning Steps for Student Profile

#### 1. Student ID Decomposition

* Split `STUDENT_ID` into three new columns:

  * `COURSE_ID`
  * `INTAKE_NO`
  * `INDEX_NO`
* Format: `<COURSE_ID>-<INTAKE_NO>/<INDEX_NO>`


#### 2. Gender

* Clean - Have 2 unique values ('M' and 'F')

#### 3. Residential Status

* Combine `SG CITIZEN`, `SG PR`, and `FOREIGNER` into a single column: `RESIDENTIAL_STATUS`.

  * Logic: Only one of the three columns contains "Y"; others are null.
  * New values: `Singapore Citizen`, `PR`, or `Foreigner`.

#### 4. Nationality

* Rename `COUNTRY OF OTHER NATIONALITY` to `NATIONALITY`.
* Standardize values:
  * Correct spelling errors (e.g., `Malaysian` → `Malaysia`)
* Replace null or blank values with: `Singapore`.

#### 5. Date of Birth (`DOB`)

* Not in datetime format
* Some rows are in short abbreviation for month instead of numbers
* Remove rows where `DOB` is null

#### 6. Highest Qualification

* Drop rows where value is `" "`

#### 7. Name of Qualification and Institution

* Split `NAME OF QUALIFICATION AND INSTITUTION` into two parts:

  * `QUALIFICATION_NAME`
  * `INSTITUTION_NAME`

* Based on `QUALIFICATION_NAME`, create a new column `FIELD_NAME`, by identifying keywords




#### 8. Date Attained Highest Qualification

* Check if the date is realistic compared to `DOB` and `COMMENCE DATE`
* Remove these rows `[ 111, 33, 113, 38, 267]`
  * These shows students which doesn't meet the requirement's age for their Qualification.

#### 9. Designation

* Key-Word Mapping to reduce unique values

#### 10. Commence Date & Completion Date

* Convert to datetime format
* Insert blank values by looking at `COURSE_NO` & `INTAKE_NO`

#### 11. Full-time or Part-time

* Standardize values:
  * Correct spelling errors (e.g., `Full Time` → `Full-Time`).

#### 12. Course Funding Type

* Standardize values to one of the following:

  * `Individual`
  * `Individual - SFC`
  * `Individual - SFC + Scholarship`
  * `Individual - Waived App Fee`
  * `Sponsored`
  * `Sponsored - SFC`
  
* Fix spelling error and remove leading/trailing spaces

#### 13. Registration Fee

* Convert to Float type
* Standardize values:
  * Correct spelling errors (e.g. `107\n107` → `107`)
  * Convert `Waived` to `0`

#### 14. Payment Mode

* Standardize values:
  *  `NETS`
  * `Giro`
  * `PayNow`
  * `Credit Card`
  * `Waived`
  * `Bank`

#### 15. Course Fee

* Convert to Float type

#### 16. cGPA calculation

* Add a new column "cGPA" for each student, based on `Semester_result`

#### 17. Hiatus Duration

* Add a new column showing the number of years each student spent as a gap between completing their last qualification and starting their new course

---


### 1. Student ID Decomposition

In [54]:

student_profiles[['COURSE_ID', 'INTAKE_NO_INDEX']] = student_profiles['STUDENT ID'].str.split('-', n=1, expand=True)
student_profiles[['INTAKE_NO', 'INDEX_NO']] = student_profiles['INTAKE_NO_INDEX'].str.split('/', n=1, expand=True)
student_profiles = student_profiles.drop(columns=['INTAKE_NO_INDEX'])

print("Student ID decomposed:")
display(student_profiles[['STUDENT ID', 'COURSE_ID', 'INTAKE_NO', 'INDEX_NO']].head())

Student ID decomposed:


Unnamed: 0,STUDENT ID,COURSE_ID,INTAKE_NO,INDEX_NO
0,1101-009/001,1101,9,1
1,1101-009/002,1101,9,2
2,1101-009/003,1101,9,3
3,1101-009/004,1101,9,4
4,1101-009/005,1101,9,5


###  2. Gender

This column is already cleaned.

###  3. Residential Status

In [55]:

# Define a function to determine residential status
def get_residential_status(row):
    if row['SG CITIZEN'] == 'Y':
        return 'Singapore Citizen'
    elif row['SG PR'] == 'Y':
        return 'PR'
    elif row['FOREIGNER'] == 'Y':
        return 'Foreigner'
    else:
        return None # Should not happen based on previous check, but good practice

# Apply the function to create the new 'RESIDENTIAL_STATUS' column
student_profiles['RESIDENTIAL_STATUS'] = student_profiles.apply(get_residential_status, axis=1)

# Drop the original columns
student_profiles = student_profiles.drop(columns=['SG CITIZEN', 'SG PR', 'FOREIGNER'])

print("Residential Status column created and original columns dropped:")
display(student_profiles[['STUDENT ID', 'RESIDENTIAL_STATUS']].head())

Residential Status column created and original columns dropped:


Unnamed: 0,STUDENT ID,RESIDENTIAL_STATUS
0,1101-009/001,Foreigner
1,1101-009/002,Singapore Citizen
2,1101-009/003,Foreigner
3,1101-009/004,Foreigner
4,1101-009/005,Singapore Citizen


### 4. Nationality

In [56]:
# Rename the column
student_profiles = student_profiles.rename(columns={'COUNTRY OF OTHER NATIONALITY': 'NATIONALITY'})

# Standardize values (correcting 'Malaysian' to 'Malaysia')
student_profiles['NATIONALITY'] = student_profiles['NATIONALITY'].replace('Malaysian', 'Malaysia')

# Replace null or blank values with 'Singapore'
student_profiles['NATIONALITY'] = student_profiles['NATIONALITY'].replace(r'^\s*$', 'Singapore', regex=True) # Handle blank strings
student_profiles['NATIONALITY'] = student_profiles['NATIONALITY'].fillna('Singapore') # Handle actual NaN values

print("Nationality column cleaned:")
display(student_profiles[['STUDENT ID', 'NATIONALITY']].head())

# Verify the changes
print("\nUnique values in 'NATIONALITY' after cleaning:")
display(student_profiles['NATIONALITY'].value_counts().reset_index())

Nationality column cleaned:


Unnamed: 0,STUDENT ID,NATIONALITY
0,1101-009/001,Malaysia
1,1101-009/002,Singapore
2,1101-009/003,India
3,1101-009/004,Netherlands
4,1101-009/005,Singapore



Unique values in 'NATIONALITY' after cleaning:


Unnamed: 0,NATIONALITY,count
0,Singapore,239
1,Malaysia,28
2,China,14
3,India,12
4,Philippines,8
5,Myanmar,3
6,Netherlands,1
7,Vietnam,1
8,Indonesia,1


### 5. Date of Birth (DOB)

In [57]:

# Convert 'DOB' to datetime, trying multiple formats and coercing errors
student_profiles['DOB'] = pd.to_datetime(student_profiles['DOB'], errors='coerce', dayfirst=True)

# Remove rows where 'DOB' is null (NaT)
student_profiles_cleaned = student_profiles.dropna(subset=['DOB']).copy()

print("\nRows with invalid 'DOB' values removed.")
print(f"Number of rows before dropping: {len(student_profiles)}")
print(f"Number of rows after dropping: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

# Check for values that couldn't be parsed (should be empty now)
invalid_dob_after_drop = student_profiles[student_profiles['DOB'].isna()]

print("\nRows with invalid 'DOB' values after dropping (should be empty):")
display(invalid_dob_after_drop[['STUDENT ID', 'DOB']])


Rows with invalid 'DOB' values removed.
Number of rows before dropping: 307
Number of rows after dropping: 303

Rows with invalid 'DOB' values after dropping (should be empty):


Unnamed: 0,STUDENT ID,DOB


### 6. Highest Qualification

In [58]:

# Drop rows where 'HIGHEST QUALIFICATION' is a blank string
student_profiles_cleaned = student_profiles[student_profiles['HIGHEST QUALIFICATION'].str.strip() != ''].copy()

print("Student profiles after dropping rows with blank 'Highest Qualification':")
display(student_profiles_cleaned.head())

print(f"\nNumber of rows before dropping: {len(student_profiles)}")
print(f"Number of rows after dropping: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

Student profiles after dropping rows with blank 'Highest Qualification':


Unnamed: 0,STUDENT ID,GENDER,NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,...,PAYMENT MODE,COURSE FEE,Comma Count,QUALIFICATION_AGE_YEARS,DESIGNATION_CATEGORY,COURSE_COMPLETED,COURSE_ID,INTAKE_NO,INDEX_NO,RESIDENTIAL_STATUS
0,1101-009/001,F,Malaysia,1981-09-13,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18,2023-09-17,...,GIRO,5136,0,36.3,Assistant,True,1101,9,1,Foreigner
1,1101-009/002,F,Singapore,1979-07-26,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18,2023-09-17,...,NETS,5136,1,36.9,Assistant,True,1101,9,2,Singapore Citizen
2,1101-009/003,F,India,1990-02-01,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18,2023-09-17,...,NETS,5136,2,25.5,Unknown,True,1101,9,3,Foreigner
3,1101-009/004,F,Netherlands,1976-04-20,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18,2023-09-17,...,NETS,5136,2,41.8,Manager,True,1101,9,4,Foreigner
4,1101-009/005,F,Singapore,1983-11-25,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18,2023-09-17,...,GIRO,4812,3,31.5,Executive,True,1101,9,5,Singapore Citizen



Number of rows before dropping: 303
Number of rows after dropping: 302


### 7. Name of Qualification and Institution

#### Splitting qualification and institution

In [59]:
import re
import pandas as pd

# Shared memory sets to enforce consistency across rows
known_institutes = set()
known_qualifications = set()

# Global collector for unknown parts (deduplicated)
all_unknown_parts = set()

# Canonicalization helpers
def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return s
    s = s.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
    s = re.sub(r"\b([oOnNaA])\s*'\s*level\b", lambda m: f"{m.group(1)}'level", s, flags=re.IGNORECASE)
    s = re.sub(r"\bo['’]?level\b", "O Level", s, flags=re.IGNORECASE)
    s = re.sub(r"\ba['’]?level\b", "A Level", s, flags=re.IGNORECASE)
    s = re.sub(r"\bn['’]?level\b", "N Level", s, flags=re.IGNORECASE)
    s = re.sub(r"\badvanced\s+diploma\b", "Advanced Diploma", s, flags=re.IGNORECASE)
    s = re.sub(r"\s+", " ", s).strip()
    return s

INSTITUTE_KEYWORDS = [
    'university', 'universitas', 'universiti', 'college', 'lyceum', 'polytechnic', 'academy', 'institute', 'school', 'poly',
    'ite', 'psb', 'mdis', 'acca', 'shri', 'ncoi', 'nus', 'ntu', 'kaplan', 'centre', 'ibmec', 'board', 'resource', 'mmu'
]

QUALIFICATION_KEYWORDS = [
    'bachelor', 'master', 'diploma', 'certificate', 'degree', 'phd', 'doctorate',
    'advanced diploma', 'o level', 'a level', 'n level', 'ntc', 'nitec', 'wsq',
    'foundation', 'certified', 'advanced', 'higher', 'spm', 'wpln', 'cats',
    'acta', 'lcci', 'cpa', 'chrm', 'private secretary', 'accounting technicians', 'engineering', 'networking', 'hospitality'
]

def contains_keyword(part: str, keywords):
    part_lower = part.lower()
    return any(keyword.lower() in part_lower for keyword in keywords)

def is_in_known_sets(part: str):
    key = part.strip().upper()
    if key in known_institutes:
        return 'institute'
    if key in known_qualifications:
        return 'qualification'
    return None

def is_institute_candidate(part: str):
    plower = part.lower().strip()
    return contains_keyword(plower, INSTITUTE_KEYWORDS) or ('offered' in plower and part.isupper())

def is_qualification_candidate(part: str):
    plower = part.lower().strip()
    return contains_keyword(plower, QUALIFICATION_KEYWORDS) or (part.isupper() and len(part.strip()) <= 10)

def classify_part(part: str):
    part = part.strip()
    if not part:
        return None, 'unknown'

    # check memory first
    mem = is_in_known_sets(part)
    if mem:
        return part, mem

    if is_institute_candidate(part):
        known_institutes.add(part.upper())
        for token in part.split():
            if token.isupper() and len(token) > 1:
                known_institutes.add(token.upper())
        return part, 'institute'

    if is_qualification_candidate(part):
        known_qualifications.add(part.upper())
        return part, 'qualification'

    # fallback: unclassified
    return part, 'unknown'

def split_by_delimiters(text: str):
    parts = re.split(r'[,/]', text)
    cleaned = []
    for p in parts:
        p_clean = re.sub(r'["\'\[\]\(\)]', '', p).strip()
        if p_clean:
            cleaned.append(p_clean)
    return cleaned

def parse_education_data(text, comma_count):
    qualification = None
    institute = None

    if not text or pd.isna(text):
        return (None, None)

    text = normalize_text(str(text))

    def process_tail(parts):
        qual_parts = []
        inst_parts = []
        for p in parts:
            classified, cat = classify_part(p)
            if cat == 'institute':
                inst_parts.append(classified)
            elif cat == 'qualification':
                qual_parts.append(classified)
            else:  # unknown
                all_unknown_parts.add(classified)
        return qual_parts, inst_parts

    parts = split_by_delimiters(text)
    if not parts:
        return (None, None)

    qualification = parts[0]
    tail = parts[1:]
    qual_tail, inst_tail = process_tail(tail)

    if qual_tail:
        qualification = ', '.join([qualification] + qual_tail)
    if inst_tail:
        institute = ', '.join(inst_tail)

    return (qualification, institute)

# Assume student_profiles is already defined
student_profiles['Comma Count'] = student_profiles['NAME OF QUALIFICATION AND INSTITUTION'].fillna("").str.count(',')

# Clear known sets for fresh run
known_institutes.clear()
known_qualifications.clear()
all_unknown_parts.clear()

# Apply parsing and expand into two columns
student_profiles[['Qualification', 'Institute']] = student_profiles.apply(
    lambda row: pd.Series(parse_education_data(
        row['NAME OF QUALIFICATION AND INSTITUTION'],
        row['Comma Count']
    )),
    axis=1
)

# After processing, get all unknown parts as a list
unknowns_list = sorted(all_unknown_parts)

In [60]:

# student_profiles['Comma Count'] = student_profiles['NAME OF QUALIFICATION AND INSTITUTION'].fillna("").str.count(',')

# student_profiles[['Qualification', 'Institute']] = student_profiles.apply(
#     lambda row: pd.Series(parse_education_data(
#         row['NAME OF QUALIFICATION AND INSTITUTION'],
#         row['Comma Count']
#     )),
#     axis=1
# )


#### Creating the column "Field"

In [61]:
# Define field keywords
FIELD_KEYWORDS = {
    'Business': ['business', 'management', 'marketing', 'accounting', 'finance', 'commerce', 'bba', 'hrm', 'accountancy'],
    'IT': ['information technology', 'computer', 'computing', 'networking', 'software', 'systems', 'infocomm technology', 'cybersecurity', r'\bit\b'],
    'Engineering': ['engineering', 'mechanical', 'electrical', 'civil', 'chemical', 'computer engineering'],
    'Hospitality': ['hospitality', 'tourism', 'hotel', 'food and beverage'],
    'Health': ['nursing', 'health', 'medical', 'pharmacy', 'care'],
    'Art': ['art', 'design', 'fine art', 'visual arts', 'performing arts', 'graphic design', 'creative arts'],
    'Pre-University': ['secondary', r'\bo levels\b', r'\bo level\b', r'\ba levels\b', r'\ba level\b', r'\bn levels\b', r'\bn level\b', r'\bspm\b'],
    'Science': ['environmental science', 'science'],
    'Clerical': ['office', ]
}

import re

def infer_field(qualification: str):
    if not qualification or not isinstance(qualification, str):
        return None
    q_lower = qualification.lower()
    for field, keywords in FIELD_KEYWORDS.items():
        for kw in keywords:
            # support regex keywords like r'\bit\b'
            try:
                if re.search(kw, q_lower):
                    return field
            except re.error:
                if kw in q_lower:
                    return field
    return 'Other'

def infer_fields_multi(qualification: str):
    if not qualification or not isinstance(qualification, str):
        return []
    q_lower = qualification.lower()
    matches = []
    for field, keywords in FIELD_KEYWORDS.items():
        for kw in keywords:
            try:
                if re.search(kw, q_lower):
                    matches.append(field)
                    break
            except re.error:
                if kw in q_lower:
                    matches.append(field)
                    break
    return sorted(set(matches)) or ['Other']

# Apply to dataframe
student_profiles['Field'] = student_profiles['Qualification'].apply(infer_field)
student_profiles['Fields'] = student_profiles['Qualification'].apply(infer_fields_multi)

In [62]:
# Count occurrences of each single field
field_counts = student_profiles['Field'].value_counts()

print(field_counts)

Field
Business          151
Pre-University     61
Other              21
Science            17
Art                17
IT                 13
Engineering         7
Hospitality         7
Clerical            6
Health              2
Name: count, dtype: int64


### 8. Date Attained Highest Qualification

In [63]:

# Drop rows with the identified unrealistic qualification ages based on index
rows_to_drop = [111, 33, 113, 38, 267]
student_profiles_cleaned = student_profiles.drop(index=rows_to_drop, errors='ignore').copy()

print(f"Number of rows before dropping: {len(student_profiles)}")
print(f"Number of rows after dropping: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

print("\nStudent profiles after dropping rows with unrealistic qualification ages:")
display(student_profiles.head())

Number of rows before dropping: 302
Number of rows after dropping: 297

Student profiles after dropping rows with unrealistic qualification ages:


Unnamed: 0,STUDENT ID,GENDER,NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,...,DESIGNATION_CATEGORY,COURSE_COMPLETED,COURSE_ID,INTAKE_NO,INDEX_NO,RESIDENTIAL_STATUS,Qualification,Institute,Field,Fields
0,1101-009/001,F,Malaysia,1981-09-13,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18,2023-09-17,...,Assistant,True,1101,9,1,Foreigner,SPM,,Pre-University,[Pre-University]
1,1101-009/002,F,Singapore,1979-07-26,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18,2023-09-17,...,Assistant,True,1101,9,2,Singapore Citizen,Certificate in Office Skills,ITE,Clerical,[Clerical]
2,1101-009/003,F,India,1990-02-01,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18,2023-09-17,...,Unknown,True,1101,9,3,Foreigner,Bachelor of Business Administration,University of Rajasthan,Business,[Business]
3,1101-009/004,F,Netherlands,1976-04-20,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18,2023-09-17,...,Manager,True,1101,9,4,Foreigner,Office Management Diploma,NCOI Rotterdam,Business,"[Business, Clerical]"
4,1101-009/005,F,Singapore,1983-11-25,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18,2023-09-17,...,Executive,True,1101,9,5,Singapore Citizen,"Diploma in Business Admininstration, LCCI Leve...",,Business,[Business]


### 10. Commence Date & Completion Date

In [64]:
# Insert blank values by looking at COURSE_ID & INTAKE_NO
# Combine COURSE_ID and INTAKE_NO to create a unique intake identifier
student_profiles['INTAKE_IDENTIFIER'] = student_profiles['COURSE_ID'].astype(str) + '-' + student_profiles['INTAKE_NO'].astype(str)

# Calculate initial missing values
missing_commencement_before = student_profiles['COMMENCEMENT DATE'].isna().sum()
missing_completion_before = student_profiles['COMPLETION DATE'].isna().sum()

# Function to fill missing dates within each intake group
def fill_missing_dates(group):
    # Fill missing COMMENCEMENT DATE with the most frequent date in the group
    if group['COMMENCEMENT DATE'].isnull().any():
        most_frequent_commencement = group['COMMENCEMENT DATE'].mode()
        if not most_frequent_commencement.empty:
            group['COMMENCEMENT DATE'] = group['COMMENCEMENT DATE'].fillna(most_frequent_commencement[0])

    # Fill missing COMPLETION DATE with the most frequent date in the group
    if group['COMPLETION DATE'].isnull().any():
        most_frequent_completion = group['COMPLETION DATE'].mode()
        if not most_frequent_completion.empty:
            group['COMPLETION DATE'] = group['COMPLETION DATE'].fillna(most_frequent_completion[0])

    return group

# Apply the filling function to each intake group
student_profiles_filled_dates = student_profiles.groupby('INTAKE_IDENTIFIER').apply(fill_missing_dates)

# Drop the temporary intake identifier column
student_profiles_filled_dates = student_profiles_filled_dates.drop(columns=['INTAKE_IDENTIFIER'])

print("Student profiles after attempting to fill missing commencement and completion dates:")
display(student_profiles_filled_dates[['STUDENT ID', 'COMMENCEMENT DATE', 'COMPLETION DATE']].head())

# Calculate remaining missing values
missing_commencement_after_fill = student_profiles_filled_dates['COMMENCEMENT DATE'].isna().sum()
missing_completion_after_fill = student_profiles_filled_dates['COMPLETION DATE'].isna().sum()

# Calculate number of filled values
filled_commencement = missing_commencement_before - missing_commencement_after_fill
filled_completion = missing_completion_before - missing_completion_after_fill


print(f"\nNumber of missing 'COMMENCEMENT DATE' before filling: {missing_commencement_before}")
print(f"Number of missing 'COMMENCEMENT DATE' after filling: {missing_commencement_after_fill}")
print(f"Number of 'COMMENCEMENT DATE' values filled: {filled_commencement}")

print(f"\nNumber of missing 'COMPLETION DATE' before filling: {missing_completion_before}")
print(f"Number of missing 'COMPLETION DATE' after filling: {missing_completion_after_fill}")
print(f"Number of 'COMPLETION DATE' values filled: {filled_completion}")


# Update the student_profiles DataFrame
student_profiles = student_profiles_filled_dates

Student profiles after attempting to fill missing commencement and completion dates:


  student_profiles_filled_dates = student_profiles.groupby('INTAKE_IDENTIFIER').apply(fill_missing_dates)


Unnamed: 0_level_0,Unnamed: 1_level_0,STUDENT ID,COMMENCEMENT DATE,COMPLETION DATE
INTAKE_IDENTIFIER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1101-009,0,1101-009/001,2022-04-18,2023-09-17
1101-009,1,1101-009/002,2022-04-18,2023-09-17
1101-009,2,1101-009/003,2022-04-18,2023-09-17
1101-009,3,1101-009/004,2022-04-18,2023-09-17
1101-009,4,1101-009/005,2022-04-18,2023-09-17



Number of missing 'COMMENCEMENT DATE' before filling: 5
Number of missing 'COMMENCEMENT DATE' after filling: 0
Number of 'COMMENCEMENT DATE' values filled: 5

Number of missing 'COMPLETION DATE' before filling: 18
Number of missing 'COMPLETION DATE' after filling: 13
Number of 'COMPLETION DATE' values filled: 5


### 12. Course Funding Type

In [65]:

# Standardize values based on the provided mapping
funding_mapping = {
    'Individual': 'Individual',
    'Individual ': 'Individual',
    'Indivodual': 'Individual',
    'Indvidual': 'Individual',
    'Individual - SFC': 'Individual - SFC',
    'Individual-SFC': 'Individual - SFC',
    'Indvidual - SFC': 'Individual - SFC',
    # Correcting the mapping for 'Sponsored-no SDF' based on the user's table
    'Sponsored-no SDF': 'Sponsored',
    'Sponsored-no SDF ': 'Sponsored',
    'Individual - waived App Fee': 'Individual - Waived App Fee',
    'Individual - SFC + $1000 SCHOLARSHIP': 'Individual - SFC + Scholarship',
    'Sponsored': 'Sponsored',
    'Sponsored ': 'Sponsored',
    'Sponsored - SDF': 'Sponsored - SDF',
    'Sponsored  ': 'Sponsored', # Handle double space

}

student_profiles['COURSE FUNDING'] = student_profiles['COURSE FUNDING'].replace(funding_mapping)

print("Course Funding column standardized:")
display(student_profiles['COURSE FUNDING'].value_counts().reset_index())

# Check for spacing issues by looking at the unique values directly
print("\nUnique values in 'COURSE FUNDING' column after standardization (checking for spacing):")
display(student_profiles['COURSE FUNDING'].unique())

Course Funding column standardized:


Unnamed: 0,COURSE FUNDING,count
0,Individual,160
1,Individual - SFC,66
2,Sponsored,43
3,Sponsored - no SDF,10
4,Individual - Waived App Fee,5
5,Individual,4
6,Individual - SFC + Scholarship,4
7,Sponsored - SDF,3
8,Individual,1
9,Sponsored,1



Unique values in 'COURSE FUNDING' column after standardization (checking for spacing):


array(['Individual', 'Individual - SFC', 'Sponsored',
       'Sponsored - no SDF', 'Individual   ',
       'Individual - Waived App Fee', 'Sponsored   ', 'Individual  ',
       'Individual - SFC + Scholarship', 'Sponsored - SDF'], dtype=object)

In [66]:
# Remove leading/trailing spaces from 'COURSE FUNDING'
student_profiles['COURSE FUNDING'] = student_profiles['COURSE FUNDING'].str.strip()

print("Course Funding column after stripping spaces:")
display(student_profiles['COURSE FUNDING'].value_counts().reset_index())

print("\nUnique values in 'COURSE FUNDING' column after stripping spaces:")
display(student_profiles['COURSE FUNDING'].unique())

Course Funding column after stripping spaces:


Unnamed: 0,COURSE FUNDING,count
0,Individual,165
1,Individual - SFC,66
2,Sponsored,44
3,Sponsored - no SDF,10
4,Individual - Waived App Fee,5
5,Individual - SFC + Scholarship,4
6,Sponsored - SDF,3



Unique values in 'COURSE FUNDING' column after stripping spaces:


array(['Individual', 'Individual - SFC', 'Sponsored',
       'Sponsored - no SDF', 'Individual - Waived App Fee',
       'Individual - SFC + Scholarship', 'Sponsored - SDF'], dtype=object)

### 13. Registration Fee

In [67]:
# Standardize values: '107\n107' to '107' and 'Waived' to '0'
student_profiles['REGISTRATION FEE'] = student_profiles['REGISTRATION FEE'].replace({'107\n107': '107', 'Waived': '0'})

# Convert to Float type
student_profiles['REGISTRATION FEE'] = student_profiles['REGISTRATION FEE'].astype(float)

print("Registration Fee column cleaned and converted to float:")
display(student_profiles[['STUDENT ID', 'REGISTRATION FEE']].head())

# Verify the data type and unique values after cleaning
print("\nData type of 'REGISTRATION FEE' after cleaning:")
print(student_profiles['REGISTRATION FEE'].dtype)

print("\nUnique values and counts in 'REGISTRATION FEE' after cleaning:")
display(student_profiles['REGISTRATION FEE'].value_counts(dropna=False).reset_index())

Registration Fee column cleaned and converted to float:


Unnamed: 0_level_0,Unnamed: 1_level_0,STUDENT ID,REGISTRATION FEE
INTAKE_IDENTIFIER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1101-009,0,1101-009/001,107.0
1101-009,1,1101-009/002,107.0
1101-009,2,1101-009/003,107.0
1101-009,3,1101-009/004,107.0
1101-009,4,1101-009/005,107.0



Data type of 'REGISTRATION FEE' after cleaning:
float64

Unique values and counts in 'REGISTRATION FEE' after cleaning:


Unnamed: 0,REGISTRATION FEE,count
0,107.0,292
1,0.0,5


### 14. Payment Mode

In [68]:
# Standardize values based on the provided list
payment_mode_mapping = {
    'Nets': 'NETS',
    'NETS': 'NETS',
    'Giro': 'Giro',
    'GIRO': 'Giro',
    'PayNow': 'PayNow',
    'Cr Card': 'Credit Card',
    'Waived': 'Waived',
    'Bank': 'Bank'
}

student_profiles['PAYMENT MODE'] = student_profiles['PAYMENT MODE'].replace(payment_mode_mapping)

print("Payment Mode column standardized:")
display(student_profiles['PAYMENT MODE'].value_counts().reset_index())

Payment Mode column standardized:


Unnamed: 0,PAYMENT MODE,count
0,NETS,227
1,Giro,41
2,PayNow,23
3,Waived,4
4,Credit Card,1
5,Bank,1


### 15. Course Fee

In [69]:
# Remove '$' symbol and convert to numeric, coercing errors
student_profiles['COURSE FEE_cleaned'] = student_profiles['COURSE FEE'].astype(str).str.replace('$', '', regex=False)
student_profiles['COURSE FEE_cleaned'] = pd.to_numeric(student_profiles['COURSE FEE_cleaned'], errors='coerce')

# Identify rows with values that couldn't be converted to numeric
invalid_course_fee_rows = student_profiles[student_profiles['COURSE FEE_cleaned'].isna()]

print("Rows with invalid 'COURSE FEE' values that could not be converted to numeric:")
display(invalid_course_fee_rows[['STUDENT ID', 'COURSE FEE', 'COURSE FEE_cleaned']])

# Convert to float with 2 decimal places
student_profiles['COURSE FEE_cleaned'] = student_profiles['COURSE FEE_cleaned'].astype(float).round(2)

# Display the cleaned column and verify data type
print("\n'COURSE FEE' column after cleaning and conversion to numeric (with 2 decimal places):")
display(student_profiles[['STUDENT ID', 'COURSE FEE', 'COURSE FEE_cleaned']].head())

print("\nData type of 'COURSE FEE_cleaned' after cleaning:")
print(student_profiles['COURSE FEE_cleaned'].dtype)

Rows with invalid 'COURSE FEE' values that could not be converted to numeric:


Unnamed: 0_level_0,Unnamed: 1_level_0,STUDENT ID,COURSE FEE,COURSE FEE_cleaned
INTAKE_IDENTIFIER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1



'COURSE FEE' column after cleaning and conversion to numeric (with 2 decimal places):


Unnamed: 0_level_0,Unnamed: 1_level_0,STUDENT ID,COURSE FEE,COURSE FEE_cleaned
INTAKE_IDENTIFIER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1101-009,0,1101-009/001,5136,5136.0
1101-009,1,1101-009/002,5136,5136.0
1101-009,2,1101-009/003,5136,5136.0
1101-009,3,1101-009/004,5136,5136.0
1101-009,4,1101-009/005,4812,4812.0



Data type of 'COURSE FEE_cleaned' after cleaning:
float64


In [70]:
# Update COURSE FEE with cleaned values
student_profiles = student_profiles.drop(columns=['COURSE FEE'])
student_profiles = student_profiles.rename(columns={'COURSE FEE_cleaned': 'COURSE FEE'})

### 16. cGPA calculation

In [71]:
# Calculate the average GPA for each student from the semester_results DataFrame
cgpa_data = semester_results.groupby('STUDENT ID')['GPA'].mean().reset_index()
cgpa_data.rename(columns={'GPA': 'CGPA'}, inplace=True)

# Round the CGPA to 2 decimal places
cgpa_data['CGPA'] = cgpa_data['CGPA'].round(2)

# Merge the calculated CGPA back to the student_profiles DataFrame
student_profiles = pd.merge(student_profiles, cgpa_data, on='STUDENT ID', how='left')

print("Student profiles with CGPA column:")
display(student_profiles.head())

Student profiles with CGPA column:


Unnamed: 0,STUDENT ID,GENDER,NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,...,COURSE_ID,INTAKE_NO,INDEX_NO,RESIDENTIAL_STATUS,Qualification,Institute,Field,Fields,COURSE FEE,CGPA
0,1101-009/001,F,Malaysia,1981-09-13,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18,2023-09-17,...,1101,9,1,Foreigner,SPM,,Pre-University,[Pre-University],5136.0,3.6
1,1101-009/002,F,Singapore,1979-07-26,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18,2023-09-17,...,1101,9,2,Singapore Citizen,Certificate in Office Skills,ITE,Clerical,[Clerical],5136.0,3.5
2,1101-009/003,F,India,1990-02-01,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18,2023-09-17,...,1101,9,3,Foreigner,Bachelor of Business Administration,University of Rajasthan,Business,[Business],5136.0,3.37
3,1101-009/004,F,Netherlands,1976-04-20,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18,2023-09-17,...,1101,9,4,Foreigner,Office Management Diploma,NCOI Rotterdam,Business,"[Business, Clerical]",5136.0,3.8
4,1101-009/005,F,Singapore,1983-11-25,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18,2023-09-17,...,1101,9,5,Singapore Citizen,"Diploma in Business Admininstration, LCCI Leve...",,Business,[Business],4812.0,2.9


### 17. Hiatus Duration

In [72]:
# Calculate the time difference between COMMENCEMENT DATE and DATE ATTAINED HIGHEST QUALIFICATION
time_difference = student_profiles['COMMENCEMENT DATE'] - student_profiles['DATE ATTAINED HIGHEST QUALIFICATION']

# Convert the time difference to years and store in a new column
# We divide by the number of days in a year (approx 365.25 for leap years)
student_profiles['YEAR_SINCE_LAST_QUAL'] = (time_difference.dt.days / 365.25).round(2)

print("Student profiles with YEAR_SINCE_LAST_QUAL column:")
display(student_profiles[['STUDENT ID', 'DATE ATTAINED HIGHEST QUALIFICATION', 'COMMENCEMENT DATE', 'YEAR_SINCE_LAST_QUAL']].head())

Student profiles with YEAR_SINCE_LAST_QUAL column:


Unnamed: 0,STUDENT ID,DATE ATTAINED HIGHEST QUALIFICATION,COMMENCEMENT DATE,YEAR_SINCE_LAST_QUAL
0,1101-009/001,2018-01-08,2022-04-18,4.27
1,1101-009/002,2016-06-08,2022-04-18,5.86
2,1101-009/003,2015-08-08,2022-04-18,6.69
3,1101-009/004,2018-02-08,2022-04-18,4.19
4,1101-009/005,2015-06-08,2022-04-18,6.86


In [73]:
student_profiles


Unnamed: 0,STUDENT ID,GENDER,NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,...,INTAKE_NO,INDEX_NO,RESIDENTIAL_STATUS,Qualification,Institute,Field,Fields,COURSE FEE,CGPA,YEAR_SINCE_LAST_QUAL
0,1101-009/001,F,Malaysia,1981-09-13,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18,2023-09-17,...,009,001,Foreigner,SPM,,Pre-University,[Pre-University],5136.0,3.60,4.27
1,1101-009/002,F,Singapore,1979-07-26,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18,2023-09-17,...,009,002,Singapore Citizen,Certificate in Office Skills,ITE,Clerical,[Clerical],5136.0,3.50,5.86
2,1101-009/003,F,India,1990-02-01,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18,2023-09-17,...,009,003,Foreigner,Bachelor of Business Administration,University of Rajasthan,Business,[Business],5136.0,3.37,6.69
3,1101-009/004,F,Netherlands,1976-04-20,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18,2023-09-17,...,009,004,Foreigner,Office Management Diploma,NCOI Rotterdam,Business,"[Business, Clerical]",5136.0,3.80,4.19
4,1101-009/005,F,Singapore,1983-11-25,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18,2023-09-17,...,009,005,Singapore Citizen,"Diploma in Business Admininstration, LCCI Leve...",,Business,[Business],4812.0,2.90,6.86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292,5113-009/003,F,Singapore,1995-03-06,Degree,Bachelor of Science in Hotel Administration (H...,2019-06-08,HR Manager,2025-04-14,NaT,...,009,003,Singapore Citizen,Bachelor of Science in Hotel Administration Ho...,University of Nevada,Business,"[Business, Hospitality, Science]",5803.0,,5.85
293,5113-009/004,M,Philippines,1995-02-09,Degree,Bachelor of Science in Accounting and Finance ...,2020-08-05,Finance Manager,2025-04-14,NaT,...,009,004,PR,Bachelor of Science in Accounting and Finance ...,University of London,Business,"[Business, Science]",5803.0,,4.69
294,5113-009/005,F,Singapore,1993-10-01,Degree,Bachelor of Commerce (Accounting and Finance)/...,2019-11-05,"Head, Admin",2025-04-14,NaT,...,009,005,Singapore Citizen,Bachelor of Commerce Accounting and Finance,Curtin University of Technology,Business,[Business],5803.0,,5.44
295,5113-009/006,F,Indonesia,1996-09-30,Degree,Bachelor of Economics (Accounting)/\nUniversit...,2018-05-05,Accountant,2025-04-14,NaT,...,009,006,PR,Bachelor of Economics Accounting,Universitas Katolik Indonesia,Business,[Business],5803.0,,6.94


## Data Wrangling for Course Code
### 1. Get the column "Course Type'

* Create a new column "Course Type" based on Course Name

### 2. Average number of semesters to complete the course

* Average the number of semesters to complete the course, based on `semester_result`

### 1. Get the column "Course Type"

In [74]:
# Define a function to extract course type based on keywords
def get_course_type(course_name):
    if isinstance(course_name, str):
        course_name_lower = course_name.lower()
        if 'specialist diploma' in course_name_lower:
            return 'Specialist Diploma'
        elif 'diploma' in course_name_lower:
            return 'Diploma'
        elif 'certificate' in course_name_lower:
            return 'Certificate'
    return 'Other'

# Apply the function to create the new 'course type' column
course_code['course type'] = course_code['COURSE NAME'].apply(get_course_type)

print("Course Code DataFrame with 'course type' column:")
display(course_code.head())

print("\nCounts of each Course Type:")
display(course_code['course type'].value_counts())

Course Code DataFrame with 'course type' column:


Unnamed: 0,CODE,COURSE NAME,course type
0,1101,Diploma in Business Administration,Diploma
1,1102,Diploma in Business Analytics,Diploma
2,2101,Certificate in Digital Marketing,Certificate
3,2102,Certificate in HR Management,Certificate
4,2013,Certificate in Tourism Management,Certificate



Counts of each Course Type:


Unnamed: 0_level_0,count
course type,Unnamed: 1_level_1
Certificate,3
Diploma,2
Specialist Diploma,2


### Average number of semesters to complete the course

We need to standardise the column "PERIOD" from semester_result dataset.

In [75]:
# Standardize 'PERIOD' column by converting to numbers
def standardize_period(period):
    if isinstance(period, str):
        period_lower = period.lower().replace(' ', '').replace('semester', '').replace('sem', '') # Replace both 'semester' and 'sem'
        try:
            return int(period_lower)
        except ValueError:
            return None # Handle cases that cannot be converted
    return period # Return as is if not a string (e.g., already a number or NaN)

semester_results['PERIOD_cleaned'] = semester_results['PERIOD'].apply(standardize_period)

# Check for values that couldn't be standardized (will be None)
invalid_periods = semester_results[semester_results['PERIOD_cleaned'].isna() & semester_results['PERIOD'].notna()]

if not invalid_periods.empty:
    print("Rows with invalid 'PERIOD' values that could not be standardized:")
    display(invalid_periods[['STUDENT ID', 'PERIOD', 'PERIOD_cleaned']])
else:
    print("All 'PERIOD' values standardized successfully.")

# Drop the original 'PERIOD' column and rename the cleaned one
semester_results = semester_results.drop(columns=['PERIOD'])
semester_results = semester_results.rename(columns={'PERIOD_cleaned': 'PERIOD'})

print("\n'PERIOD' column after standardization:")
display(semester_results['PERIOD'].value_counts().reset_index())

All 'PERIOD' values standardized successfully.

'PERIOD' column after standardization:


Unnamed: 0,PERIOD,count
0,1,309
1,2,162
2,3,82
3,4,2


In [76]:
# Reset the index of student_profiles
student_profiles = student_profiles.reset_index(drop=True)

# Calculate the number of semesters for each student
semesters_taken = semester_results.groupby('STUDENT ID')['PERIOD'].max().reset_index()
semesters_taken.rename(columns={'PERIOD': 'Semesters Taken'}, inplace=True)

# Merge the number of semesters taken back to the student_profiles DataFrame
student_profiles = pd.merge(student_profiles, semesters_taken, on='STUDENT ID', how='left')

print("Student profiles with Semesters Taken column:")
display(student_profiles[['STUDENT ID', 'COMMENCEMENT DATE', 'COMPLETION DATE', 'Semesters Taken']].head())

# Group by COURSE_ID and Semesters Taken to count students
semesters_per_course = student_profiles.groupby(['COURSE_ID', 'Semesters Taken']).size().reset_index(name='Number of Students')

print("\nNumber of students per course by semesters taken:")
display(semesters_per_course)

Student profiles with Semesters Taken column:


Unnamed: 0,STUDENT ID,COMMENCEMENT DATE,COMPLETION DATE,Semesters Taken
0,1101-009/001,2022-04-18,2023-09-17,3.0
1,1101-009/002,2022-04-18,2023-09-17,3.0
2,1101-009/003,2022-04-18,2023-09-17,3.0
3,1101-009/004,2022-04-18,2023-09-17,3.0
4,1101-009/005,2022-04-18,2023-09-17,3.0



Number of students per course by semesters taken:


Unnamed: 0,COURSE_ID,Semesters Taken,Number of Students
0,1101,3.0,41
1,1101,4.0,1
2,1102,3.0,29
3,1102,4.0,1
4,2101,1.0,31
5,2102,1.0,97
6,2102,2.0,2
7,5112,2.0,45
8,5113,2.0,35
9,5113,3.0,2


Data type of course_id has to be changed to INT to match.

In [77]:
semesters_per_course['COURSE_ID'] = semesters_per_course['COURSE_ID'].astype(int)


We use mode to find average number of semesters to complete the course.

In [78]:
# Calculate the mode (most frequent value) of semesters per course
average_semesters = semesters_per_course.groupby('COURSE_ID')['Semesters Taken'].agg(lambda x: x.mode()[0]).reset_index(name='Course Duration (semesters)')

# Convert 'CODE' in course_code to object type to match 'COURSE_ID' in average_semesters
course_code['CODE'] = course_code['CODE'].astype(object)

# Merge this mode with the course_code DataFrame
course_code = pd.merge(course_code, average_semesters, left_on='CODE', right_on='COURSE_ID', how='left')

# Drop the redundant COURSE_ID column from the merge
course_code = course_code.drop(columns=['COURSE_ID'])

print("Course Code DataFrame with 'avg sems to complete' (mode):")
display(course_code)

Course Code DataFrame with 'avg sems to complete' (mode):


Unnamed: 0,CODE,COURSE NAME,course type,course duration
0,1101,Diploma in Business Administration,Diploma,3.0
1,1102,Diploma in Business Analytics,Diploma,3.0
2,2101,Certificate in Digital Marketing,Certificate,1.0
3,2102,Certificate in HR Management,Certificate,1.0
4,2013,Certificate in Tourism Management,Certificate,
5,5112,Specialist Diploma in Business Innovation and ...,Specialist Diploma,2.0
6,5113,Specialist Diploma in Intelligent Systems,Specialist Diploma,2.0


The NaN value is because no students have taken this course.

In [79]:
semesters_per_course[semesters_per_course['COURSE_ID'] == 2013]

Unnamed: 0,COURSE_ID,Semesters Taken,Number of Students


## Data Wrangling for Semester Results
### 1. Remove Unmatched Semester Results

* Remove entries from `semester_results` if the student does not exist in the `student profile` table.

### Remove Unmatched Semester Results

In [80]:
# Get the set of student IDs from student_profiles
student_profiles_ids = set(student_profiles['STUDENT ID'].dropna().unique())

# Filter semester_results to keep only rows where the student ID is in student_profiles
semester_results_cleaned = semester_results[semester_results['STUDENT ID'].isin(student_profiles_ids)].copy()

print("Semester results after removing unmatched student IDs:")
display(semester_results_cleaned.head())
print(f"\nNumber of rows before removal: {len(semester_results)}")
print(f"Number of rows after removal: {len(semester_results_cleaned)}")

# Update the semester_results DataFrame
semester_results = semester_results_cleaned

Semester results after removing unmatched student IDs:


Unnamed: 0,STUDENT ID,GPA,PERIOD
0,1101-009/001,3.5,1
1,1101-009/001,3.6,2
2,1101-009/001,3.7,3
3,1101-009/002,3.4,1
4,1101-009/002,3.5,2



Number of rows before removal: 555
Number of rows after removal: 525


### Standardising 'PERIOD'

In [81]:
# Standardize 'PERIOD' column by converting to numbers
def standardize_period(period):
    if isinstance(period, str):
        period_lower = period.lower().replace(' ', '').replace('semester', '').replace('sem', '') # Replace both 'semester' and 'sem'
        try:
            return int(period_lower)
        except ValueError:
            return None # Handle cases that cannot be converted
    return period # Return as is if not a string (e.g., already a number or NaN)

semester_results['PERIOD_cleaned'] = semester_results['PERIOD'].apply(standardize_period)

# Check for values that couldn't be standardized (will be None)
invalid_periods = semester_results[semester_results['PERIOD_cleaned'].isna() & semester_results['PERIOD'].notna()]

if not invalid_periods.empty:
    print("Rows with invalid 'PERIOD' values that could not be standardized:")
    display(invalid_periods[['STUDENT ID', 'PERIOD', 'PERIOD_cleaned']])
else:
    print("All 'PERIOD' values standardized successfully.")

# Drop the original 'PERIOD' column and rename the cleaned one
semester_results = semester_results.drop(columns=['PERIOD'])
semester_results = semester_results.rename(columns={'PERIOD_cleaned': 'PERIOD'})

print("\n'PERIOD' column after standardization:")
display(semester_results['PERIOD'].value_counts().reset_index())

All 'PERIOD' values standardized successfully.

'PERIOD' column after standardization:


Unnamed: 0,PERIOD,count
0,1,290
1,2,153
2,3,80
3,4,2


### "Had Passed" column

In [82]:
# Merge semester_results with student_profiles to get COURSE_ID
semester_results_with_course = pd.merge(semester_results, student_profiles[['STUDENT ID', 'COURSE_ID']], on='STUDENT ID', how='left')

# Convert COURSE_ID to the same data type as CODE in course_code before merging
semester_results_with_course['COURSE_ID'] = pd.to_numeric(semester_results_with_course['COURSE_ID'], errors='coerce')

# Merge with course_code to get the 'course type'
semester_results_with_course = pd.merge(semester_results_with_course, course_code[['CODE', 'course type']], left_on='COURSE_ID', right_on='CODE', how='left')

# Define a function to determine if a student passed the semester
def check_if_passed(row):
    gpa = row['GPA']
    course_type = row['course type']

    if pd.isna(gpa) or pd.isna(course_type):
        return False # Cannot determine if passed without GPA or course type

    if course_type in ['Certificate', 'Diploma']:
        return gpa >= 2.0
    elif course_type == 'Specialist Diploma':
        return gpa >= 2.3
    else:
        return False # Default to False for other or unknown course types

# Apply the function to create the 'Had Passed' column
semester_results_with_course['Had Passed'] = semester_results_with_course.apply(check_if_passed, axis=1)

# Drop the intermediate columns if desired
semester_results_with_course = semester_results_with_course.drop(columns=['COURSE_ID', 'CODE', 'course type'])

# Update the semester_results DataFrame
semester_results = semester_results_with_course

print("Semester results with 'Had Passed' column:")
display(semester_results.head())

# Display counts of True/False in 'Had Passed'
print("\nCounts of 'Had Passed' status:")
display(semester_results['Had Passed'].value_counts())

Semester results with 'Had Passed' column:


Unnamed: 0,STUDENT ID,GPA,PERIOD,Had Passed
0,1101-009/001,3.5,1,True
1,1101-009/001,3.6,2,True
2,1101-009/001,3.7,3,True
3,1101-009/002,3.4,1,True
4,1101-009/002,3.5,2,True



Counts of 'Had Passed' status:


Unnamed: 0_level_0,count
Had Passed,Unnamed: 1_level_1
True,522
False,27


### Removing Duplicates

Since there is only rows where they all columns are exactly the same, we can drop one of them.

In [83]:
# Drop exact duplicate rows from semester_results, keeping the first occurrence
semester_results_cleaned = semester_results.drop_duplicates(keep='first').copy()

print(f"Number of rows before dropping duplicates: {len(semester_results)}")
print(f"Number of rows after dropping duplicates: {len(semester_results_cleaned)}")

# Update the semester_results DataFrame
semester_results = semester_results_cleaned

print("\nSemester results after removing duplicates:")
display(semester_results.head())

Number of rows before dropping duplicates: 549
Number of rows after dropping duplicates: 492

Semester results after removing duplicates:


Unnamed: 0,STUDENT ID,GPA,PERIOD,Had Passed
0,1101-009/001,3.5,1,True
1,1101-009/001,3.6,2,True
2,1101-009/001,3.7,3,True
3,1101-009/002,3.4,1,True
4,1101-009/002,3.5,2,True


## Exporting Data
---

In [84]:
# Define file paths
course_code_file = "cleaned_course_codes.csv"
semester_results_file = "cleaned_semester_results.csv"
student_profiles_file = "cleaned_student_profiles.csv"

# Export DataFrames to CSV files
course_code.to_csv(course_code_file, index=False)
semester_results.to_csv(semester_results_file, index=False)
student_profiles.to_csv(student_profiles_file, index=False)

print(f"Cleaned data saved to:\n- {course_code_file}\n- {semester_results_file}\n- {student_profiles_file}")

# For Google Colab, you can also provide links to download the files
try:
    from google.colab import files
    print("\nClick the links below to download the files (Google Colab):")
    # These might not work if the files are very large or in a complex directory structure
    # You might need to manually download from the file explorer in Colab
    files.download(course_code_file)
    files.download(semester_results_file)
    files.download(student_profiles_file)
except ImportError:
    print("\nRunning in a local environment (likely VS Code). Files saved to the current directory.")
    print("You can find the files in your file explorer.")

Cleaned data saved to:
- cleaned_course_codes.csv
- cleaned_semester_results.csv
- cleaned_student_profiles.csv

Click the links below to download the files (Google Colab):


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>