<a href="https://colab.research.google.com/github/swamhtetg90/DAVI-CA2/blob/Swam-N-Ben-Merge/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DAVI Data Cleaning
## Group-10
## Group Member: Devendran Yoheswaran, Kaung Myat San, Swam Htet Aung
---

## Meta Data
---
### Student Profiles

> **Note:** Data is manually entered, so the values are not standardized.

| **Field Name**                            | **Description**                                                                                                                          | **Example**                |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | -------------------------- |
| **STUDENT ID**                            | Student ID is made up of three attributes: `<Course code>-<Intake No>/<Index Number of student in the intake>`                           | `1101-013/001`             |
| **GENDER**                                | Gender                                                                                                                                   | `M`, `F`                   |
| **SG CITIZEN**                            | Singapore Citizen                                                                                                                        | `Y` or blank               |
| **SG PR**                                 | Singapore Permanent Resident                                                                                                             | `Y` or blank               |
| **FOREIGNER**                             | Neither SG Citizen nor SG PR (mutually exclusive with SG CITIZEN and SG PR)                                                              | `Y` or blank               |
| **COUNTRY OF OTHER NATIONALITY**          | Country of nationality (only for SG PR or foreigner)                                                                                     | `Malaysia`, `India`, etc.  |
| **DOB**                                   | Date of Birth. Format: `DD/MM/YYYY`                                                                                                      | `04/03/1978`               |
| **HIGHEST QUALIFICATION**                 | Highest qualification attained prior to this course                                                                                      | `Certificate`, `Diploma`   |
| **NAME OF QUALIFICATION AND INSTITUTION** | Institute where the highest qualification was attained                                                                                   | As provided by participant |
| **DATE ATTAINED HIGHEST QUALIFICATION**   | Date when the qualification was awarded. Format: `DD/MM/YYYY`                                                                            | `06/11/2016`               |
| **DESIGNATION**                           | Job designation                                                                                                                          | As provided by participant |
| **COMMENCEMENT DATE**                     | Course start date. Format: `DD/MM/YYYY`                                                                                                  | `06/01/2023`               |
| **COMPLETION DATE**                       | Course end date. Blank if course is ongoing. Format: `DD/MM/YYYY`                                                                        | `06/04/2024`               |
| **FULL-TIME OR PART-TIME**                | Whether the course is Full-time or Part-time                                                                                             | `Full-Time`, `Part-Time`   |
| **COURSE FUNDING**                        | Course funding type:<br>- `Individual`<br>- `Individual - SFC` (SkillsFuture Credit)<br>- `Sponsored`<br>- `Individual - waived App Fee` | `Individual - SFC`         |
| **REGISTRATION FEE**                      | Registration fee in SGD                                                                                                                  | As entered                 |
| **PAYMENT MODE**                          | Mode of payment                                                                                                                          | `NETS`, `Giro`, `PayNow`   |
| **COURSE FEE**                            | Course fee in SGD                                                                                                                        | As entered                 |

---

### Course Codes

| **S/N**         | **Description**        | **Example**                          |
| --------------- | ---------------------- | ------------------------------------ |
| **CODE**        | Course code (4 digits) | `1101`                               |
| **COURSE NAME** | Course name            | `Diploma in Business Administration` |

---

### Semester Results

| **S/N**        | **Description**                                                                                                           | **Example**    |
| -------------- | ------------------------------------------------------------------------------------------------------------------------- | -------------- |
| **STUDENT ID** | Must match the ID in the Student Profile dataset                                                                          | `1101-013/001` |
| **PERIOD**     | Semester number<br>• Certificate: 1 semester<br>• Diploma: 3 semesters<br>• Master’s: 2 semesters (some exceptions apply) | `1`, `2`, `3`  |
| **GPA**        | GPA for the semester.<br>• Max GPA: 4<br>• Pass GPA: Certificate/Diploma = 2, Master's = 2.3                              | `3.2`          |

---


## Importing Modules
---

In [679]:
import pandas as pd
import numpy as np

## Loading Data
---

#### Course_Code Data

In [680]:
course_code = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Course%20Codes.xlsx?raw=true")
course_code.head()

Unnamed: 0,CODE,COURSE NAME
0,1101,Diploma in Business Administration
1,1102,Diploma in Business Analytics
2,2101,Certificate in Digital Marketing
3,2102,Certificate in HR Management
4,2013,Certificate in Tourism Management


#### Semester Results Data

In [681]:
semester_results = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Semester%20Results.xlsx?raw=true")
semester_results.head()

Unnamed: 0,STUDENT ID,PERIOD,GPA
0,1101-009/001,Sem 1,3.5
1,1101-009/001,Sem 2,3.6
2,1101-009/001,Sem 3,3.7
3,1101-009/002,Sem 1,3.4
4,1101-009/002,Sem 2,3.5


#### Student Profiles Data

In [682]:
student_profiles = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Student%20Profiles.xlsx?raw=true")
student_profiles.head()

Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812


## Data Analysis & Cleaning
---

### Course Code

In [683]:
course_code.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CODE         7 non-null      int64 
 1   COURSE NAME  7 non-null      object
dtypes: int64(1), object(1)
memory usage: 244.0+ bytes


In [684]:
course_code.head(10)

Unnamed: 0,CODE,COURSE NAME
0,1101,Diploma in Business Administration
1,1102,Diploma in Business Analytics
2,2101,Certificate in Digital Marketing
3,2102,Certificate in HR Management
4,2013,Certificate in Tourism Management
5,5112,Specialist Diploma in Business Innovation and ...
6,5113,Specialist Diploma in Intelligent Systems


#### Course Code DataFrame Summary

- The **`Course Code` DataFrame** contains **no null values**.
- It has a total of **7 rows**.
- The data appears to be **cleaned and ready for use**.


### Semester Results
---
#### STUDENT ID
---

In [685]:
# Get sets of Student IDs from both DataFrames
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Find IDs that are in semester_results but not in student_profiles
only_in_results = ids_results - ids_profiles

if only_in_results:
    print("Student IDs found in semester_results but not in student_profiles:")
    print(only_in_results)

    # Optionally, display the rows from semester_results for these IDs
    mismatched_rows = semester_results[semester_results['STUDENT ID'].isin(only_in_results)]
    print("\nMismatched rows from semester_results:")
    display(mismatched_rows)
else:
    print("All STUDENT IDs in semester_results are also present in student_profiles.")

Student IDs found in semester_results but not in student_profiles:
{'5112-007/004', '2101-106/004', '2101-106/003', '5112-007/005', '2101-106/002', '2101-106/001', '5112-007/006', '5112-007/003', '2101-106/005', '5112-007/002', '5112-007/001'}

Mismatched rows from semester_results:


Unnamed: 0,STUDENT ID,PERIOD,GPA
116,2101-106/001,Sem 1,2.4
117,2101-106/002,Sem 1,3.1
118,2101-106/003,Sem 1,3.4
119,2101-106/004,Sem 1,2.8
120,2101-106/005,Sem 1,2.3
130,5112-007/001,Sem 1,2.9
131,5112-007/001,Sem 2,3.8
132,5112-007/002,Sem 1,3.2
133,5112-007/002,Sem 2,3.3
134,5112-007/003,Sem 1,3.6


From looking at the above, we can see that there are students which only exists in `Semester Results` and not in `Student Profiles`.

We will be removing these rows as
* We don't have information on these students

#### PERIOD
---

In [686]:
# Check unique values in the 'PERIOD' column
unique_periods = semester_results['PERIOD'].value_counts().reset_index()
unique_periods.columns = ['Period', 'Count']
print("Unique values in 'PERIOD' column:")
display(unique_periods)

Unique values in 'PERIOD' column:


Unnamed: 0,Period,Count
0,Sem 1,223
1,Sem 2,124
2,Semester 1,56
3,Sem 3,45
4,Semester 2,37
5,Semester 3,37
6,Sem1,30
7,Sem 4,1
8,Semester 4,1
9,Sem2,1


Since the values are not standardized, we will convert it to numbers:
example:
* `Sem 1` → `1`
* `Semester 1` → `1`
* `Sem 2` → `2`

In [687]:
# Standardize 'PERIOD' column by converting to numbers
def standardize_period(period):
    if isinstance(period, str):
        period_lower = period.lower().replace(' ', '').replace('semester', '').replace('sem', '') # Replace both 'semester' and 'sem'
        try:
            return int(period_lower)
        except ValueError:
            return None # Handle cases that cannot be converted
    return period # Return as is if not a string (e.g., already a number or NaN)

semester_results['PERIOD_cleaned'] = semester_results['PERIOD'].apply(standardize_period)

# Check for values that couldn't be standardized (will be None)
invalid_periods = semester_results[semester_results['PERIOD_cleaned'].isna() & semester_results['PERIOD'].notna()]

if not invalid_periods.empty:
    print("Rows with invalid 'PERIOD' values that could not be standardized:")
    display(invalid_periods[['STUDENT ID', 'PERIOD', 'PERIOD_cleaned']])
else:
    print("All 'PERIOD' values standardized successfully.")

# Drop the original 'PERIOD' column and rename the cleaned one
semester_results = semester_results.drop(columns=['PERIOD'])
semester_results = semester_results.rename(columns={'PERIOD_cleaned': 'PERIOD'})

print("\n'PERIOD' column after standardization:")
display(semester_results['PERIOD'].value_counts().reset_index())

All 'PERIOD' values standardized successfully.

'PERIOD' column after standardization:


Unnamed: 0,PERIOD,count
0,1,309
1,2,162
2,3,82
3,4,2


#### GPA
---

In [688]:
# Check data type and summary statistics for 'GPA'
print("Info and descriptive statistics for 'GPA':")
semester_results['GPA'].info()
display(semester_results['GPA'].describe())

# Check unique values in 'GPA' to spot any non-numeric entries
print("\nUnique values in 'GPA' column:")
display(semester_results['GPA'].unique())

# Attempt to convert 'GPA' to numeric, coercing errors
semester_results['GPA_cleaned'] = pd.to_numeric(semester_results['GPA'], errors='coerce')

# Check for values that couldn't be parsed (will be NaN)
invalid_gpa = semester_results[semester_results['GPA_cleaned'].isna() & semester_results['GPA'].notna()]

print("\nRows with invalid 'GPA' values that could not be converted to numeric:")
display(invalid_gpa[['STUDENT ID', 'PERIOD', 'GPA', 'GPA_cleaned']])

# Drop the temporary cleaned column for now if there are invalid values, or proceed with cleaning
# If there are invalid values, you'll need to decide how to handle them (e.g., drop rows, impute)
# If there are no invalid values, you can drop the original and rename the cleaned column
if invalid_gpa.empty:
    print("\nNo invalid GPA values found that could not be converted to numeric.")
    # Proceed to check for values outside the expected range (0-4)
    unrealistic_gpa = semester_results[(semester_results['GPA_cleaned'] < 0) | (semester_results['GPA_cleaned'] > 4)]
    if not unrealistic_gpa.empty:
        print("\nRows with unrealistic 'GPA' values (outside 0-4 range):")
        display(unrealistic_gpa[['STUDENT ID', 'PERIOD', 'GPA', 'GPA_cleaned']])
    else:
        print("\nNo unrealistic GPA values found (outside 0-4 range).")
else:
    print("\nPlease handle the invalid GPA values shown above before proceeding.")

# Drop the temporary cleaned column
semester_results = semester_results.drop(columns=['GPA_cleaned'], errors='ignore')

Info and descriptive statistics for 'GPA':
<class 'pandas.core.series.Series'>
RangeIndex: 555 entries, 0 to 554
Series name: GPA
Non-Null Count  Dtype  
--------------  -----  
555 non-null    float64
dtypes: float64(1)
memory usage: 4.5 KB


Unnamed: 0,GPA
count,555.0
mean,3.104865
std,0.606161
min,1.6
25%,2.7
50%,3.2
75%,3.6
max,4.0



Unique values in 'GPA' column:


array([3.5, 3.6, 3.7, 3.4, 3.3, 3.2, 3.9, 3.8, 2.8, 2.9, 3. , 2.1, 2.2,
       2.3, 1.9, 2. , 2.4, 1.6, 2.3, 2.7, 2.5, 4. , 3.8, 2.4, 2.9, 2.2,
       3. , 3.9, 2.5, 1.9, 1.7, 3.1, 3.2, 2.5, 2.6, 3.3, 3.1, 3.7, 2.6,
       3. , 2.8, 2.1, 2. , 2.7, 3.6, 3.6, 1.8])


Rows with invalid 'GPA' values that could not be converted to numeric:


Unnamed: 0,STUDENT ID,PERIOD,GPA,GPA_cleaned



No invalid GPA values found that could not be converted to numeric.

No unrealistic GPA values found (outside 0-4 range).


#### Check for duplicated rows
---

In [689]:
# Check for duplicate rows based on 'STUDENT ID' and 'PERIOD'
duplicate_semester_results_period = semester_results[semester_results.duplicated(subset=['STUDENT ID', 'PERIOD'], keep=False)]

if duplicate_semester_results_period.empty:
    print("No duplicate rows found based on 'STUDENT ID' and 'PERIOD'.")
else:
    print("Duplicate rows found based on 'STUDENT ID' and 'PERIOD':")
    # Sort by 'STUDENT ID' and 'PERIOD' to show duplicates next to each other
    display(duplicate_semester_results_period.sort_values(by=['STUDENT ID', 'PERIOD']))

# Calculate total number of duplicate rows based on STUDENT ID and PERIOD
total_period_duplicates_count = duplicate_semester_results_period.shape[0]
print(f"\nTotal number of duplicate rows based on 'STUDENT ID' and 'PERIOD': {total_period_duplicates_count}")


# Calculate the number of exact duplicate rows
exact_duplicates = semester_results[semester_results.duplicated(keep=False)]
exact_duplicates_count = exact_duplicates.shape[0]
print(f"Total number of exact duplicate rows (all columns): {exact_duplicates_count}")

# Calculate the number of rows where STUDENT ID and PERIOD are the same, but GPA is different
# This is the total period duplicates minus the exact duplicates
mismatched_gpa_duplicates_count = total_period_duplicates_count - exact_duplicates_count
print(f"Number of duplicate rows based on 'STUDENT ID' and 'PERIOD' with different 'GPA': {mismatched_gpa_duplicates_count}")

# Optionally, display rows where STUDENT ID and PERIOD are the same, but GPA is different
if mismatched_gpa_duplicates_count > 0:
    print("\nDuplicate rows based on 'STUDENT ID' and 'PERIOD' with different 'GPA':")
    # Filter out exact duplicates from the period duplicates
    mismatched_gpa_rows = duplicate_semester_results_period[~duplicate_semester_results_period.duplicated(keep=True)]
    display(mismatched_gpa_rows.sort_values(by=['STUDENT ID', 'PERIOD']))

Duplicate rows found based on 'STUDENT ID' and 'PERIOD':


Unnamed: 0,STUDENT ID,GPA,PERIOD
243,1102-003/001,1.9,1
264,1102-003/001,1.9,1
244,1102-003/001,2.0,2
265,1102-003/001,2.0,2
245,1102-003/001,2.4,3
...,...,...,...
507,2102-069/009,3.0,1
496,2102-069/010,3.5,1
508,2102-069/010,3.5,1
497,2102-069/011,3.2,1



Total number of duplicate rows based on 'STUDENT ID' and 'PERIOD': 66
Total number of exact duplicate rows (all columns): 66
Number of duplicate rows based on 'STUDENT ID' and 'PERIOD' with different 'GPA': 0


Since there is only rows where they all columns are exactly the same, we can drop one of them.

In [690]:
# Drop exact duplicate rows from semester_results, keeping the first occurrence
semester_results_cleaned = semester_results.drop_duplicates(keep='first').copy()

print(f"Number of rows before dropping duplicates: {len(semester_results)}")
print(f"Number of rows after dropping duplicates: {len(semester_results_cleaned)}")

# Update the semester_results DataFrame
semester_results = semester_results_cleaned

print("\nSemester results after removing duplicates:")
display(semester_results.head())

Number of rows before dropping duplicates: 555
Number of rows after dropping duplicates: 522

Semester results after removing duplicates:


Unnamed: 0,STUDENT ID,GPA,PERIOD
0,1101-009/001,3.5,1
1,1101-009/001,3.6,2
2,1101-009/001,3.7,3
3,1101-009/002,3.4,1
4,1101-009/002,3.5,2


### Student Profile
---

#### Student_ID

In [691]:
# Check for duplicate student IDs in student_profiles
duplicate_student_profiles = student_profiles[student_profiles.duplicated(subset=['STUDENT ID'], keep=False)]

if duplicate_student_profiles.empty:
    print("No duplicate STUDENT IDs found in student_profiles.")
else:
    print("Duplicate STUDENT IDs found in student_profiles:")
    # Sort by 'STUDENT ID' to show duplicates next to each other
    display(duplicate_student_profiles.sort_values(by='STUDENT ID'))

Duplicate STUDENT IDs found in student_profiles:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
74,2101-107/001,F,,Y,,Malaysia,08/07/1984,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2016-04-04,Manager,2022-04-24 00:00:00,2022-09-23 00:00:00,Part Time,Indivodual,107,NETS,2996
75,2101-107/001,F,,Y,,Malaysia,08/07/1984,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2016-04-04,Manager,2022-04-24 00:00:00,2022-09-23 00:00:00,Part Time,Indivodual,107,NETS,2996
217,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,N.A.,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
218,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,Admin Manager,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
221,5112-008/004,F,,,Y,China,06/08/1989,Degree,Bachelor of Business (Accounting)/\nMurdoch Un...,2018-10-20,Admin Supervisor,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
222,5112-008/004,F,,,Y,China,06/08/1989,Degree,Bachelor of Business (Accounting)/\nMurdoch Un...,2018-10-20,Admin Supervisor,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
276,5113-007/001,M,Y,,,,11/02/1972,Degree,Bachelor of Arts in Human Resource Management ...,1996-05-10,Manager,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803
285,5113-007/001,M,Y,,,,11/02/1972,Degree,Bachelor of Arts in Human Resource Management ...,1996-05-10,Manager,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803
277,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
286,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803


There seemed to be rows where there are duplicated Student ID.

From looking through these rows it seemed there are rows where the whole row is not duplicated.
Let us take a look at those.

In [692]:
# Find rows that are duplicates based on 'STUDENT ID'
duplicate_ids_mask = student_profiles.duplicated(subset=['STUDENT ID'], keep=False)

# Find rows that are exact duplicates across all columns
exact_duplicates_mask = student_profiles.duplicated(keep=False)

# Filter to find rows where 'STUDENT ID' is duplicated but the whole row is not an exact duplicate
mismatched_duplicate_rows = student_profiles[duplicate_ids_mask & ~exact_duplicates_mask].copy()

if mismatched_duplicate_rows.empty:
    print("No rows found where STUDENT ID is duplicated but the entire row is different.")
else:
    print("Rows with duplicate STUDENT IDs but different content:")
    # Sort by 'STUDENT ID' to group mismatched rows for the same student together
    mismatched_duplicate_rows_sorted = mismatched_duplicate_rows.sort_values(by='STUDENT ID')
    display(mismatched_duplicate_rows_sorted)

    print("\nDifferences between the rows:")
    # Group by STUDENT ID and compare rows within each group
    for student_id, group in mismatched_duplicate_rows_sorted.groupby('STUDENT ID'):
        print(f"\n--- Differences for STUDENT ID: {student_id} ---")
        # Assuming there are only two rows for each mismatched ID for simplicity in comparison
        if len(group) >= 2:
            row1 = group.iloc[0]
            row2 = group.iloc[1]
            differences = row1.compare(row2)
            if differences.empty:
                print("Rows are identical (no differences found).")
            else:
                display(differences)
        else:
            print("Not enough rows to compare for this STUDENT ID.")

Rows with duplicate STUDENT IDs but different content:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
217,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,N.A.,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
218,5112-008/001,M,Y,,,,22/01/1983,Degree,Degree of Bachelor of Science with Honours in ...,2017-12-20,Admin Manager,2022-10-18 00:00:00,2023-09-29 00:00:00,Part-Time,Individual,107,Nets,5803
277,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
286,5113-007/002,M,Y,,,,17/07/1966,Degree,Bachelor of Science/\nNational University of S...,1990-07-10,Assistant Vice-President,2023-10-16 00:00:00,2024-09-20 00:00:00,Part-Time,Individual,107,Nets,5803



Differences between the rows:

--- Differences for STUDENT ID: 5112-008/001 ---


Unnamed: 0,self,other
DESIGNATION,N.A.,Admin Manager



--- Differences for STUDENT ID: 5113-007/002 ---


Unnamed: 0,self,other
COMPLETION DATE,2025-03-18 00:00:00,2024-09-20 00:00:00


From looking at the differences:
* **5112-008/001:** We can take the row where Admin Manager is in Designation as he might have forgotten to add it the first time.
* **5113-007/002:** For Completion Date, let us check other students in the same Course and Intake No to see which is more viable

In [693]:
# Filter student_profiles for the specific course code and intake number
course_intake_identifier = '5113-007'
relevant_students = student_profiles[student_profiles['STUDENT ID'].str.startswith(course_intake_identifier)].copy()

# Display the student IDs and their completion dates for this group
print(f"Completion dates for students in intake {course_intake_identifier}:")
display(relevant_students[['STUDENT ID', 'COMPLETION DATE']])

Completion dates for students in intake 5113-007:


Unnamed: 0,STUDENT ID,COMPLETION DATE
276,5113-007/001,2024-09-20 00:00:00
277,5113-007/002,2025-03-18 00:00:00
278,5113-007/003,2024-09-20 00:00:00
279,5113-007/004,2024-09-20 00:00:00
280,5113-007/005,2024-09-20 00:00:00
281,5113-007/005,2024-09-20 00:00:00
282,5113-007/006,2024-09-20 00:00:00
283,5113-007/007,2024-09-20 00:00:00
284,5113-007/008,2024-09-20 00:00:00
285,5113-007/001,2024-09-20 00:00:00


Therefore, from looking at the Date, `2024-09-20 00:00:00` seems more viable, therefore we will be taking this row instead of `2025-03-18 00:00:00`.

So, we will remove row `[217, 277]`

In [694]:
# Remove rows with index 217 and 277
rows_to_remove = [217, 277]
student_profiles_cleaned = student_profiles.drop(index=rows_to_remove, errors='ignore').copy()

print(f"Number of rows before removal: {len(student_profiles)}")
print(f"Number of rows after removal: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

print("\nStudent profiles after removing specified rows:")
display(student_profiles.head())

Number of rows before removal: 307
Number of rows after removal: 305

Student profiles after removing specified rows:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812


In [695]:
import pandas as pd

# Step 1: Get unique student IDs from each DataFrame
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Step 2: Compare sets
if ids_profiles == ids_results:
    print("STUDENT ID columns match exactly in both datasets.")
else:
    print("Mismatch found between STUDENT ID columns.")

    # Extra: Show differences
    only_in_profiles = ids_profiles - ids_results
    only_in_results = ids_results - ids_profiles

    if only_in_profiles:
        print("Student IDs only in student_profiles:")
        print(only_in_profiles)

    if only_in_results:
        print("Student IDs only in semester_results:")
        print(only_in_results)


Mismatch found between STUDENT ID columns.
Student IDs only in student_profiles:
{'5113-009/006', '5113-009/007', '2101-111/008', '2101-111/004', '5113-009/004', '2101-111/003', '2101-111/007', '5113-009/003', '2101-111/006', '2101-111/002', '5113-009/002', '5113-009/001', '2101-111/005', '5113-009/005', '2101-111/001'}
Student IDs only in semester_results:
{'5112-007/004', '2101-106/004', '2101-106/003', '5112-007/005', '2101-106/002', '2101-106/001', '5112-007/006', '5112-007/003', '2101-106/005', '5112-007/002', '5112-007/001'}


In [696]:
# Get sets of Student IDs
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Find mismatched IDs (present in semester_results but not in student_profiles)
only_in_results = ids_results - ids_profiles

# Filter and print those rows from semester_results
mismatched_rows = semester_results[semester_results['STUDENT ID'].isin(only_in_results)]

print("❌ Mismatched rows from semester_results:")
print(mismatched_rows)


❌ Mismatched rows from semester_results:
       STUDENT ID  GPA  PERIOD
116  2101-106/001  2.4       1
117  2101-106/002  3.1       1
118  2101-106/003  3.4       1
119  2101-106/004  2.8       1
120  2101-106/005  2.3       1
130  5112-007/001  2.9       1
131  5112-007/001  3.8       2
132  5112-007/002  3.2       1
133  5112-007/002  3.3       2
134  5112-007/003  3.6       1
135  5112-007/003  3.7       2
136  5112-007/004  3.5       1
137  5112-007/004  2.6       2
138  5112-007/005  2.1       1
139  5112-007/005  2.3       2
140  5112-007/006  3.2       1
141  5112-007/006  3.1       2


In [697]:
import pandas as pd

# Extract all unique prefixes from semester_results
prefixes = set(semester_results['STUDENT ID'].str.extract(r'^(\d{4}-\d{3})')[0])

# Filter student_profiles to include only rows whose STUDENT ID starts with one of the prefixes
matched_profiles = student_profiles[
    student_profiles['STUDENT ID'].str.extract(r'^(\d{4}-\d{3})')[0].isin(prefixes)
].copy()

# Display result
print("✅ Matched student profiles (by prefix only):")
matched_profiles


✅ Matched student profiles (by prefix only):


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,5113-008/003,F,Y,,,,18/11/1991,Degree,Bachelor of Business (Marketing)/\nRMIT Univer...,2017-03-21,Regional Recruiter,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
296,5113-008/004,F,Y,,,,29/04/1974,Degree,Bachelor of Commerce in Management and Marketi...,2017-02-28,Confidential Assistant,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
297,5113-008/005,F,Y,,,,19/10/1981,Degree,"Bachelor of Arts in Human Resource Management,...",2018-01-30,Journey Management Team Lead,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803
298,5113-008/006,F,Y,,,,19/03/1971,Degree,Bachelor of Arts (Sociology)/\nState Universit...,1995-05-30,Academy Program Coordinator,2024-04-08 00:00:00,2025-03-18 00:00:00,Part-Time,Individual,107,Nets,5803




---

### Checking students which are not found in semester_results

In [698]:
# Get sets of Student IDs
ids_profiles = set(student_profiles['STUDENT ID'].dropna().unique())
ids_results = set(semester_results['STUDENT ID'].dropna().unique())

# Find mismatched IDs (present in student_profiles but not in student_profiles)
only_in_results =  ids_profiles - ids_results

# Filter and print those rows from student_profiles
mismatched_rows = student_profiles[student_profiles['STUDENT ID'].isin(only_in_results)]

print("❌ Mismatched rows from student_profiles:")
mismatched_rows.head(20)


❌ Mismatched rows from student_profiles:


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
108,2101-111/001,F,,Y,,Malaysian,08/07/1995,Degree,"Bachelor of Science (HRD), Universiti Teknolog...",2020-04-04,Manager,2025-04-24 00:00:00,,Part Time,Indivodual,107,PayNow,2996
109,2101-111/002,F,Y,,,,08/09/1997,Certificate,Certificate in Grammar & Writing Intermediate ...,2021-10-04,Admin Assistant,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
110,2101-111/003,F,Y,,,,19/06/1999,Diploma,"Diploma in Mechatronic Engineering, Nee Ann Po...",2020-12-24,HR Manager,2025-04-24 00:00:00,,Part Time,Sponsored,107,PayNow,2696
111,2101-111/004,F,Y,,,,28/09/2010,Certificate,"Higher Nitec in Hospitality Operations, ITE",2021-09-24,Admin Executive,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
112,2101-111/005,F,Y,,,,19/10/1990,Certificate,O' level,2010-07-24,Admin Executive,2025-04-24 00:00:00,,Part Time,Individual - waived App Fee,Waived,Waived,2596
113,2101-111/006,F,Y,,,,10/04/1999,Certificate,Higher Nitec in Business Studies (Service Mana...,2012-04-24,Admin Assistant,2025-04-24 00:00:00,,Part Time,Sponsored,107,PayNow,2696
114,2101-111/007,F,Y,,,,24/05/2000,Certificate,O' levels,2020-04-24,Secretary,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
115,2101-111/008,F,Y,,,,12/04/2001,Certificate,O' levels,2022-05-24,Executive,2025-04-24 00:00:00,,Part Time,Individual,107,PayNow,2996
300,5113-009/001,F,,,Y,India,14/11/1985,Degree,Bachelor of Arts in Business with Logistics an...,2019-08-04,Program Coordinator,2025-04-14 00:00:00,,Part-Time,Individual,107,Nets,5803
301,5113-009/002,F,Y,,,,28/01/1993,Degree,Bachelor of Business (Business Administration)...,2020-10-04,"Assistant Director, Business Development",2025-04-14 00:00:00,,Part-Time,Individual,107,Nets,5803


Students that are included in `student_profile dataset` but not in semester_results are those who haven't finish a single semester or haven't been graded.

#### Gender
---

In [699]:
# Check unique values in the 'GENDER' column
unique_genders = student_profiles['GENDER'].value_counts().reset_index()
unique_genders.columns = ['Gender', 'Count']
print("Unique values in 'GENDER' column:")
display(unique_genders)

Unique values in 'GENDER' column:


Unnamed: 0,Gender,Count
0,F,265
1,M,40


From looking at the unique values, it seemed to be clean.

#### SG CITIZEN, SG PR & FOREIGNER
---

In [700]:
import pandas as pd
import numpy as np

# Load the student_profiles DataFrame
student_profiles = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Student%20Profiles.xlsx?raw=true")

# Replace empty strings with NaN in the specified columns
cols_to_check = ['SG CITIZEN', 'SG PR', 'FOREIGNER']
for col in cols_to_check:
    student_profiles[col] = student_profiles[col].replace(r'^\s*$', np.nan, regex=True)

# Check if there is only one non-null value among 'SG CITIZEN', 'SG PR', and 'FOREIGNER' for each row
student_profiles['Residential_Status_Check'] = (student_profiles[['SG CITIZEN', 'SG PR', 'FOREIGNER']].notna().sum(axis=1) == 1)

# Print rows where the condition is False (i.e., not exactly one non-null value)
mismatched_residential_status = student_profiles[student_profiles['Residential_Status_Check'] == False]

if mismatched_residential_status.empty:
    print("Each student has exactly one residential status specified.")
else:
    print("The following rows have more or less than one residential status specified:")
    display(mismatched_residential_status[['STUDENT ID', 'SG CITIZEN', 'SG PR', 'FOREIGNER', 'Residential_Status_Check']])

# Drop the temporary check column
student_profiles = student_profiles.drop(columns=['Residential_Status_Check'])

Each student has exactly one residential status specified.


Therefore, we can combine these 3 columns into 1 columnn called "Residential Status".

#### Nationality
---

In [701]:
# Count non-null unique values
unique_nationalities = student_profiles['COUNTRY OF OTHER NATIONALITY'].value_counts(dropna=False).reset_index()
unique_nationalities.columns = ['Nationality', 'Count']

# Separate null count from the rest
null_count = unique_nationalities[unique_nationalities['Nationality'].isna()]['Count'].values[0] if unique_nationalities['Nationality'].isna().any() else 0

# Drop the NaN row to show only actual nationalities in the list
non_null_nationalities = unique_nationalities[unique_nationalities['Nationality'].notna()].copy()

# Display summary
print("Unique non-null values in 'COUNTRY OF OTHER NATIONALITY' column:")
display(non_null_nationalities)

print(f"\nNumber of null (missing) values: {null_count}")
print(f"Total number of records: {len(student_profiles)}")


Unique non-null values in 'COUNTRY OF OTHER NATIONALITY' column:


Unnamed: 0,Nationality,Count
1,,88
2,Malaysia,27
3,China,14
4,India,12
5,Philippines,8
6,Myanmar,3
7,Netherlands,1
8,Malaysian,1
9,Vietnam,1
10,Indonesia,1



Number of null (missing) values: 151
Total number of records: 307


Based on the meta data given, it is said that for SG citizens, it will show blank (null values) in this column, so we need to check whether number of null values match number of 'Y' in SG CITIZEN column.

In [702]:
# Count occurrences including NaN (null) values
sg_citizen_counts = student_profiles['SG CITIZEN'].value_counts(dropna=False).reset_index()

# Rename columns for clarity
sg_citizen_counts.columns = ['SG CITIZEN Value', 'Count']

# Display the counts
print("Count of each value in the 'SG CITIZEN' column (including nulls):")
display(sg_citizen_counts)


Count of each value in the 'SG CITIZEN' column (including nulls):


Unnamed: 0,SG CITIZEN Value,Count
0,Y,224
1,,68
2,Yes,15


Based on the results, you can see that if it is blank in "COUNTRY OF OTHER NATIONALITY", the Nationality is Singapore

#### DOB (Date of Birth)
---

In [703]:
# Check for missing values in the 'DOB' column
missing_dob = student_profiles['DOB'].isnull().sum()
print(f"Number of missing values in 'DOB': {missing_dob}")

# Attempt to convert 'DOB' to datetime, coercing errors
student_profiles['DOB_datetime'] = pd.to_datetime(student_profiles['DOB'], errors='coerce', format='%d/%m/%Y')

# Check for values that couldn't be parsed (will be NaT - Not a Time)
invalid_dob = student_profiles[student_profiles['DOB_datetime'].isna() & student_profiles['DOB'].notna()]

print("\nRows with invalid 'DOB' values:")
display(invalid_dob[['STUDENT ID', 'DOB', 'DOB_datetime']])

# Drop the temporary datetime column
student_profiles = student_profiles.drop(columns=['DOB_datetime'])

Number of missing values in 'DOB': 0

Rows with invalid 'DOB' values:


Unnamed: 0,STUDENT ID,DOB,DOB_datetime
105,2101-110/006,13-Feb-1984,NaT
106,2101-110/007,13-Jul-1987,NaT
107,2101-110/008,15-Jul-1994,NaT
184,2102-067A/011,16-07-1991,NaT


From looking at values that cannot be chnged to datetime, it shows that month column is in its short abbreviation form instead of number.

So, we will have to convert it.

### Highest Qualification, Qualification Name and Institution
----

In [704]:
# Step 1: Create a new column with the count of commas in each row
student_profiles['Comma Count'] = student_profiles['NAME OF QUALIFICATION AND INSTITUTION'].astype(str).str.count(',')

# Step 2: Group by the number of commas and count rows in each group
comma_group_counts = student_profiles['Comma Count'].value_counts().sort_index().reset_index()
comma_group_counts.columns = ['Number of Commas', 'Number of Rows']

# Step 3: Display the result
print("Rows grouped by number of commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(comma_group_counts)


Rows grouped by number of commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,Number of Commas,Number of Rows
0,0,137
1,1,143
2,2,21
3,3,6


#### Data Analysis Based on Comma Count

- **Rows with 1 comma (143 rows)**:
  - Clear structure for data splitting
  - Can use comma "," as delimiter to split data into qualification and institute
  - Will implement this splitting approach for these rows

- **Rows with 0 commas**:
  - Expected to contain only qualification name OR institute name
  - Require further examination to determine content type
  - Need additional analysis to classify these entries

- **Rows with 2 or more commas**:
  - Complex structure requiring detailed investigation
  - Need further examination to understand data format
  - May require custom parsing logic for proper data extraction

#### Rows with 0 commas

In [705]:
# Filter rows with no commas (Comma Count == 0) and reset index
no_comma_rows = student_profiles[student_profiles['Comma Count'] == 0].reset_index(drop=True)

# Display the filtered rows
print("Rows with NO commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(no_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])



Rows with NO commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,SPM
1,N' level
2,O' level
3,N' Level
4,SPM
...,...
132,Bachelor of Arts in Business with Logistics an...
133,Bachelor of Business (Business Administration)...
134,Bachelor of Science in Accounting and Finance ...
135,Bachelor of Commerce (Accounting and Finance)/...


From this table, I've observed that many of them actually contain institution names, but are connected by "/" instead of commas, so for these rows, I will again count number of "/".

In [706]:
# Count '/' in rows that have no commas
no_comma_rows['Slash Count'] = no_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary = no_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows WITHOUT commas:")
display(slash_count_summary)


Distribution of '/' in rows WITHOUT commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,57
1,1,80


Rows with 0 slashes contain only qualification names, and institues are blank. What to do? for rows with 0 comma, check for slashes, if got 0 slashes-> confirm qualification, if got 1 slash, separate by slash.

In [707]:
# Count '/' in rows that have no commas
no_comma_rows['Slash Count'] = no_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Filter rows in no_comma_rows that contain at least one slash
rows_with_slashes_in_no_comma = no_comma_rows[no_comma_rows['Slash Count'] > 0].reset_index(drop=True)

# Display the filtered rows
print("Rows with slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):")
display(rows_with_slashes_in_no_comma[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with slashes in 'NAME OF QUALIFICATION AND INSTITUTION' (from no_comma_rows):


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,Degree of Bachelor of Science with Honours in ...
1,Degree of Bachelor of Science with Honours in ...
2,Honours Degree of Bachelor of Science (Managem...
3,Master of Arts (Chinese Studies)/\nNational Un...
4,Bachelor of Business (Accounting)/\nMurdoch Un...
...,...
75,Bachelor of Arts in Business with Logistics an...
76,Bachelor of Business (Business Administration)...
77,Bachelor of Science in Accounting and Finance ...
78,Bachelor of Commerce (Accounting and Finance)/...


#### Rows with 1 comma

In [708]:
# Filter rows with no commas (Comma Count == 1) and reset index
one_comma_rows = student_profiles[student_profiles['Comma Count'] == 1].reset_index(drop=True)

# Display the filtered rows
print("Rows with one comma in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(one_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])


Rows with one comma in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Certificate in Office Skills, ITE"
1,"Lower Secondary Education, Malaysia"
2,"Bahcelor of Management (Marketing), Universiti..."
3,"Diploma in Multimedia & Infocomm Technology, N..."
4,"Diploma in Business, Temasek Polytechnic"
...,...
138,"BBA, Monash University"
139,"BBA, Monash University"
140,"Bachelor of Arts in Human Resource Management,..."
141,Bachelor of Science in Hotel Administration (H...


There are some pecularity in these rows, some don't even contain institution names, just the country, some use comma as a separator between university and country. I will first rows with slashes.

In [709]:
# Count '/' in rows that have no commas
one_comma_rows['Slash Count'] = one_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary = one_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows WITHOUT commas:")
display(slash_count_summary)

Distribution of '/' in rows WITHOUT commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,137
1,1,6


In [710]:
# Further filter those with exactly one slash
no_slash_in_one_comma = one_comma_rows[one_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/') == 0].reset_index(drop=True)

# Display the result
print("Rows with EXACTLY 1 comma and no slash in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(no_slash_in_one_comma[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with EXACTLY 1 comma and no slash in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Certificate in Office Skills, ITE"
1,"Lower Secondary Education, Malaysia"
2,"Bahcelor of Management (Marketing), Universiti..."
3,"Diploma in Multimedia & Infocomm Technology, N..."
4,"Diploma in Business, Temasek Polytechnic"
...,...
132,"Bachelor of Arts and Social Science, National ..."
133,"BBA, Monash University"
134,"BBA, Monash University"
135,"Bachelor of Arts in Human Resource Management,..."


In [711]:


# Further filter those with exactly one slash
one_slash_in_one_comma = one_comma_rows[one_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/') == 1].reset_index(drop=True)

# Display the result
print("Rows with EXACTLY 1 comma and EXACTLY 1 slash in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(one_slash_in_one_comma[['NAME OF QUALIFICATION AND INSTITUTION']])


Rows with EXACTLY 1 comma and EXACTLY 1 slash in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"SPM, LCCI Level 2 Certificate in Book-keeping ..."
1,"Advanced Diploma in Business Administration, U..."
2,"Bachelor of Accountancy, jointly offered by NU..."
3,Professional Diploma in Leadership and People ...
4,Bachelor of Science in Hotel Administration (H...
5,Bachelor of Science in Hotel Administration (H...




#### Data Separation Process:
- **Split by both comma AND slash** to get 3 components

#### Component Classification Rules:

- **1st Component**:
  - Always confirmed as **qualification**
  - No further analysis needed

- **2nd Component**:
  - **Check 1**: Contains "university" or "academy" or "institute", if not OR contains at least one word in ALL CAPS along with "offered"
    - **If YES** → Classify as **institution**
  - **Check 2**: If Check 1 fails, check if contains city or country name
    - **If YES** → Put under **"country where qualification is gained" column**
  - **Check 3**: If both checks fail → Classify as **qualification**

- **3rd Component**:
  - **Check 1**: Contains "university" or "academy" or "institute" if not OR contains at least one word in ALL CAPS along with "offered"
    - **If YES** → Classify as **institution**
  - **Check 2**: If Check 1 fails, check if contains city or country name
    - **If YES** → Put under **"country where qualification is gained" column**
  - **Check 3**: If both checks fail → Classify as **qualification**

#### Rows with 2 commas

In [712]:

two_comma_rows = student_profiles[student_profiles['Comma Count'] == 2].reset_index(drop=True)

# Display the filtered rows
print("Rows with two commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(two_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with two commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Bachelor of Business Administration, Universit..."
1,"Office Management Diploma, NCOI Rotterdam, The..."
2,"Advanced Diploma in Tourism, Hospitality and E..."
3,"Certified Accounting Technicians (CATS), ACCA,..."
4,"Certified Accounting Technicians (CATS), ACCA,..."
5,"NTC 2, N Level, WPLN"
6,"WSQ Higher Certificate in Human Resources, WPL..."
7,"Certificate in Payroll Administration, SHRI Ac..."
8,"NTC 2, N Level, WPLN"
9,"SPM, Certificate in Business Studies (Business..."


In [713]:
# Count '/' in rows that have two commas
two_comma_rows['Slash Count'] = two_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary_two_commas = two_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary_two_commas.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows with TWO commas:")
display(slash_count_summary_two_commas)

Distribution of '/' in rows with TWO commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,21


Qualification Data Parsing Logic

Core Principles
1. **First part = Always qualification** (confirmed)
2. **All three columns can have multiple values** (join with commas)
3. **Categorize remaining parts** using keyword matching
4. **Group by category** and join multiple values

Updated Keyword Categories

1. Qualification Keywords
- **Degree types**: `Bachelor`, `Master`, `Diploma`, `Certificate`, `Degree`, `PhD`, `Doctorate`
- **Education levels**: `O Level`, `A Level`, `N Level`, `NTC`, `Nitec`, `WSQ`
- **Certifications**: `Certified`, `Advanced`, `Higher`, `Foundation`
- **Qualification codes**: `SPM`, `WPLN`, `CATS`, `ACTA`, `LCCI`

2. Institute Keywords
- **Full names**: `University`, `College`, `Polytechnic`, `Academy`, `Institute`, `School`, `Poly`
- **Known institutes**: `ITE`, `PSB`, `MDIS`, `ACCA`, `SHRI`, `NCOI`
- **Specific institutes**: `Singapore Polytechnic`, `Republic Poly`, `Tunku Abdul Rahman College`

3. Country Keywords
- `India`, `China`, `Philippines`, `The Netherlands`, `Netherlands`, `Singapore`

Processing Logic

Step 1: Split and Initialize
- Split by commas into parts
- First part → Qualification column
- Initialize empty lists for qualifications, institutes, countries

Step 2: Categorize Remaining Parts
For each remaining part:
- **Check country keywords first** (exact or partial match)
- **Check institute keywords** (contains any keyword)
- **Check qualification keywords** (contains any keyword)
- **Default**: If all caps and short (≤10 chars) → treat as qualification

Step 3: Handle Multiple Values
- **Qualifications**: Join all qualification parts with commas
- **Institutes**: Join all institute parts with commas
- **Countries**: Join all country parts with commas
- **Set null** if no values found for a category

Final Examples

| Original String | Qualification | Institute | Country |
|----------------|---------------|-----------|---------|
| `Bachelor of Business Administration, University of Rajasthan, India` | Bachelor of Business Administration | University of Rajasthan | India |
| `NTC 2, N Level, WPLN` | NTC 2, N Level, WPLN | null | null |
| `O level, Private Secretary Certificate, LCCI` | O level, Private Secretary Certificate, LCCI | null | null |
| `SPM, Certificate in Business Studies Level IV, Tunku Abdul Rahman College` | SPM, Certificate in Business Studies Level IV | Tunku Abdul Rahman College | null |
| `Diploma in Human Resource Management, PSB, ACTA` | Diploma in Human Resource Management, ACTA | PSB | null |
| `Certified Accounting Technicians (CATS), ACCA, SPM` | Certified Accounting Technicians (CATS), SPM | ACCA | null |

Priority Rules
1. **Country detection** has highest priority (usually last part)
2. **Institute detection** for clear institutional names
3. **Qualification detection** for educational levels and certifications
4. **Ambiguous short caps** → default to qualification
5. **Join multiple values** in same category with commas

#### Rows with 3 commas

In [714]:
# Filter rows with 3 commas (Comma Count == 3) and reset index
three_comma_rows = student_profiles[student_profiles['Comma Count'] == 3].reset_index(drop=True)

# Display the filtered rows
print("Rows with three commas in 'NAME OF QUALIFICATION AND INSTITUTION':")
display(three_comma_rows[['NAME OF QUALIFICATION AND INSTITUTION']])

Rows with three commas in 'NAME OF QUALIFICATION AND INSTITUTION':


Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION
0,"Diploma in Business Admininstration, LCCI Leve..."
1,"O' levels, Diploma in Business Management (dis..."
2,"Diploma in Business Admininstration, LCCI Leve..."
3,"Diploma in Financial Informatics, Nanyang Poly..."
4,"Bachelor of Arts, New Era University, Quezon C..."
5,"Diploma in Business Administration (HRM), PSB ..."


In [715]:
# Count '/' in rows that have three commas
three_comma_rows['Slash Count'] = three_comma_rows['NAME OF QUALIFICATION AND INSTITUTION'].str.count('/')

# Group and count how many rows have how many slashes
slash_count_summary_three_commas = three_comma_rows['Slash Count'].value_counts().sort_index().reset_index()
slash_count_summary_three_commas.columns = ['Number of Slashes', 'Number of Rows']

print("Distribution of '/' in rows with THREE commas:")
display(slash_count_summary_three_commas)

Distribution of '/' in rows with THREE commas:


Unnamed: 0,Number of Slashes,Number of Rows
0,0,5
1,2,1


Final Complete Qualification Data Parsing Logic

Core Processing Steps

Step 1: Split by Multiple Delimiters
- **Split by both commas AND slashes**: `,` and `/`
- **Clean each part**: Remove extra spaces, quotes, brackets
- **First part = Always qualification**

Step 2: Updated Keyword Categories

Qualification Keywords
- **Degree types**: `Bachelor`, `Master`, `Diploma`, `Certificate`, `Degree`, `PhD`, `Doctorate`
- **Education levels**: `O Level`, `A Level`, `N Level`, `NTC`, `Nitec`, `WSQ`
- **Certifications**: `Certified`, `Advanced`, `Higher`, `Foundation`
- **Professional codes**: `SPM`, `WPLN`, `CATS`, `ACTA`, `LCCI`, `CPA`, `CHRM`

Institute Keywords
- **Full names**: `University`, `College`, `Polytechnic`, `Academy`, `Institute`, `School`, `Poly`
- **Known institutes**: `ITE`, `PSB`, `MDIS`, `ACCA`, `SHRI`, `NCOI`
- **Specific institutes**: `Singapore Polytechnic`, `Republic Poly`, `Nanyang Polytechnic`, `New Era University`, `PSB Academy`

Country Keywords
- `India`, `China`, `Philippines`, `The Netherlands`, `Netherlands`, `Singapore`, `Quezon City` (treat as location indicator for Philippines)

City Keywords (Treat as Institutes)
- `Glasgow`, `Dublin`, `Quezon City`, `Rotterdam`

Step 3: Categorization Logic
For each part after the first:
1. **Check country keywords** (exact match or contains country name)
2. **Check institute keywords** (contains any institute keyword OR is a known city)
3. **Check qualification keywords** (contains any qualification keyword)
4. **Default rule**: If all caps and ≤10 characters → treat as qualification
5. **If unclear**: Context-based decision

Step 4: Multi-Value Handling
- **Join multiple values** in same category with commas
- **Set null** if no values found for a category

Complete Examples

| Original String | Split Parts | Qualification | Institute | Country |
|----------------|-------------|---------------|-----------|---------|
| `Diploma in Business Administration, LCCI Level 3, Private Secretary's Diploma, LCCI` | [Diploma in Business Administration] [LCCI Level 3] [Private Secretary's Diploma] [LCCI] | Diploma in Business Administration, LCCI Level 3, Private Secretary's Diploma, LCCI | null | null |
| `O' levels, Diploma in Business Management (distance learning), Glasgow, Dublin` | [O' levels] [Diploma in Business Management (distance learning)] [Glasgow] [Dublin] | O' levels, Diploma in Business Management (distance learning) | Glasgow, Dublin | null |
| `Diploma in Financial Informatics, Nanyang Polytechnic, CPA, SHRI Academy` | [Diploma in Financial Informatics] [Nanyang Polytechnic] [CPA] [SHRI Academy] | Diploma in Financial Informatics, CPA | Nanyang Polytechnic, SHRI Academy | null |
| `Bachelor of Arts, New Era University, Quezon City, Philippines` | [Bachelor of Arts] [New Era University] [Quezon City] [Philippines] | Bachelor of Arts | New Era University, Quezon City | Philippines |
| `Diploma in Business Administration (HRM), PSB Academy, CHRM, SHRI, Nitec in Office Skills, ITE` | [Diploma in Business Administration (HRM)] [PSB Academy] [CHRM] [SHRI] [Nitec in Office Skills] [ITE] | Diploma in Business Administration (HRM), CHRM, Nitec in Office Skills | PSB Academy, SHRI, ITE | null |



# Complete Chronological Qualification Data Parsing Logic Flow

## Master Algorithm: Comma-Based Branch System

### Initial Step: Count Commas
- Count total commas in the string
- Branch to appropriate handling logic based on comma count

---

## Branch 1: **ZERO COMMAS (0)**

### Sub-Branch 1A: Count Slashes
- If **0 slashes**: Entire string = Qualification only
 - Result: `(string, null, null)`

### Sub-Branch 1B: 1 Slash Found
- Split string by the single slash into 2 parts
- **Part 1**: Always qualification
- **Part 2**: Analyze for institute and country
 - Check if contains institution keywords (`university`, `college`, `academy`, `institute`, `school`)
 - If YES: Part 2 = Institute
 - Extract country if found country or city name in the second part,eg: "University of [Country]"
 - Result: `(part1, part2, extracted_country_or_null)`

### Sub-Branch 1C: 2+ Slashes Found
- Split by all slashes into multiple parts
- **Part 1**: Always qualification
- **Remaining parts**: Apply same classification logic as 3+ comma case
- Use priority-based keyword matching for each part

---

## Branch 2: **ONE COMMA (1)**

### Sub-Branch 2A: No Slashes (1 comma, 0 slashes)
- Split by comma into exactly 2 parts
- **Part 1**: Always qualification
- **Part 2**: Apply classification rules in priority order:
 1. **Institute Check**: Contains `university|college|academy|institute|school|polytechnic` → Institute
 2. **Country Check**: Contains known country/city names → Country
 3. **Qualification Check**: Contains qualification keywords → Additional qualification
 4. **Default**: Treat as qualification
- Result: Distribute parts accordingly

### Sub-Branch 2B: With Slashes (1 comma, 1+ slashes)
- Split by both comma AND slashes to get 3+ parts
- **Part 1**: Always qualification
- **Parts 2+**: Apply advanced classification:
 1. **Institute Priority**: Contains institution keywords OR "ALL CAPS + offered pattern"
 2. **Country Priority**: Contains country/city names
 3. **Qualification Priority**: Contains qualification keywords
 4. **Default**: All caps ≤10 characters → Qualification
- Group classified parts by category and join with commas

---

## Branch 3: **TWO COMMAS (2)**

### Standard 3-Part Processing
- Split by commas into exactly 3 parts
- **Part 1**: Always qualification (no analysis needed)
- **Parts 2 & 3**: Apply priority-based classification:

#### Priority Classification System:
1. **Highest Priority - Country Detection**:
  - Exact match: `India`, `China`, `Philippines`, `Netherlands`, `Singapore`
  - Partial match: Contains country names within the part
  - Usually appears as last part

2. **Second Priority - Institute Detection**:
  - Institution types: `University`, `College`, `Polytechnic`, `Academy`, `Institute`, `School`
  - Known institutes: `ITE`, `PSB`, `MDIS`, `ACCA`, `SHRI`, `NCOI`
  - Specific names: `Singapore Polytechnic`, `Nanyang Polytechnic`

3. **Third Priority - Qualification Detection**:
  - Degree types: `Bachelor`, `Master`, `Diploma`, `Certificate`
  - Education levels: `O Level`, `A Level`, `N Level`, `NTC`, `Nitec`, `WSQ`
  - Professional codes: `SPM`, `LCCI`, `CPA`, `CHRM`, `CATS`

4. **Default Rule**: All caps and ≤10 characters → Treat as qualification

### Multi-Value Handling:
- If multiple parts classify as same category, join with commas
- Set null for categories with no matches

---

## Branch 4: **THREE+ COMMAS (3+)**

### Advanced Multi-Delimiter Processing

#### Step 4.1: Split by Multiple Delimiters
- Split by both commas AND slashes using regex pattern
- Clean each part: remove extra spaces, quotes, brackets
- Filter out empty parts

#### Step 4.2: Initialize Category Collections
- Create empty lists for: qualifications, institutes, countries
- Add first part to qualifications list automatically

#### Step 4.3: Advanced Classification Logic
For each remaining part, apply this hierarchical decision tree:

##### Level 1: Country Detection (Highest Priority)
- **Exact Match**: Part exactly matches known country names
- **Partial Match**: Part contains country names as substring
- **Location Indicators**: "Quezon City" (indicates Philippines)
- **Action**: Add to countries list and skip further checks

##### Level 2: Institute Detection (Second Priority)
- **Institution Types**: Contains `university`, `college`, `polytechnic`, `academy`, `institute`, `school`
- **Known Abbreviations**: Matches `ITE`, `PSB`, `MDIS`, `SHRI`, `ACCA`
- **City Names**: `Glasgow`, `Dublin`, `Rotterdam` (treat as institutes)
- **Special Pattern**: Contains "ALL CAPS + offered" pattern
- **Action**: Add to institutes list

##### Level 3: Qualification Detection (Third Priority)
- **Degree Keywords**: Contains `bachelor`, `master`, `diploma`, `certificate`, `phd`
- **Education Levels**: Contains `level`, `ntc`, `nitec`, `wsq`
- **Certifications**: Contains `certified`, `advanced`, `higher`, `foundation`
- **Professional Codes**: Matches `SPM`, `LCCI`, `CPA`, `CHRM`, `CATS`, `ACTA`
- **Action**: Add to qualifications list

##### Level 4: Default Classification Rules
- **All Caps Rule**: If part is all uppercase AND ≤10 characters → Qualification
- **Ambiguous Cases**: When multiple rules could apply, use context and position
- **Final Fallback**: Default to qualification if no clear match

#### Step 4.4: Multi-Value Assembly
- **Join Strategy**: Combine multiple values in each category with comma separation
- **Null Handling**: Set category to null if no items classified
- **Order Preservation**: Maintain relative order of items within each category

#### Step 4.5: Special Case Handling
- **Mixed Delimiters**: Handle combinations like "item1, item2/item3, item4"
- **Nested Information**: Extract relevant parts from complex strings
- **Duplicate Prevention**: Avoid adding same item to multiple categories

---

## Enhanced Keyword Database

### Qualification Keywords (Comprehensive)
- **Academic Degrees**: Bachelor, Master, Diploma, Certificate, Degree, PhD, Doctorate, Advanced Diploma
- **Education Levels**: O Level, A Level, N Level, NTC, Nitec, WSQ, Foundation
- **Professional Certifications**: Certified, Advanced, Higher, Foundation
- **Industry Codes**: SPM, WPLN, CATS, ACTA, LCCI, CPA, CHRM
- **Special Qualifications**: Private Secretary, Accounting Technicians

### Institute Keywords (Comprehensive)
- **Institution Types**: University, College, Polytechnic, Academy, Institute, School, Poly
- **Known Institutions**: ITE, PSB, MDIS, ACCA, SHRI, NCOI
- **Specific Names**: Singapore Polytechnic, Republic Poly, Nanyang Polytechnic, New Era University, PSB Academy, Tunku Abdul Rahman College
- **International**: University of Rajasthan, Anna University, Xi'an Shiyou University

### Geographic Keywords (Comprehensive)
- **Countries**: India, China, Philippines, The Netherlands, Netherlands, Singapore
- **Cities as Location Indicators**: Glasgow, Dublin, Quezon City, Rotterdam, Dumaguete City
- **Regional Indicators**: Asia, Europe (if context requires)

---

## Quality Assurance Rules

### Validation Checks
1. **Completeness**: Ensure first part always goes to qualification
2. **Consistency**: Verify classification logic applied uniformly
3. **Contextual**: Consider part position and surrounding context
4. **Flexibility**: Handle variations in formatting and terminology

### Edge Case Management
1. **Ambiguous All-Caps**: Use surrounding context to decide
2. **Multiple Categories**: If part fits multiple categories, use priority system
3. **Incomplete Information**: Handle missing delimiters gracefully
4. **Format Variations**: Account for different punctuation styles

This detailed logic flow provides a systematic approach to handle all possible comma and slash combinations while maintaining accuracy and consistency in classification.

In [716]:
# Count unique values including nulls
unique_qualifications = student_profiles['HIGHEST QUALIFICATION'].value_counts(dropna=False).reset_index()

# Rename columns for clarity
unique_qualifications.columns = ['Highest Qualification', 'Count']

# Display the result
print("Unique values in 'HIGHEST QUALIFICATION' column (including nulls):")
display(unique_qualifications)


Unique values in 'HIGHEST QUALIFICATION' column (including nulls):


Unnamed: 0,Highest Qualification,Count
0,Degree,137
1,Certificate,87
2,Diploma,71
3,Master,11
4,,1


In [717]:
# Check for rows where 'HIGHEST QUALIFICATION' is null or empty
null_qualification_rows = student_profiles[student_profiles['HIGHEST QUALIFICATION'].isnull() | (student_profiles['HIGHEST QUALIFICATION'] == ' ') | (student_profiles['HIGHEST QUALIFICATION'] == '')]

print("Rows with null or empty 'HIGHEST QUALIFICATION':")
display(null_qualification_rows)

Rows with null or empty 'HIGHEST QUALIFICATION':


Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE,Comma Count
128,2102-063/013,F,,,Y,Vietnam,24/07/1989,,,2016-06-06,Accounts Executive,2022-04-18 00:00:00,2022-09-14 00:00:00,Part-Time,Individual,107,Nets,888,0


Since only 1 row have got null values for  Highest Qualification, we will drop it.

#### Name of Qualification and Institution
---

In [718]:
from collections import Counter
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re

def get_filtered_common_words(df, column='NAME OF QUALIFICATION AND INSTITUTION', min_length=3):
    words = []
    for line in df[column].dropna():
        tokens = re.findall(r'\b\w+\b', line)
        filtered = [
            word.strip().title()
            for word in tokens
            if len(word) >= min_length and word.lower() not in ENGLISH_STOP_WORDS
        ]
        words.extend(filtered)
    return Counter(words).most_common()

# Usage:
filtered_words = get_filtered_common_words(student_profiles_test)
for word, count in filtered_words:
    print(f"{word}: {count}")

Bachelor: 120
University: 107
Business: 82
Diploma: 82
Management: 66
Level: 49
Science: 46
Administration: 36
Polytechnic: 30
Singapore: 28
Arts: 28
Certificate: 21
Human: 20
Academy: 19
National: 19
Honours: 18
Spm: 17
Accounting: 17
Resource: 17
Technology: 16
Ite: 15
Commerce: 14
Education: 13
Engineering: 13
Finance: 13
Nanyang: 12
Hrm: 12
Murdoch: 12
Universiti: 11
Institute: 11
Master: 11
Degree: 11
Malaysia: 10
Shri: 10
Studies: 10
Lcci: 9
Psb: 9
Nitec: 9
Accountancy: 9
Technological: 9
Class: 9
Office: 8
Higher: 8
Economics: 8
Skills: 7
Marketing: 7
Tourism: 7
Hospitality: 7
Levels: 7
Events: 7
Philippines: 7
Republic: 6
Social: 6
Computer: 6
Merit: 6
Second: 6
Secondary: 5
Mdis: 5
Communication: 5
Kaplan: 5
Nus: 5
Uol: 5
Ann: 5
Financial: 5
Edinburgh: 5
Napier: 5
London: 5
Graduate: 5
Logistics: 5
Sciences: 5
Teknologi: 4
Multimedia: 4
Temasek: 4
Wpln: 4
Bradford: 4
Hotel: 4
Tunku: 4
Abdul: 4
Rahman: 4
College: 4
Curtin: 4
Psychology: 4
Postgraduate: 4
International: 4
Rmit: 

In [726]:
student_profiles_test = student_profiles.copy()

In [732]:
import re
import pandas as pd

# Shared memory sets to enforce consistency across rows
known_institutes = set()
known_qualifications = set()

# Canonicalization helpers
def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return s
    # unify quotes/apostrophes
    s = s.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
    # remove stray spaces around apostrophes: "o ' level" -> "o'level"
    s = re.sub(r"\b([oOnNaA])\s*'\s*level\b", lambda m: f"{m.group(1)}'level", s, flags=re.IGNORECASE)
    # fix common variants into canonical forms
    s = re.sub(r"\bo['’]?level\b", "O Level", s, flags=re.IGNORECASE)
    s = re.sub(r"\ba['’]?level\b", "A Level", s, flags=re.IGNORECASE)
    s = re.sub(r"\bn['’]?level\b", "N Level", s, flags=re.IGNORECASE)
    s = re.sub(r"\badvanced\s+diploma\b", "Advanced Diploma", s, flags=re.IGNORECASE)
    # collapse multiple spaces
    s = re.sub(r"\s+", " ", s).strip()
    return s


# Keyword lists (can be extended dynamically if desired)
INSTITUTE_KEYWORDS = [
    'university', 'college', 'polytechnic', 'academy', 'institute', 'school', 'poly',
    'ite', 'psb', 'mdis', 'acca', 'shri', 'ncoi', 'singapore polytechnic',
    'republic poly', 'nanyang polytechnic', 'new era university', 'psb academy',
    'tunku abdul rahman college', 'university of rajasthan', 'anna university',
    "xi'an shiyou university"
]

QUALIFICATION_KEYWORDS = [
    'bachelor', 'master', 'diploma', 'certificate', 'degree', 'phd', 'doctorate',
    'advanced diploma', 'o level', 'a level', 'n level', 'ntc', 'nitec', 'wsq',
    'foundation', 'certified', 'advanced', 'higher', 'spm', 'wpln', 'cats',
    'acta', 'lcci', 'cpa', 'chrm', 'private secretary', 'accounting technicians'
]

def contains_keyword(part: str, keywords):
    part_lower = part.lower()
    return any(keyword.lower() in part_lower for keyword in keywords)

def is_in_known_sets(part: str):
    key = part.strip().upper()
    if key in known_institutes:
        return 'institute'
    if key in known_qualifications:
        return 'qualification'
    return None

def is_institute_candidate(part: str):
    plower = part.lower().strip()
    return contains_keyword(plower, INSTITUTE_KEYWORDS) or ('offered' in plower and part.isupper())

def is_qualification_candidate(part: str):
    plower = part.lower().strip()
    return contains_keyword(plower, QUALIFICATION_KEYWORDS) or (part.isupper() and len(part.strip()) <= 10)

def classify_part(part: str):
    part = part.strip()
    if not part:
        return None, 'unknown'

    # check memory first
    mem = is_in_known_sets(part)
    if mem:
        return part, mem

    # Institute check should have higher priority than the general qualification check
    if is_institute_candidate(part):
        # learn multiword and their uppercase tokens
        known_institutes.add(part.upper())
        for token in part.split():
            if token.isupper() and len(token) > 1:
                known_institutes.add(token.upper())
        return part, 'institute'

    # qualification check
    if is_qualification_candidate(part):
        known_qualifications.add(part.upper())
        return part, 'qualification'


    # fallback: treat as institute and remember similarly
    # This fallback might be the issue if 'ITE' is short and all caps
    # Let's add a specific check for 'ITE' before the fallback
    if part.upper() == 'ITE':
        known_institutes.add('ITE')
        return part, 'institute'


    # Original fallback, only if it doesn't match any keyword or specific rules
    # If it's all caps and short, it might still be a qualification code like SPM, LCCI etc.
    # Let's refine this fallback or remove it if keyword checks cover most cases.
    # For now, let's assume keywords and specific checks are sufficient.
    # If it reaches here, it's an unclassified term. We can decide how to handle it.
    # Based on the plan, if unclear, context-based decision is needed.
    # For now, let's classify as 'unknown' if no rules match.
    # known_institutes.add(part.upper())
    # for token in part.split():
    #     if token.isupper() and len(token) > 1:
    #         known_institutes.add(token.upper())
    # return part, 'institute'
    return part, 'unknown'


def split_by_delimiters(text: str):
    parts = re.split(r'[,/]', text)
    cleaned = []
    for p in parts:
        p_clean = re.sub(r'["\'\[\]\(\)]', '', p).strip()
        if p_clean:
            cleaned.append(p_clean)
    return cleaned

def parse_education_data(text, comma_count):
    qualification = None
    institute = None

    if not text or pd.isna(text):
        return (None, None)

    text = normalize_text(str(text))

    # Helper to process a list of parts after first part is qualification
    def process_tail(parts):
        qual_parts = []
        inst_parts = []
        for p in parts:
            classified, cat = classify_part(p)
            if cat == 'institute':
                inst_parts.append(classified)
            elif cat == 'qualification':
                 qual_parts.append(classified)
            # Unclassified parts are ignored for now.
        return qual_parts, inst_parts


    parts = split_by_delimiters(text)
    if not parts:
        return (None, None)

    qualification = parts[0]
    tail = parts[1:]
    qual_tail, inst_tail = process_tail(tail)

    if qual_tail:
        qualification = ', '.join([qualification] + qual_tail)
    if inst_tail:
        institute = ', '.join(inst_tail)


    return (qualification, institute)

# Re-apply the parsing after modifying the function
student_profiles_test['Comma Count'] = student_profiles_test['NAME OF QUALIFICATION AND INSTITUTION'].fillna("").str.count(',')

# Clear the known sets before re-applying the function to ensure fresh classification
known_institutes.clear()
known_qualifications.clear()


student_profiles_test[['Qualification', 'Institute']] = student_profiles_test.apply(
    lambda row: pd.Series(parse_education_data(
        row['NAME OF QUALIFICATION AND INSTITUTION'],
        row['Comma Count']
    )),
    axis=1
)

In [733]:
# Suppose your DataFrame is student_profile_test
student_profiles_test['Comma Count'] = student_profiles_test['NAME OF QUALIFICATION AND INSTITUTION'].fillna("").str.count(',')

student_profiles_test[['Qualification', 'Institute']] = student_profiles_test.apply(
    lambda row: pd.Series(parse_education_data(
        row['NAME OF QUALIFICATION AND INSTITUTION'],
        row['Comma Count']
    )),
    axis=1
)


In [734]:
# sample_df = (
#     processed_df[["Comma Count","HIGHEST QUALIFICATION", "NAME OF QUALIFICATION AND INSTITUTION", "Qualification", "Institute"]]
#     .sort_values("Comma Count")  # optional, for neat grouping
#     .groupby("Comma Count", group_keys=False)
#     .head(30)
# )


In [735]:
student_profiles_test[['NAME OF QUALIFICATION AND INSTITUTION', 'Qualification', 'Institute']]

Unnamed: 0,NAME OF QUALIFICATION AND INSTITUTION,Qualification,Institute
0,SPM,SPM,
1,"Certificate in Office Skills, ITE",Certificate in Office Skills,ITE
2,"Bachelor of Business Administration, Universit...",Bachelor of Business Administration,University of Rajasthan
3,"Office Management Diploma, NCOI Rotterdam, The...",Office Management Diploma,NCOI Rotterdam
4,"Diploma in Business Admininstration, LCCI Leve...","Diploma in Business Admininstration, LCCI Leve...",
...,...,...,...
302,Bachelor of Science in Hotel Administration (H...,Bachelor of Science in Hotel Administration Ho...,University of Nevada
303,Bachelor of Science in Accounting and Finance ...,Bachelor of Science in Accounting and Finance ...,University of London
304,Bachelor of Commerce (Accounting and Finance)/...,Bachelor of Commerce Accounting and Finance,Curtin University of Technology
305,Bachelor of Economics (Accounting)/\nUniversit...,Bachelor of Economics Accounting,


#### qualification

In [None]:
# import re

# # Step 1: Clean and lowercase the qualification names
# def clean_qualification(name):
#     if not isinstance(name, str):
#         return ''
#     # Normalize quotes and strip
#     name = name.lower().strip()
#     name = re.sub(r"[‘’'`\"]", '', name)
#     # Standardize O Level variants
#     if re.match(r"^(o\s*level|o's?\s*level|o\s*levels)$", name):
#         return 'o level'
#     # Standardize A Level variants
#     if re.match(r"^(a\s*level|a's?\s*level|a\s*levels)$", name):
#         return 'a level'
#     # Standardize N Level variants
#     if re.match(r"^(n\s*level|n's?\s*level|n\s*levels)$", name):
#         return 'n level'
#     # Treat empty or dash as blank
#     if name in ['', '-']:
#         return ''
#     return name

# student_profiles['Qualification Clean'] = student_profiles['QUALIFICATION_NAME'].apply(clean_qualification)

# # Step 2: Define field keywords and categories (split school levels)
# field_keywords = {
#     'Business': ['business', 'commerce', 'marketing', 'accounting', 'finance', 'management', 'entrepreneurship', 'accountancy'],
#     'IT': ['information technology', 'infocomm', 'computing', 'computer science', 'ict', 'software', 'cybersecurity', 'data', 'ai', 'machine learning'],
#     'Engineering': ['engineering', 'mechanical', 'electrical', 'civil', 'aerospace', 'electronics', 'mechatronics'],
#     'Healthcare': ['nursing', 'pharmacy', 'biomedical', 'healthcare', 'medical', 'life sciences'],
#     'Design & Media': ['design', 'media', 'animation', 'visual', 'graphic', 'interior', 'fashion'],
#     'Education': ['education', 'teaching', 'early childhood'],
#     'Hospitality': ['hospitality', 'tourism', 'culinary', 'hotel'],
#     'Sciences': ['science', 'chemistry', 'biology', 'physics', 'mathematics'],
#     'Social Sciences': ['psychology', 'sociology', 'social work', 'counselling'],
#     'Law': ['law', 'legal'],
#     'Arts': ['art', 'communication', 'public relations', 'journalism', 'media', 'performing arts'],
#     'O Level': ['o level'],
#     'A Level': ['a level'],
#     'N Level': ['n level'],
#     'Others': []
# }

# # Step 3: Categorize based on keywords
# def categorize_field(qualification):
#     if qualification == '':
#         return 'Others'
#     for category, keywords in field_keywords.items():
#         if any(keyword in qualification for keyword in keywords):
#             return category
#     return 'Others'

# # Step 4: Apply categorization
# student_profiles['Qualification Field'] = student_profiles['Qualification Clean'].apply(categorize_field)

# # Step 5: Group and display counts
# field_counts = student_profiles['Qualification Field'].value_counts().reset_index()
# field_counts.columns = ['Qualification Field', 'Count']

# print("Qualification Field Categories Count:")
# display(field_counts)




As we can see, business is the most commonly found field in this dataset. Do note that data are categorised based on only field, meaning in business category, it may include, degree, diploma or even master.

In [None]:
# # Filter and display qualifications categorized as 'Others'
# others_df = student_profiles[student_profiles['Qualification Field'] == 'Others']

# # See unique "qualification name" entries in 'Others'
# unique_others = others_df['QUALIFICATION_NAME'].value_counts().reset_index()
# unique_others.columns = ['Qualification Name', 'Count']

# print("Qualifications Categorized as 'Others':")
# display(unique_others)


Can have SPM as a separate category.

#### Institution

In [None]:
# # Updated function to categorize institution name
# def categorize_institution(name):
#     if not isinstance(name, str):
#         return 'Other'

#     # Clean and standardize
#     clean_name = name.strip().replace('\n', '').upper()

#     # List of known Singapore universities (you can expand this list as needed)
#     singapore_universities = [
#         'NUS', 'NATIONAL UNIVERSITY OF SINGAPORE',
#         'NTU', 'NANYANG TECHNOLOGICAL UNIVERSITY',
#         'SUSS', 'SINGAPORE UNIVERSITY OF SOCIAL SCIENCES',
#         'SIT', 'SINGAPORE INSTITUTE OF TECHNOLOGY',
#         'SMU', 'SINGAPORE MANAGEMENT UNIVERSITY'
#     ]

#     # Check for known universities
#     if any(uni in clean_name for uni in singapore_universities) or 'UNIVERSITY' in clean_name:
#         return 'University'
#     elif 'POLYTECHNIC' in clean_name:
#         return 'Polytechnic'
#     elif clean_name == 'ITE':
#         return 'ITE'
#     else:
#         return 'Other'

# # Apply cleaning and categorization
# student_profiles['Institution Clean'] = student_profiles['INSTITUTION_NAME'].astype(str).str.strip().str.replace('\n', '')
# student_profiles['Institution Clean'] = student_profiles['Institution Clean'].str.upper()

# student_profiles['Institution Category'] = student_profiles['Institution Clean'].apply(categorize_institution)

# # Count by category
# institution_counts = student_profiles['Institution Category'].value_counts().reset_index()
# institution_counts.columns = ['Institution Category', 'Count']

# print("Counts by Institution Category:")
# display(institution_counts)


In [None]:
# # Filter rows where the Institution Category is 'Other'
# other_institutions = student_profiles[student_profiles['Institution Category'] == 'Other']

# print("Institutions categorized as 'Other':")
# display(other_institutions[['INSTITUTION_NAME', 'Institution Clean']].drop_duplicates())


In [None]:
# # Filter rows where Institution Category is 'Other'
# other_institutions = student_profiles[student_profiles['Institution Category'] == 'Other']

# # Get unique cleaned institution names in the 'Other' category
# unique_other_institutions = other_institutions['Institution Clean'].unique()

# print("All institution names categorized as 'Other':")
# for inst in unique_other_institutions:
#     print(inst)


#### DATE ATTAINED HIGHEST QUALIFICATION
----

In [None]:
# Convert 'DATE ATTAINED HIGHEST QUALIFICATION' to datetime objects
student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] = pd.to_datetime(student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'], errors='coerce')

# Convert 'COMMENCEMENT DATE' to datetime objects
student_profiles['COMMENCEMENT DATE'] = pd.to_datetime(student_profiles['COMMENCEMENT DATE'], errors='coerce')


# Check for unrealistic dates: Date attained should be after DOB and likely before or around COMMENCEMENT DATE
unrealistic_qualification_dates = student_profiles[
    (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'].notna()) &
    (
        (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] < student_profiles['DOB']) |
        (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] > student_profiles['COMMENCEMENT DATE']) # Assuming qualification is attained before or around course start
    )
]

print("Rows with potentially unrealistic 'DATE ATTAINED HIGHEST QUALIFICATION':")
display(unrealistic_qualification_dates[['STUDENT ID', 'DOB', 'DATE ATTAINED HIGHEST QUALIFICATION', 'COMMENCEMENT DATE']])

# Check for missing values after conversion
missing_qualification_dates = student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'].isna().sum()
print(f"\nNumber of missing values in 'DATE ATTAINED HIGHEST QUALIFICATION' after conversion: {missing_qualification_dates}")

In [None]:
# Convert 'DOB' to datetime objects, coercing errors
student_profiles['DOB'] = pd.to_datetime(student_profiles['DOB'], errors='coerce', format='%d/%m/%Y')

# Convert 'DATE ATTAINED HIGHEST QUALIFICATION' to datetime, if not already
student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] = pd.to_datetime(
    student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'], errors='coerce', format='%d/%m/%Y')

# Calculate age in years at time of qualification
student_profiles['QUALIFICATION_AGE_YEARS'] = (
    (student_profiles['DATE ATTAINED HIGHEST QUALIFICATION'] - student_profiles['DOB']).dt.days / 365.25
).round(1)

# Sort the DataFrame by calculated age in ascending order
sorted_qualification_age = student_profiles.sort_values(by='QUALIFICATION_AGE_YEARS', ascending=True)

print("Age in years at the time of attaining highest qualification (sorted ascending):")
display(sorted_qualification_age[['STUDENT ID', 'DOB', 'NAME OF QUALIFICATION AND INSTITUTION', 'DATE ATTAINED HIGHEST QUALIFICATION', 'QUALIFICATION_AGE_YEARS']])

From looking at this data, I realized that it is impossible for some of these students to have these qualification. Therefore, we will be dropping them.
`[ 111, 33, 113, 38, 267]`

#### Designation
---

In [None]:
# Check unique values and their counts in the 'DESIGNATION' column
designation_counts = student_profiles['DESIGNATION'].value_counts().reset_index()
designation_counts.columns = ['Designation', 'Count']

print("Unique values and counts in 'DESIGNATION' column:")
# Display top 50 designations if there are many unique values
if len(designation_counts) > 50:
    display(designation_counts.head(50))
    print(f"\n... and {len(designation_counts) - 50} more unique designations.")
else:
    display(designation_counts)

In [None]:
from collections import Counter
import re

# Combine all designations into a single string, handling potential NaNs
all_designations = ' '.join(student_profiles['DESIGNATION'].dropna().astype(str).str.lower())

# Split the string into words, using regex to find word characters
words = re.findall(r'\b\w+\b', all_designations)

# Count the frequency of each word
word_counts = Counter(words)

# Convert to a DataFrame for easier display
word_counts_df = pd.DataFrame.from_dict(word_counts, orient='index', columns=['Count']).reset_index()
word_counts_df = word_counts_df.rename(columns={'index': 'Word'})

# Sort by count in descending order
word_counts_df = word_counts_df.sort_values(by='Count', ascending=False)

print("Count of each word in 'DESIGNATION' column:")
display(word_counts_df.head(50)) # Displaying the top 50 most frequent words

In [None]:
# Function to categorize designations based on keywords
def categorize_designation(designation):
    if pd.isna(designation) or designation.strip() in ['', '-', 'N.A.']:
        return 'Unknown'
    designation = str(designation).lower()
    if 'manager' in designation:
        return 'Manager'
    elif 'executive' in designation:
        return 'Executive'
    elif 'assistant' in designation:
        return 'Assistant'
    elif 'officer' in designation:
        return 'Officer'
    elif 'specialist' in designation:
        return 'Specialist'
    elif 'consultant' in designation:
        return 'Consultant'
    elif 'coordinator' in designation:
        return 'Coordinator'
    elif 'head' in designation:
        return 'Head'
    elif 'director' in designation:
        return 'Director'
    elif 'analyst' in designation:
        return 'Analyst'
    elif 'administrator' in designation:
        return 'Administrator'
    elif 'clerk' in designation:
        return 'Clerk'
    elif 'teacher' in designation or 'lecturer' in designation:
        return 'Educator'
    elif 'audit' in designation:
        return 'Audit'
    elif 'finance' in designation or 'accountant' in designation:
        return 'Finance/Accounting'
    elif 'marketing' in designation or 'business development' in designation:
        return 'Marketing/BD'
    elif 'recruitment' in designation or 'hr' in designation or 'human resource' in designation:
        return 'HR/Recruitment'
    elif 'it' in designation or 'developer' in designation:
        return 'IT/Tech'
    elif 'operation' in designation:
        return 'Operations'
    # Add more categories as needed
    else:
        return 'Other'

# Apply the categorization function
student_profiles['DESIGNATION_CATEGORY'] = student_profiles['DESIGNATION'].apply(categorize_designation)

# Check the counts of the new categories
designation_category_counts = student_profiles['DESIGNATION_CATEGORY'].value_counts().reset_index()
designation_category_counts.columns = ['Designation Category', 'Count']
print("\nCounts of Designation Categories:")
display(designation_category_counts)

#### Commence Date & Completion Date
---

In [None]:
# Convert 'COMMENCEMENT DATE' and 'COMPLETION DATE' to datetime objects
student_profiles['COMMENCEMENT DATE'] = pd.to_datetime(student_profiles['COMMENCEMENT DATE'], errors='coerce')
student_profiles['COMPLETION DATE'] = pd.to_datetime(student_profiles['COMPLETION DATE'], errors='coerce')

# Add 'COURSE_COMPLETED' column
student_profiles['COURSE_COMPLETED'] = student_profiles['COMPLETION DATE'].notna()

# Display the relevant columns to verify
display(student_profiles[['STUDENT ID', 'COMMENCEMENT DATE', 'COMPLETION DATE', 'COURSE_COMPLETED']])

In [None]:
# Check rows where 'COMPLETION DATE' is null (NaT)
missing_completion_date_rows = student_profiles[student_profiles['COMPLETION DATE'].isna()]

print("Rows with null 'COMPLETION DATE':")
display(missing_completion_date_rows)

In [None]:
# Find rows with invalid commencement dates (NaT after conversion)
invalid_commencement_dates_rows = student_profiles[student_profiles['COMMENCEMENT DATE'].isna()]

# Get the student IDs from rows with invalid commencement dates
student_ids_with_invalid_commencement = invalid_commencement_dates_rows['STUDENT ID']

# Check which of these student IDs exist in the semester_results DataFrame
students_in_semester_results_bool = student_ids_with_invalid_commencement.isin(semester_results['STUDENT ID'])

# Count students with invalid commencement dates who are in semester_results
count_in_semester_results = students_in_semester_results_bool.sum()

# Count students with invalid commencement dates who are not in semester_results
count_not_in_semester_results = (~students_in_semester_results_bool).sum()

print(f"Number of students with invalid commencement dates found in semester_results: {count_in_semester_results}")
print(f"Number of students with invalid commencement dates not found in semester_results: {count_not_in_semester_results}")

# Display the rows from semester_results for students with invalid commencement dates (as done before)
students_in_semester_results_df = semester_results[semester_results['STUDENT ID'].isin(student_ids_with_invalid_commencement)]
print("\nRows from semester_results for students with invalid commencement dates:")
display(students_in_semester_results_df)

From looking at this, i can insert the commence date and and completion date based on rows that have the same `COURSE_ID` and `INTAKE_NO`

#### FULL-TIME OR PART-TIME
----

In [None]:
# Check unique values in the 'FULL-TIME OR PART-TIME' column
unique_course_type = student_profiles['FULL-TIME OR PART-TIME'].value_counts().reset_index()
unique_course_type.columns = ['Course Type', 'Count']
print("Unique values in 'FULL-TIME OR PART-TIME' column:")
display(unique_course_type)

Based on the count, I believe we can combine `Part-Time` and `Part Time` together by changing ` ` to `-`

#### COURSE FUNDING
----

In [None]:
# Check unique values and their counts in the 'COURSE FUNDING' column
course_funding_counts = student_profiles['COURSE FUNDING'].value_counts(dropna=False).reset_index()
course_funding_counts.columns = ['Course Funding', 'Count']

print("Unique values and counts in 'COURSE FUNDING' column:")
display(course_funding_counts)

From looking at the unqiue values, it seemed the data hasn't been standardized yet.

Below is the table which shows how to standardized it


| **Original Values**                                   | **Cleaned / Grouped As**         |
| ----------------------------------------------------- | -------------------------------- |
| Individual<br>Individual  <br>Indivodual<br>Indvidual | `Individual`                     |
| Individual - SFC<br>Individual-SFC<br>Indvidual - SFC  | `Individual - SFC`               |
| Individual - waived App Fee                           | `Individual - Waived App Fee`    |
| Individual - SFC + \$1000 SCHOLARSHIP                 | `Individual - SFC + Scholarship` |
| Sponsored<br>Sponsored  <br>Sponsored - no SDF  <br>Sponsored-no SDF       | `Sponsored`                      |
| Sponsored - SDF                                       | `Sponsored - SDF`                |




#### REGISTRATION FEE
----

In [None]:
# Check unique values and their counts in the 'REGISTRATION FEE' column
registration_fee_counts = student_profiles['REGISTRATION FEE'].value_counts(dropna=False).reset_index()
registration_fee_counts.columns = ['Registration Fee', 'Count']

print("Unique values and counts in 'REGISTRATION FEE' column:")
display(registration_fee_counts)

From looking at this there seem to be a typing error `107\n107`,
We will be changing this to 107
As for Waived, we will change this to 0

#### PAYMENT MODE
----

In [None]:
# 15. Payment Mode
# Check unique values and their counts in the 'PAYMENT MODE' column
payment_mode_counts = student_profiles['PAYMENT MODE'].value_counts().reset_index()
payment_mode_counts.columns = ['Payment Mode', 'Count']

print("Unique values and counts in 'PAYMENT MODE' column:")
display(payment_mode_counts)

From looking at the unique values, We will need to standardized it

| **Original Values** | **Cleaned / Grouped As** |
| ------------------- | ------------------------ |
| Nets                | NETS                     |
| NETS                | NETS                     |
| Giro                | Giro                     |
| GIRO                | Giro                     |
| PayNow              | PayNow                   |
| Cr Card             | Credit Card              |
| Waived              | Waived                   |
| Bank                | Bank                     |


#### COURSE FEE
---

In [None]:
# 16. Course Fee
# Remove '$' symbol and convert to numeric, coercing errors
student_profiles['COURSE FEE_cleaned'] = student_profiles['COURSE FEE'].astype(str).str.replace('$', '', regex=False)
student_profiles['COURSE FEE_cleaned'] = pd.to_numeric(student_profiles['COURSE FEE_cleaned'], errors='coerce')

# Identify rows with values that couldn't be converted to numeric
invalid_course_fee_rows = student_profiles[student_profiles['COURSE FEE_cleaned'].isna()]

print("Rows with invalid 'COURSE FEE' values that could not be converted to numeric:")
display(invalid_course_fee_rows[['STUDENT ID', 'COURSE FEE', 'COURSE FEE_cleaned']])

# Convert to float with 2 decimal places
student_profiles['COURSE FEE_cleaned'] = student_profiles['COURSE FEE_cleaned'].astype(float).round(2)

# Display the cleaned column and verify data type
print("\n'COURSE FEE' column after cleaning and conversion to numeric (with 2 decimal places):")
display(student_profiles[['STUDENT ID', 'COURSE FEE', 'COURSE FEE_cleaned']].head())

print("\nData type of 'COURSE FEE_cleaned' after cleaning:")
print(student_profiles['COURSE FEE_cleaned'].dtype)

In [None]:
# Update COURSE FEE with cleaned values
student_profiles = student_profiles.drop(columns=['COURSE FEE'])
student_profiles = student_profiles.rename(columns={'COURSE FEE_cleaned': 'COURSE FEE'})

## Data Cleaning for Student Profiles
---

### Data Cleaning Steps for Student Profile

#### 1. Student ID Decomposition

* Split `STUDENT_ID` into three new columns:

  * `COURSE_ID`
  * `INTAKE_NO`
  * `INDEX_NO`
* Format: `<COURSE_ID>-<INTAKE_NO>/<INDEX_NO>`

#### 2. Remove Unmatched Semester Results

* Remove entries from `semester_results` if the student does not exist in the `student profile` table.

#### 3. Gender

* Clean - Have 2 unique values ('M' and 'F')

#### 4. Residential Status

* Combine `SG CITIZEN`, `SG PR`, and `FOREIGNER` into a single column: `RESIDENTIAL_STATUS`.

  * Logic: Only one of the three columns contains "Y"; others are null.
  * New values: `Singapore Citizen`, `PR`, or `Foreigner`.

#### 5. Nationality

* Rename `COUNTRY OF OTHER NATIONALITY` to `NATIONALITY`.
* Standardize values:
  * Correct spelling errors (e.g., `Malaysian` → `Malaysia`)
* Replace null or blank values with: `Singapore`.

#### 6. Date of Birth (`DOB`)

* Not in datetime format
* Some rows are in short abbreviation for month instead of numbers
* Remove rows where `DOB` is null

#### 7. Highest Qualification

* Drop rows where value is `" "`

#### 8. Name of Qualification and Institution

* Split `NAME OF QUALIFICATION AND INSTITUTION` into two parts:

  * `QUALIFICATION_NAME` (text before the first `,` or `/`)
  * `INSTITUTION_NAME` (text after the first `,` or `/`)

#### 9. Date Attained Highest Qualification

* Check if the date is realistic compared to `DOB` and `COMMENCE DATE`
* Remove these rows `[ 111, 33, 113, 38, 267]`
  * These shows students which doesn't meet the requirement's age for their Qualification.

#### 10. Designation

* Key-Word Mapping to reduce unique values

#### 11. Commence Date & Completion Date

* Convert to datetime format
* Insert blank values by looking at `COURSE_NO` & `INTAKE_NO`

#### 12. Full-time or Part-time

* Standardize values:
  * Correct spelling errors (e.g., `Full Time` → `Full-Time`).

#### 13. Course Funding Type

* Standardize values to one of the following:

  * `Individual`
  * `Individual - SFC`
  * `Individual - SFC + Scholarship`
  * `Individual - Waived App Fee`
  * `Sponsored`
  * `Sponsored - SFC`
  
* Fix spelling error and remove leading/trailing spaces

#### 14. Registration Fee

* Convert to Float type
* Standardize values:
  * Correct spelling errors (e.g. `107\n107` → `107`)
  * Convert `Waived` to `0`

#### 15. Payment Mode

* Standardize values:
  *  `NETS`
  * `Giro`
  * `PayNow`
  * `Credit Card`
  * `Waived`
  * `Bank`

#### 16. Course Fee

* Convert to Float type

---


In [None]:
# 1. Student ID Decomposition
student_profiles[['COURSE_ID', 'INTAKE_NO_INDEX']] = student_profiles['STUDENT ID'].str.split('-', n=1, expand=True)
student_profiles[['INTAKE_NO', 'INDEX_NO']] = student_profiles['INTAKE_NO_INDEX'].str.split('/', n=1, expand=True)
student_profiles = student_profiles.drop(columns=['INTAKE_NO_INDEX'])

print("Student ID decomposed:")
display(student_profiles[['STUDENT ID', 'COURSE_ID', 'INTAKE_NO', 'INDEX_NO']].head())

In [None]:
# 2. Remove Unmatched Semester Results
# Get the set of student IDs from student_profiles
student_profiles_ids = set(student_profiles['STUDENT ID'].dropna().unique())

# Filter semester_results to keep only rows where the student ID is in student_profiles
semester_results_cleaned = semester_results[semester_results['STUDENT ID'].isin(student_profiles_ids)].copy()

print("Semester results after removing unmatched student IDs:")
display(semester_results_cleaned.head())
print(f"\nNumber of rows before removal: {len(semester_results)}")
print(f"Number of rows after removal: {len(semester_results_cleaned)}")

# Update the semester_results DataFrame
semester_results = semester_results_cleaned

In [None]:
# 4. Residential Status
# Define a function to determine residential status
def get_residential_status(row):
    if row['SG CITIZEN'] == 'Y':
        return 'Singapore Citizen'
    elif row['SG PR'] == 'Y':
        return 'PR'
    elif row['FOREIGNER'] == 'Y':
        return 'Foreigner'
    else:
        return None # Should not happen based on previous check, but good practice

# Apply the function to create the new 'RESIDENTIAL_STATUS' column
student_profiles['RESIDENTIAL_STATUS'] = student_profiles.apply(get_residential_status, axis=1)

# Drop the original columns
student_profiles = student_profiles.drop(columns=['SG CITIZEN', 'SG PR', 'FOREIGNER'])

print("Residential Status column created and original columns dropped:")
display(student_profiles[['STUDENT ID', 'RESIDENTIAL_STATUS']].head())

In [None]:
# 5. Nationality
# Rename the column
student_profiles = student_profiles.rename(columns={'COUNTRY OF OTHER NATIONALITY': 'NATIONALITY'})

# Standardize values (correcting 'Malaysian' to 'Malaysia')
student_profiles['NATIONALITY'] = student_profiles['NATIONALITY'].replace('Malaysian', 'Malaysia')

# Replace null or blank values with 'Singapore'
student_profiles['NATIONALITY'] = student_profiles['NATIONALITY'].replace(r'^\s*$', 'Singapore', regex=True) # Handle blank strings
student_profiles['NATIONALITY'] = student_profiles['NATIONALITY'].fillna('Singapore') # Handle actual NaN values

print("Nationality column cleaned:")
display(student_profiles[['STUDENT ID', 'NATIONALITY']].head())

# Verify the changes
print("\nUnique values in 'NATIONALITY' after cleaning:")
display(student_profiles['NATIONALITY'].value_counts().reset_index())

In [None]:
# 6. Date of Birth (DOB)
# Convert 'DOB' to datetime, trying multiple formats and coercing errors
student_profiles['DOB'] = pd.to_datetime(student_profiles['DOB'], errors='coerce', dayfirst=True)

# Remove rows where 'DOB' is null (NaT)
student_profiles_cleaned = student_profiles.dropna(subset=['DOB']).copy()

print("\nRows with invalid 'DOB' values removed.")
print(f"Number of rows before dropping: {len(student_profiles)}")
print(f"Number of rows after dropping: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

# Check for values that couldn't be parsed (should be empty now)
invalid_dob_after_drop = student_profiles[student_profiles['DOB'].isna()]

print("\nRows with invalid 'DOB' values after dropping (should be empty):")
display(invalid_dob_after_drop[['STUDENT ID', 'DOB']])

In [None]:
# 7. Highest Qualification
# Drop rows where 'HIGHEST QUALIFICATION' is a blank string
student_profiles_cleaned = student_profiles[student_profiles['HIGHEST QUALIFICATION'].str.strip() != ''].copy()

print("Student profiles after dropping rows with blank 'Highest Qualification':")
display(student_profiles_cleaned.head())

print(f"\nNumber of rows before dropping: {len(student_profiles)}")
print(f"Number of rows after dropping: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

In [None]:
# 9. Date Attained Highest Qualification
# Drop rows with the identified unrealistic qualification ages based on index
rows_to_drop = [111, 33, 113, 38, 267]
student_profiles_cleaned = student_profiles.drop(index=rows_to_drop, errors='ignore').copy()

print(f"Number of rows before dropping: {len(student_profiles)}")
print(f"Number of rows after dropping: {len(student_profiles_cleaned)}")

# Update the student_profiles DataFrame
student_profiles = student_profiles_cleaned

print("\nStudent profiles after dropping rows with unrealistic qualification ages:")
display(student_profiles.head())

In [None]:
# 11. Commence Date & Completion Date - Insert blank values by looking at COURSE_ID & INTAKE_NO

# Combine COURSE_ID and INTAKE_NO to create a unique intake identifier
student_profiles['INTAKE_IDENTIFIER'] = student_profiles['COURSE_ID'].astype(str) + '-' + student_profiles['INTAKE_NO'].astype(str)

# Calculate initial missing values
missing_commencement_before = student_profiles['COMMENCEMENT DATE'].isna().sum()
missing_completion_before = student_profiles['COMPLETION DATE'].isna().sum()

# Function to fill missing dates within each intake group
def fill_missing_dates(group):
    # Fill missing COMMENCEMENT DATE with the most frequent date in the group
    if group['COMMENCEMENT DATE'].isnull().any():
        most_frequent_commencement = group['COMMENCEMENT DATE'].mode()
        if not most_frequent_commencement.empty:
            group['COMMENCEMENT DATE'] = group['COMMENCEMENT DATE'].fillna(most_frequent_commencement[0])

    # Fill missing COMPLETION DATE with the most frequent date in the group
    if group['COMPLETION DATE'].isnull().any():
        most_frequent_completion = group['COMPLETION DATE'].mode()
        if not most_frequent_completion.empty:
            group['COMPLETION DATE'] = group['COMPLETION DATE'].fillna(most_frequent_completion[0])

    return group

# Apply the filling function to each intake group
student_profiles_filled_dates = student_profiles.groupby('INTAKE_IDENTIFIER').apply(fill_missing_dates)

# Drop the temporary intake identifier column
student_profiles_filled_dates = student_profiles_filled_dates.drop(columns=['INTAKE_IDENTIFIER'])

print("Student profiles after attempting to fill missing commencement and completion dates:")
display(student_profiles_filled_dates[['STUDENT ID', 'COMMENCEMENT DATE', 'COMPLETION DATE']].head())

# Calculate remaining missing values
missing_commencement_after_fill = student_profiles_filled_dates['COMMENCEMENT DATE'].isna().sum()
missing_completion_after_fill = student_profiles_filled_dates['COMPLETION DATE'].isna().sum()

# Calculate number of filled values
filled_commencement = missing_commencement_before - missing_commencement_after_fill
filled_completion = missing_completion_before - missing_completion_after_fill


print(f"\nNumber of missing 'COMMENCEMENT DATE' before filling: {missing_commencement_before}")
print(f"Number of missing 'COMMENCEMENT DATE' after filling: {missing_commencement_after_fill}")
print(f"Number of 'COMMENCEMENT DATE' values filled: {filled_commencement}")

print(f"\nNumber of missing 'COMPLETION DATE' before filling: {missing_completion_before}")
print(f"Number of missing 'COMPLETION DATE' after filling: {missing_completion_after_fill}")
print(f"Number of 'COMPLETION DATE' values filled: {filled_completion}")


# Update the student_profiles DataFrame
student_profiles = student_profiles_filled_dates

In [None]:
# 13. Course Funding Type
# Standardize values based on the provided mapping
funding_mapping = {
    'Individual': 'Individual',
    'Individual ': 'Individual',
    'Indivodual': 'Individual',
    'Indvidual': 'Individual',
    'Individual - SFC': 'Individual - SFC',
    'Individual-SFC': 'Individual - SFC',
    'Indvidual - SFC': 'Individual - SFC',
    # Correcting the mapping for 'Sponsored-no SDF' based on the user's table
    'Sponsored-no SDF': 'Sponsored',
    'Sponsored-no SDF ': 'Sponsored',
    'Individual - waived App Fee': 'Individual - Waived App Fee',
    'Individual - SFC + $1000 SCHOLARSHIP': 'Individual - SFC + Scholarship',
    'Sponsored': 'Sponsored',
    'Sponsored ': 'Sponsored',
    'Sponsored - SDF': 'Sponsored - SDF',
    'Sponsored  ': 'Sponsored', # Handle double space

}

student_profiles['COURSE FUNDING'] = student_profiles['COURSE FUNDING'].replace(funding_mapping)

print("Course Funding column standardized:")
display(student_profiles['COURSE FUNDING'].value_counts().reset_index())

# Check for spacing issues by looking at the unique values directly
print("\nUnique values in 'COURSE FUNDING' column after standardization (checking for spacing):")
display(student_profiles['COURSE FUNDING'].unique())

In [None]:
# Remove leading/trailing spaces from 'COURSE FUNDING'
student_profiles['COURSE FUNDING'] = student_profiles['COURSE FUNDING'].str.strip()

print("Course Funding column after stripping spaces:")
display(student_profiles['COURSE FUNDING'].value_counts().reset_index())

print("\nUnique values in 'COURSE FUNDING' column after stripping spaces:")
display(student_profiles['COURSE FUNDING'].unique())

In [None]:
# 14. Registration Fee
# Standardize values: '107\n107' to '107' and 'Waived' to '0'
student_profiles['REGISTRATION FEE'] = student_profiles['REGISTRATION FEE'].replace({'107\n107': '107', 'Waived': '0'})

# Convert to Float type
student_profiles['REGISTRATION FEE'] = student_profiles['REGISTRATION FEE'].astype(float)

print("Registration Fee column cleaned and converted to float:")
display(student_profiles[['STUDENT ID', 'REGISTRATION FEE']].head())

# Verify the data type and unique values after cleaning
print("\nData type of 'REGISTRATION FEE' after cleaning:")
print(student_profiles['REGISTRATION FEE'].dtype)

print("\nUnique values and counts in 'REGISTRATION FEE' after cleaning:")
display(student_profiles['REGISTRATION FEE'].value_counts(dropna=False).reset_index())

In [None]:
# 15. Payment Mode
# Standardize values based on the provided list
payment_mode_mapping = {
    'Nets': 'NETS',
    'NETS': 'NETS',
    'Giro': 'Giro',
    'GIRO': 'Giro',
    'PayNow': 'PayNow',
    'Cr Card': 'Credit Card',
    'Waived': 'Waived',
    'Bank': 'Bank'
}

student_profiles['PAYMENT MODE'] = student_profiles['PAYMENT MODE'].replace(payment_mode_mapping)

print("Payment Mode column standardized:")
display(student_profiles['PAYMENT MODE'].value_counts().reset_index())

In [None]:
student_profiles


## Exporting Data
---

In [None]:
# Define file paths
course_code_file = "cleaned_course_codes.csv"
semester_results_file = "cleaned_semester_results.csv"
student_profiles_file = "cleaned_student_profiles.csv"

# Export DataFrames to CSV files
course_code.to_csv(course_code_file, index=False)
semester_results.to_csv(semester_results_file, index=False)
student_profiles.to_csv(student_profiles_file, index=False)

print(f"Cleaned data saved to:\n- {course_code_file}\n- {semester_results_file}\n- {student_profiles_file}")

# For Google Colab, you can also provide links to download the files
try:
    from google.colab import files
    print("\nClick the links below to download the files (Google Colab):")
    # These might not work if the files are very large or in a complex directory structure
    # You might need to manually download from the file explorer in Colab
    files.download(course_code_file)
    files.download(semester_results_file)
    files.download(student_profiles_file)
except ImportError:
    print("\nRunning in a local environment (likely VS Code). Files saved to the current directory.")
    print("You can find the files in your file explorer.")