# DAVI Data Cleaning
## Group-10
## Group Member: Devendran Yoheswaran, Kaung Myat San, Swam Htet Aung
---

## Meta Data
---
### Student Profiles

> **Note:** Data is manually entered, so the values are not standardized.

| **Field Name**                            | **Description**                                                                                                                          | **Example**                |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | -------------------------- |
| **STUDENT ID**                            | Student ID is made up of three attributes: `<Course code>-<Intake No>/<Index Number of student in the intake>`                           | `1101-013/001`             |
| **GENDER**                                | Gender                                                                                                                                   | `M`, `F`                   |
| **SG CITIZEN**                            | Singapore Citizen                                                                                                                        | `Y` or blank               |
| **SG PR**                                 | Singapore Permanent Resident                                                                                                             | `Y` or blank               |
| **FOREIGNER**                             | Neither SG Citizen nor SG PR (mutually exclusive with SG CITIZEN and SG PR)                                                              | `Y` or blank               |
| **COUNTRY OF OTHER NATIONALITY**          | Country of nationality (only for SG PR or foreigner)                                                                                     | `Malaysia`, `India`, etc.  |
| **DOB**                                   | Date of Birth. Format: `DD/MM/YYYY`                                                                                                      | `04/03/1978`               |
| **HIGHEST QUALIFICATION**                 | Highest qualification attained prior to this course                                                                                      | `Certificate`, `Diploma`   |
| **NAME OF QUALIFICATION AND INSTITUTION** | Institute where the highest qualification was attained                                                                                   | As provided by participant |
| **DATE ATTAINED HIGHEST QUALIFICATION**   | Date when the qualification was awarded. Format: `DD/MM/YYYY`                                                                            | `06/11/2016`               |
| **DESIGNATION**                           | Job designation                                                                                                                          | As provided by participant |
| **COMMENCEMENT DATE**                     | Course start date. Format: `DD/MM/YYYY`                                                                                                  | `06/01/2023`               |
| **COMPLETION DATE**                       | Course end date. Blank if course is ongoing. Format: `DD/MM/YYYY`                                                                        | `06/04/2024`               |
| **FULL-TIME OR PART-TIME**                | Whether the course is Full-time or Part-time                                                                                             | `Full-Time`, `Part-Time`   |
| **COURSE FUNDING**                        | Course funding type:<br>- `Individual`<br>- `Individual - SFC` (SkillsFuture Credit)<br>- `Sponsored`<br>- `Individual - waived App Fee` | `Individual - SFC`         |
| **REGISTRATION FEE**                      | Registration fee in SGD                                                                                                                  | As entered                 |
| **PAYMENT MODE**                          | Mode of payment                                                                                                                          | `NETS`, `Giro`, `PayNow`   |
| **COURSE FEE**                            | Course fee in SGD                                                                                                                        | As entered                 |

---

### Course Codes

| **S/N**         | **Description**        | **Example**                          |
| --------------- | ---------------------- | ------------------------------------ |
| **CODE**        | Course code (4 digits) | `1101`                               |
| **COURSE NAME** | Course name            | `Diploma in Business Administration` |

---

### Semester Results

| **S/N**        | **Description**                                                                                                           | **Example**    |
| -------------- | ------------------------------------------------------------------------------------------------------------------------- | -------------- |
| **STUDENT ID** | Must match the ID in the Student Profile dataset                                                                          | `1101-013/001` |
| **PERIOD**     | Semester number<br>• Certificate: 1 semester<br>• Diploma: 3 semesters<br>• Master’s: 2 semesters (some exceptions apply) | `1`, `2`, `3`  |
| **GPA**        | GPA for the semester.<br>• Max GPA: 4<br>• Pass GPA: Certificate/Diploma = 2, Master's = 2.3                              | `3.2`          |

---


## Importing Modules
---

In [2]:
import pandas as pd
import numpy as np

## Loading Data
---

#### Course_Code Data

In [3]:
course_code = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Course%20Codes.xlsx?raw=true")
course_code.head()

Unnamed: 0,CODE,COURSE NAME
0,1101,Diploma in Business Administration
1,1102,Diploma in Business Analytics
2,2101,Certificate in Digital Marketing
3,2102,Certificate in HR Management
4,2013,Certificate in Tourism Management


#### Semester Results Data

In [4]:
semester_results = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Semester%20Results.xlsx?raw=true")
semester_results.head()

Unnamed: 0,STUDENT ID,PERIOD,GPA
0,1101-009/001,Sem 1,3.5
1,1101-009/001,Sem 2,3.6
2,1101-009/001,Sem 3,3.7
3,1101-009/002,Sem 1,3.4
4,1101-009/002,Sem 2,3.5


#### Student Profiles Data

In [5]:
student_profiles = pd.read_excel("https://github.com/swamhtetg90/DAVI-CA2/blob/main/DAVI%20CA2%20datasets%20and%20meta%20data/Student%20Profiles.xlsx?raw=true")
student_profiles.head()

Unnamed: 0,STUDENT ID,GENDER,SG CITIZEN,SG PR,FOREIGNER,COUNTRY OF OTHER NATIONALITY,DOB,HIGHEST QUALIFICATION,NAME OF QUALIFICATION AND INSTITUTION,DATE ATTAINED HIGHEST QUALIFICATION,DESIGNATION,COMMENCEMENT DATE,COMPLETION DATE,FULL-TIME OR PART-TIME,COURSE FUNDING,REGISTRATION FEE,PAYMENT MODE,COURSE FEE
0,1101-009/001,F,,,Y,Malaysia,13/09/1981,Certificate,SPM,2018-01-08,Admin & HR Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,GIRO,5136
1,1101-009/002,F,Y,,,,26/07/1979,Certificate,"Certificate in Office Skills, ITE",2016-06-08,Admin Assistant,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual-SFC,107,NETS,5136
2,1101-009/003,F,,,Y,India,01/02/1990,Degree,"Bachelor of Business Administration, Universit...",2015-08-08,-,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
3,1101-009/004,F,,,Y,Netherlands,20/04/1976,Diploma,"Office Management Diploma, NCOI Rotterdam, The...",2018-02-08,HR Support / Office Manager,2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Individual,107,NETS,5136
4,1101-009/005,F,Y,,,,25/11/1983,Diploma,"Diploma in Business Admininstration, LCCI Leve...",2015-06-08,"Executive, Administration",2022-04-18 00:00:00,2023-09-17 00:00:00,Part-Time,Sponsored,107,GIRO,4812


## Data Analysis
---

In [9]:
semester_results.head()

Unnamed: 0,STUDENT ID,PERIOD,GPA
0,1101-009/001,Sem 1,3.5
1,1101-009/001,Sem 2,3.6
2,1101-009/001,Sem 3,3.7
3,1101-009/002,Sem 1,3.4
4,1101-009/002,Sem 2,3.5


## Data Cleaning
---

In [6]:
# Display initial structure
print("Initial shape:", semester_results.shape)
print(semester_results.info())

# 1. Remove duplicates
df = semester_results.drop_duplicates()

# 2. Remove rows with missing STUDENT ID, PERIOD, or GPA
df = df.dropna(subset=['STUDENT ID', 'PERIOD', 'GPA'])

# 3. Standardize data types
df['STUDENT ID'] = df['STUDENT ID'].astype(str)

# Extract numeric semester from PERIOD (e.g., 'Sem 1' -> 1)
df['PERIOD'] = df['PERIOD'].str.extract(r'(\d+)').astype(float)

df['GPA'] = pd.to_numeric(df['GPA'], errors='coerce')

# Remove rows with invalid GPA or PERIOD after conversion
df = df.dropna(subset=['PERIOD', 'GPA'])

# 4. Ensure GPA is within 0 to 4
df = df[(df['GPA'] >= 0) & (df['GPA'] <= 4)]

# Convert PERIOD to integer
df['PERIOD'] = df['PERIOD'].astype(int)

# Reset index after cleaning
df = df.reset_index(drop=True)

# Summary
print("Cleaned shape:", df.shape)
print("Cleaned data preview:")
print(df.head())


Initial shape: (555, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555 entries, 0 to 554
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   STUDENT ID  555 non-null    object 
 1   PERIOD      555 non-null    object 
 2   GPA         555 non-null    float64
dtypes: float64(1), object(2)
memory usage: 13.1+ KB
None
Cleaned shape: (522, 3)
Cleaned data preview:
     STUDENT ID  PERIOD  GPA
0  1101-009/001       1  3.5
1  1101-009/001       2  3.6
2  1101-009/001       3  3.7
3  1101-009/002       1  3.4
4  1101-009/002       2  3.5


In [7]:
import pandas as pd

summary_table = pd.DataFrame({
    "Field Name": ["STUDENT ID", "PERIOD", "GPA"],
    "Number of records Affected": [
        semester_results['STUDENT ID'].isna().sum(),
        semester_results['PERIOD'].isna().sum() + semester_results['PERIOD'].str.extract(r'(\d+)')[0].isna().sum(),
        semester_results['GPA'].isna().sum() + sum((pd.to_numeric(semester_results['GPA'], errors='coerce') > 4) | (pd.to_numeric(semester_results['GPA'], errors='coerce') < 0))
    ],
    "Action Taken to clean the data": [
        "Removed missing STUDENT ID; standardized as string",
        "Extracted numeric semester from text like 'Sem 1'; dropped missing/invalid",
        "Converted to float; removed non-numeric or GPA outside 0–4 range"
    ]
})

print("\nSummary of Data Cleaning:")
print(summary_table)



Summary of Data Cleaning:
   Field Name  Number of records Affected  \
0  STUDENT ID                           0   
1      PERIOD                           0   
2         GPA                           0   

                      Action Taken to clean the data  
0  Removed missing STUDENT ID; standardized as st...  
1  Extracted numeric semester from text like 'Sem...  
2  Converted to float; removed non-numeric or GPA...  


## Exporting Data
---

In [8]:
# Save cleaned file
cleaned_file_path = 'cleaned_semester_results.csv'
df.to_csv(cleaned_file_path, index=False)
