# SMU BOSS Bidding Data Quality check

## Purpose
- To prepare ingestion of data into AfterClass.
- Upon combining the scraped data into a CSV file, we noticed that there were some data quality issues. This IPYNB file is to find potential data quality issues and to find ways to mitigate it.

## Identified issues
- Instructors having multi-valued attributes.
- Class1_day having multi-valued attributes.
- Term has different naming convetions whereas .xls downloaded only has one naming convention

In [73]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [74]:
# load the dataset
data = pd.read_csv('transformed_data_w_timings_v3.csv', dtype={'CatalogueNo': str})

  data = pd.read_csv('transformed_data_w_timings_v3.csv', dtype={'CatalogueNo': str})


In [75]:
data.head()

Unnamed: 0,Term,Description,Section,Vacancy,Before Process Vacancy,Median Bid,Min Bid,Instructor,Grading Basis,class1_day,...,exam_startdate,exam_day,exam_starttime,AY,Incoming Freshman,Incoming Exchange,Round,Window,SubjectArea,CatalogueNo
0,2,Financial Reporting and Analysis,G3,42,3,25.0,25.0,GOH BENG WEE,Graded,Wed,...,20-Apr-2022,Wed,08:30,2021,no,no,2A,3,ACCT,224
1,2,Valuation,G1,42,9,10.09,10.09,CHENG NAM SANG,Graded,Mon,...,27-Apr-2022,Wed,08:30,2021,no,no,2A,3,ACCT,336
2,2,Valuation,G2,42,12,10.03,10.0,CHENG NAM SANG,Graded,Mon,...,27-Apr-2022,Wed,08:30,2021,no,no,2A,3,ACCT,336
3,2,Auditing for the Public Sector,G1,42,7,25.0,25.0,LIM SOO PING,Graded,Thu,...,27-Apr-2022,Wed,13:00,2021,no,no,2A,3,ACCT,409
4,2,Public Relations Writing,G1,45,10,10.0,10.0,YASMIN HANNAH RAMLE,Graded,Thu,...,,,,2021,no,no,2A,3,COMM,225


In [76]:
data.describe(include='all')

Unnamed: 0,Term,Description,Section,Vacancy,Before Process Vacancy,Median Bid,Min Bid,Instructor,Grading Basis,class1_day,...,exam_startdate,exam_day,exam_starttime,AY,Incoming Freshman,Incoming Exchange,Round,Window,SubjectArea,CatalogueNo
count,35734.0,35734,35734,35734.0,35734.0,35734.0,35734.0,35734,30909,30531,...,22456,22456,22456,35734.0,35734,35734,35734.0,35734.0,35734,35734.0
unique,5.0,667,45,,,,,929,2,7,...,82,6,3,,2,2,6.0,,55,373.0
top,2.0,Management Communication,G1,,,,,Not Assigned Yet,Graded,Tue,...,25-Nov-2022,Fri,08:30,,no,no,1.0,,COR,101.0
freq,16437.0,1388,14088,,,,,509,30699,6955,...,471,4731,11331,,34579,31787,10185.0,,5266,2671.0
mean,,,,43.333156,12.125259,27.409722,22.706312,,,,...,,,,2022.609811,,,,1.51763,,
std,,,,5.700971,12.451885,16.86029,15.130064,,,,...,,,,1.010684,,,,0.65875,,
min,,,,1.0,0.0,10.0,10.0,,,,...,,,,2021.0,,,,1.0,,
25%,,,,45.0,2.0,15.0925,10.3,,,,...,,,,2022.0,,,,1.0,,
50%,,,,45.0,7.0,24.0,17.89,,,,...,,,,2023.0,,,,1.0,,
75%,,,,45.0,19.0,35.0,30.0,,,,...,,,,2023.0,,,,2.0,,


In [77]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35734 entries, 0 to 35733
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Term                    35734 non-null  object 
 1   Description             35734 non-null  object 
 2   Section                 35734 non-null  object 
 3   Vacancy                 35734 non-null  int64  
 4   Before Process Vacancy  35734 non-null  int64  
 5   Median Bid              35734 non-null  float64
 6   Min Bid                 35734 non-null  float64
 7   Instructor              35734 non-null  object 
 8   Grading Basis           30909 non-null  object 
 9   class1_day              30531 non-null  object 
 10  class1_starttime        30370 non-null  object 
 11  class1_venue            30336 non-null  object 
 12  class2_day              880 non-null    object 
 13  class2_starttime        880 non-null    object 
 14  class2_venue            880 non-null  

In [78]:
data.isnull().sum()

Term                          0
Description                   0
Section                       0
Vacancy                       0
Before Process Vacancy        0
Median Bid                    0
Min Bid                       0
Instructor                    0
Grading Basis              4825
class1_day                 5203
class1_starttime           5364
class1_venue               5398
class2_day                34854
class2_starttime          34854
class2_venue              34854
class3_day                35677
class3_starttime          35677
class3_venue              35677
exam_startdate            13278
exam_day                  13278
exam_starttime            13278
AY                            0
Incoming Freshman             0
Incoming Exchange             0
Round                         0
Window                        0
SubjectArea                   0
CatalogueNo                   0
dtype: int64

In [79]:
# Check for duplicates
data.duplicated().sum()

np.int64(0)

## ISSUE: Investigate missing grading basis

In [80]:
data['Grading Basis'].value_counts()

Grading Basis
Graded       30699
Pass/Fail      210
Name: count, dtype: int64

In [81]:
missing_grading_basis_data = data[data['Grading Basis'].isna()]
missing_grading_basis_data.head(10)

Unnamed: 0,Term,Description,Section,Vacancy,Before Process Vacancy,Median Bid,Min Bid,Instructor,Grading Basis,class1_day,...,exam_startdate,exam_day,exam_starttime,AY,Incoming Freshman,Incoming Exchange,Round,Window,SubjectArea,CatalogueNo
6,2,Management Communication,G8,30,3,35.0,35.0,CHAN BOH YEE,,,...,,,,2021,no,no,2A,3,COR-COMM,1304
7,2,Japanese,G3,45,2,26.45,26.44,"AKIKO ITO, AKIKO ITO",,,...,,,,2021,no,no,2A,3,COR-JPAN,2401
8,2,"Business, Government and Society",G3,45,5,25.0,25.0,CHAN KAY MIN,,,...,,,,2021,no,no,2A,3,COR-MGMT,1302
45,2,Management Communication,G19,30,3,10.0,10.0,LINDY ONG,,,...,,,,2021,no,no,2A,2,COR-COMM,1304
46,2,Japanese,G2,45,1,10.01,10.01,"AKIKO ITO, AKIKO ITO",,,...,,,,2021,no,no,2A,2,COR-JPAN,2401
47,2,"Constitutions, Cultures, and Context",G1,38,1,21.67,21.67,MAARTJE DE VISSER,,,...,,,,2021,no,no,2A,2,COR-LAW,2610
48,2,Jurisprudence: Modern and Critical Theories of...,G1,38,8,17.17,17.17,TAN SEOW HON,,,...,,,,2021,no,no,2A,2,COR-LAW,2612
49,2,"Business, Government and Society",G11,45,2,10.03,10.03,GILBERT TAN YIP WEI,,,...,,,,2021,no,no,2A,2,COR-MGMT,1302
116,2,Management Communication,G21,30,1,12.77,12.77,LINDY ONG,,,...,,,,2021,no,no,2A,1,COR-COMM,1304
117,2,Management Communication,G30,30,1,10.0,10.0,VANDANA ADVANI,,,...,,,,2021,no,no,2A,1,COR-COMM,1304


In [82]:
missing_grading_basis_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Term,4825.0,5.0,1,2315.0,,,,,,,
Description,4825.0,72.0,Management Communication,1388.0,,,,,,,
Section,4825.0,42.0,G1,1270.0,,,,,,,
Vacancy,4825.0,,,,39.786736,7.707542,4.0,30.0,45.0,45.0,46.0
Before Process Vacancy,4825.0,,,,11.005389,11.243239,0.0,2.0,6.0,17.0,45.0
Median Bid,4825.0,,,,28.591422,15.236722,10.0,17.9,25.69,35.89,168.0
Min Bid,4825.0,,,,23.923685,14.195842,10.0,12.0,20.0,30.91,150.0
Instructor,4825.0,120.0,ROSIE CHING,170.0,,,,,,,
Grading Basis,0.0,0.0,,,,,,,,,
class1_day,0.0,0.0,,,,,,,,,


In [83]:
missing_grading_basis_data[missing_grading_basis_data['Description']=='Management Communication'].head(1).T

Unnamed: 0,6
Term,2
Description,Management Communication
Section,G8
Vacancy,30
Before Process Vacancy,3
Median Bid,35.0
Min Bid,35.0
Instructor,CHAN BOH YEE
Grading Basis,
class1_day,


## ROOT CAUSE: Scraping logic not waiting for website to be fully loaded, skipping it
Investigating into row 6 more, I found the class at https://boss.intranet.smu.edu.sg/ClassDetails.aspx?SelectedClassNumber=1609&SelectedAcadTerm=2120&SelectedAcadCareer=UGRD, which was not recorded in the csv. 

From this, I have identified that the scraping code didn't wait for page elements to load completely before attempting to extract data, thus causing the code to skip the page.

## FIX: Wait until page is healthy before scraping
To fix this, the code will have to `WebDriverWait.until()` function from selenium to wait for the page to load before extracting data.

---

## ISSUE: Missing class1_day

In [84]:
missing_class1_day_data = data[data['Grading Basis'].notna() & data['class1_day'].isna()]
missing_class1_day_data.head(10)

Unnamed: 0,Term,Description,Section,Vacancy,Before Process Vacancy,Median Bid,Min Bid,Instructor,Grading Basis,class1_day,...,exam_startdate,exam_day,exam_starttime,AY,Incoming Freshman,Incoming Exchange,Round,Window,SubjectArea,CatalogueNo
1732,2,Virtual Business Professional,G1,25,16,25.0,25.0,Not Assigned Yet,Graded,,...,,,,2021,no,no,1B,2,WRIT,200
2218,2,Virtual Business Professional,G1,25,15,12.28,12.28,Not Assigned Yet,Graded,,...,,,,2021,no,no,1B,1,WRIT,200
2483,2,Health Economics(SMU-X),G3,45,38,20.0,20.0,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,2,ECON,215
2617,2,The Singapore International Arbitration Centre...,G1,38,32,11.01,11.01,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,2,LAW,4020
2675,2,Entrepreneurship Practicum(SMU-X),G2,45,38,12.01,12.01,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,2,MGMT,327
2723,2,Legal Environment and Employment Relations,G1,45,31,13.98,11.74,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,2,OBHR,232
2739,2,Operations Strategy: Principles and Practice,G1,45,26,15.85,11.66,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,2,OPIM,319
2783,2,Virtual Business Professional,G1,25,15,11.2,10.0,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,2,WRIT,200
2814,2,Accounting Information Systems,G4,45,32,10.89,10.0,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,1,ACCT,221
2834,2,Advanced Financial Accounting,G2,45,27,20.0,20.0,Not Assigned Yet,Graded,,...,,,,2021,no,no,1A,1,ACCT,335


In [85]:
missing_class1_day_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Term,378.0,5.0,2,164.0,,,,,,,
Description,378.0,107.0,Computing Technology For Finance,13.0,,,,,,,
Section,378.0,13.0,G1,213.0,,,,,,,
Vacancy,378.0,,,,43.185185,5.658973,7.0,45.0,45.0,45.0,45.0
Before Process Vacancy,378.0,,,,30.325397,7.695976,0.0,27.0,31.5,36.0,45.0
Median Bid,378.0,,,,18.85082,10.678839,10.0,11.8825,15.45,21.585,95.4
Min Bid,378.0,,,,13.763492,7.647385,10.0,10.0,10.105,15.0,95.4
Instructor,378.0,3.0,Not Assigned Yet,370.0,,,,,,,
Grading Basis,378.0,2.0,Graded,374.0,,,,,,,
class1_day,0.0,0.0,,,,,,,,,


In [86]:
missing_class1_day_data.head(10).T

Unnamed: 0,1732,2218,2483,2617,2675,2723,2739,2783,2814,2834
Term,2,2,2,2,2,2,2,2,2,2
Description,Virtual Business Professional,Virtual Business Professional,Health Economics(SMU-X),The Singapore International Arbitration Centre...,Entrepreneurship Practicum(SMU-X),Legal Environment and Employment Relations,Operations Strategy: Principles and Practice,Virtual Business Professional,Accounting Information Systems,Advanced Financial Accounting
Section,G1,G1,G3,G1,G2,G1,G1,G1,G4,G2
Vacancy,25,25,45,38,45,45,45,25,45,45
Before Process Vacancy,16,15,38,32,38,31,26,15,32,27
Median Bid,25.0,12.28,20.0,11.01,12.01,13.98,15.85,11.2,10.89,20.0
Min Bid,25.0,12.28,20.0,11.01,12.01,11.74,11.66,10.0,10.0,20.0
Instructor,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet,Not Assigned Yet
Grading Basis,Graded,Graded,Graded,Graded,Graded,Graded,Graded,Graded,Graded,Graded
class1_day,,,,,,,,,,


In [87]:
missing_class1_day_data.loc[1732].T

Term                                                  2
Description               Virtual Business Professional
Section                                              G1
Vacancy                                              25
Before Process Vacancy                               16
Median Bid                                         25.0
Min Bid                                            25.0
Instructor                             Not Assigned Yet
Grading Basis                                    Graded
class1_day                                          NaN
class1_starttime                                    NaN
class1_venue                                        NaN
class2_day                                          NaN
class2_starttime                                    NaN
class2_venue                                        NaN
class3_day                                          NaN
class3_starttime                                    NaN
class3_venue                                    

In [88]:
missing_class1_day_data.loc[2814].T

Term                                                   2
Description               Accounting Information Systems
Section                                               G4
Vacancy                                               45
Before Process Vacancy                                32
Median Bid                                         10.89
Min Bid                                             10.0
Instructor                              Not Assigned Yet
Grading Basis                                     Graded
class1_day                                           NaN
class1_starttime                                     NaN
class1_venue                                         NaN
class2_day                                           NaN
class2_starttime                                     NaN
class2_venue                                         NaN
class3_day                                           NaN
class3_starttime                                     NaN
class3_venue                   

## ROOT CAUSE: (1) There is no physical class or (2) Class no longer exists

### SUB-ISSUE 1: There is no physical class
Investigating into row 1732 more, I found the class at https://boss.intranet.smu.edu.sg/ClassDetails.aspx?SelectedClassNumber=2052&SelectedAcadTerm=2120&SelectedAcadCareer=UGRD, was recorded in the csv. 
The class was virtual, so there was no class location.

### SUB-ISSUE 2: Class no longer exists in OverallResults
Investigating into row 2814 more, I found the class at https://boss.intranet.smu.edu.sg/ClassDetails.aspx?SelectedClassNumber=1778&SelectedAcadTerm=2120&SelectedAcadCareer=UGRD, was recorded in the csv. 

Upon looking at https://boss.intranet.smu.edu.sg/OverallResults.aspx, I found that the record did not exist in the Overall Results, likely because it was a class that they decided not to use.

However, it existed in the BOSS results excel table given to us at OASIS.

In [89]:
data.groupby(['Round','Window'])[['Round','Window']].value_counts()

Round  Window
1      1         6524
       2         3357
       3          175
       4          129
1A     1         4822
       2         3577
       3          744
1B     1         2811
       2         2139
1C     1         2264
       2         1083
       3          600
2      1         3177
       2         1782
       3         1145
       4            9
       5            2
2A     1          787
       2          405
       3          202
Name: count, dtype: int64

In [109]:
# Group by class identifiers and count total rows
class_row_counts = (
    data[data['Grading Basis'].notna() & data['class1_day'].isna()]
    .groupby(['CatalogueNo', 'SubjectArea', 'Section', 'AY', 'Term'])
    .size()
    .reset_index(name='num_rows')
)

# Filter if needed
low_activity_classes = class_row_counts[class_row_counts['num_rows'] <= 14]

low_activity_classes = low_activity_classes.sort_values(by='AY', ascending=True)
low_activity_classes.head(10)

Unnamed: 0,CatalogueNo,SubjectArea,Section,AY,Term,num_rows
1,100,IDIS,G2,2021,2,3
13,103,MKTG,G7,2021,2,2
14,103,MKTG,G8,2021,2,3
15,104,MGMT,G2,2021,2,3
12,103,MKTG,G5,2021,2,1
11,103,FNCE,G2,2021,2,3
30,200,WRIT,G1,2021,2,6
26,1305,COR,G9,2021,2,3
25,1305,COR,G15,2021,2,2
61,221,ACCT,G4,2021,2,3


In [110]:
low_activity_classes.count()

CatalogueNo    137
SubjectArea    137
Section        137
AY             137
Term           137
num_rows       137
dtype: int64

In [None]:
# Define the correct ordering of bidding phases
custom_order = [
    '1-1', '1-2',
    '1A-1', '1A-2',
    '1B-1', '1B-2',
    '2-1', '2-2', '2-3', '2-4', '2-5',
    '2A-1', '2A-2', '2A-3'
]

# Create a combined identifier for round and window
data['bidding_phase'] = data['Round'].astype(str) + '-' + data['Window'].astype(str)

# Convert to ordered categorical to ensure proper sorting
data['bidding_phase'] = pd.Categorical(
    data['bidding_phase'], 
    categories=custom_order, 
    ordered=True
)

# Find the maximum round and window for each class
max_phases = (
    data
    .groupby(['CatalogueNo', 'SubjectArea', 'Section', 'AY', 'Term'])['bidding_phase']
    .max()
    .reset_index()
)

# Define target phase
target_phase = '2A-3'

# Filter classes that didn't reach Round 2A, Window 3
classes_not_reaching_2A_3 = max_phases[max_phases['bidding_phase'] < target_phase]

# Sort and display results
classes_not_reaching_2A_3 = classes_not_reaching_2A_3.sort_values(by='AY', ascending=True)
classes_not_reaching_2A_3.head(10)

Unnamed: 0,CatalogueNo,SubjectArea,Section,AY,Term,bidding_phase
3662,213,IS,G9,2023,2,1B-1
2096,1302,COR-MGMT,G9,2021,2,2-2
326,101,FNCE,G9,2023,2,2-1
325,101,FNCE,G9,2023,1,2-3
324,101,FNCE,G9,2022,2,1A-2
323,101,FNCE,G9,2022,1,2-3
322,101,FNCE,G9,2021,2,2-2
3766,215,IS,G9,2021,2,2A-2
3767,215,IS,G9,2022,2,2-3
3768,215,IS,G9,2023,2,1B-1


## FIX: Filter out classes that did not reach Round 2A Window 3
- Upon investigating, not all records in the above code indicate phantom classes. However, investigating MKTG103 in AY2021T2 for G5, G7, G8, it is no longer available in OverallResults.
- The current data omits any records with `Min Bid` == 0, which means that some of the missing `bidding_phase` are not caused by dropped classes, but by lack of active bidders.
- This means that the phantom classes that were cancelled due to lack of venue or instructors did not make it to Round 2A Window 3, making it a suitable filtering option to remove phantom classes.

To fix SUB-ISSUE 2, drop any rows with 0 median/min bid prices, no class1_day, and did not reach Round 2A Window 3.