# Knowledge Mapper

To create a embedding visualization based on text data from Unit handbook that helps students decide which courses to pursue based on their interests and current enrolled units.

```
Key fields:
1. UNIT_TITLE
2. HANDBOOK_SYNOPSIS
3. UNIT_LEARNING_OUTCOME

Unit meta data:
1. UNIT_CODE
2. ABBREVIATED_UNIT_TITLE
Colour by:
1. STUDY_LEVEL (UG vs. PG)
2. OWNING_FACULTY (faculty who teaches it)
3. OWNING_ORG_UNIT (department who teaches it)

Auxillary questions:
1. What other units should be prohibited to be taken in dual with this unit?
2. What units should I take before this unit? 

Note: PUBLISH_TO_HANDBOOK probably should be used as a first-order filter.
```

### [Embedding Projector](https://projector.tensorflow.org/)

Data format:
```
Load data from your computer
Step 1: Load a TSV file of vectors.
Example of 3 vectors with dimension 4:
0.1\t0.2\t0.5\t0.9
0.2\t0.1\t5.0\t0.2
0.4\t0.1\t7.0\t0.8
Step 2 (optional): Load a TSV file of metadata.
Example of 3 data points and 2 columns.
Note: If there is more than one column, the first row will be parsed as column labels.
Pokémon\tSpecies
Wartortle\tTurtle
Venusaur\tSeed
Charmeleon\tFlame
```

Metadata
```
UNIT_CODE\tUNIT_TITLE\tHANDBOOK_SYNOPSIS\tUNIT_LEARNING_OUTCOME\tSTUDY_LEVEL\tOWNING_FACULTY\tOWNING_ORG_UNIT
```

### Libraries

In [1]:
!pip install transformers



In [2]:
!pip install sentence-transformers



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# general
import json
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer

In [5]:
data_path = Path('/content/drive/MyDrive/knowledge-mapper/data/raw_data/units owned by faculty - one line per unit (as at 18 Feb 2021).xlsx')
path = "/content/drive/MyDrive/knowledge-mapper/data"
df = pd.read_excel(data_path, sheet_name='Sheet1')

In [6]:
df.shape

(5848, 36)

In [7]:
df.columns

Index(['CL_UNIT_ID', 'UNIT_CODE', 'CL_UNIT_VERSION', 'UNIT_TITLE',
       'ABBREVIATED_UNIT_TITLE', 'CREDIT_POINTS', 'UNIT_STATUS',
       'OWNING_FACULTY', 'OWNING_ORG_UNIT', 'HIGHEST_SCA_BAND',
       'IMPLEMENTATION_YR', 'PUBLISH_TO_HANDBOOK', 'STUDY_LEVEL',
       'HIGHEST_SCA_BAND_1', 'UNIT_EFTSL', 'HANDBOOK_SYNOPSIS',
       'WORKLOAD_REQUIREMENTS', 'QUOTA_INFORMATION', 'OTHER_UNIT_COSTS',
       'FIELD_WORK', 'AREA_OF_STUDY_LINKS', 'OFF_CAMPUS_ATTEND_REQUIREMENTS',
       'SPECIAL_NOTE_TO_STUDENTS', 'UNITOFFERING',
       'HANDBOOK_ASSESSMENT_SUMMARY', 'ASSES_ITEMS', 'UNITCOORD', 'CHIEFEXAM',
       'TEACHING_RESPONSIBILITY', 'PREREQUISITE', 'COREQUISITE', 'PROHIBITION',
       'RULES(PREREQ,COREQ,PROH)', 'INFORMATION_RULE', 'LEARNING_OUTCOME_INFO',
       'UNIT_LEARNING_OUTCOME'],
      dtype='object')

In [8]:
df.head()

Unnamed: 0,CL_UNIT_ID,UNIT_CODE,CL_UNIT_VERSION,UNIT_TITLE,ABBREVIATED_UNIT_TITLE,CREDIT_POINTS,UNIT_STATUS,OWNING_FACULTY,OWNING_ORG_UNIT,HIGHEST_SCA_BAND,IMPLEMENTATION_YR,PUBLISH_TO_HANDBOOK,STUDY_LEVEL,HIGHEST_SCA_BAND_1,UNIT_EFTSL,HANDBOOK_SYNOPSIS,WORKLOAD_REQUIREMENTS,QUOTA_INFORMATION,OTHER_UNIT_COSTS,FIELD_WORK,AREA_OF_STUDY_LINKS,OFF_CAMPUS_ATTEND_REQUIREMENTS,SPECIAL_NOTE_TO_STUDENTS,UNITOFFERING,HANDBOOK_ASSESSMENT_SUMMARY,ASSES_ITEMS,UNITCOORD,CHIEFEXAM,TEACHING_RESPONSIBILITY,PREREQUISITE,COREQUISITE,PROHIBITION,"RULES(PREREQ,COREQ,PROH)",INFORMATION_RULE,LEARNING_OUTCOME_INFO,UNIT_LEARNING_OUTCOME
0,554c86a41b5aac10653b206b274bcbf9,ACB1020,2021.04RO,Accounting in business,ACC IN BUS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit introduces basic accounting concepts...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACB1120 OR ACF1100 OR ACF1200 OR ACC1100 OR AC...,Prohibition: ACB1120 OR ACF1100 OR ACF1200 OR ...,,"On successful completion of this unit, you sho...",ULO1 - demonstrate an understanding of various...
1,1928deab1bd65c504c45bbbbdc4bcb3a,ACB1100,2021.01RO,Introduction to financial accounting,INTRO FIN ACC,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 3,2021,N,undergraduate,SCA Band 3,0.125,This unit provides students with an introducti...,Minimum total expected workload to achieve the...,,Costs are indicative and subject to change.Ele...,,,,,,Within semester assessment: 50% + Examination:...,,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACC1100 OR ACF1100 OR ACW1100,Prohibition: ACC1100 OR ACF1100 OR ACW1100,,The learning outcomes associated with this uni...,ULO1 - Identify and analyse measurement system...
2,e14cc6a41b5aac10653b206b274bcb3e,ACB1120,2021.04RO,Financial accounting 1,INTRO FIN ACC,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit provides you with an introduction to...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACC1100 OR ACF1100 OR ACW1120,Prohibition: ACC1100 OR ACF1100 OR ACW1120,,"On successful completion of this unit, you sho...",ULO1 - identify and analyse measurement system...
3,a128deab1bd65c504c45bbbbdc4bcbb4,ACB1200,2021.01RO,Accounting for managers,ACC FOR MNGRS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 3,2021,N,undergraduate,SCA Band 3,0.125,This unit introduces basic accounting concepts...,Minimum total expected workload to achieve the...,,Costs are indicative and subject to change.Ele...,,,,,,Within semester assessment: 50% + Examination:...,,,Mr Jonathan Phillips,Responsible teaching Department of Accounting ...,,,ACF1100 OR ACW1100 OR ACB1100 OR ACC1200 OR AC...,Prohibition: ACF1100 OR ACW1100 OR ACB1100 OR ...,,The learning outcomes associated with this uni...,ULO1 - Ddemonstrate an understanding of variou...
4,b528deab1bd65c504c45bbbbdc4bcbc7,ACB2020,2021.01RO,Cost information for decision making,COST INFO FOR DEC MA,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 3,2021,N,undergraduate,SCA Band 3,0.125,Introduction to management accounting. Topics ...,Minimum total expected workload to achieve the...,,Costs are indicative and subject to change.Ele...,,,,,,Within semester assessment: 50% + Examination:...,,,Mr Paul Yap,Responsible teaching Department of Accounting ...,ACB1100,,,Prerequisite: ACB1100; \n,Corequisite: Students must be enrolled in cour...,The learning outcomes associated with this uni...,ULO1 - Describe cost behaviour under different...


### Unique Identifiers

In [9]:
df.CL_UNIT_ID.nunique(), df.UNIT_CODE.nunique()

(5848, 5848)

In [10]:
key_cols = ["CL_UNIT_ID",
            "UNIT_CODE",
            "UNIT_TITLE",
            "HANDBOOK_SYNOPSIS",
            "UNIT_LEARNING_OUTCOME",
            "STUDY_LEVEL",
            "OWNING_FACULTY",
            "OWNING_ORG_UNIT"
            ]

### Filter by `PUBLISH_TO_HANDBOOK`

In [11]:
df = df.loc[df["PUBLISH_TO_HANDBOOK"] == 'Y', :]

In [12]:
df.shape

(5324, 36)

In [13]:
df.head()

Unnamed: 0,CL_UNIT_ID,UNIT_CODE,CL_UNIT_VERSION,UNIT_TITLE,ABBREVIATED_UNIT_TITLE,CREDIT_POINTS,UNIT_STATUS,OWNING_FACULTY,OWNING_ORG_UNIT,HIGHEST_SCA_BAND,IMPLEMENTATION_YR,PUBLISH_TO_HANDBOOK,STUDY_LEVEL,HIGHEST_SCA_BAND_1,UNIT_EFTSL,HANDBOOK_SYNOPSIS,WORKLOAD_REQUIREMENTS,QUOTA_INFORMATION,OTHER_UNIT_COSTS,FIELD_WORK,AREA_OF_STUDY_LINKS,OFF_CAMPUS_ATTEND_REQUIREMENTS,SPECIAL_NOTE_TO_STUDENTS,UNITOFFERING,HANDBOOK_ASSESSMENT_SUMMARY,ASSES_ITEMS,UNITCOORD,CHIEFEXAM,TEACHING_RESPONSIBILITY,PREREQUISITE,COREQUISITE,PROHIBITION,"RULES(PREREQ,COREQ,PROH)",INFORMATION_RULE,LEARNING_OUTCOME_INFO,UNIT_LEARNING_OUTCOME
0,554c86a41b5aac10653b206b274bcbf9,ACB1020,2021.04RO,Accounting in business,ACC IN BUS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit introduces basic accounting concepts...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACB1120 OR ACF1100 OR ACF1200 OR ACC1100 OR AC...,Prohibition: ACB1120 OR ACF1100 OR ACF1200 OR ...,,"On successful completion of this unit, you sho...",ULO1 - demonstrate an understanding of various...
2,e14cc6a41b5aac10653b206b274bcb3e,ACB1120,2021.04RO,Financial accounting 1,INTRO FIN ACC,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit provides you with an introduction to...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACC1100 OR ACF1100 OR ACW1120,Prohibition: ACC1100 OR ACF1100 OR ACW1120,,"On successful completion of this unit, you sho...",ULO1 - identify and analyse measurement system...
5,0ded927adb5e68102bdd077cd39619c4,ACB2120,2021.05,Financial accounting 2,FIN ACCT 2,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit provides an overview of the current ...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,�\n�,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Lisa Powell,Responsible teaching Department of Accounting ...,ACB1120 OR ACC1100 OR ACF1100 OR ACW1120,,ACF2100 OR ACW2120 OR ACC2100,Prerequisite: ACB1120 OR ACC1100 OR ACF1100 OR...,,"On successful completion of this unit, you sho...","ULO1 - explain the content of, and regulatory ..."
6,7d4cc6a41b5aac10653b206b274bcb9a,ACB2220,2021.03RO,Management accounting 1,MGT ACCT 1,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,Introduction to management accounting. Topics ...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr John Ko,Responsible teaching Department of Accounting ...,ACC1100 OR ACF1100 OR ACW1120 OR ACB1120,,ACC2200 OR ACF2200 OR ACW2220,Prerequisite: ACC1100 OR ACF1100 OR ACW1120 OR...,,"On successful completion of this unit, you sho...",ULO1 - describe cost behaviour under different...
7,b14cc6a41b5aac10653b206b274bcbbd,ACB2420,2021.04RO,Accounting information systems,ACC INFO SYS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,"The objective of this unit is two-fold. First,...",Minimum total expected workload to achieve the...,,,,,,,S2-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Daisy Seng,Responsible teaching Department of Accounting ...,ACB1020 OR ACF1100 OR ACB1120 OR ACF1200 OR AC...,,ACF2400 OR ACW2420 OR ACC2400,Prerequisite: ACB1020 OR ACF1100 OR ACB1120 OR...,,"On successful completion of this unit, you sho...",ULO1 - examine the role of accounting informat...


### Get key columns

In [14]:
df_txt = df.loc[:, key_cols]

In [15]:
df_txt.isna().sum()

CL_UNIT_ID                 0
UNIT_CODE                  0
UNIT_TITLE                 0
HANDBOOK_SYNOPSIS         27
UNIT_LEARNING_OUTCOME    378
STUDY_LEVEL                2
OWNING_FACULTY             0
OWNING_ORG_UNIT            0
dtype: int64

In [16]:
df_txt.fillna("", inplace=True)

In [17]:
df_txt.isna().sum()

CL_UNIT_ID               0
UNIT_CODE                0
UNIT_TITLE               0
HANDBOOK_SYNOPSIS        0
UNIT_LEARNING_OUTCOME    0
STUDY_LEVEL              0
OWNING_FACULTY           0
OWNING_ORG_UNIT          0
dtype: int64

> Pretrained models: https://www.sbert.net/docs/pretrained_models.html

In [18]:
# ver 0.1 used:
# 768 dimensions --> 'distilbert-base-nli-stsb-mean-tokens'
model_name = 'all-MiniLM-L6-v2' # 384 dim
model = SentenceTransformer(model_name) # load DistillBERT model (more efficient) 

In [19]:
def get_bert_embeddings(text, model):
    """Computes the mean BERT embeddings (context dependent) for a given sentence
    Returns a 768 dimensional embedding
    """
    if text.strip() != "":
        embeddings = model.encode(text)
    else:
        embeddings = np.zeros(model.get_sentence_embedding_dimension())
    return embeddings

In [20]:
df_txt.loc[0, 'HANDBOOK_SYNOPSIS'], df_txt.loc[0, 'UNIT_LEARNING_OUTCOME']

('This unit introduces basic accounting concepts to non-accountants. The information requirements of two main groups of information users are examined - external users such as current and potential investors and internal users such as managers. This unit provides an introduction to the structure, meaning, analysis and interpretation of financial statements, in addition to exploring financial issues confronting managers, such as cost and performance measurement and budgeting.',
 'ULO1 - demonstrate an understanding of various forms of business organisations\nULO2 - apply financial and management accounting principles in the preparation of financial statements\nULO3 - measure and interpret information relating to financial performance, financial position, liquidity and risk indicators of businesses\nULO4 - measure and interpret financial and non-financial information for managers to use in planning, decision making and control\nULO5 - develop the ability to work effectively in a team and

### Baseline: Unit Synopsis

Few strategies:
- Combine embeddings (mean/sum) for both `HANDBOOK_SYNOPSIS` and `UNIT_LEARNING_OUTCOME`
- Keep them seperate to provide two distinct "modes" for viz that the users can select from
- Show them the top-K similar UNITS in terms of synopsis or outcomes

In [21]:
get_bert_embeddings(df_txt.loc[0, 'HANDBOOK_SYNOPSIS'], model).shape

(384,)

In [22]:
df_txt = df_txt.loc[df_txt.HANDBOOK_SYNOPSIS != "", :]
df_txt.shape

(5297, 8)

In [23]:
df_txt['embeddings'] = df_txt['HANDBOOK_SYNOPSIS'].apply(lambda x: get_bert_embeddings(x, model))

In [24]:
df_embeddings = pd.DataFrame(df_txt["embeddings"].to_list())

In [25]:
df_embeddings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383
0,0.049430,0.036486,-0.059333,0.028311,-0.068452,-0.025572,0.082057,0.022052,0.076473,-0.004748,-0.005129,-0.083492,-0.028406,-0.053263,0.002599,-0.122760,0.004804,-0.077006,0.013091,0.009666,0.062835,-0.004316,0.006714,0.030640,-0.007846,-0.014437,-0.032466,0.055979,0.012983,-0.039692,-0.035119,0.019072,0.113799,-0.013120,-0.006766,-0.014247,0.018174,0.128508,0.003950,0.005199,...,-0.004387,-0.031537,0.020803,0.058806,0.006792,-0.059722,-0.001031,0.079582,0.066685,0.005510,-0.007321,-0.030533,0.012080,0.038542,-0.033219,-0.005014,-0.085667,-0.012656,0.041698,0.061884,0.007849,0.033259,-0.103763,-0.012229,0.045732,-0.056332,-0.013825,0.015461,0.079118,0.115157,0.016234,0.038647,0.033976,0.072122,-0.034113,-0.029431,0.065845,0.018878,-0.002411,-0.004672
1,0.012721,0.020302,-0.089373,0.029205,-0.100231,0.047186,0.044921,0.003473,0.015653,0.013069,-0.015994,-0.054618,-0.017134,-0.047070,0.022977,-0.141775,-0.030008,-0.073604,0.035636,0.060599,0.040012,0.011723,0.012229,0.019345,-0.074937,-0.003801,0.022155,0.049982,0.039038,-0.061193,-0.045439,-0.018802,0.089773,-0.003225,-0.013433,0.028904,0.032719,0.048772,0.011705,0.028134,...,0.003240,-0.079057,-0.003367,0.075400,0.038143,-0.100702,-0.006414,0.050814,0.074911,0.024747,0.022804,0.028783,0.051956,-0.009070,0.012103,0.030666,-0.068625,0.005940,0.020455,0.045944,-0.038614,0.061240,-0.114387,0.084165,0.014242,-0.026239,-0.015909,0.008905,0.053963,0.094199,0.014408,0.063934,-0.005599,0.003223,-0.028036,-0.024800,0.044891,0.083983,0.037209,-0.005507
2,0.041525,0.020710,-0.104727,0.053119,-0.063939,0.145475,-0.006584,0.013607,-0.016274,0.041064,-0.001463,-0.062867,-0.026238,0.044929,-0.025241,-0.072922,-0.024269,-0.010401,0.019862,0.030063,0.042374,0.045248,0.014316,0.019791,-0.033261,0.005737,-0.035817,0.046313,-0.013639,-0.069572,-0.062446,0.027060,0.088501,-0.001673,0.070538,0.002996,0.052349,0.025556,0.012658,-0.017768,...,-0.057574,-0.079593,0.031444,0.102588,0.051701,-0.005918,-0.007915,0.064675,0.056812,0.001117,-0.071258,-0.042550,0.075606,0.052781,-0.028022,0.025534,-0.008886,-0.026029,-0.009614,0.089190,-0.088106,0.054429,-0.136495,0.032303,0.008176,-0.083002,0.043164,0.012278,0.027970,0.065549,-0.038777,0.075341,-0.010161,-0.010265,-0.018843,-0.059328,0.048046,-0.026207,0.040385,-0.012090
3,0.040773,-0.015366,-0.021432,0.013921,0.002273,0.036791,0.069693,0.030917,0.014918,0.059604,-0.018498,-0.021194,0.006928,0.007854,-0.039247,-0.080535,-0.005979,0.000468,0.016622,-0.011943,0.072286,-0.011470,0.003223,0.062409,-0.081462,-0.028964,0.005284,0.010257,-0.017777,-0.057212,-0.061031,0.026651,0.073659,0.012806,-0.005195,-0.005284,-0.040707,-0.012640,-0.011117,0.013296,...,0.019986,-0.065862,0.053509,0.089661,0.032313,-0.035758,0.013874,0.089588,0.094570,0.043676,0.024544,-0.044664,0.012883,0.018526,0.003028,0.025059,-0.129340,0.016878,0.009357,0.076822,0.030013,0.044259,-0.161702,0.011148,-0.047760,-0.027653,-0.005154,0.033693,0.081066,0.051677,0.001578,0.032848,0.085195,-0.016641,0.036665,-0.021959,0.062364,0.016640,0.012791,0.015125
4,-0.024332,-0.015139,-0.110688,-0.049983,-0.068496,0.024542,0.014532,0.034777,0.048527,0.038183,-0.039465,-0.024378,0.043406,-0.003061,0.012502,-0.077279,0.028125,-0.089346,0.004201,-0.019009,0.019051,-0.007406,-0.041594,0.015253,-0.053541,-0.033170,0.023718,-0.038502,-0.011490,-0.072711,-0.078091,-0.012278,0.164108,0.056577,-0.057149,0.009644,0.089818,-0.016148,0.000899,0.031223,...,0.029838,-0.049142,-0.065701,0.084629,0.051370,-0.026313,0.056897,0.125020,0.003141,0.036343,-0.028344,-0.024121,0.034046,0.023522,0.007375,-0.013553,-0.034468,0.024432,-0.030379,0.046067,-0.033232,0.066418,-0.089958,0.007719,-0.005725,0.006876,-0.015328,-0.012804,0.017458,0.093933,0.001219,0.018883,0.036808,-0.056187,0.036593,0.007428,0.000238,0.029433,0.035828,0.033832
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5292,0.045825,0.002348,-0.022949,-0.022758,-0.029135,0.058328,0.002187,-0.082958,0.011485,-0.009267,-0.037400,-0.041441,-0.032366,0.038213,-0.056152,-0.011578,0.032782,0.030519,0.047092,0.013139,0.025830,0.027229,0.019867,-0.027691,-0.081859,0.020964,0.005943,-0.068896,0.089740,-0.050601,-0.066920,0.050851,0.026278,0.006663,0.004524,0.177274,0.007189,0.009615,-0.017210,0.063187,...,-0.088741,0.023735,-0.006424,0.012625,-0.005937,0.003630,-0.009854,0.133050,-0.001327,0.002841,-0.011198,-0.020894,-0.050984,0.012341,0.036443,0.038593,-0.056472,-0.051032,0.053603,0.046401,-0.025593,0.084142,0.018844,0.026004,-0.035227,0.032644,-0.120905,-0.070807,0.031700,0.052595,0.048660,0.018469,0.035266,-0.008199,-0.025748,0.015269,0.013076,0.001907,-0.126006,-0.029926
5293,0.012139,-0.024767,-0.046055,0.042246,-0.020888,0.076084,-0.043391,-0.061456,-0.011250,0.019409,-0.039317,-0.028689,0.015277,0.061507,-0.013273,0.023799,0.035511,0.048533,0.055478,0.045425,0.066818,0.020689,0.045465,0.009965,-0.039454,-0.007803,-0.010705,-0.053511,0.102174,-0.027998,-0.049435,0.089900,0.032088,0.037505,0.032111,0.177677,0.035332,-0.036410,-0.057841,0.002917,...,-0.080943,0.004957,0.007055,0.007748,-0.013634,-0.030985,-0.011573,0.065846,-0.006795,0.063418,-0.044686,-0.049651,-0.039598,0.005464,-0.003763,0.056308,-0.000784,-0.036197,0.044891,0.033052,0.000986,0.059289,0.015125,0.048592,-0.029770,0.062536,-0.115744,-0.052087,0.025801,0.034618,0.052978,0.038412,0.023558,0.018010,0.026596,0.054017,-0.057633,-0.009506,-0.098609,-0.009620
5294,0.025295,0.009707,-0.043437,0.004690,-0.028572,0.053570,-0.017481,-0.032099,0.027415,-0.030504,-0.048409,-0.079628,-0.007519,-0.000766,-0.020173,-0.048904,0.016029,-0.021540,0.000618,0.011502,0.037301,-0.017169,0.017960,0.027817,-0.068101,-0.007271,0.032190,-0.021574,0.089639,-0.045048,-0.020716,0.062671,0.073234,0.010984,-0.009964,0.158034,0.022462,0.053551,0.007560,0.084647,...,-0.070944,0.016287,-0.033852,0.062589,-0.056435,-0.036008,0.026728,0.115582,0.008202,0.000479,-0.021222,-0.076397,-0.019541,0.024328,0.044234,0.087934,-0.000057,-0.027890,0.004140,0.058738,0.002556,0.064306,-0.042531,0.044478,-0.055215,0.033081,-0.099778,-0.045926,0.045235,0.047170,0.018046,0.062795,-0.017208,0.025467,-0.060478,0.073320,-0.003855,0.005485,-0.058613,-0.025527
5295,-0.028266,-0.061888,-0.009002,0.008550,-0.023232,0.022092,-0.039583,0.033949,-0.053184,0.010602,0.071309,0.019923,0.008717,0.043923,-0.066768,-0.049011,-0.000005,0.014482,-0.016701,-0.039606,-0.012053,-0.019426,0.012295,0.030631,0.006393,-0.041928,0.016116,0.051249,-0.006486,0.029145,0.012257,0.044764,0.011856,-0.039444,-0.023594,0.077680,0.073113,0.030906,-0.033086,0.006555,...,-0.074718,0.070368,0.026658,0.027408,-0.016044,0.041437,-0.018995,-0.036800,0.044380,-0.005741,0.037622,0.003454,-0.047451,0.035737,0.048424,-0.060154,0.066682,-0.090170,0.017595,0.073168,0.061531,0.115817,0.009085,-0.018026,0.055448,0.035329,0.012495,0.012682,-0.009550,0.017790,-0.000464,0.091296,0.032186,-0.095970,-0.079010,-0.004554,0.081503,-0.064255,-0.016460,-0.093074


In [26]:
df_meta = df_txt.loc[:, ['UNIT_TITLE', 'UNIT_CODE', 'HANDBOOK_SYNOPSIS', 'STUDY_LEVEL', 'OWNING_FACULTY', 'OWNING_ORG_UNIT']]

In [27]:
df_meta.head()

Unnamed: 0,UNIT_TITLE,UNIT_CODE,HANDBOOK_SYNOPSIS,STUDY_LEVEL,OWNING_FACULTY,OWNING_ORG_UNIT
0,Accounting in business,ACB1020,This unit introduces basic accounting concepts...,undergraduate,Faculty of Business and Economics,Department of Accounting
2,Financial accounting 1,ACB1120,This unit provides you with an introduction to...,undergraduate,Faculty of Business and Economics,Department of Accounting
5,Financial accounting 2,ACB2120,This unit provides an overview of the current ...,undergraduate,Faculty of Business and Economics,Department of Accounting
6,Management accounting 1,ACB2220,Introduction to management accounting. Topics ...,undergraduate,Faculty of Business and Economics,Department of Accounting
7,Accounting information systems,ACB2420,"The objective of this unit is two-fold. First,...",undergraduate,Faculty of Business and Economics,Department of Accounting


Remove newlines

In [28]:
df_meta['HANDBOOK_SYNOPSIS'] = df_meta.HANDBOOK_SYNOPSIS.apply(lambda x: x.replace("\n", " "))

### Export results

In [None]:
# df_txt.to_csv("unit_synopsis_embeddings.csv", index=False)

In [29]:
df_embeddings.to_csv(f"{path}/unit_synopsis_embeddings_only_all-MiniLM-L6-v2.tsv", sep="\t", header=False, index=False)

In [30]:
df_meta.to_csv(f"{path}/unit_synopsis_metadata_only_all-MiniLM-L6-v2.tsv", sep="\t", index=False)

### Baseline: Learning Outcome

Filtering/cleaning

In [31]:
import re
pattern = r'ULO[\d]+[\s][\-][\s]?'

In [32]:
def apply_filter(x):
    x = re.sub(pattern, '', x)
    x = x.replace('\n', '. ')
    return x

In [33]:
df_txt['UNIT_LEARNING_OUTCOME_TIDY'] = df_txt['UNIT_LEARNING_OUTCOME'].apply(lambda x: apply_filter(x))

In [34]:
df_txt.UNIT_LEARNING_OUTCOME_TIDY

0       demonstrate an understanding of various forms ...
2       identify and analyse measurement systems and t...
5       explain the content of, and regulatory require...
6       describe cost behaviour under different assump...
7       examine the role of accounting information sys...
                              ...                        
5843    Demonstrate self-motivation, and able to const...
5844    Justify your conceptual, material and logistic...
5845    Compose a self-directed and coherent work plan...
5846                                                     
5847                                                     
Name: UNIT_LEARNING_OUTCOME_TIDY, Length: 5297, dtype: object

In [35]:
df_txt = df_txt.loc[df_txt.UNIT_LEARNING_OUTCOME_TIDY != "", :]

In [36]:
df_txt.shape

(4941, 10)

In [37]:
df_txt['embeddings'] = df_txt['UNIT_LEARNING_OUTCOME_TIDY'].apply(lambda x: get_bert_embeddings(x, model))

In [38]:
df_embeddings = pd.DataFrame(df_txt["embeddings"].to_list())

In [39]:
df_embeddings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383
0,0.083268,0.073515,-0.084015,0.036917,-0.071435,0.020099,0.072414,0.009141,0.057214,-0.012768,-0.059674,-0.092708,-0.059305,-0.004339,0.017727,-0.085220,-0.036699,0.008513,0.008665,0.013997,0.039447,-0.013720,0.024463,0.044717,-0.064188,-0.005584,-0.002826,0.042329,0.021518,-0.065877,-0.026540,0.007265,0.103098,0.021197,-0.020909,0.046053,0.028226,0.067197,0.069710,0.014653,...,0.071266,-0.075304,-1.534046e-02,0.031985,0.050403,-0.040217,0.020723,0.047585,0.096883,0.022978,-0.024384,-0.009349,0.043391,0.019272,0.007071,0.040076,-0.082708,-0.010072,0.029494,0.035412,-0.046548,0.000010,-0.107051,0.102470,0.001597,-0.084114,0.029821,0.025514,0.026079,0.099872,0.009164,0.005831,0.020759,0.025908,-0.069881,-0.058588,0.039603,0.012212,0.030530,-0.037905
1,0.036672,0.036816,-0.055494,-0.016555,-0.109912,0.012588,0.074628,0.003314,0.051804,0.009599,0.037693,-0.128263,0.015954,0.004797,0.000048,-0.123172,-0.005766,0.018876,0.060623,0.002303,0.045855,-0.008295,0.010501,0.020452,-0.007280,-0.023514,-0.069798,0.052990,0.018768,-0.060755,-0.034065,-0.000829,0.057628,0.042296,0.012880,0.013958,0.065667,0.022514,0.061823,0.005218,...,-0.011093,-0.054650,-4.776023e-02,0.075810,0.056373,-0.077092,0.014076,0.027825,0.061860,-0.016419,0.013829,-0.014873,0.017475,0.020162,0.041882,0.035626,-0.097880,-0.015185,0.065379,0.057501,-0.002158,0.013107,-0.118228,0.066722,0.008739,-0.018803,-0.017860,0.039355,0.032235,0.049576,-0.009315,-0.009387,0.029966,0.016203,-0.085643,0.056214,0.089103,-0.000166,-0.021122,-0.076160
2,0.083723,0.049477,-0.044086,0.044515,-0.078590,0.050484,0.043661,0.045655,0.016474,-0.043575,-0.009442,-0.097784,-0.046282,-0.023909,0.025238,-0.108204,-0.016533,-0.021021,0.032219,0.013589,0.092117,0.007490,0.060561,0.053938,-0.060763,-0.001840,-0.009774,0.065546,0.025498,-0.040675,-0.035244,-0.022802,0.075977,0.017910,0.031753,0.072579,0.014453,0.032839,0.023983,-0.001473,...,0.017285,-0.105751,2.735224e-02,0.039088,0.034648,-0.003578,-0.039066,0.016623,0.078325,0.003837,0.001858,-0.000412,0.049950,0.034633,-0.006224,0.047109,-0.066557,0.011531,0.046951,0.070635,-0.009290,0.010224,-0.139284,0.051036,-0.001429,-0.036383,0.041999,-0.016756,0.023452,0.080688,0.029348,0.072237,0.000773,0.028682,-0.068542,-0.077498,0.048086,0.071591,0.025449,-0.050087
3,-0.004654,0.063362,-0.068568,0.036475,0.007228,0.006653,0.010422,0.082622,0.032757,0.097565,-0.062942,-0.064237,0.025756,0.038799,-0.056821,-0.106210,0.040913,-0.050084,0.010119,-0.113653,0.017875,-0.065983,-0.052398,0.031441,-0.018113,-0.033054,-0.018150,0.021337,0.034366,-0.027607,-0.040223,0.050901,0.047874,0.015354,-0.011223,0.064719,-0.050458,0.034019,0.022613,0.039816,...,0.004622,-0.014582,-3.196067e-02,0.079225,-0.036200,0.015618,-0.042649,0.066223,0.045499,0.004121,0.014325,-0.039068,-0.020413,-0.005636,0.013805,0.024096,-0.072031,0.043118,-0.036911,0.073461,0.077302,0.005419,-0.194487,-0.032633,-0.091011,-0.089994,0.044160,0.044214,0.015905,0.048751,0.039664,-0.011997,0.085643,-0.026804,0.005977,0.035269,0.031772,-0.046070,-0.037605,0.025166
4,0.003050,0.077509,-0.065115,-0.018655,-0.051351,0.016636,0.109407,0.036844,0.039608,0.067764,-0.039488,-0.070701,0.030692,0.017611,-0.005930,-0.114828,-0.013728,-0.092426,0.042311,-0.040873,-0.009102,-0.020574,-0.036542,-0.036173,-0.057283,-0.030823,-0.066019,0.029429,-0.019395,-0.078877,0.004849,-0.013707,0.132396,0.033129,-0.062177,0.008659,0.021151,0.090285,0.047890,0.035185,...,0.020764,-0.016934,-1.173453e-02,0.066278,-0.025055,-0.032114,-0.004846,0.050704,0.014504,0.067975,0.036980,-0.032254,0.001608,-0.027213,-0.011329,-0.010986,-0.082645,0.005202,0.056827,0.072472,-0.027266,0.005022,-0.112209,0.012028,0.014758,-0.041072,0.005132,-0.008448,0.012391,0.093909,0.021012,-0.041413,0.063373,-0.002820,-0.049564,0.015496,0.074806,-0.015119,-0.014245,0.037606
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4936,0.070479,-0.016183,-0.005395,-0.012133,0.002947,0.040266,-0.062758,0.004594,-0.011425,-0.026487,-0.083320,0.013419,0.022308,0.051117,-0.025433,0.010676,-0.005407,0.018376,0.019448,0.032461,-0.052818,-0.035001,0.024222,-0.013169,0.020963,0.007907,0.005337,-0.057825,0.084487,-0.028947,-0.035409,0.052946,-0.000677,0.011577,-0.022830,0.130375,0.008922,-0.001638,-0.011977,0.048845,...,-0.079476,0.015498,6.012678e-02,-0.010194,-0.032734,0.012599,0.040487,0.095823,-0.001926,0.050196,-0.034087,-0.090270,0.053830,0.004762,-0.014520,0.034075,-0.011722,0.014711,0.057626,0.080493,0.000166,0.026988,0.064418,0.046556,-0.046219,0.039433,-0.053342,-0.037999,0.053545,0.042079,0.033561,0.037186,0.034732,0.056066,0.004349,0.055306,0.005161,0.016434,-0.022319,-0.015748
4937,0.004632,0.055278,-0.007318,-0.022775,-0.003292,0.056340,0.046017,-0.015889,-0.061952,-0.015340,-0.061857,-0.043810,-0.029623,0.036480,-0.039363,-0.011975,0.049904,0.024367,-0.010541,0.017544,0.033350,0.043270,0.042053,0.038939,-0.066678,-0.020990,-0.016548,0.002420,0.093744,-0.062800,-0.022101,0.072052,0.019308,-0.030504,0.040716,0.125457,-0.022356,-0.008413,0.012092,0.010246,...,-0.054266,-0.017162,3.712818e-03,0.015855,-0.031384,0.028965,-0.037995,0.101315,-0.026451,-0.031986,-0.036917,-0.071630,0.028776,-0.008725,0.052034,0.059499,-0.047195,0.025691,0.046531,0.072265,-0.037951,-0.015713,0.008412,0.058363,-0.008375,0.037357,-0.027571,-0.060303,0.062488,0.001805,0.025181,0.070120,-0.011467,0.040498,-0.059844,0.056202,0.028652,0.028626,-0.018350,-0.010360
4938,0.008761,0.074300,-0.010118,-0.002762,-0.016300,0.050487,0.024564,0.004672,-0.071272,-0.038554,-0.080242,-0.057881,-0.031251,0.012809,-0.044194,0.004758,0.055667,0.067183,0.018107,0.048386,0.051310,0.023292,0.022077,-0.001237,-0.091518,-0.045569,0.000838,0.021424,0.128144,-0.048820,-0.022369,0.072165,0.001627,-0.040327,0.005623,0.130760,-0.048734,0.017030,0.028395,0.019119,...,-0.046862,0.002095,-1.443668e-03,-0.016898,-0.016509,0.040379,-0.050304,0.074450,0.052909,-0.015426,-0.023391,-0.052885,0.003771,-0.020276,0.041561,0.048760,-0.022661,0.047938,0.073394,0.045152,-0.038099,-0.028009,0.022084,0.042930,-0.016836,0.071701,-0.053938,-0.082806,0.066614,0.014976,0.030879,0.065525,0.026397,0.040088,-0.046231,0.076915,0.019812,0.032605,0.015487,0.004117
4939,0.070576,0.020423,-0.038023,0.000396,0.012386,0.046084,-0.008685,-0.005522,-0.021337,-0.030592,-0.054077,-0.088775,0.012277,-0.003525,-0.038626,-0.055676,0.016176,-0.005859,0.036807,0.004832,-0.019132,-0.009077,0.052763,0.011946,-0.014828,-0.004953,0.031094,0.007100,0.098844,-0.033092,-0.044480,0.115598,-0.026078,-0.004942,0.046463,0.168085,-0.012213,-0.015893,0.011558,0.057777,...,-0.067977,0.021528,1.853421e-02,0.033149,-0.012866,-0.006686,0.031420,0.079341,-0.069067,0.004812,-0.023906,-0.102833,0.027238,0.002427,-0.001275,0.032004,-0.022617,-0.013102,0.064931,0.097208,-0.017744,0.001840,-0.024648,0.065283,-0.046386,0.027130,-0.063467,-0.081014,0.045739,0.033384,0.045813,0.043872,0.025080,0.078121,-0.034227,0.081302,0.022312,0.023160,-0.019333,-0.013517


In [40]:
df_meta = df_txt.loc[:, ['UNIT_TITLE', 'UNIT_CODE', 'UNIT_LEARNING_OUTCOME', 'STUDY_LEVEL', 'OWNING_FACULTY', 'OWNING_ORG_UNIT']]

In [41]:
df_meta.head()

Unnamed: 0,UNIT_TITLE,UNIT_CODE,UNIT_LEARNING_OUTCOME,STUDY_LEVEL,OWNING_FACULTY,OWNING_ORG_UNIT
0,Accounting in business,ACB1020,ULO1 - demonstrate an understanding of various...,undergraduate,Faculty of Business and Economics,Department of Accounting
2,Financial accounting 1,ACB1120,ULO1 - identify and analyse measurement system...,undergraduate,Faculty of Business and Economics,Department of Accounting
5,Financial accounting 2,ACB2120,"ULO1 - explain the content of, and regulatory ...",undergraduate,Faculty of Business and Economics,Department of Accounting
6,Management accounting 1,ACB2220,ULO1 - describe cost behaviour under different...,undergraduate,Faculty of Business and Economics,Department of Accounting
7,Accounting information systems,ACB2420,ULO1 - examine the role of accounting informat...,undergraduate,Faculty of Business and Economics,Department of Accounting


Remove newlines (?)

In [42]:
df_meta['UNIT_LEARNING_OUTCOME'] = df_meta.UNIT_LEARNING_OUTCOME.apply(lambda x: x.replace("\n", " "))

### Export results

In [None]:
# df_txt.to_csv("unit_learning_outcomes_embeddings.csv", index=False)

In [43]:
df_embeddings.to_csv(f"{path}/unit_learning_outcomes_embeddings_only_all-MiniLM-L6-v2.tsv", sep="\t", header=False, index=False)

In [44]:
df_meta.to_csv(f"{path}/unit_learning_outcomes_metadata_only_all-MiniLM-L6-v2.tsv", sep="\t", index=False)

EOF