# Knowledge Mapper

To create a embedding visualization based on text data from Unit handbook that helps students decide which courses to pursue based on their interests and current enrolled units.

```
Key fields:
1. UNIT_TITLE
2. HANDBOOK_SYNOPSIS
3. UNIT_LEARNING_OUTCOME

Unit meta data:
1. UNIT_CODE
2. ABBREVIATED_UNIT_TITLE
Colour by:
1. STUDY_LEVEL (UG vs. PG)
2. OWNING_FACULTY (faculty who teaches it)
3. OWNING_ORG_UNIT (department who teaches it)

Auxillary questions:
1. What other units should be prohibited to be taken in dual with this unit?
2. What units should I take before this unit? 

Note: PUBLISH_TO_HANDBOOK probably should be used as a first-order filter.
```

### [Embedding Projector](https://projector.tensorflow.org/)

Data format:
```
Load data from your computer
Step 1: Load a TSV file of vectors.
Example of 3 vectors with dimension 4:
0.1\t0.2\t0.5\t0.9
0.2\t0.1\t5.0\t0.2
0.4\t0.1\t7.0\t0.8
Step 2 (optional): Load a TSV file of metadata.
Example of 3 data points and 2 columns.
Note: If there is more than one column, the first row will be parsed as column labels.
Pokémon\tSpecies
Wartortle\tTurtle
Venusaur\tSeed
Charmeleon\tFlame
```

Metadata
```
UNIT_CODE\tUNIT_TITLE\tHANDBOOK_SYNOPSIS\tUNIT_LEARNING_OUTCOME\tSTUDY_LEVEL\tOWNING_FACULTY\tOWNING_ORG_UNIT
```

### Libraries

In [53]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 5.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 17.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 40.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=11b0b2132a9

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |█████                           | 10kB 12.9MB/s eta 0:00:01[K     |██████████▏                     | 20kB 12.3MB/s eta 0:00:01[K     |███████████████▏                | 30kB 8.9MB/s eta 0:00:01[K     |████████████████████▎           | 40kB 7.5MB/s eta 0:00:01[K     |█████████████████████████▍      | 51kB 4.3MB/s eta 0:00:01[K     |██████████████████████████████▍ | 61kB 4.8MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.4MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 5.5MB/s 
Building wheels for collected packages: sentence-tran

In [3]:
# general
import json
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer

In [86]:
data_path = Path('/content/drive/MyDrive/knowledge-mapper/data/raw_data/units owned by faculty - one line per unit (as at 18 Feb 2021).xlsx')
df = pd.read_excel(data_path, sheet_name='Sheet1')

In [87]:
df.shape

(5848, 36)

In [88]:
df.columns

Index(['CL_UNIT_ID', 'UNIT_CODE', 'CL_UNIT_VERSION', 'UNIT_TITLE',
       'ABBREVIATED_UNIT_TITLE', 'CREDIT_POINTS', 'UNIT_STATUS',
       'OWNING_FACULTY', 'OWNING_ORG_UNIT', 'HIGHEST_SCA_BAND',
       'IMPLEMENTATION_YR', 'PUBLISH_TO_HANDBOOK', 'STUDY_LEVEL',
       'HIGHEST_SCA_BAND_1', 'UNIT_EFTSL', 'HANDBOOK_SYNOPSIS',
       'WORKLOAD_REQUIREMENTS', 'QUOTA_INFORMATION', 'OTHER_UNIT_COSTS',
       'FIELD_WORK', 'AREA_OF_STUDY_LINKS', 'OFF_CAMPUS_ATTEND_REQUIREMENTS',
       'SPECIAL_NOTE_TO_STUDENTS', 'UNITOFFERING',
       'HANDBOOK_ASSESSMENT_SUMMARY', 'ASSES_ITEMS', 'UNITCOORD', 'CHIEFEXAM',
       'TEACHING_RESPONSIBILITY', 'PREREQUISITE', 'COREQUISITE', 'PROHIBITION',
       'RULES(PREREQ,COREQ,PROH)', 'INFORMATION_RULE', 'LEARNING_OUTCOME_INFO',
       'UNIT_LEARNING_OUTCOME'],
      dtype='object')

In [89]:
df.head()

Unnamed: 0,CL_UNIT_ID,UNIT_CODE,CL_UNIT_VERSION,UNIT_TITLE,ABBREVIATED_UNIT_TITLE,CREDIT_POINTS,UNIT_STATUS,OWNING_FACULTY,OWNING_ORG_UNIT,HIGHEST_SCA_BAND,IMPLEMENTATION_YR,PUBLISH_TO_HANDBOOK,STUDY_LEVEL,HIGHEST_SCA_BAND_1,UNIT_EFTSL,HANDBOOK_SYNOPSIS,WORKLOAD_REQUIREMENTS,QUOTA_INFORMATION,OTHER_UNIT_COSTS,FIELD_WORK,AREA_OF_STUDY_LINKS,OFF_CAMPUS_ATTEND_REQUIREMENTS,SPECIAL_NOTE_TO_STUDENTS,UNITOFFERING,HANDBOOK_ASSESSMENT_SUMMARY,ASSES_ITEMS,UNITCOORD,CHIEFEXAM,TEACHING_RESPONSIBILITY,PREREQUISITE,COREQUISITE,PROHIBITION,"RULES(PREREQ,COREQ,PROH)",INFORMATION_RULE,LEARNING_OUTCOME_INFO,UNIT_LEARNING_OUTCOME
0,554c86a41b5aac10653b206b274bcbf9,ACB1020,2021.04RO,Accounting in business,ACC IN BUS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit introduces basic accounting concepts...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACB1120 OR ACF1100 OR ACF1200 OR ACC1100 OR AC...,Prohibition: ACB1120 OR ACF1100 OR ACF1200 OR ...,,"On successful completion of this unit, you sho...",ULO1 - demonstrate an understanding of various...
1,1928deab1bd65c504c45bbbbdc4bcb3a,ACB1100,2021.01RO,Introduction to financial accounting,INTRO FIN ACC,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 3,2021,N,undergraduate,SCA Band 3,0.125,This unit provides students with an introducti...,Minimum total expected workload to achieve the...,,Costs are indicative and subject to change.Ele...,,,,,,Within semester assessment: 50% + Examination:...,,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACC1100 OR ACF1100 OR ACW1100,Prohibition: ACC1100 OR ACF1100 OR ACW1100,,The learning outcomes associated with this uni...,ULO1 - Identify and analyse measurement system...
2,e14cc6a41b5aac10653b206b274bcb3e,ACB1120,2021.04RO,Financial accounting 1,INTRO FIN ACC,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit provides you with an introduction to...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACC1100 OR ACF1100 OR ACW1120,Prohibition: ACC1100 OR ACF1100 OR ACW1120,,"On successful completion of this unit, you sho...",ULO1 - identify and analyse measurement system...
3,a128deab1bd65c504c45bbbbdc4bcbb4,ACB1200,2021.01RO,Accounting for managers,ACC FOR MNGRS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 3,2021,N,undergraduate,SCA Band 3,0.125,This unit introduces basic accounting concepts...,Minimum total expected workload to achieve the...,,Costs are indicative and subject to change.Ele...,,,,,,Within semester assessment: 50% + Examination:...,,,Mr Jonathan Phillips,Responsible teaching Department of Accounting ...,,,ACF1100 OR ACW1100 OR ACB1100 OR ACC1200 OR AC...,Prohibition: ACF1100 OR ACW1100 OR ACB1100 OR ...,,The learning outcomes associated with this uni...,ULO1 - Ddemonstrate an understanding of variou...
4,b528deab1bd65c504c45bbbbdc4bcbc7,ACB2020,2021.01RO,Cost information for decision making,COST INFO FOR DEC MA,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 3,2021,N,undergraduate,SCA Band 3,0.125,Introduction to management accounting. Topics ...,Minimum total expected workload to achieve the...,,Costs are indicative and subject to change.Ele...,,,,,,Within semester assessment: 50% + Examination:...,,,Mr Paul Yap,Responsible teaching Department of Accounting ...,ACB1100,,,Prerequisite: ACB1100; \n,Corequisite: Students must be enrolled in cour...,The learning outcomes associated with this uni...,ULO1 - Describe cost behaviour under different...


### Unique Identifiers

In [90]:
df.CL_UNIT_ID.nunique(), df.UNIT_CODE.nunique()

(5848, 5848)

In [91]:
key_cols = ["CL_UNIT_ID",
            "UNIT_CODE",
            "UNIT_TITLE",
            "HANDBOOK_SYNOPSIS",
            "UNIT_LEARNING_OUTCOME",
            "STUDY_LEVEL",
            "OWNING_FACULTY",
            "OWNING_ORG_UNIT"
            ]

### Filter by `PUBLISH_TO_HANDBOOK`

In [92]:
df = df.loc[df["PUBLISH_TO_HANDBOOK"] == 'Y', :]

In [93]:
df.shape

(5324, 36)

In [94]:
df.head()

Unnamed: 0,CL_UNIT_ID,UNIT_CODE,CL_UNIT_VERSION,UNIT_TITLE,ABBREVIATED_UNIT_TITLE,CREDIT_POINTS,UNIT_STATUS,OWNING_FACULTY,OWNING_ORG_UNIT,HIGHEST_SCA_BAND,IMPLEMENTATION_YR,PUBLISH_TO_HANDBOOK,STUDY_LEVEL,HIGHEST_SCA_BAND_1,UNIT_EFTSL,HANDBOOK_SYNOPSIS,WORKLOAD_REQUIREMENTS,QUOTA_INFORMATION,OTHER_UNIT_COSTS,FIELD_WORK,AREA_OF_STUDY_LINKS,OFF_CAMPUS_ATTEND_REQUIREMENTS,SPECIAL_NOTE_TO_STUDENTS,UNITOFFERING,HANDBOOK_ASSESSMENT_SUMMARY,ASSES_ITEMS,UNITCOORD,CHIEFEXAM,TEACHING_RESPONSIBILITY,PREREQUISITE,COREQUISITE,PROHIBITION,"RULES(PREREQ,COREQ,PROH)",INFORMATION_RULE,LEARNING_OUTCOME_INFO,UNIT_LEARNING_OUTCOME
0,554c86a41b5aac10653b206b274bcbf9,ACB1020,2021.04RO,Accounting in business,ACC IN BUS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit introduces basic accounting concepts...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACB1120 OR ACF1100 OR ACF1200 OR ACC1100 OR AC...,Prohibition: ACB1120 OR ACF1100 OR ACF1200 OR ...,,"On successful completion of this unit, you sho...",ULO1 - demonstrate an understanding of various...
2,e14cc6a41b5aac10653b206b274bcb3e,ACB1120,2021.04RO,Financial accounting 1,INTRO FIN ACC,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit provides you with an introduction to...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Mahendra Goyal,Responsible teaching Department of Accounting ...,,,ACC1100 OR ACF1100 OR ACW1120,Prohibition: ACC1100 OR ACF1100 OR ACW1120,,"On successful completion of this unit, you sho...",ULO1 - identify and analyse measurement system...
5,0ded927adb5e68102bdd077cd39619c4,ACB2120,2021.05,Financial accounting 2,FIN ACCT 2,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,This unit provides an overview of the current ...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,�\n�,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Lisa Powell,Responsible teaching Department of Accounting ...,ACB1120 OR ACC1100 OR ACF1100 OR ACW1120,,ACF2100 OR ACW2120 OR ACC2100,Prerequisite: ACB1120 OR ACC1100 OR ACF1100 OR...,,"On successful completion of this unit, you sho...","ULO1 - explain the content of, and regulatory ..."
6,7d4cc6a41b5aac10653b206b274bcb9a,ACB2220,2021.03RO,Management accounting 1,MGT ACCT 1,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,Introduction to management accounting. Topics ...,Minimum total expected workload to achieve the...,,,,,,,S1-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr John Ko,Responsible teaching Department of Accounting ...,ACC1100 OR ACF1100 OR ACW1120 OR ACB1120,,ACC2200 OR ACF2200 OR ACW2220,Prerequisite: ACC1100 OR ACF1100 OR ACW1120 OR...,,"On successful completion of this unit, you sho...",ULO1 - describe cost behaviour under different...
7,b14cc6a41b5aac10653b206b274bcbbd,ACB2420,2021.04RO,Accounting information systems,ACC INFO SYS,6.0,Accredited,Faculty of Business and Economics,Department of Accounting,SCA Band 4,2021,Y,undergraduate,SCA Band 4,0.125,"The objective of this unit is two-fold. First,...",Minimum total expected workload to achieve the...,,,,,,,S2-01-PENINSULA-ON-CAMPUS Offered-Y,,1 - 50% APPLY_TO_ALL_OFFER - Y\n2 - 50% APPL...,,Dr Daisy Seng,Responsible teaching Department of Accounting ...,ACB1020 OR ACF1100 OR ACB1120 OR ACF1200 OR AC...,,ACF2400 OR ACW2420 OR ACC2400,Prerequisite: ACB1020 OR ACF1100 OR ACB1120 OR...,,"On successful completion of this unit, you sho...",ULO1 - examine the role of accounting informat...


### Get key columns

In [99]:
df_txt = df.loc[:, key_cols]

In [100]:
df_txt.isna().sum()

CL_UNIT_ID                 0
UNIT_CODE                  0
UNIT_TITLE                 0
HANDBOOK_SYNOPSIS         27
UNIT_LEARNING_OUTCOME    378
STUDY_LEVEL                2
OWNING_FACULTY             0
OWNING_ORG_UNIT            0
dtype: int64

In [101]:
df_txt.fillna("", inplace=True)

In [102]:
df_txt.isna().sum()

CL_UNIT_ID               0
UNIT_CODE                0
UNIT_TITLE               0
HANDBOOK_SYNOPSIS        0
UNIT_LEARNING_OUTCOME    0
STUDY_LEVEL              0
OWNING_FACULTY           0
OWNING_ORG_UNIT          0
dtype: int64

In [56]:
model_name = 'distilbert-base-nli-stsb-mean-tokens'
model = SentenceTransformer(model_name) # load DistillBERT model (more efficient) 

100%|██████████| 245M/245M [00:34<00:00, 7.07MB/s]


In [107]:
def get_bert_embeddings(text, model):
    """Computes the mean BERT embeddings (context dependent) for a given sentence
    Returns a 768 dimensional embedding
    """
    if text.strip() != "":
        embeddings = model.encode(text)
    else:
        embeddings = np.zeros(model.get_sentence_embedding_dimension())
    return embeddings

In [78]:
df_txt.loc[0, 'HANDBOOK_SYNOPSIS'], df_txt.loc[0, 'UNIT_LEARNING_OUTCOME']

('This unit introduces basic accounting concepts to non-accountants. The information requirements of two main groups of information users are examined - external users such as current and potential investors and internal users such as managers. This unit provides an introduction to the structure, meaning, analysis and interpretation of financial statements, in addition to exploring financial issues confronting managers, such as cost and performance measurement and budgeting.',
 'ULO1 - demonstrate an understanding of various forms of business organisations\nULO2 - apply financial and management accounting principles in the preparation of financial statements\nULO3 - measure and interpret information relating to financial performance, financial position, liquidity and risk indicators of businesses\nULO4 - measure and interpret financial and non-financial information for managers to use in planning, decision making and control\nULO5 - develop the ability to work effectively in a team and

### Baseline

Few strategies:
- Combine embeddings (mean/sum) for both `HANDBOOK_SYNOPSIS` and `UNIT_LEARNING_OUTCOME`
- Keep them seperate to provide two distinct "modes" for viz that the users can select from
- Show them the top-K similar UNITS in terms of synopsis or outcomes

In [81]:
get_bert_embeddings(df_txt.loc[0, 'HANDBOOK_SYNOPSIS'], model).shape

(768,)

In [110]:
df_txt['embeddings'] = df_txt['HANDBOOK_SYNOPSIS'].apply(lambda x: get_bert_embeddings(x, model))

In [126]:
df_embeddings = pd.DataFrame(df_txt["embeddings"].to_list())

In [127]:
df_embeddings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
0,-0.155681,0.489606,-0.421600,-0.072159,-0.067547,0.274299,0.091877,-0.698768,-0.666343,-0.536237,0.291072,0.899175,-0.077659,-0.613539,0.343758,-0.670124,0.794246,-0.388422,0.641546,-0.456669,-0.119552,0.153932,0.466036,0.399631,-0.440171,-0.322608,0.073568,-0.297683,-0.059697,-1.148418,-0.071681,-0.730634,-0.144538,-0.316398,-0.140504,0.568030,0.627514,0.342635,-0.513060,0.573039,...,0.690227,0.179910,0.819151,-0.986035,-0.217857,0.704062,-0.557054,0.336979,0.073345,0.416182,0.169119,0.252819,-0.007765,0.867754,0.049940,0.622713,0.249839,0.490119,0.017129,0.472541,-0.081607,0.270919,-0.168286,0.216581,-0.160095,1.078925,0.140712,0.485378,-0.825715,-0.538115,-0.143329,-0.438859,-0.328311,0.467558,-0.623120,-0.232646,0.089066,0.373773,0.051465,0.133306
1,-0.131595,0.717932,0.103539,-0.352256,0.445363,0.574004,0.224938,-0.443664,0.156914,-0.621618,0.611865,0.595117,0.254392,-0.395437,0.282696,-0.721986,0.495153,-0.142491,0.425335,-0.847287,0.158907,-0.121272,-0.077500,0.322465,0.245117,0.059716,0.712812,0.279434,-0.062502,-0.594381,-0.161564,-0.912918,-0.276822,0.038880,0.144402,0.864494,0.575056,0.140458,-0.946865,0.566941,...,0.402116,0.236811,0.438252,-0.977628,-0.213731,1.024718,-0.511329,0.466471,0.013676,-0.338417,0.193394,-0.484011,-0.175995,1.359603,0.611458,-0.022046,0.306897,0.529887,-0.013254,0.200802,-0.388851,-0.016475,-0.287569,0.173027,0.000487,0.696104,0.154334,0.583121,-0.620627,-0.281843,-0.357997,-0.563931,-0.109533,0.522604,-0.882951,-0.611184,-0.418232,0.575920,-0.099176,0.193249
2,-0.088683,0.253432,-0.127657,0.622538,-0.325812,0.536132,0.729754,-0.175889,-0.649913,-0.419049,-0.334742,0.199344,0.924215,-0.092979,-0.252962,0.571395,0.298841,-0.037249,0.201663,-0.605967,-0.471863,-0.401052,0.603825,0.514850,1.017304,-0.253662,0.632144,1.323611,-0.037013,-0.371983,-0.155334,-0.673828,-0.292144,-0.339262,0.097976,0.239314,0.449920,0.534731,-1.278293,0.211376,...,0.944642,0.162069,0.349109,-1.309283,-0.536939,0.179826,0.029545,0.720614,0.231857,-0.428832,-0.070528,-0.317812,-0.905764,0.546344,0.519607,0.382819,-0.553630,0.094056,0.145244,0.248824,-0.619912,-0.656022,0.469421,-0.663645,-0.103491,0.573311,-0.210589,0.457057,-0.521667,-0.271598,-0.267051,-0.230766,-0.338813,1.029949,-0.810742,-0.761941,-0.479853,0.077615,0.325930,-0.238579
3,-0.087107,0.653962,-0.185668,-0.346532,0.271266,0.130659,0.110633,-0.481047,-0.453030,-0.704346,0.269104,0.765400,-0.131103,-0.477464,0.208585,-0.560699,0.167637,-0.375447,0.031503,-0.373973,0.472703,0.225207,0.008906,0.703815,-0.013692,0.098833,0.512104,0.422383,-0.036169,-0.866069,-0.052607,-1.309379,0.249422,-0.075759,-0.154234,0.623039,0.071653,0.113011,-1.043562,0.201496,...,0.398728,0.430202,0.548670,-1.208655,-0.139523,1.113375,0.092824,0.219635,0.400243,-0.533676,0.536090,0.274127,-0.216970,0.898456,0.493920,0.672126,0.379411,0.108186,-0.388592,0.055922,-0.419779,-0.068736,0.082193,-0.113954,-0.304840,0.383442,0.005209,0.868236,-0.421514,-0.385263,-0.639858,-0.303500,0.164779,0.129109,-0.683032,-0.802295,-0.375369,0.249854,-0.136509,0.575346
4,-0.532874,0.162291,0.077257,-0.460296,0.101121,-0.125485,0.050201,-0.356446,-0.598464,-1.237478,0.488874,0.532229,-0.091280,-0.702574,0.335763,-0.642425,0.524485,-0.317519,0.089133,-0.480850,0.275057,-0.027158,0.426448,0.599703,0.038396,-0.081411,0.517921,0.242776,0.089475,-0.650904,0.042535,-0.911383,-0.222596,-0.157123,0.055192,0.288849,-0.067637,0.022072,-0.545285,0.417791,...,0.471688,0.172623,0.781899,-1.341941,-0.104758,1.206891,0.046972,0.373366,0.314850,0.170564,0.918850,0.559489,0.105526,0.945548,0.284143,0.772073,-0.208892,-0.008863,-0.276363,0.641212,-0.104338,0.394144,-0.004151,0.318418,-0.228820,-0.008893,-0.172069,0.425589,-0.102984,-0.597771,-0.184812,0.271644,0.264752,0.756584,-0.709250,-0.324466,0.025467,0.286214,0.278130,0.348504
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5319,0.451990,0.309736,0.946544,-0.206571,0.681023,0.378996,-0.038571,-0.000072,0.655995,-0.964841,0.390931,0.427837,0.667702,0.414819,0.738071,0.121245,0.404929,-0.526483,0.035465,-0.451402,0.748603,0.089200,0.244289,-0.192313,-0.007623,0.010355,-0.410965,0.582181,0.374543,0.154543,-0.075573,-0.305851,-0.244675,-0.313534,0.712564,0.424006,0.066939,0.270249,-0.216470,-0.260179,...,-0.097501,-0.541598,-0.691183,0.168447,0.128817,0.054504,-0.304466,0.254807,-0.207594,-0.037088,0.645562,0.225860,-0.329473,-0.057693,-0.200952,-0.031406,-0.920568,-0.139816,-0.523350,0.152666,-0.156703,0.583586,-0.517861,-0.077945,-0.516586,0.060561,0.452929,0.301669,-0.312279,-0.229596,0.232006,-0.408977,1.130307,-0.103168,0.016254,0.068473,0.700358,-0.176143,-0.143601,-0.354070
5320,0.585972,0.173362,0.702247,-0.463834,0.153539,0.009890,0.227868,-0.234419,0.560748,-1.224319,0.144835,0.600135,-0.722453,0.121530,0.436783,-0.436609,0.606450,-0.665273,-0.198893,-0.112475,1.003078,-0.338065,0.272511,-0.149882,0.336205,-0.476018,0.249239,0.807162,0.559770,-0.366245,-0.008232,-0.407181,-0.751834,-0.307670,0.229383,0.456592,-0.296975,-0.666954,-0.185002,-0.008163,...,0.175887,-0.393938,-0.380653,-0.491284,-0.047709,0.790005,-0.328354,0.129119,0.147158,-0.297146,1.408487,1.146480,-0.480268,0.329741,-0.340197,0.102914,-0.628993,-0.177613,-0.229700,0.639053,-0.135556,0.188249,-0.606796,-0.030319,-0.084508,0.058909,0.372703,0.360030,-0.236066,-0.339238,0.539303,-0.025319,0.437043,0.552850,-0.608056,0.460748,0.483788,-0.002887,0.263095,-0.526901
5321,0.509944,0.360756,0.642134,-0.235867,0.226658,0.158867,0.078351,0.259400,0.661149,-1.074976,0.296680,0.349802,0.238783,0.234959,0.545547,-0.206755,0.409027,-0.542906,-0.268160,-0.356998,0.653871,-0.158315,0.363592,-0.331558,0.251463,-0.146694,-0.276316,1.053881,0.386767,-0.057406,0.122199,-0.221336,-1.042944,-0.179753,0.421663,0.134218,-0.187798,-0.215637,-0.209742,0.217943,...,0.187008,-0.446254,-0.205287,-0.063929,0.017698,0.453568,-0.369643,0.461921,0.099851,-0.214051,0.815553,0.502965,-0.166785,0.150800,-0.099230,0.097537,-0.828334,-0.243795,-0.681920,0.456389,-0.166575,0.612180,-0.733390,0.186237,-0.205241,-0.169358,0.269180,0.250561,0.064340,-0.307812,0.436055,-0.260180,0.900953,0.145090,-0.097312,0.219846,0.890203,-0.106119,0.259449,-0.282486
5322,0.336204,0.899426,0.348201,0.271562,0.286492,0.275031,0.735461,-0.927131,0.665656,0.378922,-0.679339,0.312554,0.841431,0.984633,0.486278,-0.408940,-0.393133,-0.200634,-1.030742,-0.326768,0.218838,0.184668,0.005407,0.533873,-0.221351,0.383131,0.110432,0.280357,0.391598,-0.385549,-0.061017,0.644500,0.262560,0.018769,0.283756,0.000466,-0.155474,0.748603,0.758571,-1.461854,...,-0.539716,-0.236504,-0.293125,-0.379360,0.749131,1.060159,-0.166750,-0.566639,0.426174,0.239423,0.401438,-0.529120,-0.554480,0.606987,-0.597182,0.044124,0.000223,0.086355,0.050554,-0.324323,-0.361047,0.234613,-0.796544,-0.002724,-0.371354,-0.085942,0.434012,0.725207,-0.713853,-0.432174,-0.456558,-0.038807,0.900083,-0.030827,-0.268565,0.252473,-0.052912,-0.434790,0.432867,-0.551641


In [140]:
df_meta = df_txt.loc[:, ['UNIT_TITLE', 'UNIT_CODE', 'HANDBOOK_SYNOPSIS', 'STUDY_LEVEL', 'OWNING_FACULTY', 'OWNING_ORG_UNIT']]

In [141]:
df_meta.head()

Unnamed: 0,UNIT_TITLE,UNIT_CODE,HANDBOOK_SYNOPSIS,STUDY_LEVEL,OWNING_FACULTY,OWNING_ORG_UNIT
0,Accounting in business,ACB1020,This unit introduces basic accounting concepts...,undergraduate,Faculty of Business and Economics,Department of Accounting
2,Financial accounting 1,ACB1120,This unit provides you with an introduction to...,undergraduate,Faculty of Business and Economics,Department of Accounting
5,Financial accounting 2,ACB2120,This unit provides an overview of the current ...,undergraduate,Faculty of Business and Economics,Department of Accounting
6,Management accounting 1,ACB2220,Introduction to management accounting. Topics ...,undergraduate,Faculty of Business and Economics,Department of Accounting
7,Accounting information systems,ACB2420,"The objective of this unit is two-fold. First,...",undergraduate,Faculty of Business and Economics,Department of Accounting


Remove newlines

In [145]:
df_meta['HANDBOOK_SYNOPSIS'] = df_meta.HANDBOOK_SYNOPSIS.apply(lambda x: x.replace("\n", " "))

### Export results

In [111]:
df_txt.to_csv("unit_synopsis_embeddings.csv", index=False)

In [130]:
df_embeddings.to_csv("unit_synopsis_embeddings_only.csv", sep="\t", header=False, index=False)

In [146]:
df_meta.to_csv("unit_synopsis_metadata_only.csv", sep="\t", index=False)

EOF