Name		Name	Last commit message	Last commit date
parent directory ..
LDA.png		LDA.png
README.md		README.md
bert+lstm.ipynb		bert+lstm.ipynb
bert-cls-token.ipynb		bert-cls-token.ipynb
distribution.jpg		distribution.jpg
topic-modelling-dim-reduction.ipynb		topic-modelling-dim-reduction.ipynb
workflow-sketch.png		workflow-sketch.png

README.md

Radar Chart Visualization

Overview

Here's an illustration regarding how our process works.

Data Prep: See above Dataset Section.
Topic Modelling: We want to investigate what dimension we want our radar chart to reflect on. In other words, we want to know what classes we want for the model training process. We choose clustering algorithms to gain more insight into the structure of the data. We first apply Bert embeddings to the sentences and then apply the UMAP algorithm to reduce the dimensionality of the embeddings. Afterwards, we use HDBSCAN to perform clustering. We analyze the resulted clusters and summarize the topics.
- Failed Attempt: We also try popular topic modelling methods such as LDA, but it doesn't yield relevant results due to the significant similarities between the sentences of our task.
Multiclass Classification Model Development: After manually checking the clusters and choosing our topics, we manually label some data within the relevant clusters for training. To build the model, we apply Bert embedding and a linear regression layer for output. We also tried Bert embedding with the LSTM model.
Score Calculation: After model training, we apply normalization techniques and design a score calculation mechanism for later radar chart visualization.
Web Visualization: We set up a small python web server and display the final results using d3.js.

Methods

Topic Modelling

Collab notebook for topic modelling part

For this part, we are asking what topic we can frequently find in these sentences. We use pre-trained Bert embedding as it has shown exceptional results in various NLP tasks. Besides, Bert is pre-trained using a large corpus of data, and we believe they have a more accurate representation of words and sentences.

After applying the embeddings, each sentence becomes a 768-dimensional vector. However, most clustering algorithms work better on lower-dimensional data. Hence we use UMAP to reduce the dimensionality of the embeddings. Afterwards, we apply HDBSCAN for clustering, which stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise.

After obtaining the clusters, we use the class-based TF-IDF to extract some keywords from each cluster. We also manually check sentences to observe what topics appear frequently. Besides, we reduce the number of topics by merging the most similar topic vectors.

Topics We decide on six categories (5 topics + other):

(Others)
Participation matters
Have extra credit
Engaging lecture
Helpful office hour
Heavy workload

Multiclass Classification

After deciding the topics, we manually label those clusters to prepare data for training. We apply two different model architecture. The first model uses ** Bert's pre-trained [CLS] token** with a linear layer, whereas our second model uses Bert's pre-trained embedding on all words and connect it to LSTM for training.

Sample Traning Sentences

"You should be fine as long as you go over the lectures!" - Participation matters
"He also give big curves and extra credit at the end." - Have extra credit
"Super interesting lectures and very friendly and hilarious!" - Engaging Lecture
"He has very generous office hours and is willing to help you if you have any questions." - Helpful office hour
"Expect lots of work - written quizzes twice a week, oral quiz once a week, and graded homework three times a week." - Heavy workload
"Doesn't reply to emails." - Other
"Also, he gives 5 pop-quizzes so beware." - Other
"Run, don't walk." - Other

Prevent Overfitting

To prevent Overfitting, we first reduce the network's capacity by removing layers or reducing the number of elements in the hidden layers. We also use Dropout layers, randomly removing certain features by setting them to zero. We apply regularization, which adds a cost to the loss function for large weights.

Metrics & Results

We evaluate both models using the training loss vs validation loss curve and class accuracy. The overall accuracy for both models is around 86%. Check out both notebook for more detailed information.

Score Calculation & Nomalization

To normalize the score calculation process later in the web server, we first select a random sample of professors. We feed the comments, run the model, and obtain each professors' raw score by averaging every sentence's softmax/probability output. Here is the distribution of each category from our sample data:

We then normalize the score to represent where it lies on the sample distribution. i.e. 80% means the raw score is better than 80% of that of all samples.

Visualization & Web Server

The web server and visualization code are in the folder ( link here [Todo @Ao]). We use d3.js for visualization and flask for setting up a simple web server. The current site supports a comparison of up to 3 professors. The flow works like this:

E.g. visit URL http://127.0.0.1:5000/<prof_id>/<prof2_id>
Scrape data from rate my professor site
Feed comments into the model
Calculate score (normalized)
Display Visualization

How to run [Todo @Ao]

Future work

[todo]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radar-Chart

Radar-Chart

README.md

Radar Chart Visualization

Overview

Methods

Topic Modelling

Multiclass Classification

Sample Traning Sentences

Prevent Overfitting

Metrics & Results

Score Calculation & Nomalization

Visualization & Web Server

Future work

Files

Radar-Chart

Directory actions

More options

Directory actions

More options

Latest commit

History

Radar-Chart

Folders and files

parent directory

README.md

Radar Chart Visualization

Overview

Methods

Topic Modelling

Multiclass Classification

Sample Traning Sentences

Prevent Overfitting

Metrics & Results

Score Calculation & Nomalization

Visualization & Web Server

Future work