Skip to content

tituslhy/Skimlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the SkimLit App!

TensorFlow Python
made-with-Markdown Generic badge


The aim of skimlit is to make lengthy summaries skimmable to the eyes. Though abstracts are already summaries of their main documents, they can still be quite hard to read. Thankfully, AI can help! The experiments closely follow the models attempted by the paper PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts, using the same data source.

The SkimLit app

  1. Instantiate the environment, type the following commands in your terminal:
  • git clone https://github.com/tituslhy/Skimlit
  • pip -r requirements.txt
  1. Run the backend (written with FastAPI), type the following commands in your terminal:
  • pip install "uvicorn[standard]"
  • uvicorn app:app --port 8000 --reload
  1. Run the frontend (written with Streamlit), open a new terminal instance and type the following command: streamlit run skimlit.py

This launches the application's user interface. Feel free to interact with it!


Users are encouraged to upload their unskimmable summaries to the text folder and click 'skim it'. This loads the model which will then be used to run an inference on the submitted text.


A skimmable summary is then returned as an output to the user

The model

Unfortunately the model size is too large and will not be uploaded on GitHub - but do reach out to me if you would like to have the exact weights. The model architecture (specifics are available in utils/utils.py under the 'build_model' function) is:


All experimented models are trained over 3 epochs on only 10% of the training data to speed up experiment timing. The best model of the experiments tabulated below is then trained on all training data over 5 epochs. A summary of all experimented models and their validation accuracy are as follows:

Experiment Model Validation accuracy (on 10% of test data) Findings
Naive-Bayes TF-IDF Classifier This is the baseline model which serves as the benchmark to all other models experimented. 72.2% The baseline model has a surprisingly good score!
Conv1D on Word embeddings Learned a 128 dimension embedding for each word in vocabulary and added a Conv1D layer on top with an n-gram of 5 79.7% This was the second best performing model. Word embeddings are clearly very important in helping the model understand sentence classification in an abstract.
Universal Sentence Encoder (USE) and Conv1D layer on word embeddings Using USE word embeddings from tensorflow hub, we add a Conv1D layer on top with an n-gram of 5, and fit to the dataset. Word embeddings layer was frozen. 71.2% Performance was expectedly poorer because there are fewer parameters to train, given that the embeddings layer was frozen.
Conv1D character embeddings This model is focused on character embeddings only. We learn a 28 dimension embedding for each character including [UNK]. 65.2% Performance is the worst here, indicating either that we need a more sophisticated model to learn character embeddings adequately, or that character embeddings simply are not the ideal for this task.
Combining USE Sentence Encoder and Conv1D word embeddings with Conv1D character embeddings This is a hybrid model of the previous 2 approaches. 73.1% Performance barely beats the baseline - this is likely because the learning of character embeddings pulled the validation accuracy down.
Tribrid model with USE word embeddings, character embeddings and sentence positioning embeddings We derive the sentence's position as an embedding for each sentence in each abstract, and the total abstract length as another embedding input to the USE word-char hybrid model before. This model now takes in 4 tensor inputs (words, characters, sentence position in abstract and total lines in abstract) 83.0% (final testing accuracy after 5 epochs of training is 84.8%) This finding shows that sentence position is very important in its classification.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages