# Final Project Summary
## Module #5 (Unsupervised Learning & Natural Language Processing)
## Mark Streer (DS/ML)

### Abstract:

Globally, scientific literature is disseminated in English, yet the language is spoken by fewer than 20% of the world's population, and only 5% speak it natively. Researchers working in non-Anglophone countries normally enlist help from translators and/or editors in order to publish in their second language; however, language professionals are inconsistent in their diction and writing style, leaving their customers to question the value of their services. Could English biomedical corpora be used as reference repositories of expected genre-specific language for translators to consult when translating, and for ESL authors to check/edit their work? In this project, I examined a collection of English technical translations from Japanese by the same translator (myself), finding the topics modeled to correspond to different domains of medicine and healthcare, and to be well differentiated by clustering algorithms as well. Further work will apply the same pipeline to the Japanese source texts, to see how the topics/clusters compare and diverge across these two languages.

### Design:

The dataset analyzed consists of the English texts of my technical translations from Japanese to English in 2020-2021. These documents are generally original research articles (RAs) or abstracts in IMRAD format (n=110). The Japanese authors range from graduate students writing their first scientific paper, to professors and physicians writing their 20th; the sophistication of the lexis and syntax in their source texts is similarly diverse. Likewise, scientific corpora are normally drawn from a wide variety of sources and authors. Despite this variation, the rules applied to distill technical Japanese from diverse authors and domains should be relatively **consistent** within any given translator, corresponding to their personal translation style.

Preprocessing was applied to remove punctuation and small words (<4 characters); the results were lemmatized using WordNetLemmatizer. Specific other considerations applied in preprocessing and model selection included:
* Given the size of the documents (~500-10,000 words), latent Dirichlet allocation (LDA) was expected to perform the best at differentiating topics. However, non-negative matrix factorization (NMF) ultimately generated topics and topic-word loadings that were more interpretable.

* Since technical terminology and jargon were expected to play key roles in differentiating topics/domains in a corpus of technical documents, the 1000 most-common English words were added to the stopwords list for vectorization.
* For the same reason, a TF-IDF vectorizer was preferred to a simple count vectorizer, to emphasize rare terms which were likely jargon.

### Communication

#### NMF Topic Model (n=16)

![](images/NMF16_topics.png)

My personal familiarity with the documents in the corpus leads me to believe that 16 topics is not an unreasonable number. The documents come from a wide variety of customers, from different universities and domains of expertise; one would expect a random selection of documents from PubMed or ScienceDirect to be diverse as well. 

Several of these topics persisted in an 8-topic model run for comparison (i.e. surgery, COVID-19, cancer, neuroscience, and rehabilitation). Even the newly created categories were similarly interpretable as the coalescing of numerous minor categories. Notable examples include:
* **Nursing**: Community health, nursing education, stroke
* **Biomolecular Therapeutics**: Stem cell therapeutics, clinical trials (i.e. pharmaceuticals)
* **Functional Health**: Community health, social welfare, physiology


#### NMF Topic Model (n=8)

![](images/NMF08_topics.png)

#### Clustering

K-means clustering (k=16) was applied to the doc-topic matrix resulting from the 16-topic NMF model. For each cluster generated, the doc-topic vectors of each doc in the group were averaged, and individual docs compared with the cluster centroid. Surprisingly, despite the small corpus size, clusters were populated with relevant documents. One finding of particular note is that the mean doc-topic vector usually had one dominant topic, with the next largest in distant second. This held true even for 'smaller' topics in the model (9-16) such as '13.Stroke' and '10.ObGyn', as shown below, supporting the original decision to set n=16 topics in my mind.

![](images/clusterex1.png)

![](images/clusterex2.png)

![](images/clusterex3.png)