## Module #5 Minimum Viable Product

### Mark Streer (DS/ML)

### Motivation

Globally, scientific literature is disseminated in English, yet the language is spoken by fewer than 20% of the world's population, and only 5% speak it natively. Researchers working in non-Anglophone countries normally enlist help from translators and/or editors in order to publish in their second language; however, language professionals are inconsistent in their diction and writing style, leaving their customers to question the value of their services. Could biomedical corpora be used as reference repositories of expected genre-specific language for native English (L1) translators to consult when translating, and for ESL (L2) authors to check/edit their work? In this project, I examine a collection of technical translations from Japanese to English by the same translator (myself), aiming to model topics in the corpus, analyze the most salient words for each topic, and see if they can be traced back to consistent keywords in the Japanese.

### Methodology

The dataset analyzed consists of the English texts of my technical translations from Japanese to English in 2020-2021. These documents are generally original research articles (RAs; or their abstract) in IMRAD format (n=117). The Japanese authors range from graduate students writing their first scientific paper, to professors and physicians writing their 20th; the sophistication of the lexis and syntax in their source texts is similarly diverse. Likewise, scientific corpora are normally drawn from a wide variety of sources and authors. Despite this variation, the rules applied to distill technical Japanese from diverse authors and domains should be relatively **consistent** within any given translator, corresponding to their personal translation style.

Topics are expected to roughly correspond to the domains of clinical research covered by the corpus: neuroscience, nursing, gerontology, pharmaceuticals, and civil engineering (if memory serves) should be among the most well represented.

Preprocessing was applied to remove punctuation and small words (<4 characters); the results were lemmatized using WordNetLemmatizer. Optionally, a list of the 1000 most-common English words was added to the stopwords list for vectorization. TF-IDF vectorization proved tricky to generate good results, so simple count vectors are applied for now.

### Results

* Latent Dirichlet allocation showed the best performance on extracting sensible topics, at least to this humans' eyes. This approach was selected for interpretability given the docs' length, range from a paragraph to ten pages.
* The words in Topics 7 and 12 come from two case reports, a specific genre in medical writing focused on the history of one or two specific patients. Meanwhile, Topics 1-4 seem to correspond to community health papers, typically involving large cohorts and questionnaire surveys. It's possible PC1 corresponds to some level of lexical complexity, as terminology in public health articles tends to be more familiar to a lay audience, while case reports lean heavily on medical jargon and anatomical terms.
* For PC2, I interpret the axis as a measure of 'people-centrism' in the domain of study. Documents with greater focus on healthcare services that require person-person interactions, such as rehabilitation and nursing, appear well represented at the top. Meanwhile, Topic 6 seems to correspond to a machine learning paper on imaging diagnosis; Topic 11 is a review article on hallucinations in schizophrenia generally. These papers are more academic in nature, and do not really involve interpersonal interactions in healthcare.
* Intertopic distance maps were highly variable for different hyperparameters of min_df, max_df, and n_components. The values below produced a map that is somewhat interpretable to me, knowing the original documents. However, it's difficult to objectively justify the ones chosen.

In [1]:
# 1. Import cleaned text data (doc_n = 117)
import pandas as pd
df = pd.read_pickle('master_list_20211108.pkl')
master_list = list(df.iloc[:,0])

# 2. Vectorize
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(stop_words='english', 
                                min_df=2,                   # don't want words occurring in only one document (e.g. proper nouns, drug names)
                                max_df=0.6)                 # don't want words that occur very frequently in clinical medicine (e.g. 'research', 'patients')
doc_word_tf = tf_vectorizer.fit_transform(master_list)

# 3. LDA
from sklearn.decomposition import LatentDirichletAllocation
lda_tf = LatentDirichletAllocation(n_components=12, random_state=0)
lda_tf.fit(doc_word_tf)

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(lda_tf, doc_word_tf, tf_vectorizer)