## Module #5 Minimum Viable Product

### Mark Streer (DS/ML)

### Motivation

Globally, scientific literature is disseminated in English, yet the language is spoken by fewer than 20% of the world's population, and only 5% speak it natively. Researchers working in non-Anglophone countries normally enlist help from translators and/or editors in order to publish in their second language; however, language professionals are inconsistent in their diction and writing style, leaving their customers to question the value of their services. Could biomedical corpora be used as reference repositories of expected genre-specific language for native English (L1) translators to consult when translating, and for ESL (L2) authors to check/edit their work? In this project, I examine a collection of technical translations from Japanese to English by the same translator (myself), aiming to model topics in the corpus, analyze the most salient words for each topic, and see if they can be traced back to consistent keywords in the Japanese.

### Methodology

The dataset analyzed consists of the English texts of my technical translations from Japanese to English in 2020-2021. These documents are generally original research articles (RAs; or their abstract) in IMRAD format (n=117). The Japanese authors range from graduate students writing their first scientific paper, to professors and physicians writing their 20th; the sophistication of the lexis and syntax in their source texts is similarly diverse. Likewise, scientific corpora are normally drawn from a wide variety of sources and authors. Despite this variation, the rules applied to distill technical Japanese from diverse authors and domains should be relatively **consistent** within any given translator, corresponding to their personal translation style.

Given the length of each document, latent Dirichlet allocation is applied to sort them into topics. Topics are expected to roughly correspond to the domains of clinical research covered by the corpus: neuroscience, nursing, gerontology, pharmaceuticals, and civil engineering (if memory serves) should be among the most well represented.

Preprocessing was applied to remove punctuation and small words (<4 characters); the results were lemmatized using WordNetLemmatizer. Optionally, a list of the 1000 most-common English words was added to the stopwords list for vectorization. TF-IDF vectorization proved tricky to generate good results, so simple count vectors are applied for now.

### Results

In [2]:
import pandas as pd

df = pd.read_pickle('master_list_20211108.pkl')
master_list = list(df.iloc[:,0])


from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(stop_words='english', max_df=0.3)
doc_word_tf = tf_vectorizer.fit_transform(master_list)

from sklearn.decomposition import LatentDirichletAllocation

# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=30, random_state=0)
lda_tf.fit(doc_word_tf)

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(lda_tf, doc_word_tf, tf_vectorizer)

FileNotFoundError: [Errno 2] No such file or directory: 'master_list_20211108.pkl'

#### Classification performance

* Random forest models resulted in the best performance across the four model types tested, followed by K nearest neighbors (k=10), decision tree, and logistic regression. 
* Strong bias is apparent towards the majority class (Southern) in the logistic regression and decision tree models. Southern-versus-all classification consistently earns the highest f1-score in all models (OVA).

![](mvp_fig1.png)

![](mvp_fig2.png)

Further work will:
1. Re-run analysis using 16 MFCCs - the conventional size in speech processing algorithms - instead of the Librosa default of 20.
2. Split train/test datasets by user to ensure the model is not merely determining speaker similarity.