<h1 align="center">fastText for text classification</h1>

***

In this notebook, we will train a fastText model for criteria sentence classification, and evalute the performance in test data.

**What is fastText?**
fastText is a library for efficient learning of word representations and sentence classification.

* training data (22962 sentences), validation data (7682 sentences) test data (7697 sentences)
* 44 semantic categories

|#|group topics|semantic categories|
|---|---|----
|1|`Health Status`|`Disease` `Symptom` `Sign` `Pregnancy-related Activity` `Neoplasm Status` `Non-Neoplasm Disease Stage` `Allergy Intolerance` `Organ or Tissue Status` `Life Expectancy` `Oral related`
|2|`Treatment or Health Care`|`Pharmaceutical Substance or Drug` `Therapy or Surgery` `Device` `Nursing`
|3|`Diagnostic or Lab Test`|`Diagnostic` `Laboratory Examinations` `Risk Assessment` `Receptor Status`
|4|`Demographic Characteristics`|`Age` `Special Patient Characteristic` `Literacy` `Gender` `Education` `Address` `Ethnicity`
|5|`Ethical Consideration`|`Consent` `Enrollment in other studies` `Researcher Decision` `Capacity` `Ethical Audit` `Compliance with Protocol`
|6|`Lifestyle Choice`|`Addictive Behavior` `Bedtime` `Exercise` `Diet` `Alcohol Consumer` `Sexual related` `Smoking Status` `Blood Donation`
|7|`Data or Patient Source`|`Encounter` `Disabilities` `Healthy` `Data Accessible`
|8|`Other`|`Multiple`

In [114]:
import os sys
import fasttext
import codecs
import jieba

<h2>Getting and preparing the data</h2>

***

Before training our first classifier, we need to prepare the train data and test data. We will use the test data to evaluate how good the learned classifier is.

Each line of the text file contains a list of labels, followed by the corresponding sentence. All the labels start by the _ _label_ _ prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.

train data:

In [115]:
with open("criteria.train", "w", encoding="utf-8") as outf:
    with open("./data/train_data.txt", "r", encoding="utf-8") as inf:
        for line in inf:
            l = line.strip().split("\t")
            sentence = jieba.cut(l[2].strip().replace("\t", " ").replace("\n", " "))
            outf.write("__label__{} {}\n".format(l[1].replace(" ", "_"), " ".join(list(sentence))))

validation data:

In [118]:
with open("criteria.valid", "w", encoding="utf-8") as outf:
    with open("./data/validation_data.txt", "r", encoding="utf-8") as inf:
        for line in inf:
            l = line.strip().split("\t")
            sentence = jieba.cut(l[2].strip().replace("\t", " ").replace("\n", " "))
            outf.write("__label__{} {}\n".format(l[1].replace(" ", "_"), " ".join(list(sentence))))

test data:

In [117]:
with open("criteria.test", "w", encoding="utf-8") as outf:
    with open("./data/test_data.txt", "r", encoding="utf-8") as inf:
        for line in inf:
            l = line.strip().split("\t")
            sentence = jieba.cut(l[2].strip().replace("\t", " ").replace("\n", " "))
            outf.write("__label__{} {}\n".format(l[1].replace(" ", "_"), " ".join(list(sentence))))

<h2>fastText classifier</h2>

***

Automatic hyperparameter optimization

In [119]:
model = fasttext.train_supervised(input="criteria.train",autotuneValidationFile='criteria.valid')

In [120]:
model.test("criteria.test")

(7697, 0.8162920618422762, 0.8162920618422762)

<h2>Save model and test data results</h2>

***

call save_model to save it as a file.

In [121]:
model.save_model("fastText_criteria.bin")

load model with load_model function, and evaluate on test data.

In [122]:
test_data_file = "criteria.test"
test_results_save_file = "test_data_predict.txt"

criteria_ids, criteria_sentences = [], []
with open(test_data_file, "r", encoding="utf-8") as inf:
    c = 0
    for line in inf:
        c += 1
        l = line.strip().split(" ")
        criteria_ids.append("s{}".format(c))
        criteria_sentences.append(" ".join(l[1:]))
        
model = fasttext.load_model("fastText_criteria.bin")        
predicted = model.predict(criteria_sentences, k=1)

with codecs.open(test_results_save_file, "w", encoding="utf-8") as outf:
    for i in range(len(criteria_ids)):
        outf.write("{}\t{}\t{}\n".format(criteria_ids[i], predicted[0][i][0].replace("__label__", "").replace("_", " "), "".join(criteria_sentences[i].split())))



<h2>Evaluation</h2>

***

In [123]:
test_data_file = "../data/test_data.txt"
test_results_save_file = "test_data_predict.txt"
test_results_evaluation_save_file = "test_data_evaluation.txt"
os.system("python evaluation.py {} {} > {}".format(test_data_file, test_results_save_file, test_results_evaluation_save_file))

0

In [124]:
with open(test_results_evaluation_save_file, "r") as f:
    for line in f:
        print(line.strip("\n"))

**************************************** Evaluation results*****************************************
                                       Precision.       Recall.          f1.            
                 Addictive Behavior    0.9012           0.8295           0.8639         
                            Address    0.6154           0.6667           0.6400         
                                Age    0.9769           0.9705           0.9737         
                   Alcohol Consumer    0.6250           0.8333           0.7143         
                Allergy Intolerance    0.9355           0.9103           0.9227         
                            Bedtime    1.0000           0.5833           0.7368         
                     Blood Donation    0.8182           0.8182           0.8182         
                           Capacity    0.5574           0.6071           0.5812         
           Compliance with Protocol    0.7576           0.8333           0.7937         
         

<h2>Predict a new input criteria sentence with saved model</h2>

***

In [125]:
examples = ["性别不限", "年龄大18岁，", "病人对研究药物过敏。"]
print(examples)
sentences = [" ".join(list(jieba.cut(s.strip().replace("\t", " ").replace("\n", " ")))) for s in examples]
print(sentences)

model = fasttext.load_model("fastText_criteria.bin") 
results = model.predict(sentences, k=1)
print(results)

['性别不限', '年龄大18岁，', '病人对研究药物过敏。']
['性别 不 限', '年龄 大 18 岁 ，', '病人 对 研究 药物 过敏 。']
([['__label__Gender'], ['__label__Age'], ['__label__Allergy_Intolerance']], [array([1.00001], dtype=float32), array([1.00001], dtype=float32), array([1.0000099], dtype=float32)])
