## Prerequisites

# EIOS machine learning to detect disease article signals for WHO-AFRO
## Performed by Dr. Scott Pezanowski
### 2023-05-01 to 2023-07-31

1. Install Tensorflow and AutoKeras using the instructions found here https://autokeras.com/install/ .
2. Run the commands below to install other Python libraries.

In [None]:
!python3 -m pip install striprtf
!python3 -m pip install scikit-learn
!python3 -m pip install regex
!python3 -m pip install --user -U nltk

# Create the training data

* The code belows assume that you exported from EIOS roughly equal numbers of articles previously labeled as sginals and articles not signals.
* The more representative your articles are of all types of articles you are looking for, the better.
* Articles will ideally come from different time periods, sources, be about different diseases, etc.

* Read in the data from the Doc files <br>
* Note: They are actually not in MS Doc format. They are in RTF. So, we need the striprtf python library to convert RTF to TXT. <br>
* The code assumes that for each set of documents, there are article signals in a directory named "signal" and articles that are not signals in a directory named "not_signal." <br>
* Here, I also set the variable max_chars which will be used later to truncate the full text to the first 4,000 characters for model training. Longer text will cause memory issues unless you have a lot of GPU memory.

In [None]:
import warnings
warnings.filterwarnings('ignore')
import glob
from datetime import datetime
from striprtf.striprtf import rtf_to_text
import pandas as pd

# Below is the primary path to your training data. You should have "signal" and "not_signal" subdirectories for your respective articles.
data_path = '/mnt/h/My Drive/work/BrightWorldLabs/coreWork/consulting/2023/WHO-AFRO/data/signalprediction'
dir_signal = 'signal'
dir_not_signal = 'not_signal'
min_chars = 40  # if the article is less than this number of characters, I will remove it below because it is likely not valuable and not a signal.
max_chars = 4000  # used to truncate the article text because longer text causes computer memory problems and usually the first 4000 characters is enough to judge if it is a signal.
max_words = 400  # not currently used

# fields to parse. Some of these fields might be valuable to the model in the future. However, I only used full_text and translated_full_text.
label_title = 'Title: '
label_about_country = 'About country: '
label_source = 'Source: '
label_retrieval_date = 'Retrieval date: '
label_source_type = 'Source type: '
label_from_country = 'From country: '
label_language = 'Language: '
label_summary = 'Summary: '
label_full_text = 'Full text: '
label_translated_text = 'Translated text: '
article_break = '(out of '

articles = []
article_columns = ['title', 'about_country', 'source', 'retrieval_date', 'source_type', 'from_country', 'language', 'summary', 'full_text', 'translated_text', 'signal_type']
# loop through all articles and parse them line by line using their labels to place them in the appropriate variables.
for f in glob.glob(data_path + '/**/' + '*.doc', recursive=True):
    with open(f, 'r', encoding="utf-8", errors='ignore') as file:
        rtf = file.read()
        text = rtf_to_text(rtf)  # converts RTF format to plain text.
        lines = text.split('\n')
        title = ''
        about_country = ''
        source = ''
        retrieval_date = ''
        source_type = ''
        from_country = ''
        language = ''
        summary = ''
        full_text = ''
        translated_text = ''
        for line in lines:
            if article_break in line and '1(out of ' not in line:
                # the next two lines creates the article label for machine learning. signal/not_signal
                signal_type = 'signal'
                if dir_not_signal in f:
                    signal_type = 'not_signal'
                article_arr = [
                    title,
                    about_country,
                    source,
                    retrieval_date.strftime('%s'),
                    source_type,
                    from_country,
                    language,
                    summary,
                    full_text,
                    translated_text,
                    signal_type
                ]
                articles.append(article_arr)
                title = ''
                about_country = ''
                source = ''
                retrieval_date = ''
                source_type = ''
                from_country = ''
                language = ''
                summary = ''
                full_text = ''
                translated_text = ''
            if label_title in line:
                title = line.replace(label_title, '')
            elif label_about_country in line:
                tmp_arr = line.split('|')
                about_country = tmp_arr[0].replace(label_about_country, '').strip()
                source = tmp_arr[1].replace(label_source, '').strip()
            elif label_retrieval_date in line:
                retrieval_date = line.replace(label_retrieval_date, '').split('|')[0]
                try:
                    retrieval_date = datetime.strptime(retrieval_date + ' 2023', '%d %b %H:%M %Y')
                except ValueError as ve:
                    retrieval_date = datetime.now()
            elif label_source_type in line:
                tmp_arr = line.split('|')
                source_type = tmp_arr[0].replace(label_source_type, '').strip()
                from_country = tmp_arr[1].replace(label_from_country, '').strip()
                language = tmp_arr[2].replace(label_language, '')
            elif label_summary in line:
                summary = line.replace(label_summary, '')
            elif label_full_text in line:
                tmp_arr = line.split('|')
                full_text = tmp_arr[0].replace(label_full_text, '')
                if len(full_text) > min_chars:
                    full_text = full_text[:max_chars].encode(
                        'utf-8', errors='ignore'
                        ).decode('utf-8')
            elif label_translated_text in line:
                tmp_arr = line.split('|')
                translated_text = tmp_arr[0].replace(label_translated_text, '')
                if len(translated_text) > min_chars:
                    translated_text = translated_text[:max_chars].encode(
                        'utf-8', errors='ignore'
                        ).decode('utf-8')
        if len(title) > 0:
            signal_type = 'signal'
            if dir_not_signal in f:
                signal_type = 'not_signal'
            article_arr = [
                title,
                about_country,
                source,
                retrieval_date.strftime('%s'),
                source_type,
                from_country,
                language,
                summary,
                full_text,
                translated_text,
                signal_type
            ]
            articles.append(article_arr)

df = pd.DataFrame(articles, columns=article_columns)

In [None]:
df.drop_duplicates(subset=['full_text'], inplace=True)

In [None]:
# if the full_text exists, use that for training. If not, use the translated_text.
def set_final_text(row):
    if row['language'] == 'English':
        return row['full_text']
    else:
        return row['translated_text']


df['final_text'] = df.apply(lambda row: set_final_text(row), axis=1)

In [None]:
df

## Downsample the not_signal class to match the number in the signal class.

* Typically for binary classification, it is best to have equal amounts for each class.
* Since I had more not_signals, I downsampled the not_signal class which means randomly removing articles until the amount matches.

In [None]:
df_final = df[df['final_text'] != ''] # drop those that are not in English and do not have a translation.
df_signal = df_final[df_final['signal_type'] == 'signal']
df_not_signal = df_final[df_final['signal_type'] == 'not_signal']

In [None]:
print(len(df_signal))
print(len(df_not_signal))

from sklearn.utils import resample


df_not_signal = resample(df_not_signal,
                         replace=True,
                         n_samples=len(df_signal),
                         random_state=42)

In [None]:
print(len(df_signal))
print(len(df_not_signal))

# Train the model using AutoKeras TextClassifier

We will classify the documents as either signals or not signals. <br>
Import necessary libraries for machine learning.

In [None]:
import os

import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

import autokeras as ak

Split the data for training and testing. The default is an 80/20 split.

In [None]:
train_data, test_data = train_test_split(pd.concat([df_signal, df_not_signal]))

## Data Preprocessing

The libraries below contain puntuation and stopwords lists.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

My preprocessing steps include:

* Removing non-utf characters like symbols.
* Removing punctuation.
* Removing stop words.
* Set the label for signals as 1 (positive) and not_signals as 0 (negative).

In [None]:
import string


s_words = nltk.corpus.stopwords.words("english")
punc = string.punctuation
punc += '’' + '‘' + '”' + '“'

def remove_non_utf(txt):
    return txt.encode(
        'utf-8', errors='ignore'
    ).decode('utf-8')


def create_label(lbl):
    if lbl == 'signal':
        return 1
    if lbl == 'not_signal':
        return 0
    

def remove_stop_words(txt):
    new_txt = ''
    word_len = 0
    for sent in nltk.sent_tokenize(txt):
        text_tokens = nltk.word_tokenize(sent)
        tokens_without_sw = [word for word in text_tokens if not word in s_words]
        new_txt += ' '.join(tokens_without_sw) + '. '
        word_len += len(tokens_without_sw)
        if word_len >= max_words:
            break
    return new_txt

def remove_punctuation(txt):
    return "".join([w for w in txt if w not in punc])

x_train = train_data['full_text']
x_train = x_train.apply(remove_non_utf)
x_train = x_train.apply(remove_stop_words)
x_train = x_train.apply(remove_punctuation)

y_train = train_data[article_columns[-1]].apply(create_label)
x_test = test_data['full_text']
x_test = x_test.apply(remove_non_utf)
x_test = x_test.apply(remove_stop_words)
x_test = x_test.apply(remove_punctuation)
y_test = test_data[article_columns[-1]].apply(create_label)

x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

print(x_train.shape)  # (25000,)
print(y_train.shape)  # (25000, 1)
print(x_train[0])
print(y_train[0])

## Load the data into AutoKeras

* Use AutoKeras's TextClassifier to fit the model with 5 epochs.
* With AutoML, the typical strategy is to first train with a small number of epochs allowing time for it to try lots of different model types.
* Plus, usuaully it does not take long to decide which model is best.
* Once we find the best model for the data, farther down we will train that model longer for better results.
* I have a Nvidia RTX Titan GPU and I estimate the fit to find the best model took about 10 minutes.
* Farther down, training the best model for longer took about 10 minutes.
* The training time depends on your computer hardware and the amount of training data.

In [None]:
# Initialize the text classifier.
clf = ak.TextClassifier(
    overwrite=True, max_trials=4
)  # It only tries 4 models as a quick demo. You can set max_trials higher to try many different models.
# Feed the text classifier with training data.
clf.fit(x_train, y_train, epochs=5)  # epochs=10
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))

# Export the best performing model and try to train longer to obtain better results

* In the step above, we found the best model.
* Now, export that model to a standard Keras model.
* Train the best model longer.

In [None]:
best_model = clf.export_model()
print(best_model.summary())

I used EarlyStopping to prevent overfitting the model.

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5)
best_model.fit(x_train, y_train, epochs=100, batch_size=16, callbacks=[callback])

Evalutae the best model.

In [None]:
print(best_model.evaluate(x_test, y_test))

Get the actual probability predictions so that I can use them to sort the results.

In [None]:
y_pred = best_model.predict(x_test)

In [None]:
best_model.metrics_names

Sort the results by those most probable to be signals since this would likely be the scenario if machine learning is added to EIOS to help analysts.

In [None]:
predicted = np.append(np.array([x_test]).T, y_pred, axis=1)

In [None]:
predicted[predicted[:,1].argsort()[::-1]]