# Introduction: Predictive model for differential diagnosis

In this notebook, our goal is to develop a model that can take in a patient's symptoms as an input and return a list of the top 3 possible classes (diseases) alongside confidence values for each class expressed as probabilities.


## Library and Data import

In [None]:
#|include: false 
#|cellmetadata: false

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Input data files are available in the read-only "../input/" directory
# For example, running the above (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


In [None]:
#|include: false 

%pip install seaborn
%pip install fastkaggle
%pip install -Uqq fastbook
%pip install --upgrade pip
%pip install tqdm
#%pip install catboost
#%pip install optuna
#%pip install optuna_distributed
#%pip install openfe
#%pip install xgboost
#%pip install lightgbm
#%pip install h2o
#%pip install polars
#%pip install -q -U autogluon.tabular
#%pip install autogluon
#%pip install wandb
#%pip install sweetviz

In [None]:
#| code-fold: true
#| output: false
#| code-summary: "Library Import"

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import fastbook
#fastbook.setup_book()
#from fastbook import *
from fastai.tabular.all import *
import numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from fastai.imports import *
np.set_printoptions(linewidth=130)
from fastai.text.all import *
from pathlib import Path
import os
import warnings
import gc
import pickle
from joblib import dump, load

# ULMFiT approach

In traditional text transfer learning, We use a pre-trained model called a language model. The model we are also going to use in this example was initially trained on Wikipedia on the task of guessing the next word. We then fine-tuned this model for our disease classification task based on symptoms. We can then use this model for our task of disease classification.

But the Wikipedia English might differ from medical jargon, so to further improve our model. We can employ a technique shown in the [ULMFIT Paper](https://arxiv.org/abs/1801.06146) by Jeremy Howard and Sebastian Ruder. They take the above a step further by fitting the pre-trained model on medical corpus and then using that as a base for our classifier. They noticed that adding this step of training the pretrained model on the task specific corpus gives better result as the model also has better context of the final task.

In [None]:
!ls /kaggle/input/symptoms-disease-no-id

In [None]:
path = Path('/kaggle/input/symptoms-disease-no-id')
path

In [None]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
symptom_df.head()

In [None]:
symptom_df['text'].nunique(),sd_df['text'].nunique()

Below I define a DataLoader which is an extension of PyTorch's DataLoaders class, albeit with more functionality. This takes in our data, and prepares it as input for our model, passing it in batches etc.

The DataLoaders Object allows us to build data objects we can use for training without specifically changing the raw input data.

The dataloader then acts as input for our models. We also pass in valid_pct=0.2 which samples and uses 20% of our data for validation.

In [None]:
#dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True, valid_pct=0.2)
dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True,text_col='text', valid_pct=0.2)
#dls_lm = TextDataLoaders.from_folder(path=path_lm, is_lm=True, valid_pct=0.1)

We then use show_batch to have a look at some of our data.Since, we are guessing the next word in a sentence, you will notice that the targets have shifted one word to thr right in the *text_* column.

In [None]:
dls_lm.show_batch(max_n=5)

From the above, we notice that the texts were processed and split into tokens. It adds some special tokens like xxbos to indicate the beginning of a text and xxmaj to indicate the next word was capitalised.

We then define a fastai [learner](https://docs.fast.ai/learner.html#learner), which is a fastai class that we can use to handle the training loop. It bundles the essential components needed for training together such as the data, model,the dataloaders, loss functions

We use the AWD LSTM architecture.

In [None]:

learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()

In [None]:
#| error: false
learn.fit_one_cycle(1, 1e-2)

In [None]:
#| code-fold: show

# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)

# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'

# Now save the model
learn.save('1epoch')

In [None]:
#| error: false
learn = learn.load('1epoch')

In [None]:
#| error: false
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)

In [None]:
#| code-fold: true
#| output: false
#| code-summary: "Save the model"
# Now save the model
learn.save_encoder('finetuned')

In [None]:
#| output: false
#| error: false
TEXT = "I have running nose, stomach and joint pains"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

In [None]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
#sd_df = pd.read_csv(path_lm/'symptom_disease_no_id_col.csv')
sd_df.head()

In [None]:
# Check for NaN values in the label column
print(sd_df['label'].isna().sum())

# If there are NaNs, you can drop those rows
#df = df.dropna(subset=['label'])

In [None]:
#| output: false
#| error: false
#dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test', text_vocab=dls_lm.vocab)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test',text_col='text',label_col='label', text_vocab=dls_lm.vocab)

In [None]:
#| error: false
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

In [None]:
from pathlib import Path
learn.path = Path('/kaggle/working')

In [None]:
#| error: false
learn = learn.load_encoder('finetuned')

In [None]:
len(dls_lm.vocab)

In [None]:
#| error: false
learn.fit_one_cycle(1, 2e-2)

In [None]:
#| error: false
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

In [1]:
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

NameError: name 'learn' is not defined

In [None]:
#| code-fold: true
#| code-summary: "Click to see full code in one cell"
#| error: false
path = Path('/kaggle/input/symptoms-disease-no-id')
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
dls_lm = TextDataLoaders.from_df(symptom_df, path=path,text_col='text', is_lm=True, valid_pct=0.2)
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()
learn.fit_one_cycle(1, 1e-2)
# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)
# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'
# Now save the model
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)
# Now save the model
learn.save_encoder('finetuned')


#finetuning the classifier
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,text_col='text',label_col='label', text_vocab=dls_lm.vocab)
from pathlib import Path
learn.path = Path('/kaggle/working')
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

# References

[Fastai Documentation - Text Transfer Learning](https://docs.fast.ai/tutorial.text.html#the-ulmfit-approach)

