# Introduction: Predictive model for differential diagnosis

In this notebook, our goal is to develop a model that can take in a patient's symptoms as an input and return a list of the top 3 possible classes (diseases) alongside confidence values for each class expressed as probabilities.


## Library and Data import

In [1]:
#|include: false 

%pip install seaborn
%pip install fastkaggle
%pip install -Uqq fastbook
%pip install --upgrade pip
%pip install tqdm
#%pip install catboost
#%pip install optuna
#%pip install optuna_distributed
#%pip install openfe
#%pip install xgboost
#%pip install lightgbm
#%pip install h2o
#%pip install polars
#%pip install -q -U autogluon.tabular
#%pip install autogluon
#%pip install wandb
#%pip install sweetviz

Note: you may need to restart the kernel to use updated packages.
Collecting fastkaggle
  Downloading fastkaggle-0.0.8-py3-none-any.whl.metadata (4.3 kB)
Downloading fastkaggle-0.0.8-py3-none-any.whl (11 kB)
Installing collected packages: fastkaggle
Successfully installed fastkaggle-0.0.8
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.0
    Uninstalling pip-24.0:
      Successfully uninstalled pip-24.0
Successfully installed pip-25.0.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to 

In [2]:
#| code-fold: true
#| output: false
#| code-summary: "Library Import"

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import fastbook
#fastbook.setup_book()
#from fastbook import *
from fastai.tabular.all import *
import numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from fastai.imports import *
np.set_printoptions(linewidth=130)
from fastai.text.all import *
from pathlib import Path
import os
import warnings
import gc
import pickle
from joblib import dump, load

# ULMFiT approach

Our initial pre-trained model used above was initially trained on Wikipedia on the task of guessing the next word. We then fine-tuned this model for our disease classification task based on symptoms.

But the Wikipedia English might differ from medical jargon, so to further improve our model, We can take this a step further by fitting this pre-trained model on medical corpus and using that as a the base for our classifier.

In [3]:
!ls /kaggle/input/symptoms-disease-no-id

symptom_disease_no_id_col.csv  symptom_no_id.csv


In [4]:
path = Path('/kaggle/input/symptoms-disease-no-id')
path

Path('/kaggle/input/symptoms-disease-no-id')

In [5]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
symptom_df.head()

Unnamed: 0,text
0,"I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches."
1,"My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation."
2,"I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints."
3,"There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."
4,"My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms."


In [6]:
symptom_df['text'].nunique(),sd_df['text'].nunique()

(1153, 1153)

In [7]:
#dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True, valid_pct=0.2)
dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True,text_col='text', valid_pct=0.2)
#dls_lm = TextDataLoaders.from_folder(path=path_lm, is_lm=True, valid_pct=0.1)

In [8]:
dls_lm.show_batch(max_n=5)

Unnamed: 0,text,text_
0,"xxbos xxmaj there are red sores on my face and near my nose . i have noticed that the rash is spreading more quickly on my neck and chest . xxbos xxmaj the veins on my legs are very noticeable and are causing me a lot of discomforts . xxmaj they are swollen and protrude from my skin , making them visible through my xxunk . xxbos xxmaj i 've been suffering","xxmaj there are red sores on my face and near my nose . i have noticed that the rash is spreading more quickly on my neck and chest . xxbos xxmaj the veins on my legs are very noticeable and are causing me a lot of discomforts . xxmaj they are swollen and protrude from my skin , making them visible through my xxunk . xxbos xxmaj i 've been suffering from"
1,"legs , and torso for the past few weeks . xxmaj it is red , itchy , and covered in dry , xxunk patches . xxbos i have a rash on my legs that is causing a lot of discomforts . xxmaj it seems there is a cramp and i can see prominent veins on the xxunk . xxmaj also , i have been feeling very tired and fatigued in the past",", and torso for the past few weeks . xxmaj it is red , itchy , and covered in dry , xxunk patches . xxbos i have a rash on my legs that is causing a lot of discomforts . xxmaj it seems there is a cramp and i can see prominent veins on the xxunk . xxmaj also , i have been feeling very tired and fatigued in the past couple"
2,"and frequent dizziness . xxmaj my regular activities have been made more tough by these symptoms , which have been causing me a great deal of distress . xxbos xxmaj my muscles have been really weak , and my neck has been extremely stiff . xxmaj my joints have swollen , making it difficult to move about . xxmaj walking has been excruciatingly uncomfortable . xxbos xxmaj i 've been feeling really","frequent dizziness . xxmaj my regular activities have been made more tough by these symptoms , which have been causing me a great deal of distress . xxbos xxmaj my muscles have been really weak , and my neck has been extremely stiff . xxmaj my joints have swollen , making it difficult to move about . xxmaj walking has been excruciatingly uncomfortable . xxbos xxmaj i 've been feeling really ill"
3,"some quite thick and red phlegm . xxbos xxmaj back pain , a persistent cough , and numbness in my arms and legs have been bothering me . xxmaj i 've had trouble staying balanced and experiencing dizziness , and my neck hurts . xxbos i have trouble focusing , and my emotions xxunk . xxmaj my brain might seem xxunk and foggy at times , making it difficult for me to","quite thick and red phlegm . xxbos xxmaj back pain , a persistent cough , and numbness in my arms and legs have been bothering me . xxmaj i 've had trouble staying balanced and experiencing dizziness , and my neck hurts . xxbos i have trouble focusing , and my emotions xxunk . xxmaj my brain might seem xxunk and foggy at times , making it difficult for me to xxunk"
4,"also some odd patches of skin that are a different hue than the rest of me , and i occasionally get little pimples that resemble nodules . xxbos a nasty rash has just appeared on my skin . xxmaj blackheads and pimples packed with pus are everywhere . xxmaj additionally , my skin has been extremely sensitive . xxbos i have been suffering from intense itching , vomiting , fatigue , and","some odd patches of skin that are a different hue than the rest of me , and i occasionally get little pimples that resemble nodules . xxbos a nasty rash has just appeared on my skin . xxmaj blackheads and pimples packed with pus are everywhere . xxmaj additionally , my skin has been extremely sensitive . xxbos i have been suffering from intense itching , vomiting , fatigue , and weight"


In [9]:

learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [10]:
#| error: false
learn.fit_one_cycle(1, 1e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.336221,3.432591,0.3614,30.956762,00:02


In [11]:
#| code-fold: show

# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)

# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'

# Now save the model
learn.save('1epoch')

Path('/kaggle/working/models/1epoch.pth')

In [12]:
#| error: false
learn = learn.load('1epoch')

  state = torch.load(file, map_location=device)


In [13]:
#| error: false
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.674989,2.843485,0.416016,17.175516,00:02
1,3.322473,2.491878,0.447927,12.083943,00:02
2,3.094737,2.287922,0.485718,9.854441,00:02
3,2.910559,2.196874,0.502015,8.996846,00:02
4,2.781198,2.18363,0.507523,8.878479,00:02


In [14]:
#| code-fold: true
#| output: false
#| code-summary: "Save the model"
# Now save the model
learn.save_encoder('finetuned')

In [15]:
#| output: false
#| error: false
TEXT = "I have running nose, stomach and joint pains"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


In [16]:
print("\n".join(preds))

i have running nose , stomach and joint pains . My appetite is mild , and my chest hurts . i have been experiencing severe back pain , a headache , and a high fever . My heart is racing , and i have been experiencing
i have running nose , stomach and joint pains . My throat hurts , I 've been coughing up thick stomach pain . My chest pain has been quite weak , and I 've had a lot of stomach ache . I 've


In [17]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
#sd_df = pd.read_csv(path_lm/'symptom_disease_no_id_col.csv')
sd_df.head()

Unnamed: 0,label,text
0,Psoriasis,"I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches."
1,Psoriasis,"My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation."
2,Psoriasis,"I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints."
3,Psoriasis,"There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."
4,Psoriasis,"My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms."


In [18]:
# Check for NaN values in the label column
print(sd_df['label'].isna().sum())

# If there are NaNs, you can drop those rows
#df = df.dropna(subset=['label'])

0


In [19]:
#| output: false
#| error: false
#dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test', text_vocab=dls_lm.vocab)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test',text_col='text',label_col='label', text_vocab=dls_lm.vocab)

In [20]:
#| error: false
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [21]:
from pathlib import Path
learn.path = Path('/kaggle/working')

In [22]:
#| error: false
learn = learn.load_encoder('finetuned')

  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


In [23]:
len(dls_lm.vocab)

944

In [24]:
#| error: false
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.322105,2.422645,0.425,00:01


In [25]:
#| error: false
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,1.454906,1.690209,0.625,00:01


In [26]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,1.066525,1.136398,0.754167,00:01
1,0.975307,0.929305,0.770833,00:01


In [27]:
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

('Typhoid',
 tensor(16),
 tensor([0.0064, 0.0252, 0.1021, 0.0470, 0.0020, 0.0103, 0.0069, 0.0837, 0.0012,
         0.0626, 0.0060, 0.0562, 0.0618, 0.0452, 0.0356, 0.0041, 0.2126, 0.0103,
         0.0208, 0.0388, 0.0050, 0.0315, 0.0391, 0.0856]))

In [28]:
#| code-fold: true
#| code-summary: "Click to see full code in one cell"
#| error: false
path = Path('/kaggle/input/symptoms-disease-no-id')
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
dls_lm = TextDataLoaders.from_df(symptom_df, path=path,text_col='text', is_lm=True, valid_pct=0.2)
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()
learn.fit_one_cycle(1, 1e-2)
# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)
# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'
# Now save the model
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)
# Now save the model
learn.save_encoder('finetuned')


#finetuning the classifier
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,text_col='text',label_col='label', text_vocab=dls_lm.vocab)
from pathlib import Path
learn.path = Path('/kaggle/working')
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.302298,3.715562,0.319763,41.081654,00:01


  state = torch.load(file, map_location=device)
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.601099,3.088351,0.375058,21.940868,00:02
1,3.265305,2.707949,0.418822,14.998481,00:02
2,3.018157,2.503999,0.461733,12.231309,00:02
3,2.845709,2.404028,0.474421,11.067669,00:02
4,2.727142,2.383838,0.477127,10.846457,00:02


  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


epoch,train_loss,valid_loss,accuracy,time
0,2.340014,2.461548,0.4625,00:01


epoch,train_loss,valid_loss,accuracy,time
0,1.528868,1.718048,0.691667,00:01


epoch,train_loss,valid_loss,accuracy,time
0,1.075513,1.136581,0.783333,00:01
1,0.995345,0.905521,0.804167,00:01


('Hypertension',
 tensor(9),
 tensor([0.0052, 0.0386, 0.0400, 0.0893, 0.0035, 0.0173, 0.0150, 0.0446, 0.0053,
         0.1807, 0.0051, 0.0598, 0.1156, 0.0314, 0.0195, 0.0075, 0.0949, 0.0056,
         0.0250, 0.0441, 0.0122, 0.0408, 0.0722, 0.0269]))