# Introduction: Predictive model for differential diagnosis

In this notebook, our goal is to develop a model that can take in a patient's symptoms as an input and return a list of the top 3 possible classes (diseases) alongside confidence values for each class expressed as probabilities.


## Library and Data import

In [1]:
#|include: false
#| eval: false
#| code-fold: show
#| output: false
#| code-summary: "Run if in kaggle enviroment"


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Input data files are available in the read-only "../input/" directory
# For example, running the above (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/symptom-dataset-synthetic/symptom_synth.csv
/kaggle/input/symptoms-disease-no-id/symptom_disease_no_id_col.csv
/kaggle/input/symptoms-disease-no-id/symptom_no_id.csv
/kaggle/input/symptom2disease/Symptom2Disease.csv


In [2]:
#|include: false 

%pip install seaborn
%pip install fastkaggle
%pip install -Uqq fastbook
%pip install --upgrade pip
%pip install tqdm
#%pip install catboost
#%pip install optuna
#%pip install optuna_distributed
#%pip install openfe
#%pip install xgboost
#%pip install lightgbm
#%pip install h2o
#%pip install polars
#%pip install -q -U autogluon.tabular
#%pip install autogluon
#%pip install wandb
#%pip install sweetviz

Note: you may need to restart the kernel to use updated packages.
Collecting fastkaggle
  Downloading fastkaggle-0.0.8-py3-none-any.whl.metadata (4.3 kB)
Downloading fastkaggle-0.0.8-py3-none-any.whl (11 kB)
Installing collected packages: fastkaggle
Successfully installed fastkaggle-0.0.8
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.0
    Uninstalling pip-24.0:
      Successfully uninstalled pip-24.0
Successfully installed pip-25.0.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to 

In [3]:
#| code-fold: show
#| output: false
#| code-summary: "Library Import"

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import fastbook
#fastbook.setup_book()
#from fastbook import *
from fastai.tabular.all import *
import numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from fastai.imports import *
np.set_printoptions(linewidth=130)
from fastai.text.all import *
from pathlib import Path
import os
import warnings
import gc
import pickle
from joblib import dump, load

# ULMFiT approach

Our initial pre-trained model used above was initially trained on Wikipedia on the task of guessing the next word. We then fine-tuned this model for our disease classification task based on symptoms.

But the Wikipedia English might differ from medical jargon, so to further improve our model, We can take this a step further by fitting this pre-trained model on medical corpus and using that as a the base for our classifier.

In [4]:
!ls /kaggle/input/symptoms-disease-no-id

symptom_disease_no_id_col.csv  symptom_no_id.csv


In [5]:
path = Path('/kaggle/input/symptoms-disease-no-id')
path

Path('/kaggle/input/symptoms-disease-no-id')

In [6]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
symptom_df.head()

Unnamed: 0,text
0,"I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches."
1,"My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation."
2,"I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints."
3,"There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."
4,"My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms."


In [7]:
symptom_df['text'].nunique(),sd_df['text'].nunique()

(1153, 1153)

In [8]:
#dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True, valid_pct=0.2)
dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True,text_col='text', valid_pct=0.2)
#dls_lm = TextDataLoaders.from_folder(path=path_lm, is_lm=True, valid_pct=0.1)

In [9]:
dls_lm.show_batch(max_n=5)

Unnamed: 0,text,text_
0,"xxbos xxmaj my skin has a silvery xxunk over it , especially on my back and arms . xxmaj this dusting is made up of small scales that peel off xxunk when xxunk . xxmaj also , there are rashes all over my body . xxbos i often sneeze and have a runny nose . xxmaj my eyes are scratchy and frequently runny , and i frequently cough . xxmaj my head","xxmaj my skin has a silvery xxunk over it , especially on my back and arms . xxmaj this dusting is made up of small scales that peel off xxunk when xxunk . xxmaj also , there are rashes all over my body . xxbos i often sneeze and have a runny nose . xxmaj my eyes are scratchy and frequently runny , and i frequently cough . xxmaj my head hurts"
1,"doctor , xxmaj my skin is covered in a very uncomfortable rash , along with some odd patches of a different hue . xxmaj my skin also has a few pimples that resemble little knots xxbos xxmaj my wound is healing more slowly these days . xxmaj my feet and hands are tingling and becoming numb . i feel really xxunk . xxbos xxmaj i 've been experiencing intense itching all over",", xxmaj my skin is covered in a very uncomfortable rash , along with some odd patches of a different hue . xxmaj my skin also has a few pimples that resemble little knots xxbos xxmaj my wound is healing more slowly these days . xxmaj my feet and hands are tingling and becoming numb . i feel really xxunk . xxbos xxmaj i 've been experiencing intense itching all over my"
2,"pain , and it 's been really uncomfortable . xxmaj the belly pain has been getting worse and is starting to xxunk my daily life . xxmaj moreover , i get chills every night , followed by a mild fever . xxbos xxmaj i 've been dealing with a really nasty rash on my skin lately . xxmaj it 's covered in blackheads and pimples that are packed with pus . xxmaj",", and it 's been really uncomfortable . xxmaj the belly pain has been getting worse and is starting to xxunk my daily life . xxmaj moreover , i get chills every night , followed by a mild fever . xxbos xxmaj i 've been dealing with a really nasty rash on my skin lately . xxmaj it 's covered in blackheads and pimples that are packed with pus . xxmaj my"
3,"and chills every night and feel terrible because of this . xxmaj also , i feel i have lost my appetite . xxbos xxmaj my throat often feels like it is burning , especially after eating . a bitter or sour aftertaste that i occasionally experience also makes me feel unwell . xxbos i have been suffering from back pain , a hacking cough , and weakness in my arms and legs","chills every night and feel terrible because of this . xxmaj also , i feel i have lost my appetite . xxbos xxmaj my throat often feels like it is burning , especially after eating . a bitter or sour aftertaste that i occasionally experience also makes me feel unwell . xxbos i have been suffering from back pain , a hacking cough , and weakness in my arms and legs ."
4,"xxmaj i 've been sneezing xxunk and i just ca n't get this chill to go away . i feel so weak and exhausted , and my cough wo n't stop . xxmaj my temperature is really high . xxbos xxmaj i 've been feeling extremely scratchy , sick , and worn out . xxmaj in addition , xxmaj i 've lost weight and have a temperature . xxmaj my urine is","i 've been sneezing xxunk and i just ca n't get this chill to go away . i feel so weak and exhausted , and my cough wo n't stop . xxmaj my temperature is really high . xxbos xxmaj i 've been feeling extremely scratchy , sick , and worn out . xxmaj in addition , xxmaj i 've lost weight and have a temperature . xxmaj my urine is dark"


In [10]:
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [11]:
learn.fit_one_cycle(1, 1e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.37216,3.679505,0.323351,39.626778,00:02


In [12]:
#| code-fold: show

# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)

# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'

# Now save the model
learn.save('1epoch')

Path('/kaggle/working/models/1epoch.pth')

In [13]:
learn = learn.load('1epoch')

  state = torch.load(file, map_location=device)


In [14]:
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.585111,3.027508,0.370587,20.645729,00:02
1,3.256822,2.616809,0.417607,13.691961,00:02
2,3.028377,2.411102,0.460286,11.146241,00:02
3,2.852578,2.314805,0.475333,10.122949,00:02
4,2.745948,2.292166,0.484881,9.896353,00:02


In [15]:
#| output: false
# Now save the model
learn.save_encoder('finetuned')

In [16]:
TEXT = "I have running nose, stomach and joint pains"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


In [17]:
print("\n".join(preds))

i have running nose , stomach and joint pains . Suddenly , i have a headache and get chills . I 've been coughing up a lot of thick , crimson nails . I 've been experiencing severe back pain . My neck has
i have running nose , stomach and joint pains . There are small red spots on my arms and legs , which makes it difficult for me to walk and walk . i eat what i have recently noticed . i have been experiencing intense chills and chills


In [18]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
#sd_df = pd.read_csv(path_lm/'symptom_disease_no_id_col.csv')
sd_df.head()

Unnamed: 0,label,text
0,Psoriasis,"I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches."
1,Psoriasis,"My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation."
2,Psoriasis,"I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints."
3,Psoriasis,"There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."
4,Psoriasis,"My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms."


In [19]:
# Check for NaN values in the label column
print(sd_df['label'].isna().sum())

# If there are NaNs, you can drop those rows
#df = df.dropna(subset=['label'])

0


In [20]:
#| output: false
#dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test', text_vocab=dls_lm.vocab)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test',text_col='text',label_col='label', text_vocab=dls_lm.vocab)

In [21]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [22]:
from pathlib import Path
learn.path = Path('/kaggle/working')

In [23]:
learn = learn.load_encoder('finetuned')

  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


In [24]:
len(dls_lm.vocab)

944

In [25]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.286716,2.428135,0.433333,00:01


In [26]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,1.531197,1.68965,0.633333,00:01


In [27]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,1.101122,1.152585,0.779167,00:01
1,1.022826,0.899249,0.8125,00:01


In [28]:
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

('Hypertension',
 tensor(9),
 tensor([0.0224, 0.0319, 0.0957, 0.0965, 0.0024, 0.0557, 0.0091, 0.0450, 0.0056,
         0.1008, 0.0037, 0.0411, 0.0889, 0.0364, 0.0479, 0.0040, 0.0396, 0.0043,
         0.0696, 0.0276, 0.0190, 0.0628, 0.0435, 0.0466]))

In [29]:
#| code-fold: show
#| code-summary: "Click to see full code in one cell"
path = Path('/kaggle/input/symptoms-disease-no-id')
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
dls_lm = TextDataLoaders.from_df(symptom_df, path=path,text_col='text', is_lm=True, valid_pct=0.2)
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()
learn.fit_one_cycle(1, 1e-2)
# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)
# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'
# Now save the model
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)
# Now save the model
learn.save_encoder('finetuned')


#finetuning the classifier
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,text_col='text',label_col='label', text_vocab=dls_lm.vocab)
from pathlib import Path
learn.path = Path('/kaggle/working')
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.38834,3.689447,0.334447,40.022724,00:01


  state = torch.load(file, map_location=device)
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.649581,3.016814,0.392752,20.426121,00:02
1,3.320565,2.677835,0.41888,14.553554,00:02
2,3.06983,2.42339,0.471687,11.284052,00:02
3,2.88089,2.318225,0.485966,10.15763,00:02
4,2.753544,2.303509,0.487196,10.009243,00:02


  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


epoch,train_loss,valid_loss,accuracy,time
0,2.286848,2.414281,0.429167,00:01


epoch,train_loss,valid_loss,accuracy,time
0,1.476966,1.582516,0.729167,00:01


epoch,train_loss,valid_loss,accuracy,time
0,1.090178,1.094736,0.783333,00:01
1,0.99108,0.906746,0.795833,00:01


('Hypertension',
 tensor(9),
 tensor([0.0031, 0.0424, 0.1357, 0.0849, 0.0049, 0.0267, 0.0148, 0.0102, 0.0026,
         0.2103, 0.0050, 0.1055, 0.0351, 0.0682, 0.0337, 0.0065, 0.0596, 0.0114,
         0.0211, 0.0285, 0.0052, 0.0132, 0.0127, 0.0585]))