# Introduction: Predictive model for differential diagnosis

In this notebook, our goal is to develop a model that can take in a patient's symptoms as an input and return a list of the top 3 possible classes (diseases) alongside confidence values for each class expressed as probabilities.


## Library and Data import

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/symptom2disease/Symptom2Disease.csv
/kaggle/input/symptoms-disease-no-id/symptom_disease_no_id_col.csv
/kaggle/input/symptoms-disease-no-id/symptom_no_id.csv
/kaggle/input/symptom-dataset-synthetic/symptom_synth.csv


In [2]:
%%time
#%pip install catboost
#%pip install optuna
#%pip install optuna_distributed
#%pip install openfe
%pip install seaborn
#%pip install xgboost
#%pip install lightgbm
%pip install fastkaggle
#%pip install h2o
%pip install -Uqq fastbook
#%pip install polars
#%pip install -q -U autogluon.tabular
#%pip install autogluon
%pip install --upgrade pip
%pip install tqdm
#%pip install wandb
#%pip install sweetviz

Note: you may need to restart the kernel to use updated packages.
Collecting fastkaggle
  Downloading fastkaggle-0.0.8-py3-none-any.whl.metadata (4.3 kB)
Downloading fastkaggle-0.0.8-py3-none-any.whl (11 kB)
Installing collected packages: fastkaggle
Successfully installed fastkaggle-0.0.8
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.0
    Uninstalling pip-24.0:
      Successfully uninstalled pip-24.0
Successfully installed pip-25.0.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel 

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import fastbook
#fastbook.setup_book()
#from fastbook import *
from fastai.tabular.all import *
import numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from fastai.imports import *
np.set_printoptions(linewidth=130)
from fastai.text.all import *
from pathlib import Path
import os
import warnings
import gc
import pickle
from joblib import dump, load

# ULMFiT approach

Our initial pre-trained model used above was initially trained on Wikipedia on the task of guessing the next word. We then fine-tuned this model for our disease classification task based on symptoms.

But the Wikipedia English might differ from medical jargon, so to further improve our model, We can take this a step further by fitting this pre-trained model on medical corpus and using that as a the base for our classifier.

In [4]:
!ls /kaggle/input/symptoms-disease-no-id

symptom_disease_no_id_col.csv  symptom_no_id.csv


In [5]:
path = Path('/kaggle/input/symptoms-disease-no-id')
path

Path('/kaggle/input/symptoms-disease-no-id')

In [7]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
symptom_df.head()

Unnamed: 0,text
0,"I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches."
1,"My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation."
2,"I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints."
3,"There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."
4,"My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms."


In [8]:
symptom_df['text'].nunique(),sd_df['text'].nunique()

(1153, 1153)

In [26]:
#dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True, valid_pct=0.2)
dls_lm = TextDataLoaders.from_df(symptom_df, path=path, is_lm=True,text_col='text', valid_pct=0.2)
#dls_lm = TextDataLoaders.from_folder(path=path_lm, is_lm=True, valid_pct=0.1)

In [27]:
dls_lm.show_batch(max_n=5)

Unnamed: 0,text,text_
0,"xxbos xxmaj i 've had nausea , vomiting , and exhaustion . xxmaj additionally , xxmaj i 've lost weight and have a temperature . xxmaj my urine is black and my skin has turned yellow . xxmaj i 've also been having stomach pain . xxbos xxmaj my bowel movements have become constipated and xxunk . i often feel quite tired and lack both energy and appetite . xxbos a nasty","xxmaj i 've had nausea , vomiting , and exhaustion . xxmaj additionally , xxmaj i 've lost weight and have a temperature . xxmaj my urine is black and my skin has turned yellow . xxmaj i 've also been having stomach pain . xxbos xxmaj my bowel movements have become constipated and xxunk . i often feel quite tired and lack both energy and appetite . xxbos a nasty rash"
1,"i ca n't explain . xxmaj the bumps are itchy and uncomfortable and seem to xxunk rapidly . xxmaj it 's worrying me . xxbos xxmaj my skin has been itching a lot and developing a rash . xxmaj additionally , i have a few areas of my skin that are a different hue than the rest of it . xxmaj additionally , i have a few firm pimples or breakouts on","ca n't explain . xxmaj the bumps are itchy and uncomfortable and seem to xxunk rapidly . xxmaj it 's worrying me . xxbos xxmaj my skin has been itching a lot and developing a rash . xxmaj additionally , i have a few areas of my skin that are a different hue than the rest of it . xxmaj additionally , i have a few firm pimples or breakouts on my"
2,"'ve been coughing up a lot of xxunk and my chest feels really heavy . i ca n't smell anything and my muscles are really sore xxbos xxmaj i 'm drenched with sweat and struggling to breathe . i do n't feel well and have a lot of mucous in my throat . xxmaj my chest aches , and my heart is racing . xxmaj i 'm coughing up mucus that is","been coughing up a lot of xxunk and my chest feels really heavy . i ca n't smell anything and my muscles are really sore xxbos xxmaj i 'm drenched with sweat and struggling to breathe . i do n't feel well and have a lot of mucous in my throat . xxmaj my chest aches , and my heart is racing . xxmaj i 'm coughing up mucus that is crimson"
3,"fever . i also have a headache and am perspiring a lot . xxmaj additionally hurting me have been nausea and muscle soreness . xxbos i ca n't xxunk sneezing , and my nose is running . xxmaj i 'm also constantly chilly and exhausted , and xxmaj i 've been coughing a lot . xxmaj my fever is also really high , well above usual . xxbos xxmaj even when i",". i also have a headache and am perspiring a lot . xxmaj additionally hurting me have been nausea and muscle soreness . xxbos i ca n't xxunk sneezing , and my nose is running . xxmaj i 'm also constantly chilly and exhausted , and xxmaj i 've been coughing a lot . xxmaj my fever is also really high , well above usual . xxbos xxmaj even when i do"
4,"pain . xxbos xxmaj i 've been feeling really ill lately . xxmaj i 've had this persistent cough and difficulty breathing , and my fever has been off the xxunk . xxmaj i 'm also feeling extremely exhausted , and xxmaj i 've been producing a lot of mucus when i cough . xxmaj it 's just been so overwhelming to deal with all of these symptoms xxbos xxmaj xxunk ,",". xxbos xxmaj i 've been feeling really ill lately . xxmaj i 've had this persistent cough and difficulty breathing , and my fever has been off the xxunk . xxmaj i 'm also feeling extremely exhausted , and xxmaj i 've been producing a lot of mucus when i cough . xxmaj it 's just been so overwhelming to deal with all of these symptoms xxbos xxmaj xxunk , a"


In [28]:
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [29]:
learn.fit_one_cycle(1, 1e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.311601,3.737076,0.320819,41.975079,01:20


In [30]:
import os

# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)

# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'

# Now save the model
learn.save('1epoch')

Path('/kaggle/working/models/1epoch.pth')

In [31]:
learn = learn.load('1epoch')

  state = torch.load(file, map_location=device)


In [32]:
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.585548,3.07199,0.369864,21.58482,01:21
1,3.273848,2.695683,0.406467,14.815641,01:23
2,3.033676,2.450205,0.466291,11.590721,01:21
3,2.849172,2.357353,0.477792,10.562959,01:17
4,2.71661,2.34009,0.478516,10.382172,01:22


In [33]:
# Now save the model
learn.save_encoder('finetuned')

In [34]:
#learn.save_encoder('finetuned')

In [35]:
TEXT = "I have running nose, stomach and joint pains"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


In [36]:
print("\n".join(preds))

i have running nose , stomach and joint pains . My nose has also been really uncomfortable . My skin has also been really weak , and my neck has been quite stiff . My neck hurts , and I 've been experiencing pain and
i have running nose , stomach and joint pains . i have a high fever and have a headache . i have been experiencing nausea , nausea , and blurred vision . My heart is racing , and i have severe headaches . i have a high


In [37]:
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
#sd_df = pd.read_csv(path_lm/'symptom_disease_no_id_col.csv')
sd_df.head()

Unnamed: 0,label,text
0,Psoriasis,"I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches."
1,Psoriasis,"My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation."
2,Psoriasis,"I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints."
3,Psoriasis,"There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."
4,Psoriasis,"My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms."


In [38]:
# Check for NaN values in the label column
print(sd_df['label'].isna().sum())

# If there are NaNs, you can drop those rows
#df = df.dropna(subset=['label'])

0


In [39]:
#dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test', text_vocab=dls_lm.vocab)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,valid='test',text_col='text',label_col='label', text_vocab=dls_lm.vocab)

In [40]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


In [41]:
from pathlib import Path
learn.path = Path('/kaggle/working')

In [42]:
learn = learn.load_encoder('finetuned')

  wgts = torch.load(join_path_file(file,self.path/self.model_dir, ext='.pth'), map_location=device)


In [43]:
len(dls_lm.vocab)

944

In [44]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.283944,2.397299,0.429167,00:36


In [45]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,1.555474,1.704521,0.604167,00:44


In [46]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,1.095626,1.098281,0.766667,01:27
1,0.970842,0.886695,0.770833,01:30


In [47]:
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

('Cervical spondylosis',
 tensor(3),
 tensor([0.0065, 0.0212, 0.0903, 0.1501, 0.0026, 0.0149, 0.0118, 0.0196, 0.0041,
         0.0693, 0.0030, 0.0262, 0.0833, 0.0780, 0.0925, 0.0110, 0.1041, 0.0084,
         0.0140, 0.0227, 0.0134, 0.0183, 0.1027, 0.0321]))

In [None]:
path = Path('/kaggle/input/symptoms-disease-no-id')
#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
dls_lm = TextDataLoaders.from_df(symptom_df, path=path,text_col='text', is_lm=True, valid_pct=0.2)
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=path, wd=0.1).to_fp16()
learn.fit_one_cycle(1, 1e-2)
# Create a directory to save the model
os.makedirs('/kaggle/working/models', exist_ok=True)
# Set the model directory for the learner
learn.model_dir = '/kaggle/working/models'
# Now save the model
learn.save('1epoch')
learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)
# Now save the model
learn.save_encoder('finetuned')


#finetuning the classifier
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
dls_clas = TextDataLoaders.from_df(sd_df, path=path,text_col='text',label_col='label', text_vocab=dls_lm.vocab)
from pathlib import Path
learn.path = Path('/kaggle/working')
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
learn.predict("I am having a running stomach, fever, general body weakness and have been getting bitten by mosquitoes often")

  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time


In [None]:
# For language model
dls_lm = TextDataLoaders.from_df(
    symptom_df,
    text_col='text',
    is_lm=True,
    valid_pct=0.2
)

# Create and train language model
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], 
                              path=Path('/kaggle/working'), wd=0.1)
learn.fit_one_cycle(1, 1e-2)
learn.save_encoder('finetuned')

# For classifier
dls_clas = TextDataLoaders.from_df(
    sd_df,  # Your labeled dataset
    text_col='text',
    label_col='label',
    valid_pct=0.2,
    text_vocab=dls_lm.vocab
)

# Create classifier model
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                               metrics=accuracy, 
                               path=Path('/kaggle/working'))

# Load the fine-tuned encoder
learn.load_encoder('finetuned')

# Train classifier
learn.fit_one_cycle(1, 2e-2)