# Basic Medical Info: A Key to Predict Disease?

An aproach with K-NN and support vector machines

## Table of Contents
- [Introduction](#Introduction)
- [Data Exploration](#Data-Exploration)
- [Model Selection](#Model-Selection)
- [Fine-Tuning](#Fine-tuning)
- [Model Testing](#Model-Testing)
- [Conclusion](#Conclusion)

## Introduction

Welcome! In this notebook, we'll harness the power of basic health indicators to predict diseases. Using the Disease Symptoms and Patient Profile Dataset, we aim to build a model that can accurately identify diseases based on basic symptoms and health indicators.

Whether you are a healthcare professional, a medical researcher, a data scientist, or simply a data enthusiast, this notebook will provide you with a comprehensive guide to predicting diseases using basic medical information. Let's dive in and explore the potential of this dataset. I really hope you learn new things in this notebook :).

## Data Exploration

First, let's start by importing the necessary library, `pandas`, which will help us in data manipulation and analysis. 

We then load our dataset using `pd.read_csv()` and finally, we use the `head()` function to display the first rows of our dataset. This gives us a glimpse of our data structure.

In [1]:
import pandas as pd
df = pd.read_csv('/kaggle/input/disease-symptoms-and-patient-profile-dataset/Disease_symptom_and_patient_profile_dataset.csv')
df.head()

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal,Positive
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive


In [2]:
df.dtypes

Disease                 object
Fever                   object
Cough                   object
Fatigue                 object
Difficulty Breathing    object
Age                      int64
Gender                  object
Blood Pressure          object
Cholesterol Level       object
Outcome Variable        object
dtype: object

From the initial glance at our data, we can observe that most of our variables are categorical, with 'Age' being the only numerical variable. 

Categorical variables are often non-numeric and represent various categories or groups. In our case, these include symptoms (like fever, cough, fatigue, difficulty breathing), gender, blood pressure, cholesterol level, and the outcome variable (disease). 

Our target variable is 'Disease', which we are trying to predict. Let's explore this variable.

In [3]:
print(sum(df.Disease.value_counts() >= 1))
print(sum(df.Disease.value_counts() == 1))

116
61


In [4]:
print(sum(df.Disease.value_counts() > 9))
print(sum(df.Disease.value_counts() <= 9))

6
110


Upon examining the 'Disease' column, we notice a large number of unique diseases, many of which have only 1 to 5 samples. For a reliable disease prediction model, this sample size is insufficient. 

Predicting diseases with such limited information could lead to inaccurate results and misdiagnosis, which we want to avoid. Therefore, to ensure the robustness of our model, we will focus only on the diseases that have 10 or more samples. 

This decision will reduce the number of classes we are predicting down to 6, making our model more manageable and potentially more accurate.

In [5]:
df = df[df.groupby('Disease')['Disease'].transform('size') >= 10]

In [6]:
df.shape

(83, 10)

Before we proceed with further analysis or model building, it's crucial to ensure the quality of our data. This involves checking for and handling missing values (NaNs) and duplicate entries.

1. **Missing Values**: Missing data can lead to misleading results and reduce the statistical power of the model. Therefore, we need to check if our dataset contains any NaN values.

2. **Duplicate Values**: Duplicate entries can bias the analysis by over-representing certain observations. Hence, it's important to identify and remove any duplicates in our dataset.


In [7]:
df.isna().sum()

Disease                 0
Fever                   0
Cough                   0
Fatigue                 0
Difficulty Breathing    0
Age                     0
Gender                  0
Blood Pressure          0
Cholesterol Level       0
Outcome Variable        0
dtype: int64

In [8]:
df.loc[df.duplicated()]

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
35,Asthma,Yes,Yes,No,Yes,30,Female,Normal,Normal,Positive
59,Asthma,No,Yes,Yes,Yes,35,Female,High,Normal,Negative
76,Asthma,Yes,Yes,No,Yes,35,Male,Normal,Normal,Positive
123,Asthma,Yes,Yes,No,Yes,40,Female,Normal,Normal,Positive
126,Asthma,Yes,No,Yes,Yes,40,Male,Normal,High,Positive
182,Asthma,Yes,Yes,No,Yes,45,Male,Normal,Normal,Positive
267,Osteoporosis,Yes,No,Yes,No,55,Female,Normal,Normal,Positive
284,Osteoporosis,No,Yes,No,No,60,Male,High,High,Negative
308,Stroke,Yes,No,Yes,No,65,Female,High,Low,Negative


In [9]:
df = df.drop_duplicates().reset_index(drop= True)
df.shape

(69, 10)

Now that our data is cleaned and we've narrowed down our focus to diseases with 10 or more samples, let's visualize the distribution of these classes. Understanding the balance of classes is important as it can influence the performance of our machine learning model.

We'll use a pie chart for this purpose. Let's plot this chart and see how balanced our classes are.

In [10]:
import plotly.express as px

disease_counts = df['Disease'].value_counts().reset_index()
disease_counts.columns = ['Disease', 'Count']

fig = px.pie(disease_counts, 
             values= 'Count', 
             names= 'Disease', 
             color_discrete_sequence= px.colors.sequential.Reds_r, 
             title= 'Disease Distribution')

fig.update_traces(textinfo='percent+label')

fig.show()

From the pie chart, it's evident that our classes are imbalanced. Diseases like Hypertension, Diabetes, and Migraine have approximately 1.7 times fewer samples than Asthma. We have to handle class imbalance

But before we proceed with that, let's first process our categorical variables. This will allow us to perform a univariate analysis, which involves the examination of one variable at a time. This analysis can provide valuable insights into the distribution and characteristics of our variables.

In [11]:
dicc = {'Yes':1, 'No':0, 'Low':1, 'Normal':2, 'High':3, 'Positive':1, 'Negative':0, 'Male':0, 'Female': 1}
def replace(x, dicc= dicc):
    if x in dicc:
        x = dicc[x]
    return x
df = df.applymap(replace)
df.head()

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Asthma,1,1,0,1,25,0,2,2,1
1,Asthma,1,0,0,1,28,0,3,2,1
2,Diabetes,0,0,0,0,29,0,1,2,0
3,Stroke,1,1,1,1,29,1,2,2,1
4,Migraine,1,0,0,0,30,1,2,2,0


In [12]:
df.dtypes

Disease                 object
Fever                    int64
Cough                    int64
Fatigue                  int64
Difficulty Breathing     int64
Age                      int64
Gender                   int64
Blood Pressure           int64
Cholesterol Level        int64
Outcome Variable         int64
dtype: object

Having converted our categorical data into numerical format, we are now ready to perform univariate analysis. This analysis will help us understand the distribution of our variables and their individual impact on the disease prediction.

We'll start with the 'Age' variable. Age is a crucial factor in many diseases, and understanding its distribution and relationship with various diseases can provide valuable insights. 

Following that, we'll examine the other variables one by one. Each of these variables - symptoms, gender, blood pressure, and cholesterol level - could potentially play a significant role in disease prediction. By analyzing them individually, we can gain a deeper understanding of their characteristics and importance.


In [13]:
fig = px.histogram(df, 
             x = 'Age',  
             title='Age-Disease Distribution',
             color= 'Disease'
                  )
fig.update_layout(bargap=0.2)

fig.show()

From our univariate analysis of the 'Age' variable, we observe some interesting patterns. 

1. If the age is greater than 80, the disease is likely to be a stroke. This aligns with the general understanding that the risk of stroke increases with age.

2. Migraine and Hypertension are not present in ages between 20 and 30. This could suggest that these conditions are more prevalent in older age groups.

3. Hypertension and Osteoporosis appear more frequently as the age increases, indicating a potential correlation between these diseases and age.

These observations suggest that age is a valuable feature for predicting certain diseases. However, it's important to note that our dataset has limited samples, especially for ages greater than 80. This could make predicting new values in this age range challenging.

Next, let's analyze how the other variables interact with different diseases. This will help us understand their potential as predictors and identify any patterns or correlations.

In [14]:
import plotly.subplots as sp

def subplots(df, columns):
    fig = sp.make_subplots(rows=2, cols=2, subplot_titles= columns)
    for idx, column in enumerate(columns):
        i = idx // 2 + 1 
        j = idx % 2 + 1  
        fig_express = px.histogram(df, x=column, title=column + '-Disease Distribution', color='Disease')
        for trace in fig_express.data:
            fig.add_trace(trace, row=i, col=j)
    fig.update_layout(height=600, width=800, title_text="")
    fig.show()
subplots(df, df.columns[1:5])

In [15]:
subplots(df, df.columns[6:])

Upon visual inspection of the other variables, we can observe significant differences in disease prediction based on each feature's values. For instance, whether a person has high, normal, or low cholesterol levels can significantly influence the prediction of a disease. This is consistent with real-world observations where these variables often vary among different diseases.

Some valuable insights we can glean from this analysis include:

1. A person with low blood pressure does not have a stroke. This could be a crucial factor in stroke prediction.

2. Fatigue, cholesterol level, and blood pressure are the features that show the most variation among different values. These could potentially be strong predictors in our model.

These observations underscore the importance of these variables in predicting diseases. 

Next, let's examine how these variables correlate with each other. Understanding the relationships between different variables can help us identify patterns and potential multicollinearity, which could influence our model's performance.

We'll use the `LabelEncoder` from `sklearn` to convert our categorical variables into numerical format for this correlation analysis.

In [16]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.Disease = le.fit_transform(df.Disease)
df.head()

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,0,1,1,0,1,25,0,2,2,1
1,0,1,0,0,1,28,0,3,2,1
2,1,0,0,0,0,29,0,1,2,0
3,5,1,1,1,1,29,1,2,2,1
4,3,1,0,0,0,30,1,2,2,0


In [17]:
corr_df = df.corr()
fig = px.imshow(corr_df, color_continuous_scale= px.colors.sequential.Blues)
fig.show()

From the correlation graph, we can observe that none of the variables strongly correlate with the 'Disease' variable. The most correlated variables are 'Age' and 'Difficulty Breathing', but even these only score 0.4 and -0.4 respectively. 

In situations where we have multiple variables with low correlation scores, machine learning can be a viable alternative for prediction tasks. However, it's important to note that machine learning algorithms, especially deep learning ones, typically require large amounts of data to perform optimally.

In our case, we have only 69 data points, which is relatively small. Furthermore, we are dealing with a multi-class problem with few examples for each disease, which adds to the complexity.

Given these constraints, we will try two machine learning algorithms that can perform well without a lot of data: K-Nearest Neighbors (K-NN) and Support Vector Machines (SVM). We will fine-tune these models and select the one that performs best.

Let's proceed with data preprocessing and fit our data into these algorithms.

## Model Selection

In [18]:
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

X = df.drop(['Disease'], axis= 1).values
y = df.Disease.values

In [19]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size= 0.4, shuffle= True, stratify= y, random_state=30)
X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size= 0.5, shuffle= True, stratify= y_val, random_state=30)

In [20]:
svc_pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC(class_weight= 'balanced'))])
knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])

In [21]:
svc_pipe.fit(X_train, y_train)
knn_pipe.fit(X_train, y_train)

In [22]:
ysvc_pred = svc_pipe.predict(X_val)
yknn_pred = knn_pipe.predict(X_val)


Given our problem of multi-class classification with imbalanced classes, the F1 score (macro-averaged) is an appropriate choice. The F1 score is the harmonic mean of precision and recall, and it gives a better measure of the incorrectly classified cases than the accuracy metric. 

The macro-averaged F1 score calculates the F1 score for each class independently and then takes the average. This treats all classes equally, regardless of their imbalance, which is exactly what we need for our problem.

Our goal is to maximize this F1 score.

In [23]:
from sklearn.metrics import f1_score, accuracy_score, recall_score
models = ['Base_KNN', ' Base_SVC']

f1 = [f1_score(y_val, yknn_pred, average= 'macro', zero_division= 0), f1_score(y_val, ysvc_pred, average= 'macro', zero_division= 0)]
accuracy = [accuracy_score(y_val, yknn_pred), accuracy_score(y_val, ysvc_pred)]
recall = [recall_score(y_val, yknn_pred, average= 'macro'), recall_score(y_val, ysvc_pred, average= 'macro')]

metrics_df = pd.DataFrame({'Models': models, 'f1': f1, 'Accuracy': accuracy, 'Recall': recall})
#metrics_df = metrics_df.melt(id_vars='Models', var_name='metric', value_name='score')

In [24]:
fig = px.bar(metrics_df, x='Models', y= ['f1', 'Accuracy', 'Recall'], barmode= 'group', color_discrete_sequence= px.colors.sequential.RdBu_r)
fig.show( )

After training our K-Nearest Neighbors (K-NN) and Support Vector Machines (SVM) models, we observe that SVM significantly outperforms K-NN in terms of the macro-averaged F1 score. 

Given this performance difference, it makes sense to focus our efforts on the SVM model. We will proceed with fine-tuning this model to see if we can further improve the F1 score. 

## Fine-tuning

In [25]:
from sklearn.metrics import make_scorer
f1_scorer = make_scorer(f1_score, average='macro', zero_division=0)
parameters = {
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'svc__gamma': ['scale', 'auto'],
    'svc__shrinking': [True, False],
    'svc__decision_function_shape': ['ovo', 'ovr']
}

grid_search = GridSearchCV(svc_pipe, parameters, cv=5, scoring= f1_scorer)

grid_search.fit(X_train, y_train)

print("Best Score: ", grid_search.best_score_)
print("Best Params: ", grid_search.best_params_)

best_clf = grid_search.best_estimator_

Best Score:  0.32111111111111107
Best Params:  {'svc__C': 1, 'svc__decision_function_shape': 'ovo', 'svc__gamma': 'scale', 'svc__kernel': 'rbf', 'svc__shrinking': True}


In [26]:
results = pd.DataFrame(grid_search.cv_results_)
results = results[['param_svc__C', 'param_svc__kernel', 'param_svc__gamma', 'param_svc__shrinking',
                  'param_svc__decision_function_shape', 'mean_test_score']]
results

Unnamed: 0,param_svc__C,param_svc__kernel,param_svc__gamma,param_svc__shrinking,param_svc__decision_function_shape,mean_test_score
0,0.1,linear,scale,True,ovo,0.164444
1,0.1,linear,scale,False,ovo,0.164444
2,0.1,rbf,scale,True,ovo,0.079899
3,0.1,rbf,scale,False,ovo,0.079899
4,0.1,poly,scale,True,ovo,0.074074
...,...,...,...,...,...,...
91,10,rbf,auto,False,ovr,0.266667
92,10,poly,auto,True,ovr,0.215556
93,10,poly,auto,False,ovr,0.215556
94,10,sigmoid,auto,True,ovr,0.256667


In addition to the kernel and C parameters, we will analyze the significance of other hyperparameters in achieving the highest F1 score. While the kernel and C parameters are known to have a significant impact on SVM performance, it's important to consider the influence of other hyperparameters as well.

We will specifically investigate the impact of hyperparameters such as the shrinking and the decision function shape.

Analyzing the relationship between these hyperparameters and the F1 score will provide us with a more comprehensive understanding of the model's behavior.

Let's proceed with the hyperparameter analysis and determine the optimal values for achieving the highest F1 score.

In [27]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
grouped = results.groupby(['param_svc__C', 'param_svc__kernel', 'param_svc__shrinking'])['mean_test_score'].mean().reset_index()

pivot_table_true = grouped[grouped['param_svc__shrinking'] == True].pivot(
    'param_svc__C', 'param_svc__kernel', 'mean_test_score')

pivot_table_false = grouped[grouped['param_svc__shrinking'] == False].pivot(
    'param_svc__C', 'param_svc__kernel', 'mean_test_score')

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Heatmap(z=pivot_table_true.values, x=pivot_table_true.columns, 
               y=pivot_table_true.index, colorscale='RdBu', showscale=False),
    row=1, col=1
)

fig.add_trace(
    go.Heatmap(z=pivot_table_false.values, x=pivot_table_false.columns, 
               y=pivot_table_false.index, colorscale='RdBu'),
    row=1, col=2
)

fig.update_layout(
    height=500, width=1000, title_text="Mean Test Score for SVC shrinking Hyperparameter",
    annotations=[
        go.layout.Annotation(
            text="Shrinking: True",
            xref="paper", yref="paper",
            x=0.25, y=1.07, showarrow=False,
            font=dict(size=14,)
        ),
        go.layout.Annotation(
            text="Shrinking: False",
            xref="paper", yref="paper",
            x=0.75, y=1.07, showarrow=False,
            font=dict(size=14,)
        )
    ]
)
fig.update_xaxes(title_text="Kernel", row=1, col=1)
fig.update_xaxes(title_text="Kernel", row=1, col=2)
fig.update_yaxes(title_text="C", row=1, col=1)
fig.update_yaxes(title_text="C", row=1, col=2)

fig.show()


In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.


In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.



Here we can see that there is no meaningful difference between using the shrinking hyperparameter or not. Here, the kernel and C variants determine the mean test score with no differences, whether the parameter is true or false. Now let's see if the decision function shape makes a difference or not.

In [28]:
grouped = results.groupby(['param_svc__C', 'param_svc__kernel', 'param_svc__decision_function_shape'])['mean_test_score'].mean().reset_index()

pivot_table_true = grouped[grouped['param_svc__decision_function_shape'] == 'ovo'].pivot(
    'param_svc__C', 'param_svc__kernel', 'mean_test_score')

pivot_table_false = grouped[grouped['param_svc__decision_function_shape'] == 'ovr'].pivot(
    'param_svc__C', 'param_svc__kernel', 'mean_test_score')

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Heatmap(z=pivot_table_true.values, x=pivot_table_true.columns, 
               y=pivot_table_true.index, colorscale= px.colors.sequential.Cividis_r, showscale=False),
    row=1, col=1
)

fig.add_trace(
    go.Heatmap(z=pivot_table_false.values, x=pivot_table_false.columns, 
               y=pivot_table_false.index, colorscale= px.colors.sequential.Cividis_r),
    row=1, col=2
)

fig.update_layout(
    height=500, width=1000, title_text="Mean Test Score for SVC decision function Hyperparameter",
    annotations=[
        go.layout.Annotation(
            text="Desicion function: ovo",
            xref="paper", yref="paper",
            x=0.25, y=1.07, showarrow=False,
            font=dict(size=14,)
        ),
        go.layout.Annotation(
            text="Decision function: ovr",
            xref="paper", yref="paper",
            x=0.75, y=1.07, showarrow=False,
            font=dict(size=14,)
        )
    ]
)
fig.update_xaxes(title_text="Kernel", row=1, col=1)
fig.update_xaxes(title_text="Kernel", row=1, col=2)
fig.update_yaxes(title_text="C", row=1, col=1)
fig.update_yaxes(title_text="C", row=1, col=2)

fig.show()


In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.


In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.



Here we can observe a similar result as with the shrinking hyperparameter - there is no effect on the score when we use OVO (One-vs-One) or OVR (One-vs-Rest) for the decision function. Now, let's plot how the score changes when we modify the kernel and the C parameter.

In [29]:
fig = px.line(results, x="param_svc__C", y="mean_test_score", 
              color="param_svc__kernel", 
              line_group="param_svc__shrinking", 
              hover_name="param_svc__decision_function_shape",
              labels={"mean_test_score": "Mean Test Score", "param_svc__C": "C"},
              title="Mean Test Score for each SVC Parameter")
fig.show()

We can observe a significant effect of the kernel and C parameter on the mean test scores. The range of scores varies widely, ranging from 0.06 to 0.255. This emphasizes the importance of these parameters in the SVC (Support Vector Classifier).

Lastly, we will use the best model to predict our test data and report the final results. Additionally, we will examine the samples that were classified incorrectly. Let's proceed with these tasks.

## Model Testing

In [30]:
y_pred_test = best_clf.predict(X_test)
print('Test score with best model: ', f1_score(y_test, y_pred_test, average= 'macro', zero_division= 0))

Test score with best model:  0.3611111111111111


In [31]:
import pandas as pd

X_test_df = pd.DataFrame(X_test)
df = pd.DataFrame({'actual': le.inverse_transform(y_test), 'predicted': le.inverse_transform(y_pred_test)})
df = df.set_index(X_test_df.index)
df = pd.concat([X_test_df, df], axis=1)
misclassified = df[df['actual'] != df['predicted']]

print(misclassified)

    0  1  2  3   4  5  6  7  8        actual     predicted
0   0  0  1  0  65  0  2  3  0      Diabetes  Osteoporosis
1   1  1  1  0  35  1  3  2  0  Hypertension      Migraine
2   1  0  1  0  55  1  2  2  1  Osteoporosis  Hypertension
4   1  0  1  0  45  1  3  3  1      Diabetes  Hypertension
6   0  0  1  0  45  0  2  2  0        Stroke  Osteoporosis
8   0  1  1  0  35  0  3  3  1      Migraine  Osteoporosis
10  1  1  0  0  52  0  2  1  0  Hypertension  Osteoporosis
11  1  0  1  1  55  0  3  1  1  Osteoporosis      Diabetes
13  0  0  0  1  31  0  1  2  0  Osteoporosis      Diabetes


Here, we can observe that this model performs well in predicting asthma cases but performs poorly in predicting other conditions in general. This suggests that we can use this model with a one-vs-all approach, where one class represents asthma, and the model can be used as a second opinion to determine if a person has asthma or not.

It's important to note that asthma is the most frequent class in the training data used for this model. However, even with this imbalance, we have a limited number of samples. Therefore, we can consider implementing data augmentation techniques to see if the model can improve its accuracy in predicting the other diseases.

By augmenting the data, we can generate additional samples using techniques such as rotation, scaling, or adding noise. This can potentially help the model generalize better and improve its performance in predicting the less frequent diseases.

Overall, it is crucial to explore different approaches, such as data augmentation, to enhance the model's accuracy and make more accurate predictions for a wider range of conditions.

# Conclusion

In conclusion, this notebook explored the prediction of diseases using basic medical information. The model achieved an F1 macro average score of 0.3611 for the six classes, indicating room for improvement in accurately predicting diseases with basic medical information alone.

While the current model's performance may be limited, there are opportunities for further exploration and enhancement. Collecting diverse data, employing feature engineering techniques, exploring alternative algorithms, fine-tuning hyperparameters, and seeking domain expertise can contribute to improving the model's accuracy and reliability.

Thank you for taking the time to read this notebook. If you found it informative and would like to learn more, I invite you to visit my profile and explore other notebooks I have created. Your continued support and interest are greatly appreciated. Hope you enjoyed it :)
