# Diabetes Prediction

**Alex Green and Sirajulhaq Wahaj**  
Faculty of Technology and Society, Malmo University  

---

## Abstract

Over the years, predicting diseases has been a difficult task for the medical field. However, with the rise of computer technology and machine learning, we now have better tools to analyze and predict health conditions more accurately. Diabetes is one disease that can benefit from these advancements, as early detection is crucial for managing the condition. In this project, we will explore how machine learning models can predict diabetes in its early stages. We will use different models, such as Logistic Regression, K-Nearest Neighbors, and Decision Trees. The best model will be selected based on its performance using measures like accuracy, precision, recall, and confusion matrix.

---

## Introduction

Diabetes mellitus is a group of common endocrine diseases defined by sustained high blood sugar levels. Diabetes is caused by either the pancreas not producing enough insulin or the cells of the body becoming unresponsive to the effect of the hormone.

There are three main types of diabetes:

- **Type 1 Diabetes**: An autoimmune disease where the body attacks insulin-producing cells in the pancreas.
- **Type 2 Diabetes**: Typically caused by lifestyle factors leading to insulin resistance or insufficient insulin production.
- **Gestational Diabetes**: Occurs during pregnancy and increases the risk of developing Type 2 diabetes later in life.

Untreated diabetes can lead to complications later in life, such as heart and blood vessel disease, nerve damage, kidney damage, eye damage, foot damage, and Alzheimer’s disease. This is why detecting and treating diabetes is a crucial task that could alter lives.

---

## Project Objectives

The objective of this study is divided into the following steps:

1. Clean the data from irregularities.
2. Apply artificial intelligence algorithms to the dataset. The models that will be applied are the following: Decision Tree, Logistic Regression, and K-Nearest Neighbor.
3. Compare the models based on prediction accuracy.
4. Model evaluation.

The models' implementation will be accomplished with the Python programming language from scratch.

---

## Methodology

### Dataset Selection

For this project, we will use a dataset from the Kaggle website that was originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

The dataset consists of 768 instances and 8 attributes, where one attribute is the dependent variable indicating whether a person has diabetes (1) or not (0). The attributes include:

- **Pregnancies**: Number of times pregnant.
- **Glucose**: Plasma glucose concentration.
- **Blood Pressure**: Diastolic blood pressure (mm Hg).
- **Skin Thickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)$^2$).
- **Diabetes Pedigree Function**: A function that scores the likelihood of diabetes based on family history.
- **Age**: Age in years.

---

### Dataset Pre-processing

- **Handling missing values**: The dataset contains some missing values, particularly for the insulin variable. Missing values are handled by replacing them with the median of the respective columns.
- **Outlier Detection**: Outliers are detected using the IQR method and treated accordingly to ensure model robustness.
- **Categorical Encoding**: The target variable (Outcome) is encoded as 1 (diabetes) and 0 (no diabetes).
- **Feature Scaling**: Both min-max normalization and standardization techniques are applied to the dataset for comparison during model training.

---

### Feature Selection

The table below displays the accuracy of models before and after feature selection:

| **Model**            | **Accuracy (\%) Before FS** | **Accuracy (\%) After FS** |
|-----------------------|-----------------------------|----------------------------|
| Decision Tree         | 64                          | 70                         |
| K-Nearest Neighbors   | 64                          | 66                         |
| Logistic Regression   | 66                          | 64                         |

Dropping features had only a slight impact on accuracy. Keeping all features might be the optimal move to reduce the risk of bias.

---

## Model Descriptions

### Decision Tree

The implementation of the Decision Tree is an ID3 Decision Tree that computes the Information Gain of a dataset. The Information Gain is summarized by subtracting the entropy of the dataset by the combined entropy of its subsets.

---

### K-Nearest Neighbor

KNN is a supervised machine learning method that calculates the Euclidean distance between every point in the training set and every point in the test set to identify clusters and determine a class.

---

### Logistic Regression

Logistic Regression uses a sigmoid function to compute a value between 0 and 1.

---

## Conclusion

To be concluded.

---

## References

1. Mayo Clinic Staff. (2024). Diabetes. Retrieved from [Mayo Clinic](https://www.mayoclinic.org/diseases-conditions/diabetes/symptoms-causes/syc-20371444).
2. Wikipedia. (2024). Diabetes. In Wikipedia. Retrieved from [Wikipedia](https://en.wikipedia.org/wiki/Diabetes).
3. Kaggle. (2016). Pima Indians Diabetes Database: Predict the onset of diabetes based on diagnostic measures. Retrieved from [Kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).

