**Predicting The Diagnosis Of Diabetes In Patients Based On Their Glucose And Insulin Levels**

*Introduction*

This project will analyze a dataset concerning diabetes and attempt to diagnose the disease in individuals by comparing their glucose and insulin levels to the individuals included in the dataset. Upon digesting food, the body breaks down the food into its base components, including sugar. This sugar will enter the blood and affect the blood glucose levels. Healthy individuals will produce insulin in response to this increase in blood sugar, whereas those with diabetes will have little to no production of insulin by the pancreas and thus be unable to naturally regulate the glucose levels. 

There are 2 types of diabetes: Type 1 is an autoimmune disease where the pancreas cannot produce its own insulin, Type 2 is more common and relates to higher production levels of insulin. When affected by this illness, the insulin production is very low or improperly used by the body and leads to abnormal levels of blood sugar. 
The question we will be answering in our project is: **Are we able to predict diabetes in an individual, solely based on their glucose and insulin levels?**

The dataset chosen was originally from the National Institute of Diabetes and Digestive and Kidney Diseases, then derived from Kaggle (https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset). It displays different variables that play a role in or indicate whether or not an individual is diabetic. In the raw dataset, there are nine columns, all containing numerical values, most of them integers. It is important to note the only dependent variable, the  ‘outcome’ column, which expresses the result of the prediction through the legend: 1 = yes, the individual is diabetic and 0 = no, the individual is not diabetic. One limitation of the study is the limited diversity in sampling, as the subjects are all females over 21 years old of Pima Indian heritage.


*Preliminary Data Exploration*

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
options(repr.matrix.max.rows = 6)


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [96]:
# Preliminary Exploratory Data Analysis: Read Data into R
diabetes_data <- read.csv("diabetes.csv")
diabetes_data

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<int>
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
5,121,72,23,112,26.2,0.245,30,0
1,126,60,0,0,30.1,0.349,47,1
1,93,70,31,0,30.4,0.315,23,0


In [109]:
# Preliminary Exploratory Data Analysis: Clean and Wrangle Data

diabetes_data$Patient_number <- seq.int(nrow(diabetes_data)) 

diabetes_select <- select(diabetes_data, Patient_number, Glucose, Insulin, Outcome)
diabetes_filter <- diabetes_select |>
    filter(!(Outcome == 0 & Insulin == 0))


diabetes_filter


Patient_number,Glucose,Insulin,Outcome
<int>,<int>,<int>,<int>
1,148,0,1
3,183,0,1
4,89,94,0
⋮,⋮,⋮,⋮
764,101,180,0
766,121,112,0
767,126,0,1


In [2]:
#new observations
new_obs_1<- tibble (Glucose= 140, Insulin= 0 )
new_obs_2<- tibble (Glucose= 125, Insulin=150)
bind_cols(
