<a href="https://colab.research.google.com/github/tejaswirupa/Predicting-diabetes-by-analyzing-health-factors-and-socioeconomic-indicators./blob/main/5100Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Project Introduction

# Project Introduction: Diabetes Health Indicators Dataset Notebook

#Purpose:
The primary objective of this code notebook is to preprocess the Behavioral Risk Factor Surveillance System (BRFSS) data, specifically focusing on health indicators related to diabetes. With an initial dataset comprising 330 features, this notebook aims to streamline the data into a format suitable for machine learning algorithms. The selection of features is informed by extensive research on factors influencing diabetes and other chronic health conditions, ensuring a targeted and relevant analysis.

##Dataset Details:
The dataset under consideration is derived from 253,680 survey responses obtained from the cleaned BRFSS 2015. This project adopts a binary classification approach, emphasizing the presence or absence of diabetes as the main outcome variable.

##Important Risk Factors:
Research in the field has identified several pivotal risk factors for diabetes and chronic illnesses, including high blood pressure, cholesterol, smoking, obesity, age, sex, race, diet, exercise, alcohol consumption, BMI, household income, marital status, sleep, time since last checkup, education, and health care coverage.

##Subset of Selected Features:
In alignment with the recognized risk factors, a careful curation of features from the BRFSS 2015 dataset has been undertaken. The selection process involves cross-referencing the variable names in the dataset with the BRFSS 2015 Codebook to ensure accuracy and relevance. Additionally, inspiration is drawn from the work of Zidian Xie et al, who employed a similar approach for building risk prediction models for Type 2 Diabetes using the 2014 BRFSS dataset. This project thus contributes to the ongoing efforts in leveraging machine learning techniques for health risk assessment.

BRFSS 2015 Codebook: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

Relevant Research Paper using BRFSS for Diabetes ML: https://www.cdc.gov/pcd/issues/2019/19_0109.htm

The selected features from the BRFSS 2015 dataset are:

#The dependent variable, DIABETES12:  
 Encapsulates the diabetes status of respondents in three distinct categories. A value of 0 denotes the absence of diabetes or its occurrence solely during pregnancy. A value of 1 signifies a diagnosis of pre-diabetes or borderline diabetes. On the other hand, a value of 2 indicates a confirmed diagnosis of diabetes. This variable serves as a crucial indicator for understanding the diverse diabetes statuses present in the dataset, encompassing those without the condition, individuals with pre-diabetic indications, and those diagnosed with diabetes.



#Several independent variables are considered:
##High Blood Pressure (_RFHYPE5):    

Adults who have been informed by a doctor, nurse, or health professional that they have high blood pressure.

##High Cholesterol (TOLDHI2, _CHOLCHK):

Have you ever been informed by a doctor, nurse, or health professional that your blood cholesterol is high? (TOLDHI2)
Cholesterol check within the past five years. (_CHOLCHK)
##BMI (Body Mass Index) (_BMI5):
Body Mass Index (BMI) is considered as an indicator of weight status.

##Smoking (SMOKE100):
Whether the respondent has smoked at least 100 cigarettes in their entire life.

##Other Chronic Health Conditions:

Ever told you had a stroke (CVDSTRK3).
Respondents ever reporting having coronary heart disease (CHD) or myocardial infarction (MI) (_MICHD).
Physical Activity (_TOTINDA):
Adults who reported engaging in physical activity or exercise during the past 30 days other than their regular job.

##Diet:

Consume fruit 1 or more times per day (_FRTLT1).

Consume vegetables 1 or more times per day (_VEGLT1).

Alcohol Consumption (_RFDRHV5):
Identification of heavy drinkers based on defined criteria.

##Health Care:

Do you have any kind of health care coverage, including health insurance, prepaid plans, or government plans? (HLTHPLN1)
Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? (MEDCOST)
Health General and Mental Health:

##General health assessment (GENHLTH).
Mental health status, considering stress, depression, and emotional problems in the past 30 days (MENTHLTH).

Physical health status, including illness and injury in the past 30 days (PHYSHLTH).

Serious difficulty walking or climbing stairs (DIFFWALK).
Demographics:

##Sex of the respondent (SEX).
Fourteen-level age category (_AGEG5YR).

Highest grade or year of school completed (EDUCA).

Annual household income from all sources (INCOME2), with a code for refusal if applicable.





##Imports

Importing the necessary libraries.

In [None]:
# Import pandas, numpy, and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# seaborn is a data visualization library built on matplotlib
import seaborn as sns
# set the plotting style
sns.set_style("whitegrid")

# plot tree model
import graphviz

# Model preprocessing
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Train-test splits and cross validation
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, GridSearchCV, KFold
from patsy import dmatrices

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn import tree
import xgboost as xgb

# Plot missing values
import missingno as msno

# Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer

# Regular expressions
import re

# Model metrics and analysis
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from scipy.stats import uniform, randint

# Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer

# Logistic function
from scipy.special import expit

##The Data

This project utilizes the data set `2015.csv`.

The dataset comprises information concerning the health status of each individual, indicating potential future occurrences of diabetes.

`Diabetes_012`
Occurrence of Diabetes

`HighBP`
The blood pressure is above normal levels

`HighChol`
The cholestrol is above normal levels

`CholCheck`
The Cholestrol levels being checked once every 5 years

`BMI`
The Body Mass Index

`Smoker`
Have smoked atleast 100 cigarettes in their entire life

`Stroke`
Ever had a stroke

`HeartDiseaseorAttack`
coronary heart disease(CHD) or myocardial infarction (MI)

`PhysActivity`
Physical activity in the past 30 days - not including job

`Fruits`
Consumes fruits once or more times per day

`Veggies`
Consumes veggies once or more times per day

`HvyAlcoholConsump`
Adult men >=14 drinks per week

Adult women >=7 drinks per week

`AnyHealthcare`
Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc.

`NoDocbcCost`
Was there a time in the past 12 months when needed to see a doctor but could not because of cost

`GenHlth`
Rate general health

1 = excellent 2 = very good 3 = good 4 = fair 5 = poor

`MentHlth`
days of poor mental health scale 1-30 days

`PhysHlth`
physical illness or injury days in past 30 days

`DiffWalk`
Serious difficulty while walking or climbing stairs?

`Sex`
Gender

`Age`
1 = (18-24), 9 = (60-64), 13 = (80 or older)

`Education`
Education level on scale 1-6

1 = Never attended school or only kindergarten, 2 = elementary etc.

`Income`
Income on scale 1-8

 1 = less than $10000

 8 = $75000 or more

##Questions for the problem

The overall problem is to predict the occurence of diabetes based on health attributes of the person. To answer this general problem, we are asking specific questions about the data.

##### $\rightarrow$ Questions about the data that will help you solve the problem.

*   What are the number of predicting variables and how many observations do we have?
*   Can these variables help us answer the question of occurence of diabetes?
*   What variables shouldn't be considered for this to be a meaningful study?
*   Are there any duplicate rows or missing values?
*   What shall the values of the variable we are trying to predict be?
*   What prediction variables are highly correlated with the output variable?
*   Do we have any outliers in this data?
*   What models shall we use to predict the output?
*   Are there any additional external factors  that may influence the chances of diabetes?

##Load the data

##### $\rightarrow$ Loading the Diabetes 2015 dataset.
https://www.kaggle.com/code/alexteboul/diabetes-health-indicators-dataset-notebook/input?select=2015.csv

BRFSS 2015 Codebook: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf



In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/umuhoza90/DS-Project/main/diabetes_012_health_indicators_BRFSS2015.csv')

# Exploring the contents of the data sets


Exploring the dataset. We can see the names of the columns and how do the values look like.

In [None]:
df.head()

To get the summary of the DataFrame. Also gives the data types of each column and the number of non-null values.

In [None]:
df.info()

Number of rows and columns in our dataset.

In [None]:
df.shape

## Can we answer the questions using this data?

#####Performing exploratory data analysis to determine whether the data are sufficient to answer our question.

Checking what variables can we consider to plot a pair plot.

In [None]:
df.describe()

To explore the relationships between the numerical variables, we are plotting a pair plot.

In [None]:
sns.pairplot(data = df.iloc[:, np.r_[np.arange(5), 21]], hue='Diabetes_012');

# Data preparation

## Quality control

There are no out-of-range values, the values are divided by various properties.

## Checking missing values

In [None]:
df.isnull().sum()

## Checking duplicate rows

In [None]:
df.duplicated().sum()

Dropping the duplicate rows

In [None]:
df.drop_duplicates(inplace = True)
df.duplicated().sum()

## Renaming the columns from CamelCase to snake_case

Looking at the column names.

In [None]:
df.columns

Using this method to convert the column names from CamelCase to snake_case.

In [None]:
def camel_to_snake(name):
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
    return re.sub('(.)([0-9][a-z]+)', r'\1_\2', name)

In [None]:
df.columns = [camel_to_snake(name) for name in df.columns]

Checking the converted column names.

In [None]:
df.columns

Selecting some relatable variables.

In [None]:
df = df[['diabetes_012', 'high_bp', 'high_chol', 'bmi', 'smoker',
       'phys_activity', 'fruits', 'veggies',
       'hvy_alcohol_consump', 'no_docbc_cost', 'gen_hlth','any_healthcare',
       'ment_hlth', 'phys_hlth', 'diff_walk', 'sex', 'age', 'education',
       'income']]

Rename Diabetes_012 column to Diabetes

In [None]:
df['diabetes'] = df['diabetes_012']
df.drop(columns = 'diabetes_012', inplace=True)

Frequency count for each column value

In [None]:
for column in df.columns:
    print(f"{column}\n{df[column].value_counts()}\n{'-'*90}")

# Exploratory Data Analysis

In [None]:
corr_matrix = df.corr()

plt.figure(figsize=(20,15))

sns.heatmap(corr_matrix, vmax=1, vmin=-1, square=True, annot=True, cmap='viridis')

plt.tick_params(labelsize=12);

In [None]:
# Looking at correlation to diabetes only
plt.figure(figsize=(10, 8))
df.corr("pearson")["diabetes"]
fig2 = plt.bar(
    df.corr("pearson")["diabetes"].index, df.corr("pearson")["diabetes"]
)
plt.xticks(rotation=90)

2.2 Enhance and preprocess the values to align more effectively with machine learning algorithms. To accomplish this task, we consulted the codebook, which provides detailed information about each column, feature, or question. You can find the codebook here: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

In [None]:
plt.figure(figsize=(10, 6))

# Count occurrences of each category in the 'diabetes' column
diabetes_counts = df['diabetes'].value_counts()

# Plotting the pie chart
plt.pie(diabetes_counts, labels=diabetes_counts.index, autopct='%1.2f%%', colors=['lightcoral', 'cornflowerblue'], startangle=90, wedgeprops=dict(width=0.4, edgecolor='w'))

# Adding a title
plt.title('Percentage of Diabetes Cases')

# Display the pie chart
plt.show()

*  0 = No diabetes or only during pregnancy (82.71%)
* 1 = Pre-diabetes or borderline diabetes (2.01%)
* 2 = Yes, diabetes (15.27%)

In [None]:

plt.figure(figsize=(12, 8))

# Order the x-axis by age group
age_order = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

# Plotting the countplot
sns.countplot(x='age', data=df, hue='diabetes', order=age_order, palette='bright')

# Adding a title and labels
plt.title('Age Group Distribution by Diabetes Status')
plt.xlabel('Age Group')
plt.ylabel('Count')

# Adding legend and adjusting its position
plt.legend(title='Diabetes', loc='upper right')

# Display the countplot
plt.show()


The countplot illustrates the relationship between age and diabetes, revealing that elderly individuals face an elevated risk of developing diabetes.

In [None]:
#diabetics individuals
Diabetics = df[df['diabetes'] == 2].copy()


In [None]:
plt.figure(figsize=(10, 6))

# Count occurrences of each gender in the diabetic patients
gender_counts = Diabetics['sex'].value_counts()

# Plotting the pie chart with explicit labels
plt.pie(gender_counts, labels=['Male', 'Female'], autopct='%1.2f%%', colors=['skyblue', 'orange'], startangle=90, wedgeprops=dict(width=0.4, edgecolor='w'))

# Adding a title
plt.title('Gender Distribution among Diabetic Patients')

# Display the pie chart
plt.show()

The pie chart illustrates that there isn't a pronounced correlation between gender and diabetes

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Plot HighChol vs. Diabetes in the entire dataset
sns.countplot(x='high_chol', data=df, hue='diabetes', ax=ax[0]).set_title('High Cholesterol vs Diabetes')

# Plot HighChol in Diabetic patients
sns.countplot(x='high_chol', data=Diabetics, ax=ax[1]).set_title('High Cholesterol in Diabetic Patients')

# Add common labels
for axes in ax:
    axes.set_xlabel('High Cholesterol')
    axes.set_ylabel('Count')

plt.show()


The countplots reveal a correlation between high cholesterol and diabetes. Among individuals with diabetes, 20,000+ have high cholesterol, while 10,000+ do not.






In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Plot DiffWalk vs. Diabetes in the entire dataset
sns.countplot(x='diff_walk', data=df, hue='diabetes', ax=ax[0]).set_title('Difficulty in Walking or Climbing vs Diabetes')

# Plot DiffWalk in Diabetic patients
sns.countplot(x='diff_walk', data=Diabetics, ax=ax[1]).set_title('Difficulty in Walking or Climbing in Diabetic Patients')

# Add common labels
for axes in ax:
    axes.set_xlabel('Difficulty in Walking or Climbing')
    axes.set_ylabel('Count')

plt.show()


Within the subset of individuals diagnosed with diabetes, around 13,114 individuals report experiencing difficulty in walking or climbing. Conversely,around 21,983 individuals within this diabetic group do not encounter challenges related to walking or climbing.

In [None]:

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Plot HighBP vs. Diabetes in the entire dataset
sns.countplot(x='high_bp', data=df, hue='diabetes', ax=ax[0]).set_title('High Blood Pressure vs Diabetes')

# Plot HighBP in Diabetic patients
sns.countplot(x='high_bp', data=Diabetics, ax=ax[1]).set_title('High Blood Pressure in Diabetic Patients')

# Add common labels
for axes in ax:
    axes.set_xlabel('High Blood Pressure')
    axes.set_ylabel('Count')

plt.show()

The countplot demonstrates a correlation between high blood pressure and diabetes. Among individuals with diabetes, over 26]5,000 have high blood pressure, while around  8,000 do not exhibit high blood pressure."

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Plot GenHlth vs. Diabetes in the entire dataset
sns.countplot(x='gen_hlth', data=df, hue='diabetes', ax=ax[0]).set_title('General Health vs Diabetes')

# Plot GenHlth in Diabetic patients
sns.countplot(x='gen_hlth', data=Diabetics, ax=ax[1]).set_title('General Health in Diabetic Patients')

# Add common labels
for axes in ax:
    axes.set_xlabel('General Health')
    axes.set_ylabel('Count')

# Adding custom x-axis labels
plt.xticks(ticks=[0, 1, 2, 3, 4], labels=['Excellent', 'Very Good', 'Good', 'Fair', 'Poor'])


The countplots offer a clear insight into the relationship between General Health and diabetes. Notably, individuals categorized with 'good,' 'fair,' and 'poor' health types exhibit a heightened risk of developing diabetes. The distribution of counts in these health categories is as follows:

* Good Health: about 13,000individuals
* Fair Health: about 9,000 individuals
* Very Good Health: abut 6,000 individuals
* Poor Health: about 4,000 individuals
* Excellent Health: around 1,000 individuals

This suggests a discernible association between lower perceived general health and an increased likelihood of diabetes.

In [None]:
#Diabetes and BMI

plt.figure(figsize=(10, 6))

# Create a Kernel Density Estimate (KDE) plot for BMI and Diabetes
sns.kdeplot(data=df, x="bmi", hue="diabetes", fill=True, common_norm=False, palette='viridis')

# Adding a title and labels
plt.title('BMI Distribution by Diabetes Status')
plt.xlabel('BMI')
plt.ylabel('Density')

# Display the KDE plot
plt.show()


The BMI distribution is more concentrated among individuals with diabetes compared to those without diabetes. The graph shows that the BMI in  Non-diabetic individuals exhibit a normal BMI within the range of 25-35, while diabetic patients tend to have a BMI that extends beyond this range.


In [None]:
df.drop(columns=['diabetes']).agg(['min', 'max']).T

Produce a train/test split for model comparison

Train test split

In [None]:
X = df.drop(columns = ['phys_activity', 'fruits', 'veggies', 'hvy_alcohol_consump', 'education', 'income', 'diabetes'])

y = df['diabetes']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
class_names = ['no diabetes', 'yes diabetes']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state = 1)

In [None]:
print(X_train.shape, X_test.shape)

In [None]:
X_train.agg(['mean','std']).round(2).T

In [None]:
X_test.agg(['mean','std']).round(2).T

In [None]:
plt.subplots(1,2, figsize=(10,4))

plt.subplot(1,2,1)
plt.plot(X_train.mean(), X_test.mean(), 'o')
plt.plot([0, 30], [0, 30])

plt.xlabel('Training set mean')
plt.ylabel('Test set mean')
plt.axis('square')

plt.subplot(1,2,2)
plt.plot(X_train.std(), X_test.std(), 'o')
plt.plot([0, 10], [0, 10])

plt.xlabel('Training set std')
plt.ylabel('Test set std')
plt.axis('square');

In [None]:
df['diabetes'].value_counts()

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=True)

In [None]:
numerical_columns = X_train.select_dtypes(include='number').columns.to_list()

In [None]:
scaler = StandardScaler().fit(X_train[numerical_columns])

In [None]:
X_train[numerical_columns] = scaler.transform(X_train[numerical_columns])
X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])

In [None]:
print(np.mean(X_train[numerical_columns], axis = 0).round(2))
print(np.std(X_train[numerical_columns], axis = 0))

print(np.mean(X_test[numerical_columns], axis = 0).round(2))
print(np.std(X_test[numerical_columns], axis = 0).round(2))

In [None]:
corr_matrix = X_train.corr()

plt.figure(figsize=(15,8))

sns.heatmap(corr_matrix, vmax=1, vmin=-1, square=True, cmap='viridis')

plt.tick_params(labelsize=10);

In [None]:
#sns.pairplot(data = df.iloc[:,np.r_[np.arange(6), 21]], hue='diabetes');

In [None]:
plt.figure(figsize=(8,10))
sns.boxplot(data=X_train, orient='h');


MODELING


# Linear regression

In [None]:
#imputation

In [None]:
X_train.head()

In [None]:
df_train = X_train.join(y_train)

In [None]:
df_train.head()

In [None]:
df_test = X_test.join(y_test)

In [None]:
df_test.head()

In [None]:
sns.pairplot(data=df_train,
             vars=['high_bp',	'high_chol',	'bmi',	'smoker',	'no_docbc_cost',	'gen_hlth',	'any_healthcare',	'ment_hlth',	'phys_hlth',	'diff_walk',	'sex',	'age',	'diabetes'],

             kind='reg');

In [None]:
plt.figure(figsize=(12,4))
sns.boxplot(data=df_train[['high_bp',	'high_chol',	'bmi',	'smoker',	'no_docbc_cost',	'gen_hlth',	'any_healthcare',	'ment_hlth',	'phys_hlth',	'diff_walk',	'sex',	'age',	'diabetes']]);

In [None]:
def design_matrices(formula, df_train, df_test):

    _, X_design_train = dmatrices(formula,
                                  data=df_train,
                                  return_type='dataframe')

    _, X_design_test = dmatrices(formula,
                                 data=df_test,
                                 return_type='dataframe')

    return X_design_train, X_design_test

In [None]:
X_design_train, X_design_test = design_matrices('diabetes ~(high_bp + high_chol + bmi + smoker + no_docbc_cost + gen_hlth + any_healthcare + ment_hlth + phys_hlth + diff_walk + sex + age)', df_train, df_test)

In [None]:
X_design_train.head()

In [None]:
model = sm.OLS(y_train, X_design_train).fit()

In [None]:
print(model.summary())

In [None]:
y_hat = model.predict(X_design_train)

In [None]:
plt.plot(y_train, y_hat, 'o')

plt.xlabel('Test diabetes', fontsize = 12)
plt.ylabel('Predicted test diabetes', fontsize = 12)

plt.tick_params(labelsize = 15)

In [None]:
y_predict_lr = model.predict(X_design_test)

In [None]:
plt.figure(figsize = (8,6))

plt.plot(y_test, y_predict_lr, 'o');

plt.xlabel('Test diabetes', fontsize = 15)
plt.ylabel('Predicted test set diabetes', fontsize = 15)

plt.tick_params(labelsize = 15)

In [None]:
mean_squared_error(y_test, y_predict_lr, squared = False).round(3)

In [None]:
mean_absolute_error(y_test, y_predict_lr).round(3)

In [None]:
def design_matrices(formula, df_train, df_test):

    _, X_design_train = dmatrices(formula,
                                  data=df_train,
                                  return_type='dataframe')

    _, X_design_test = dmatrices(formula,
                                 data=df_test,
                                 return_type='dataframe')

    return X_design_train, X_design_test

In [None]:
rf_model = RandomForestRegressor()

params = {
    "max_depth": randint(2, 18),
    "n_estimators": randint(80, 150),
    "min_samples_leaf": randint(1, 8)
}

search = RandomizedSearchCV(rf_model,
                            param_distributions=params,
                            n_iter=50, # Ideally, this would be larger, but it takes a long time
                            cv=5,
                            verbose=1,
                            n_jobs=1,
                            return_train_score=True,
                            scoring = 'neg_mean_squared_error')

#search.fit(X_design_train.drop(columns='Intercept'), y_train)

search.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [None]:
search.best_params_

In [None]:
model_rf = RandomForestRegressor(**search.best_params_)

In [None]:
model_rf.fit(X_train.drop(columns='diabetes'), y_train)

In [None]:
y_predict_rf = model_rf.predict(X_test.drop(columns='diabetes'))

In [None]:
plt.figure(figsize = (8,6))

plt.plot(y_test, y_predict_rf, 'o');

plt.xlabel(fontsize = 15)
plt.ylabel(fontsize = 15)

plt.tick_params(labelsize = 15)

In [None]:
mean_squared_error(y_test, y_predict_rf, squared = False).round(3)

In [None]:
mean_absolute_error(y_test, y_predict_rf).round(3)

In [None]:
(pd.DataFrame(dict(cols=X_train.drop(columns='diabetes').columns, imp=model_rf.feature_importances_)).
 sort_values(by = 'imp').
 plot('cols', 'imp', 'barh', legend = False, figsize = (4,5)))

plt.ylabel('')
plt.xlabel('Variable importance', fontsize = 15)

plt.tick_params(labelsize = 15);

##Quadratic Model

In [None]:
X_design_train_2, X_design_test_2 = design_matrices('diabetes ~ (high_bp + high_chol + bmi + smoker + no_docbc_cost + gen_hlth + any_healthcare + ment_hlth + phys_hlth + diff_walk + sex + age + I(high_bp**2) + I(high_chol**2) + I(bmi**2) + I(smoker**2) + I(no_docbc_cost**2) + I(gen_hlth**2) + I(any_healthcare**2) + I(ment_hlth**2) + I(phys_hlth**2) + I(diff_walk**2) + I(sex**2) + I(age**2)',
                                                    df_train,
                                                    df_test)

In [None]:
X_design_train_2.head()

In [None]:
model_2 = sm.OLS(y_train, X_design_train_2).fit()

In [None]:
print(model_2.summary())

In [None]:
y_hat_2 = model_2.predict(X_design_train_2)

In [None]:
plt.plot(y_train, y_hat_2, 'o')

plt.xlabel('Training set miles per gallon', fontsize = 12)
plt.ylabel('Predicted training set miles per gallon', fontsize = 12)

plt.tick_params(labelsize = 15);

In [None]:
y_predict_lr_2 = model_2.predict(X_design_test_2)

In [None]:
plt.figure(figsize = (8,6))

plt.plot(y_test, y_predict_lr_2, 'o');

plt.xlabel('Test set miles per gallon', fontsize = 15)
plt.ylabel('Predicted test set miles per gallon', fontsize = 15)

plt.tick_params(labelsize = 15)

In [None]:
mean_squared_error(y_test, y_predict_lr_2, squared = False).round(3)

In [None]:
mean_absolute_error(y_test, y_predict_lr_2).round(3)

XG BOOST

In [None]:
sns.lmplot(data = df_train, x = 'high_bp', y = 'diabetes',
           y_jitter = 0.05, logistic = True, truncate = False, ci = None, height = 6)

plt.xlabel('high bp', fontsize = 14)
plt.ylabel('diabetes', fontsize = 14);

In [None]:
x = np.linspace(-5, 5, 100)

b0 = 0
b1 = 0.4

y = expit(b0 + b1 * x)

plt.plot(x, y)

plt.xlabel('X', fontsize = 14)
plt.ylabel('Logistic function', fontsize = 14);

In [None]:
print(y_train.unique())

In [None]:
y_train_binary = (y_train > 0).astype(int)

In [None]:
# Fit the logistic regression model using mean_radius as the predictor, using the training data.
log_reg1 = sm.Logit(y_train_binary, sm.add_constant(X_train['high_bp'])).fit()

In [None]:
print(log_reg1.summary())

In [None]:
X_train.columns

In [None]:
log_reg1 = sm.Logit(y_train_binary, sm.add_constant(X_train)).fit()

In [None]:
print(log_reg1.summary())

logistic regression model aims to predict the likelihood of having diabetes based on various predictor variables. It seems several health-related factors (high blood pressure, high cholesterol, BMI) positively contribute to the likelihood, while factors like smoking have a negative impact. Make decisions about the significance of predictors based on p-values.

# XG BOOT

The XGBoost model aims to accurately predict whether an individual has pre-diabetes, no diabetes, or has already been diagnosed with diabetes.
The accuracy of the model in making these predictions is a key measure of its effectiveness.

In [None]:
# Don't run the cross validation now. It will take a long time.
xgb_model = xgb.XGBClassifier()

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3),
    "max_depth": randint(2, 6),
    "n_estimators": randint(100, 150),
    "subsample": uniform(0.6, 0.4)
}

search = RandomizedSearchCV(xgb_model,
                            param_distributions=params,
                            n_iter=200,
                            cv=5,
                            verbose=1,
                            n_jobs=1,
                            return_train_score=True)

search.fit(X_train, y_train)

In [None]:
search.best_params_

In [None]:
xgb_model = xgb.XGBClassifier()
xgb_model.set_params(**search.best_params_)

In [None]:
xgb_model.fit(X_train, y_train)

In [None]:
fig, ax = plt.subplots(figsize = (5,10))
xgb.plot_importance(xgb_model, ax = ax);

In [None]:
class_names = ['no_diabetes', 'pre_diabetes', 'yes_diabetes']

In [None]:
pred_mgb = xgb_model.predict(X_test)

In [None]:
mat_mgb = confusion_matrix(y_test, pred_mgb)

In [None]:
sns.heatmap(mat_mgb.T, square=True, annot=True, cbar=False, xticklabels=class_names, yticklabels=class_names, fmt='g')
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
print(classification_report(y_test, pred1, target_names=class_names))

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split



# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree Regression
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
dt_reg_pred = dt_reg.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_reg_pred)

# Logistic Regression (for binary classification)
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
log_reg_pred = log_reg.predict(X_test)
log_reg_mse = mean_squared_error(y_test, log_reg_pred)

# Quadratic Regression
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)


In [None]:
# Calculate additional metrics
dt_mae = mean_absolute_error(y_test, dt_reg_pred)
dt_r2 = r2_score(y_test, dt_reg_pred)


log_reg_mae = mean_absolute_error(y_test, log_reg_pred)
log_reg_r2 = r2_score(y_test, log_reg_pred)

quad_reg_mae = mean_absolute_error(y_test, quad_reg_pred)
quad_reg_r2 = r2_score(y_test, quad_reg_pred)

# Print the metrics for each model
print("Decision Tree Metrics:")
print("  MSE:", dt_mse)
print("  MAE:", dt_mae)
print("  R2:", dt_r2)

print("\nLogistic Regression Metrics:")
print("  MSE:", log_reg_mse)
print("  MAE:", log_reg_mae)
print("  R2:", log_reg_r2)

print("\nQuadratic Regression Metrics:")
print("  MSE:", quad_reg_mse)
print("  MAE:", quad_reg_mae)
print("  R2:", quad_reg_r2)

In [None]:
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import seaborn as sns


models = ['Logistic Regression', 'Decision Tree', 'Quadratic Regression']
preds = [log_reg_pred, dt_reg_pred, ]

f1_scores = []

for i in preds:
    f1_scores.append(f1_score(y_test, i, average="weighted"))

# Plotting the F1 scores
plt.figure(figsize=(8, 5))
sns.barplot(x=models, y=f1_scores, palette="viridis")
plt.title('F1 Scores Comparison')
plt.xlabel('Models')
plt.ylabel('F1 Score')
plt.show()

In [None]:
X_design_train_2, X_design_test_2 = design_matrices('diabetes ~ (high_bp + high_chol + bmi + smoker + no_docbc_cost + gen_hlth + any_healthcare + ment_hlth + phys_hlth + diff_walk + sex + age + I(high_bp**2) + I(high_chol**2) + I(bmi**2) + I(smoker**2) + I(no_docbc_cost**2) + I(gen_hlth**2) + I(any_healthcare**2) + I(ment_hlth**2) + I(phys_hlth**2) + I(diff_walk**2) + I(sex**2) + I(age**2)',
                                                    df_train,
                                                    df_test)