<a href="https://colab.research.google.com/github/saikirantony/assignments/blob/main/saikiran_final_project_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to the project**

Diabetes is a growing global health concern, with millions of people affected each year. The condition, characterized by elevated blood sugar levels, can lead to severe complications such as heart disease, kidney failure, and nerve damage if left untreated. Given the significant health and economic impact of diabetes, it is crucial to explore ways to predict and manage the disease early. This project focuses on analyzing various factors related to diabetes, aiming to develop predictive models that can aid in early diagnosis and intervention.

Through the use of machine learning and statistical techniques, the project seeks to identify key variables that influence the likelihood of diabetes in individuals. By integrating demographic, lifestyle, and health data, the goal is to build an accurate model that can classify individuals as diabetic or non-diabetic based on their characteristics. This analysis will not only contribute to a better understanding of the disease but also provide valuable insights that could lead to more effective prevention and treatment strategies.



# **Datasets used**









1. Diabetes in America Dataset: This dataset is sourced from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). It provides a comprehensive view of the diabetes-related statistics, including the prevalence, incidence, and treatment patterns of diabetes across the United States. The data contains various health-related metrics, which help in analyzing the extent and effects of diabetes in different demographics.

*  Link: https://www.niddk.nih.gov/about-niddk/strategic-plans-reports/diabetes-in-america

2. Kaggle Diabetes Dataset: This dataset, available on Kaggle, contains information about patients diagnosed with diabetes. It includes details such as age, gender, glucose levels, blood pressure, BMI, and whether the patient is diabetic or not. The dataset is widely used for prediction modeling and machine learning, especially for building classification models to predict the likelihood of diabetes based on various health parameters.

* Link: https://www.kaggle.com/datasets/mathchi/diabetes-data-set

# **Import the libraries**

In [1]:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# **Import the datasets**

In [2]:
diabetes = pd.read_csv("diabetes.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'diabetes.csv'

In [None]:
diabetes_prediction = pd.read_json("diabetes_prediction.json")

# **Basic Data exploration**

In [None]:
diabetes.head()

In [None]:
diabetes.info()

In [None]:
# Convert columns to numeric types where applicable
numeric_columns = ['id', 'encounter_id', 'patient_nbr', 'admission_type_id', 'discharge_disposition_id', 'diag_1', 'diag_2', 'diag_3',
                   'admission_source_id', 'time_in_hospital', 'num_lab_procedures', 'num_procedures',
                   'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient',
                   'number_diagnoses']

diabetes[numeric_columns] = diabetes[numeric_columns].apply(pd.to_numeric, errors='coerce')  # Convert to numeric, coerce errors to NaN

In [None]:
# Convert categorical columns to 'object' type
categorical_columns = ['race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty',
                       'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
                       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',
                       'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide.metformin',
                       'glipizide.metformin', 'glimepiride.pioglitazone', 'metformin.rosiglitazone', 'metformin.pioglitazone',
                       'change', 'diabetesMed', 'readmitted']

diabetes[categorical_columns] = diabetes[categorical_columns].astype('category')

In [None]:
diabetes_prediction.head()

In [None]:
diabetes_prediction.info()

In [None]:
diabetes_prediction = diabetes_prediction.dropna()

In [None]:
# Replace Outcome values with 'Diabetic' and 'Non-diabetic'
diabetes_prediction['Outcome'] = diabetes_prediction['Outcome'].replace({1: 'Diabetic', 0: 'Non-diabetic'})

In [None]:
diabetes_prediction.head()

# **Data Merging**

In [None]:
# Convert column names to lowercase for both datasets
diabetes.columns = diabetes.columns.str.lower()
diabetes_prediction.columns = diabetes_prediction.columns.str.lower()

# Now merge on the common columns (e.g., 'age' and 'gender')
merged_data = pd.merge(diabetes, diabetes_prediction, on=['age'], how='inner')


# **Data visualizations**

In [None]:
numeric_columns = ['time_in_hospital', 'num_lab_procedures', 'num_medications', 'num_procedures']

plt.figure(figsize=(12, 8))

for i, column in enumerate(numeric_columns, 1):
    plt.subplot(2, 2, i)
    plt.hist(diabetes[column].dropna(), bins=20, color='skyblue', edgecolor='black')
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(diabetes['age'], kde=True, color='skyblue', bins=30)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()


In [None]:
# Relationship between Age and BMI for Diabetic vs Non-diabetic
plt.figure(figsize=(8, 5))
sns.scatterplot(data=diabetes_prediction, x='age', y='bmi', hue='outcome', palette='Set2')
plt.title('Age vs BMI for Diabetic and Non-diabetic')
plt.xlabel('Age')
plt.ylabel('BMI')
plt.legend(title='Outcome')
plt.tight_layout()
plt.show()

In [None]:

# Now plot
plt.figure(figsize=(8, 5))
sns.countplot(data=diabetes_prediction, x='outcome', palette='Set2')
plt.title('Diabetic vs Non-diabetic Distribution')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.tight_layout()
plt.show()


# **Data Pivoting and Data aggregation**

In [None]:
# Pivot table for average Glucose by Outcome
pivot_glucose = diabetes_prediction.pivot_table(values='glucose', columns='outcome', aggfunc='mean')
print("Average Glucose Levels by Outcome:")
pivot_glucose


In [None]:
# Pivot table for average BMI by Outcome
pivot_bmi = diabetes_prediction.pivot_table(values='bmi', columns='outcome', aggfunc='mean')
print("\nAverage BMI by Outcome:")
pivot_bmi


In [None]:

# Pivot table for average Age by Outcome
pivot_age = diabetes_prediction.pivot_table(values='age', columns='outcome', aggfunc='mean')
print("\nAverage Age by Outcome:")
pivot_age


In [None]:
# Pivot table for average Insulin by Outcome
pivot_insulin = diabetes_prediction.pivot_table(values='insulin', columns='outcome', aggfunc='mean')
print("\nAverage Insulin by Outcome:")
pivot_insulin

# **Conclusion**

In conclusion, this diabetes prediction project has successfully demonstrated the potential of data analysis and machine learning in understanding and predicting the onset of diabetes. By leveraging various visualizations, pivot tables, and data transformation techniques, we were able to gain valuable insights into the key factors influencing diabetes. The analysis highlighted the importance of variables such as age, BMI, and glucose levels, which showed strong correlations with the likelihood of developing diabetes.

The predictive model developed during this project offers a promising approach to early diagnosis and intervention. While the model’s performance can always be improved with more data and advanced techniques, it serves as a foundation for creating tools that could assist healthcare professionals in identifying at-risk individuals. Ultimately, this project contributes to the growing body of work aimed at combating diabetes, with the potential to improve patient outcomes through earlier detection and targeted prevention strategies.