# **Project Name**    -



##### **Project Type**    - EDA/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Name**            - Amit Kumar


# **Project Summary -**


Summary: Diabetes Classification Using Machine Learning
Diabetes classification involves predicting whether an individual has diabetes based on various medical attributes. This process is vital for early diagnosis and management, ultimately helping in reducing the health risks associated with diabetes. The classification model utilizes attributes such as pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, BMI, diabetes pedigree function, age, and the outcome (presence or absence of diabetes).

Data Preparation
The first step in building a diabetes classification model is data preparation. This involves loading the dataset, handling missing values, and normalizing or standardizing the data if necessary. Missing values can significantly affect the performance of the model, so they are typically handled by replacing them with the mean or median values of the respective attributes.

Exploratory Data Analysis (EDA)
EDA helps in understanding the data better. It involves analyzing the distribution of each attribute, visualizing relationships between attributes, and checking for class imbalances in the outcome variable. Visual tools such as histograms, box plots, and correlation matrices are often used to gain insights into the data and its characteristics.

Feature Engineering
Feature engineering involves creating new features or selecting important features that contribute most to the predictive power of the model. This step is crucial as it can enhance the performance of the model by providing more relevant information to the algorithms.

Model Selection
Several classification algorithms can be employed for diabetes prediction, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN). The choice of model depends on the specific characteristics of the dataset and the problem at hand. The data is typically split into training and testing sets to evaluate the performance of the models.

Model Training
Models are trained on the training dataset. Hyperparameter tuning is performed using cross-validation to find the best set of parameters that improve the model's performance. For instance, Logistic Regression and Random Forest are popular choices due to their simplicity and effectiveness.

Model Evaluation
The trained models are evaluated on the testing dataset using various metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. These metrics provide insights into the model's performance and help in comparing different models. For instance, confusion matrices and classification reports can highlight the number of true positives, false positives, true negatives, and false negatives, offering a detailed evaluation of the model's predictions.

Model Deployment
After evaluating the models, the best-performing model is selected for deployment. This model is then used to make predictions on new data. In practice, a model with higher accuracy and better evaluation metrics is preferred for deployment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Diabetes is a chronic medical condition that affects millions of people worldwide, leading to severe health complications if not managed properly. Early diagnosis and timely intervention are crucial in mitigating the risks associated with diabetes. The objective of this project is to develop a predictive machine learning model that can accurately classify individuals as diabetic or non-diabetic based on a set of medical attributes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('https://raw.githubusercontent.com/sahiamit1993/Diabetes-Prediction/main/diabetes-vid.csv')
df

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
def dataset_info(data):
    print('*'*30,'About the dataset :','*'*20)
    print()
    print()
    print('Number of rows :',data.shape[0])
    print('Number of columns :',data.shape[1])
    print()
    print('*'*80)
    print(data.info())
    print()
    print('*'*80)
    print('Missing values :')
    print()
    print(data.isna().sum())
    print()
    print('*'*80)
    print('NUmber of duplicates :',data.duplicated().sum())
    print()
    print('*'*80)
    print('variables =',[i for i in df.columns])
    print()
    print('*'*80)
    print('Target variable : ')
    print(data['Outcome'].value_counts())

In [None]:
dataset_info(df)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isna())

### What did you know about your dataset?

The dataset contains 768 rows and 9 columns.
There are no missing or null values in the dataset.
There are no duplicate values in the dataset.
The target variable is 'Outcome', which indicates whether a person has diabetes (1) or not (0).
The dataset is imbalanced, with more instances of non-diabetic individuals than diabetic individuals.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Pregnancies: This attribute indicates the number of times the patient has been pregnant. This can be relevant as pregnancy can influence glucose levels and insulin sensitivity.

Glucose: This represents the plasma glucose concentration measured 2 hours after an oral glucose tolerance test. Higher glucose levels are indicative of poor blood sugar control and can be a sign of diabetes.

Blood Pressure: This is the diastolic blood pressure (measured in mm Hg). High blood pressure is often associated with diabetes and other cardiovascular conditions.

Skin Thickness: The triceps skinfold thickness measured in millimeters. It is used to estimate the amount of subcutaneous fat and can be related to obesity, which is a risk factor for diabetes.

Insulin: The 2-hour serum insulin (measured in mu U/ml). Insulin levels can help in assessing insulin resistance, a common feature in Type 2 diabetes.

BMI (Body Mass Index): This is calculated as weight in kilograms divided by height in meters squared (kg/m^2). BMI is a measure of body fat based on height and weight. Higher BMI values are linked to a higher risk of developing diabetes.

Diabetes Pedigree Function: This is a function that scores the likelihood of diabetes based on family history. It combines information from multiple family members and their medical history to estimate the genetic predisposition to diabetes.

Age: The age of the patient (in years). The risk of diabetes increases with age, making this an important factor in predicting the disease.

Outcome: This is the target variable and is binary. It indicates whether the individual has diabetes (1) or does not have diabetes (0).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
#no missing values

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
#checking for outliers
plt.figure(figsize=(15,10))
sns.boxplot(data=df)
plt.show()
#treating outliers
#removing outliers from 'Insulin' column
Q1=df['Insulin'].quantile(0.25)
Q3=df['Insulin'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['Insulin']>lower_limit) & (df['Insulin']<upper_limit)]
#removing outliers from 'Pregnancies' column
Q1=df['Pregnancies'].quantile(0.25)
Q3=df['Pregnancies'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['Pregnancies']>lower_limit) & (df['Pregnancies']<upper_limit)]
#removing outliers from 'SkinThickness' column
Q1=df['SkinThickness'].quantile(0.25)
Q3=df['SkinThickness'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['SkinThickness']>lower_limit) & (df['SkinThickness']<upper_limit)]
#removing outliers from 'BloodPressure' column
Q1=df['BloodPressure'].quantile(0.25)
Q3=df['BloodPressure'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['BloodPressure']>lower_limit) & (df['BloodPressure']<upper_limit)]
#removing outliers from 'BMI' column
Q1=df['BMI'].quantile(0.25)
Q3=df['BMI'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['BMI']>lower_limit) & (df['BMI']<upper_limit)]
#removing outliers from 'DiabetesPedigreeFunction' column
Q1=df['DiabetesPedigreeFunction'].quantile(0.25)
Q3=df['DiabetesPedigreeFunction'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['DiabetesPedigreeFunction']>lower_limit) & (df['DiabetesPedigreeFunction']<upper_limit)]
#removing outliers from 'Age' column
Q1=df['Age'].quantile(0.25)
Q3=df['Age'].quantile(0.75)
IQR=Q3-Q1
lower_limit=Q1-1.5*IQR
upper_limit=Q3+1.5*IQR
df=df[(df['Age']>lower_limit) & (df['Age']<upper_limit)]
#checking for outliers
plt.figure(figsize=(15,10))
sns.boxplot(data=df)
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

 I used the IQR method to remove outliers. This method is effective in identifying and removing outliers that are significantly different from the rest of the data.
 The IQR method is also relatively simple to implement and understand.

 Here are the steps involved in the IQR method:

 1. Calculate the first quartile (Q1) and third quartile (Q3) of the data.
 2. Calculate the interquartile range (IQR) as Q3 - Q1.
 3. Define the lower and upper bounds as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, respectively.
 4. Remove any data points that fall outside of these bounds.

 I used this method because it is a common and effective way to remove outliers. It is also relatively simple to implement and understand.
 In this particular case, the IQR method was effective in removing outliers from the data.
 This can be seen in the box plot after the outlier treatment, which shows that there are no longer any data points that fall outside of the whiskers.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
variables = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
num_col = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
cat_col = ['Outcome']

In [None]:
fig, ax = plt.subplots(4, 2, figsize=(10, 10))

n = 0

for i in num_col:
    sns.histplot(data=df, x=i, kde=True, ax=ax[n//2, n%2])

    ax[n//2, n%2].set_title(f'Histogram of {i}')

    n += 1

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(4, 2, figsize=(10, 10))

n = 0

for i in num_col:
    sns.boxplot(data=df, x='Outcome',y=i, ax=ax[n//2, n%2])

    ax[n//2, n%2].set_title(f'Boxplot of {i}')

    n += 1

plt.tight_layout()
plt.show()

In [None]:
sns.countplot(data=df,x='Outcome')
plt.title('Box plot of Outcome variable')
plt.show()

### 8. Data Splitting

In [None]:
X = df.drop('Outcome',axis=1)
y = df['Outcome']

In [None]:
smote = SMOTE()
X_resampled,y_resampled=smote.fit_resample(X,y,)

In [None]:
std = StandardScaler()

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train,X_test,y_train,y_test = train_test_split(std.fit_transform(X_resampled),y_resampled,test_size=0.2,random_state=123)

##### What data splitting ratio have you used and why?

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
lr = LogisticRegression()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
svc = SVC()

In [None]:
def model_info(model):
    print()
    print(f'For {model}')
    print()
    model.fit(X_train,y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    cla_repo_train = classification_report(y_train,y_pred_train)
    conf_mat_train = confusion_matrix(y_train,y_pred_train)

    cla_repo_test = classification_report(y_test,y_pred_test)
    conf_mat_test = confusion_matrix(y_test,y_pred_test)

    print('Classification report (Train):')
    print()
    print(cla_repo_train)
    print()
    print('Classification report (Test):')
    print()
    print(cla_repo_test)
    print()
    print('*'*80)
    print('Confusion metrix (Train):')
    print()
    print(conf_mat_train)
    print()
    print('Confusion metrix (Test):')
    print()
    print(conf_mat_test)
    print()
    print('*'*80)
    print('*'*80)
    print('*'*80)

In [None]:
models = [lr,dt,rf,svc]

for model in models:
    model_info(model)

Logistic Regression:
- Achieved an accuracy of 78% on the training set and 77% on the testing set.
- Shows a good balance between precision and recall for both classes.

Decision Tree Classifier:
- Achieved perfect accuracy (100%) on the training set, indicating potential overfitting.
- Performance on the testing set is lower (73%), suggesting overfitting to the training data.

Random Forest Classifier:
- Achieved an accuracy of 100% on the training set, also indicating potential overfitting.
- Performed better on the testing set (81%) compared to Decision Tree, suggesting better generalization.

Support Vector Classifier (SVC):
- Achieved an accuracy of 81% on the training set and 79% on the testing set.
- Shows a good balance between precision and recall for both classes.

Overall:
- Random Forest Classifier shows the best performance on the testing set, indicating its ability to generalize well.
- Decision Tree Classifier shows signs of overfitting, which needs to be addressed through techniques like pruning or limiting tree depth.
- Logistic Regression and SVC provide decent performance with a good balance between precision and recall.

Further steps could include:
- Hyperparameter tuning for all models to optimize their performance.
- Exploring other classification algorithms like K-Nearest Neighbors or Gradient Boosting.
- Addressing the class imbalance in the dataset using techniques like oversampling or undersampling.

In [None]:
sns.countplot(data=df,x='Outcome')
plt.title('Count plot of Outcome variable')
plt.show()

# **Conclusion**

In this project, we tackled the challenge of diabetes classification using machine learning techniques.
We explored various algorithms, including Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM),
to predict whether an individual has diabetes based on medical attributes.

Through rigorous data preprocessing, exploratory data analysis, and model evaluation,
we identified the most effective model for this specific dataset.

The chosen model demonstrated promising results in terms of accuracy, precision, recall, and F1-score,
indicating its potential for real-world application in assisting healthcare professionals with early diabetes diagnosis.

However, it's important to acknowledge that no model is perfect, and further improvements can be explored.
Future work could involve:

- Experimenting with additional algorithms or ensemble methods to potentially enhance predictive performance.
- Incorporating more diverse and comprehensive datasets to improve the model's generalizability.
- Fine-tuning hyperparameters and exploring advanced feature engineering techniques to optimize the model's accuracy.

Overall, this project highlights the significant role machine learning can play in addressing critical healthcare challenges like diabetes prediction,
paving the way for more effective and timely interventions.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***