# **Project Name**    - Diabetes Prediction



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**2210992262
##### **Team Member 2 -**2210992264
##### **Team Member 3 -**2210992311
##### **Team Member 4 -**2210992341

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
df = pd.read_csv('/content/diabetes (2) (1).csv')

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
df.shape

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
df.describe().T

#### Missing Values/Null Values

In [None]:
df.isnull().sum()

In [None]:
sns.countplot(x = 'Outcome',data = df)

### What did you know about your dataset?

Observations:
1. There are a total of 768 records and 9 features in the dataset.
2. Each feature can be either of integer or float dataype.
3. Some features like Glucose, Blood pressure , Insulin, BMI have zero values which represent missing data.
4. There are zero NaN values in the dataset.
5. In the outcome column, 1 represents diabetes positive and 0 represents diabetes negative.

## ***2. Understanding Your Variables***

In [None]:
df.columns

In [None]:
df.describe()

### Variables Description

the features in a dataset for diabetes prediction:

Pregnancies: Number of times pregnant.

Glucose: Plasma glucose concentration after 2 hours in an oral glucose tolerance test.

Blood Pressure: Diastolic blood pressure (mm Hg).

Skin Thickness: Triceps skin fold thickness (mm).

Insulin: 2-Hour serum insulin (mu U/ml).
BMI (Body Mass Index): Body mass index (weight in kg/(height in m)^2).

Diabetes Pedigree Function: Diabetes pedigree function, which provides a measure of diabetes hereditary risk based on family history.

Age: Age of the patient (years).
Outcome: Binary indicator of whether the patient has diabetes or not (1 for diabetes, 0 for no diabetes).

### Check Unique Values for each variable.

In [None]:
df['Pregnancies'].unique()
df['Glucose'].unique()
df['BloodPressure'].unique()
df['SkinThickness'].unique()
df['Insulin'].unique()
df['BMI'].unique()
df['DiabetesPedigreeFunction'].unique()
df['Age'].unique()
df['Outcome'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:

from sklearn.impute import SimpleImputer
from scipy import stats

# Assuming df is your DataFrame

# Replace missing values (e.g., represented as 0) with NaN
df.replace(0, np.nan, inplace=True)

# Check if there are any missing values
if df.isnull().values.any():
    # Impute missing values with mean, median, or mode
    imputer = SimpleImputer(strategy='mean')

    # Check if 'DiabetesPedigreeFunction' column exists before imputing
    if 'DiabetesPedigreeFunction' in df.columns:
        df['DiabetesPedigreeFunction'] = imputer.fit_transform(df[['DiabetesPedigreeFunction']])
else:
    print("No missing values found in the DataFrame.")

# Detect and handle outliers using z-score or IQR method
z_scores = np.abs(stats.zscore(df.select_dtypes(include=np.number)))
df = df[(z_scores < 3).all(axis=1)]

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Convert data types if needed
df['Age'] = df['Age'].astype(int)

# Create new feature 'BMI_category' based on BMI values
df['BMI_category'] = pd.cut(df['BMI'], bins=[0, 18.5, 24.9, 29.9, np.inf], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# Drop columns that are not needed
# For example, if 'Patient_ID' is not relevant for analysis
# df.drop('Patient_ID', axis=1, inplace=True)

### What all manipulations have you done and insights you found?


Handling Missing Values:
The code replaces missing values (represented as 0) with NaN using the replace(0, np.nan, inplace=True) method. This ensures missing values are correctly identified for subsequent processing and analysis.

Imputing Missing Values with Mean:
The SimpleImputer(strategy='mean') is used to impute missing values in the 'DiabetesPedigreeFunction' column with the mean value of the column. Imputing missing values with the mean helps to retain data integrity while filling in gaps in the dataset.

Outlier Detection and Removal:
Outliers are detected and handled using the z-score method. The code calculates z-scores for numerical columns and removes rows where z-scores exceed a certain threshold (typically 3 standard deviations from the mean). This helps to remove data points that are significantly different from the majority of the data, improving the robustness of subsequent analysis.

Removing Duplicate Rows:
The drop_duplicates(inplace=True) method is applied to remove duplicate rows from the dataset. Removing duplicates ensures that each observation is unique, preventing biases in analysis and ensuring data integrity.Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart 1: Histogram for Glucose
plt.hist(df['Glucose'], bins=20, color='skyblue')
plt.title('Distribution of Glucose')
plt.xlabel('Glucose')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

we have chose a histogram to visualize the distribution of glucose levels in the dataset. A histogram is suitable for displaying the frequency distribution of a continuous variable, making it ideal for understanding the distribution of glucose levels among the population

##### 2. What is/are the insight(s) found from the chart?

The histogram illustrates the frequency distribution of glucose levels across the dataset. By examining the shape of the histogram, we can identify common glucose level ranges and any potential outliers. Additionally, the histogram provides insights into the overall distribution pattern, such as whether it is symmetric, skewed, or multi-modal.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of glucose levels in the dataset can provide valuable insights for various stakeholders in the healthcare industry. For example, healthcare providers can use this information to identify patients with abnormal glucose levels who may require further evaluation or intervention. Similarly, pharmaceutical companies and researchers can use these insights to develop targeted interventions or therapies for managing diabetes and related conditions, ultimately leading to improved patient outcomes and potentially reducing healthcare costs.

#### Chart - 2

In [None]:
plt.scatter(df['Glucose'], df['Insulin'], alpha=0.5)
plt.title('Scatter Plot of Glucose vs. Insulin')
plt.xlabel('Glucose')
plt.ylabel('Insulin')
plt.show()

##### 1. Why did you pick the specific chart?

we have selected a scatter plot to visualize the relationship between glucose and insulin levels in the dataset. A scatter plot is suitable for displaying the relationship between two continuous variables, allowing us to identify patterns, correlations, or trends between them.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows the distribution of data points representing glucose and insulin levels for each observation in the dataset. By examining the scatter plot, we can identify any potential correlations or patterns between glucose and insulin levels. For example, we may observe a positive correlation, indicating that higher glucose levels are associated with higher insulin levels, or vice versa.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the relationship between glucose and insulin levels is crucial for managing diabetes and related metabolic conditions. The insights gained from the scatter plot can inform healthcare professionals about the dynamics between these two important biomarkers and help tailor treatment plans or interventions for patients with diabetes. Additionally, researchers and pharmaceutical companies can use these insights to develop targeted therapies or interventions aimed at improving insulin sensitivity or regulating glucose levels, ultimately leading to better health outcomes for individuals with diabetes

#### Chart - 3

In [None]:
df['BMI'].value_counts().plot(kind='bar', color='salmon')
plt.title('BMI Category Distribution')
plt.xlabel('BMI Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

We have chose a bar chart because it effectively represents the distribution of BMI categories within the dataset. Bar charts are suitable for visualizing categorical data, making them ideal for displaying the count of each BMI category.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe the distribution of individuals across different BMI categories. The bar heights represent the frequency or count of individuals falling into each category. We can identify which BMI category is the most prevalent and whether there are any significant differences in frequencies between categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this chart can have a positive impact on business decisions, particularly in industries related to healthcare, wellness, and fitness. Understanding the distribution of BMI categories can inform strategies for targeted interventions and health promotion initiatives. For example, businesses in the healthcare sector can use this information to develop personalized health plans for individuals, create targeted marketing campaigns for specific BMI categories, or design wellness programs tailored to address the needs of different BMI groups. By addressing health concerns and promoting healthier lifestyle choices, businesses can contribute to improved health outcomes for individuals and potentially reduce healthcare costs in the long run.

#### Chart - 4

In [None]:
plt.boxplot(df['Age'])
plt.title('Boxplot of Age')
plt.ylabel('Age')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a boxplot because it effectively summarizes the distribution of age values within the dataset and provides insights into the central tendency, variability, and presence of outliers. Boxplots are particularly useful for visualizing the spread and skewness of numerical data, making them suitable for identifying any potential anomalies or patterns in the distribution of age values.

##### 2. What is/are the insight(s) found from the chart?

From the boxplot, we can gain insights into several aspects of the age distribution:
The central tendency: The median line (middle line of the box) indicates the median age of the dataset, providing a measure of the central tendency.
Variability: The length of the box (interquartile range) represents the spread or variability of age values. A longer box indicates greater variability.
Presence of outliers: The whiskers extending from the box provide information about the range of age values. Any data points beyond the whiskers may be considered outliers, potentially indicating unusual or extreme ages in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this boxplot can have several positive implications for businesses, especially those in industries related to healthcare, education, and marketing. Understanding the distribution and characteristics of age values can inform decision-making processes and strategic initiatives:

#### Chart - 5

In [None]:
plt.pie(df['Outcome'].value_counts(), labels=['Non-Diabetic', 'Diabetic'], autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
plt.title('Distribution of Outcome')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a pie chart because it effectively illustrates the distribution of outcomes (non-diabetic vs. diabetic) within the dataset. Pie charts are useful for displaying proportions or percentages of a whole, making them suitable for visualizing categorical data with a few distinct categories

##### 2. What is/are the insight(s) found from the chart?

From the pie chart, we can gain insights into the relative proportions of non-diabetic and diabetic individuals within the dataset. The size of each slice represents the proportion of individuals belonging to each outcome category. We can easily compare the prevalence of diabetes within the dataset and determine the distribution of outcomes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this pie chart can have positive implications for businesses, especially those in the healthcare and wellness industries:

Healthcare providers can use this information to understand the prevalence of diabetes within a population and allocate resources accordingly for screening, diagnosis, and treatment of diabetic individuals.

Wellness organizations and fitness centers can develop targeted programs and interventions to support individuals with diabetes and promote healthy lifestyle choices to prevent the onset of diabetes.
Pharmaceutical companies and medical device

manufacturers can tailor their products and services to address the specific needs of diabetic individuals, thereby enhancing patient care and improving health outcomes.

#### Chart - 6

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(df['BloodPressure'], df['BMI'], alpha=0.5, color='green')
plt.title('Scatter Plot of Blood Pressure vs. BMI')
plt.xlabel('Blood Pressure')
plt.ylabel('BMI')
plt.show()


##### 1. Why did you pick the specific chart?

we have chose a scatter plot because it effectively visualizes the relationship between two continuous variables: blood pressure and BMI. Scatter plots are ideal for identifying patterns, trends, and correlations between variables, making them suitable for exploring relationships between quantitative data.

##### 2. What is/are the insight(s) found from the chart?

From the scatter plot, we can gain insights into the relationship between blood pressure and BMI:
Correlation: We can assess whether there is a linear or nonlinear relationship between blood pressure and BMI. A positive correlation indicates that higher blood pressure is associated with higher BMI, while a negative correlation suggests the opposite.
Pattern: We can observe any discernible patterns or clusters in the data points, which may provide insights into potential subgroups or associations between blood pressure and BMI.
Outliers: We can identify any outlier data points that deviate significantly from the general trend, which may warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this scatter plot can have positive implications for businesses, particularly those in the healthcare, wellness, and fitness industries:

Healthcare providers can use this information to assess the relationship between blood pressure and BMI in patients, identify individuals at risk of hypertension or obesity-related complications, and develop personalized treatment plans.

Wellness organizations and fitness centers can leverage this knowledge to design targeted interventions and lifestyle programs aimed at improving blood pressure management and promoting healthy BMI levels among their clients.

Public health agencies and policymakers can use the insights to inform population-level interventions and initiatives aimed at reducing the prevalence of hypertension and obesity, thereby improving overall public health outcomes.

#### Chart - 7

In [None]:
sns.barplot(x='Outcome', y='BloodPressure', data=df, ci='sd', palette='muted')
plt.title('Bar Chart of Blood Pressure by Outcome')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a bar chart because it effectively compares the average blood pressure values between different outcome categories (diabetic vs. non-diabetic). Bar charts are suitable for visualizing comparisons between categories or groups, making them ideal for comparing the mean or average values of a numerical variable across different groups.

##### 2. What is/are the insight(s) found from the chart?

From the bar chart, we can gain insights into the average blood pressure values for diabetic and non-diabetic individuals:

Comparison: We can easily compare the average blood pressure values between diabetic and non-diabetic groups. The height of each bar represents the mean blood pressure for the respective outcome category.

Variability: The error bars (confidence intervals) provide information about the variability or dispersion of blood pressure values within each group. We can assess whether there are significant differences in blood pressure variability between diabetic and non-diabetic individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this bar chart can have positive implications for businesses, especially those in the healthcare and wellness sectors:

Healthcare providers can use this information to understand the relationship between blood pressure and diabetes, identify potential risk factors or associations, and tailor treatment strategies accordingly.

Wellness organizations and health coaches can leverage these insights to educate individuals about the importance of blood pressure management, especially for those with diabetes. They can offer targeted interventions and lifestyle modifications to help individuals maintain healthy blood pressure levels and reduce the risk of diabetes-related complications.

#### Chart - 8

In [None]:
sns.kdeplot(df['Age'], shade=True, color='orange')
plt.title('Kernel Density Estimation (KDE) of Age')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a Kernel Density Estimation (KDE) plot because it provides a smooth estimate of the probability density function of a continuous variable—in this case, age. KDE plots are useful for visualizing the distribution and shape of data, particularly when the underlying distribution is not known or when visualizing the distribution of a single variable.

##### 2. What is/are the insight(s) found from the chart?

From the KDE plot, we can gain insights into the distribution of age values:

Shape of distribution: We can observe the shape of the density curve, which provides information about the central tendency, variability, and skewness of age values. For example, a symmetric bell-shaped curve suggests a normal distribution, while asymmetry may indicate skewness.

Peaks and troughs: Peaks in the density curve represent areas of higher density, indicating where age values are more concentrated. Troughs represent areas of lower density.

Spread: The width of the density curve reflects the spread or variability of age values. A wider curve indicates greater variability, while a narrower curve suggests less variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this KDE plot can have positive implications for businesses, particularly those in the healthcare, marketing, and demographic analysis sectors:

Healthcare providers can use this information to understand the age distribution of their patient population, identify age-related health trends or risk factors, and tailor healthcare services and interventions accordingly.

Marketing professionals can leverage these insights to develop targeted marketing strategies and campaigns tailored to different age demographics. Understanding the age distribution of their target audience can help businesses create more effective marketing messages and reach their desired market segments.

Demographic analysts and policymakers can use the insights to inform strategic decision-making processes, such as resource allocation, infrastructure planning, and policy development, based on the age distribution of the population. This can contribute to more efficient and equitable allocation of resources and services, leading to positive societal outcomes.

#### Chart - 9

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Insulin'], alpha=0.5, color='orange')
plt.title('Scatter Plot of Age vs. Insulin')
plt.xlabel('Age')
plt.ylabel('Insulin')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a scatter plot because it effectively visualizes the relationship between two continuous variables: age and insulin levels. Scatter plots are well-suited for identifying patterns, trends, and correlations between variables, making them ideal for exploring relationships between quantitative data.


##### 2. What is/are the insight(s) found from the chart?

From the scatter plot, we can gain insights into the relationship between age and insulin levels:

Correlation: We can assess whether there is a linear or nonlinear relationship between age and insulin levels. A positive correlation suggests that insulin levels may increase or decrease with age, while a lack of correlation indicates no significant relationship.

Pattern: We can observe any discernible patterns or clusters in the data points, which may provide insights into potential associations or subgroups based on age and insulin levels.

Outliers: We can identify any outlier data points that deviate significantly from the general trend, which may warrant further investigation into potential anomalies or unique characteristics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this scatter plot can have positive implications for businesses, particularly those in the healthcare, wellness, and medical research fields:

Healthcare providers can use this information to better understand the relationship between age and insulin levels in patients, identify individuals at risk of insulin-related conditions or complications, and develop personalized treatment plans.

Wellness organizations and fitness centers can leverage these insights to design targeted interventions and lifestyle programs aimed at promoting healthy insulin levels across different age groups.

Medical researchers and pharmaceutical companies can use the insights to inform studies and clinical trials aimed at investigating age-related changes in insulin metabolism, developing new treatments for insulin-related disorders, and improving patient care and outcomes.

#### Chart - 10

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(df['Glucose'], df['BloodPressure'], alpha=0.5, color='purple')
plt.title('Scatter Plot of Glucose vs. Blood Pressure')
plt.xlabel('Glucose')
plt.ylabel('Blood Pressure')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot because it effectively visualizes the relationship between two continuous variables: glucose and blood pressure. Scatter plots are well-suited for identifying patterns, trends, and correlations between variables, making them ideal for exploring relationships between quantitative data.

##### 2. What is/are the insight(s) found from the chart?

From the scatter plot, we can gain insights into the relationship between glucose and blood pressure:

Correlation: We can assess whether there is a linear or nonlinear relationship between glucose and blood pressure. A positive correlation suggests that blood pressure may increase or decrease with glucose levels, while a lack of correlation indicates no significant relationship.

Pattern: We can observe any discernible patterns or clusters in the data points, which may provide insights into potential associations or subgroups based on glucose and blood pressure levels.

Outliers: We can identify any outlier data points that deviate significantly from the general trend, which may warrant further investigation into potential anomalies or unique characteristics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this scatter plot can have positive implications for businesses, particularly those in the healthcare, wellness, and medical research fields:

Healthcare providers can use this information to better understand the relationship between glucose and blood pressure in patients, identify individuals at risk of blood pressure-related conditions or complications associated with glucose levels, and develop personalized treatment plans.

Wellness organizations and fitness centers can leverage these insights to design targeted interventions and lifestyle programs aimed at promoting healthy glucose and blood pressure levels among their clients.

Medical researchers and pharmaceutical companies can use the insights to inform studies and clinical trials aimed at investigating the link between glucose metabolism and blood pressure regulation, developing new treatments for conditions such as hypertension and diabetes, and improving patient care and outcomes.

#### Chart - 11

In [None]:
plt.figure(figsize=(8, 6))
plt.bar(df.index, df['SkinThickness'], color='blue')
plt.title('Skin Thickness Distribution')
plt.xlabel('Index')
plt.ylabel('Skin Thickness')
plt.show()


##### 1. Why did you pick the specific chart?

we have chose a bar chart because it effectively displays the distribution of skin thickness values across different indices. Bar charts are suitable for visualizing the frequency or count of a variable within distinct categories or groups, making them ideal for displaying the distribution of numerical data.

##### 2. What is/are the insight(s) found from the chart?

From the bar chart, we can gain insights into the distribution of skin thickness values:

Variation: We can observe the range of skin thickness values and any variability or dispersion within the dataset. Differences in bar heights indicate variations in skin thickness across different indices or individuals.

Central tendency: The height of the bars provides information about the central tendency of skin thickness values within the dataset. Higher bars indicate areas where skin thickness values are more concentrated, while lower bars suggest areas of lower concentration.

Outliers: We can identify any outlier values that deviate significantly from the main distribution, which may warrant further investigation into potential anomalies or unique characteristics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this bar chart can have positive implications for businesses, particularly those in the healthcare, dermatology, and cosmetic industries:

Healthcare providers can use this information to assess the distribution of skin thickness values in patients, identify individuals with abnormal skin thickness, and potentially diagnose underlying skin conditions or disorders.

Dermatologists and cosmetic practitioners can leverage these insights to tailor treatment plans and procedures based on individual skin characteristics, ensuring optimal outcomes and patient satisfaction.

Skincare product manufacturers and cosmetic companies can use the insights to develop targeted skincare formulations and products designed to address specific skin thickness concerns, meet consumer needs, and enhance product efficacy and performance.

#### Chart - 12

In [None]:
plt.figure(figsize=(8, 6))
plt.hist(df['Insulin'], bins=20, color='red')
plt.title('Distribution of Insulin')
plt.xlabel('Insulin')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a histogram because it effectively visualizes the distribution of insulin levels within the dataset. Histograms are well-suited for displaying the frequency or count of values within different bins or intervals of a numerical variable, making them ideal for exploring the distribution of continuous data.

##### 2. What is/are the insight(s) found from the chart?

From the histogram, we can gain insights into the distribution of insulin levels:

Shape of distribution: We can observe the shape of the histogram, which provides information about the central tendency, variability, and skewness of insulin level values. For example, a symmetric bell-shaped curve suggests a normal distribution, while asymmetry may indicate skewness.

Peaks and troughs: Peaks in the histogram represent areas of higher frequency, indicating where insulin level values are more concentrated. Troughs represent areas of lower frequency.

Spread: The width of the histogram reflects the spread or variability of insulin level values. A wider histogram indicates greater variability, while a narrower histogram suggests less variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this histogram can have positive implications for businesses, particularly those in the healthcare, pharmaceutical, and medical research fields:

Healthcare providers can use this information to understand the distribution of insulin levels in patients, identify individuals with abnormal insulin levels or insulin-related conditions, and develop personalized treatment plans.

Pharmaceutical companies and researchers can leverage these insights to inform the development of new insulin-based therapies, diagnostics, and interventions aimed at managing insulin-related disorders such as diabetes.

Public health agencies and policymakers can use the insights to monitor trends in insulin levels within populations, identify at-risk groups, and implement targeted interventions and prevention strategies to improve overall health outcomes.

#### Chart - 13

In [None]:
plt.figure(figsize=(8, 6))
plt.hist(df['BloodPressure'], bins=20, color='brown')
plt.title('Distribution of Blood Pressure')
plt.xlabel('Blood Pressure')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a histogram because it effectively visualizes the distribution of blood pressure values within the dataset. Histograms are well-suited for displaying the frequency or count of values within different bins or intervals of a numerical variable, making them ideal for exploring the distribution of continuous data.

##### 2. What is/are the insight(s) found from the chart?

From the histogram, we can gain insights into the distribution of blood pressure values:

Shape of distribution: We can observe the shape of the histogram, which provides information about the central tendency, variability, and skewness of blood pressure values. For example, a symmetric bell-shaped curve suggests a normal distribution, while asymmetry may indicate skewness.

Peaks and troughs: Peaks in the histogram represent areas of higher frequency, indicating where blood pressure values are more concentrated. Troughs represent areas of lower frequency.

Spread: The width of the histogram reflects the spread or variability of blood pressure values. A wider histogram indicates greater variability, while a narrower histogram suggests less variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this histogram can have positive implications for businesses, particularly those in the healthcare, medical device, and pharmaceutical industries:

Healthcare providers can use this information to assess the distribution of blood pressure values in patients, identify individuals with abnormal blood pressure levels or hypertension, and develop personalized treatment plans.

Medical device manufacturers can leverage these insights to design and develop innovative blood pressure monitoring devices and technologies tailored to specific patient needs and healthcare settings.

Pharmaceutical companies and researchers can use the insights to inform the development of new antihypertensive medications, diagnostic tools, and interventions aimed at managing high blood pressure and improving cardiovascular health outcomes.

#### Chart - 14 - Correlation Heatmap

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a heatmap because it effectively visualizes the correlation between different variables in the dataset. Heatmaps use color gradients to represent the strength and direction of correlations, making it easy to identify relationships between variables. Additionally, annotations provide precise correlation values, enhancing the interpretability of the chart.

##### 2. What is/are the insight(s) found from the chart?

From the correlation heatmap, we can gain insights into the relationships between different variables:

Strength of correlations: The color intensity of each cell indicates the strength of correlation between two variables. Strong positive correlations are represented by darker shades, while strong negative correlations are represented by lighter shades.

Direction of correlations: The sign (+ or -) of each correlation coefficient indicates the direction of the relationship between variables. Positive correlations imply that as one variable increases, the other variable also tends to increase, while negative correlations imply an inverse relationship.
Correlation patterns: We can identify clusters of variables with high or low correlations, revealing potential patterns or associations within the dataset.

#### Chart - 15 - Pair Plot

In [None]:
sns.pairplot(df[['Glucose', 'Insulin', 'BMI', 'Age', 'Outcome']], hue='Outcome', diag_kind='kde')
plt.title('Pairplot of Selected Features')
plt.show()

##### 1. Why did you pick the specific chart?

we have chose a pairplot because it provides a comprehensive visualization of pairwise relationships between selected features in the dataset. Pairplots are particularly useful for exploring correlations, distributions, and potential patterns across multiple variables simultaneously. Additionally, incorporating hue='Outcome' allows for the visualization of how the outcome variable relates to the selected features, making it easier to identify potential associations or differences between outcome groups.

##### 2. What is/are the insight(s) found from the chart?

From the pairplot, we can gain insights into the relationships between selected features and the outcome variable:

Pairwise relationships: We can observe scatter plots along the diagonal, representing the relationship between each feature and itself (i.e., the feature's distribution). Off-diagonal scatter plots illustrate pairwise relationships between different features, helping to identify potential correlations or patterns.

Distributions: Kernel density estimation (kde) plots on the diagonal provide insights into the distribution of each feature for different outcome groups. Differences in distribution shapes or peaks may indicate potential associations between features and outcome groups.

Class separability: By incorporating hue='Outcome', the pairplot allows us to visualize how the selected features vary between different outcome groups. Clustering or separation of data points based on outcome groups may suggest predictive potential or class separability of the selected features.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


Research Hypothesis:

Null Hypothesis (H0): There is no association between the number of pregnancies (Pregnancies) and the diabetes outcome (Outcome).
Alternate Hypothesis (H1): There is an association between the number of pregnancies (Pregnancies) and the diabetes outcome (Outcome).Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import chi2_contingency

# Assuming df_main is your DataFrame containing the dataset

# Create a contingency table between two categorical variables
contingency_table = pd.crosstab(df_main['Pregnancies'], df_main['Outcome'])

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

Chi-square Test for Independence.

##### Why did you choose the specific statistical test?

I chose the chi-square test for independence because it is suitable for analyzing the association between two categorical variables (Pregnancies and Outcome). Since Pregnancies represents the number of pregnancies (a discrete variable), and Outcome represents the diabetes outcome (a binary categorical variable), the chi-square test is appropriate to determine if there is a significant association between these variables. Additionally, the chi-square test helps to assess whether any observed association is statistically significant or occurred by chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no difference in the mean glucose levels (Glucose) between diabetic and non-diabetic individuals (Outcome).
Alternate Hypothesis (H1): There is a difference in the mean glucose levels (Glucose) between diabetic and non-diabetic individuals (Outcome).

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Assuming df_main is your DataFrame containing the dataset

# Split the dataset into diabetic and non-diabetic groups
diabetic_group = df_main[df_main['Outcome'] == 1]['Glucose']
non_diabetic_group = df_main[df_main['Outcome'] == 0]['Glucose']

# Perform independent t-test
t_statistic, p_value = ttest_ind(diabetic_group, non_diabetic_group)

print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

Independent t-test (Two-sample t-test).Answer Here.

##### Why did you choose the specific statistical test?

I chose the independent t-test because it is suitable for comparing the means of a continuous variable (Glucose) between two independent groups (Outcome). In this case, the two groups are diabetic (Outcome = 1) and non-diabetic (Outcome = 0) individuals. The t-test helps determine if the difference in mean glucose levels between these groups is statistically significant or occurred by chance. Additionally, the t-test assumes that the data are normally distributed and the variances are equal between the groups, which are reasonable assumptions for glucose levels in diabetic and non-diabetic populations.Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no correlation between age and BMI.
Alternate Hypothesis (H1): There is a correlation between age and BMI.Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import spearmanr

# Assuming df_main is your DataFrame containing the dataset

# Calculate Spearman rank correlation coefficient and p-value
corr, p_value = spearmanr(df_main['Age'], df_main['BMI'])

print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

Spearman rank correlation coefficient.Answer Here.

##### Why did you choose the specific statistical test?

I chose the Spearman rank correlation coefficient test because it is suitable for measuring the strength and direction of a monotonic relationship between two continuous variables when the assumptions of the Pearson correlation coefficient test are not met (e.g., non-linear relationship or non-normality). Age and BMI are continuous variables, and the Spearman rank correlation coefficient test does not assume a linear relationship between the variables, making it appropriate for this analysis.Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
df.fillna(df.mean(), inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

We have used Mean Imputation. Mean imputation replaces missing values with the mean of the non-missing values in the column. It is a simple and commonly used method for handling missing data, suitable for variables with a relatively normal distribution.

### 2. Handling Outliers

In [None]:
mean = df.mean()
std_dev = df.std()

# Define a threshold for outlier detection (e.g., 3 times the standard deviation)
threshold = 3

# Calculate Z-scores for each data point
z_scores = (df - mean) / std_dev

# Identify outliers
outliers = (z_scores > threshold) | (z_scores < -threshold)

# Remove outliers
cleaned_df = df[~outliers.any(axis=1)]

# Now 'cleaned_df' contains the DataFrame with outliers removed

##### What all outlier treatment techniques have you used and why did you use those techniques?

We've used Z-score based outlier detection and removal, which identifies and removes data points that are a certain number of standard deviations away from the mean (e.g., 3 times the standard deviation). This technique is effective for normally distributed data and helps mitigate the impact of extreme values on statistical analyses.Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
import pandas as pd

df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 60, float('inf')], labels=['Young', 'Middle-aged', 'Elderly'])

# Calculate insulin sensitivity index

df['InsulinSensitivity'] = 1 / (df['Insulin'] * df['Glucose'])

#### 2. Feature Selection

In [None]:
from sklearn.feature_selection import VarianceThreshold


# Assuming df_main contains the dataset with the features
X = df  # Features

# Remove non-numeric columns
X_numeric = X.select_dtypes(include=[np.number])

# Set a threshold for variance
threshold = 0.1

# Create VarianceThreshold object
selector = VarianceThreshold(threshold)

# Fit the selector to the data
selector.fit(X_numeric)

# Get the indices of the selected features
selected_indices = selector.get_support(indices=True)

# Get the names of the selected features
selected_features = X_numeric.columns[selected_indices]

# Create a new DataFrame with only the selected features
df_selected = df[selected_features]

##### What all feature selection methods have you used  and why?

we have used two feature selection methods:

SelectKBest with chi-square statistic: This method selects the top k features based on their statistical significance with the target variable. I chose this method because it's suitable for classification tasks and provides a straightforward way to select the most relevant features.

VarianceThreshold: This method removes features with low variance, assuming that features with low variance are less informative and may not contribute much to the model's performance. I used this method to filter out features that don't vary much across the dataset, which could help simplify the model and reduce overfitting.Answer Here.

Which all features you found important and why?

features are considered important for diabetes prediction because they are well-established risk factors supported by medical research and clinical practice. They capture essential aspects of glucose metabolism, body composition, age-related changes, and genetic predisposition, which play key roles in diabetes development. By including these features in predictive models, we can effectively assess an individual's risk of developing diabetes and provide targeted interventions for prevention and management.Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler

# Assuming df_selected contains the dataset with the selected features
X_selected = df_selected  # Features

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit the scaler to the data and transform the data
X_selected_normalized = scaler.fit_transform(X_selected)

# Convert the normalized data back to a DataFrame (if needed)
df_selected_normalized = pd.DataFrame(X_selected_normalized, columns=X_selected.columns)

# Check the min and max values of the normalized data
print("Min values after normalization:")
print(df_selected_normalized.min())
print("\nMax values after normalization:")
print(df_selected_normalized.max())

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# Assuming X contains the selected features and y contains the target variable
# Replace 'selected_features' with the features you have identified as important
X = df_selected[selected_features]
y = df_selected['Outcome']  # Assuming 'Outcome' is the target variable for diabetes prediction

# Split the data into training and testing sets
# Adjust the test_size parameter to choose the splitting ratio wisely
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shape of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

##### What data splitting ratio have you used and why?


we have used a data splitting ratio of 80% for training and 20% for testing, specified by setting test_size=0.2 in the train_test_split function.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Create a Random Forest classifier with class_weight='balanced'
clf = RandomForestClassifier(class_weight='balanced', random_state=0)

What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)*

The technique used to handle the imbalanced dataset is setting the class_weight='balanced' parameter in the RandomForestClassifier.

When dealing with imbalanced datasets, using class weights is a common technique to address the issue. By setting class_weight='balanced', the algorithm adjusts the weights of the classes during training, giving higher weight to the minority class and lower weight to the majority class. This helps the model to pay more attention to the minority class instances during training, thereby improving its ability to correctly classify the minority class.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Split data into features (X) and target variable (y)
X = df.drop("Outcome", axis=1)
y = df["Outcome"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_scaled, y_train)

# Predict on the testing set
y_pred = svm_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.
The machine learning (ML) model used in the provided code is a Support Vector Machine (SVM) classifier.

**Support Vector Machine (SVM)**:
SVM is a supervised learning algorithm that is used for classification tasks. It's effective in high-dimensional spaces and is particularly well-suited for cases where the number of dimensions (features) is greater than the number of samples. SVM works by finding the hyperplane that best separates the different classes in the feature space. This hyperplane is chosen so as to maximize the margin, which is the distance between the hyperplane and the nearest data point from either class, known as support vectors.

**Kernel Trick**:
SVM can efficiently perform a non-linear classification using what's called the "kernel trick." This technique implicitly maps the input data into a higher-dimensional space, where it becomes linearly separable. Common kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid.

In the provided code, a linear kernel is used (`kernel='linear'`). This means the SVM model will try to find a linear decision boundary in the feature space.

**Training and Prediction**:
The SVM model is trained using the training data, where it learns the optimal hyperplane to separate the different classes. After training, the model can make predictions on new, unseen data by determining which side of the hyperplane the data points fall on.

**Model Evaluation**:
The model's performance is evaluated using various metrics such as accuracy, precision, recall, and F1-score. These metrics provide insights into how well the model is performing in terms of correctly predicting the positive and negative classes, as well as overall performance.

In summary, SVM is a powerful algorithm for classification tasks, particularly when dealing with high-dimensional data. It's versatile and can be used with different kernel functions to handle both linear and non-linear classification problems.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate classification report
report = classification_report(y_test, y_pred, output_dict=True)
accuracy = report['accuracy']
precision = report['1']['precision']
recall = report['1']['recall']
f1_score = report['1']['f1-score']

# Create metric score chart
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
scores = [accuracy, precision, recall, f1_score]

plt.figure(figsize=(10, 6))
plt.bar(metrics, scores, color=['blue', 'green', 'orange', 'red'])
plt.title('SVM Model Performance Metrics')
plt.xlabel('Metrics')
plt.ylabel('Score')
plt.ylim(0, 1)  # Setting y-axis limit to range [0, 1]
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
#data = pd.read_csv('your_dataset.csv')  # Replace 'your_dataset.csv' with the path to your dataset file

# Preprocessing (handle missing values, encode categorical variables, etc.)
# For example, you might fill missing values with the mean or median of each column:
df.fillna(df.mean(), inplace=True)

# Split the dataset into features (X) and target variable (y)
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
print(classification_report(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The machine learning model used in the provided code snippet is a Decision Tree Classifier.

**Decision Tree Classifier:**
Decision trees are a popular type of supervised learning algorithm used for classification tasks. They work by recursively partitioning the feature space into regions that minimize impurity or maximize information gain. Each internal node of the tree represents a decision based on the value of a specific feature, and each leaf node represents the predicted class label.

Here's how the Decision Tree Classifier works:

1. **Training**: The algorithm recursively splits the feature space into regions based on the feature values that best separate the data points according to their class labels. This splitting process continues until certain stopping criteria are met, such as reaching a maximum depth, minimum number of samples in a node, or no further improvement in impurity reduction.

2. **Prediction**: When making predictions for new data points, the algorithm traverses the decision tree from the root node down to a leaf node based on the feature values of the data point. The class label associated with the leaf node reached by the data point determines its predicted class label.

**Key Parameters:**
- `criterion`: The function used to measure the quality of a split. Common options include 'gini' for the Gini impurity and 'entropy' for information gain.
- `max_depth`: The maximum depth of the decision tree. Limiting the depth helps prevent overfitting.
- `min_samples_split`: The minimum number of samples required to split an internal node.
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
- `random_state`: Seed for random number generation, ensuring reproducibility.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Load the dataset and preprocess it
# df = pd.read_csv('your_dataset.csv')  # Replace 'your_dataset.csv' with the path to your dataset file
df.fillna(df.mean(), inplace=True)

# Split the dataset into features (X) and target variable (y)
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Generate classification report
report = classification_report(y_test, y_pred, output_dict=True)
df_report = pd.DataFrame(report).transpose()

# Plot the evaluation metric score chart
plt.figure(figsize=(10, 5))
df_report[['precision', 'recall', 'f1-score']].plot(kind='bar', ax=plt.gca(), color=['skyblue', 'lightgreen', 'salmon'])
plt.title('Evaluation Metric Score Chart')
plt.xlabel('Class')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.tight_layout()
plt.show()



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report



# Define features (X) and target variable (y)
X = df.drop('Outcome', axis=1)  # Assuming 'Outcome' is your target variable
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the logistic regression model
logistic_reg = LogisticRegression()
logistic_reg.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = logistic_reg.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification report
print(classification_report(y_test, y_pred))



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The machine learning model used in the provided code snippet is Logistic Regression. Logistic Regression is a statistical method used for binary classification problems, where the target variable has two possible outcomes (e.g., yes or no, 0 or 1). It's widely used in various fields, including healthcare, finance, and marketing.

### Explanation of Logistic Regression:
- **Objective**: Logistic Regression aims to model the probability that a given input belongs to a certain class.
- **Functionality**: It applies a logistic function (also known as the sigmoid function) to the linear combination of input features to produce the probability score.
- **Decision Boundary**: Logistic Regression establishes a decision boundary that separates the classes in the feature space.
- **Optimization**: The model is trained using optimization techniques such as gradient descent to minimize the loss function, typically the log-loss or cross-entropy loss.

### Key Features:
- **Standardization**: Before training the model, the features are standardized using StandardScaler to ensure that each feature has a mean of 0 and a standard deviation of 1. This helps in improving the performance of the model, especially when dealing with features of different scales.
- **Evaluation Metrics**: The model's performance is evaluated using accuracy, which measures the proportion of correctly predicted outcomes among all predictions. Additionally, a classification report is generated, providing metrics such as precision, recall, and F1-score for each class, along with the overall metrics.



In [None]:
# Visualizing evaluation Metric Score chart
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

# Generate the classification report
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap='Blues', fmt=".2f", cbar=False)
plt.title('Classification Report')
plt.xlabel('Metrics')
plt.ylabel('Class')
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.For positive business impact, it's crucial to consider evaluation metrics that provide insights into the model's performance in a way that aligns with the business goals and priorities. In the provided codes, the evaluation metrics considered are:

1. **Accuracy Score**: Accuracy measures the proportion of correctly predicted instances among the total instances. It's a widely used metric and provides a general overview of the model's performance.

2. **Precision, Recall, and F1-score**: These metrics are particularly important in classification tasks, especially when dealing with imbalanced datasets or when different types of errors have different consequences. Precision measures the proportion of true positive predictions among all positive predictions, recall measures the proportion of true positives among all actual positives, and the F1-score is the harmonic mean of precision and recall. These metrics provide insights into the model's ability to correctly identify positive cases (e.g., cases of diabetes) while minimizing false positives and false negatives.

These metrics are chosen because they provide a comprehensive understanding of the model's performance in terms of both overall accuracy and its ability to correctly classify positive cases, which is crucial in a healthcare context like diabetes prediction. By optimizing these metrics, businesses can ensure that the model effectively identifies individuals at risk of diabetes while minimizing misclassifications, leading to better health outcomes and resource allocation.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.From the provided models, the final prediction model chosen would depend on several factors such as the nature of the problem, the dataset characteristics, and the evaluation metrics.

1. **Linear Regression**: Linear regression is suitable for predicting continuous outcomes. It's a simple and interpretable model that works well when there is a linear relationship between the features and the target variable. However, if the problem involves classification (e.g., predicting the presence or absence of diabetes), linear regression might not be the best choice.

2. **Decision Tree**: Decision trees are versatile and intuitive models that can handle both classification and regression tasks. They can capture complex relationships in the data and are easy to interpret. Decision trees work well with categorical and numerical features, making them suitable for a variety of datasets. However, they are prone to overfitting, especially with deep trees, and might not generalize well to unseen data.

3. **Logistic Regression**: Logistic regression is a widely used model for binary classification tasks like predicting the presence or absence of diabetes. It provides probabilities of class membership and can handle both numerical and categorical features. Logistic regression is robust to noise in the data and less prone to overfitting compared to decision trees. Additionally, it provides interpretable coefficients that can be useful for understanding the impact of features on the target variable.

Considering the nature of the problem (diabetes prediction), the final prediction model chosen would likely be **Logistic Regression**. Logistic regression is well-suited for binary classification tasks, interpretable, and can provide probabilities of class membership, which can be useful for decision-making in healthcare settings. Additionally, it is less prone to overfitting compared to decision trees, making it a more robust choice for generalization to unseen data.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***