# Machine Learning Introduction
In this session, we’ll explore a simple medical dataset that contains information about individuals, including their age, blood pressure, glucose levels, and more. Some of these people were diagnosed with diabetes, and some were not.

Objective: Build a machine learning model to predict whether a person has diabetes based on their health measurements. Learn ML concepts like training, testing, accuracy, and how models make predictions.


## Dataset Used: Pima Indians Diabetes Dataset
This dataset is used to predict the onset of diabetes based on diagnostic health measurements. It contains medical data from female patients of Pima Indian heritage, aged 21 and above.

---

| Feature Name               | Description                                                                 |
|---------------------------|-----------------------------------------------------------------------------|
| **Pregnancies**           | Number of times the patient has been pregnant                              |
| **Glucose**               | Plasma glucose concentration (mg/dL)                                       |
| **BloodPressure**         | Diastolic blood pressure (mm Hg)                                           |
| **SkinThickness**         | Triceps skinfold thickness (mm)                                            |
| **Insulin**               | 2-Hour serum insulin level (mu U/ml)                                       |
| **BMI**                   | Body Mass Index (weight in kg / (height in m)^2)                           |
| **DiabetesPedigreeFunction** | Likelihood of diabetes based on family history                            |
| **Age**                   | Age of the patient (years)                                                 |
| **Outcome**               | Target variable (0 = No diabetes, 1 = Has diabetes)                         |


#ML Pipeline Overview
Data → Preprocessing → Train/Test Split → Model Training → Evaluation → Prediction

 ![ML Pipeline](https://github.com/sanjayksau/wmle2024/blob/main/ml_pipeline.png?raw=true)

# 1. Library Imports

In [None]:
# Imports
# numpy: python library for numeric computation
# pandas: library for data analysis and handling structured data (tabular data here)
# matplotlib: a basic plotting library
# seaborn: a more powerful plotting library for creating attractive graphs
# scikit-learn(sklearn): a machine learning library (ready to use models, evaluation etc)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc

import warnings
warnings.filterwarnings('ignore')

# 2. Load Dataset

In [None]:
#url = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
url = "https://raw.githubusercontent.com/sanjayksau/BioML/refs/heads/main/diabetes.csv"
df = pd.read_csv(url)

# 3. Exploratory Data Analysis
Exploratory Data Analysis (EDA) helps understand the data's structure, spot patterns, and detect issues like missing values or outliers — ensuring better model preparation and insights.

In [None]:
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
#print(df.head())
df.head()

In [None]:
#print("\nDataset Info:")
print(df.info())
print(df.describe())
#print("\nMissing values:")
#print(df.isnull().sum())

In [None]:
# Target variable distribution
sns.countplot(x='Outcome', data=df)
plt.title("Target Distribution (0 = No Diabetes, 1 = Diabetes)")
plt.show()

In [None]:
# Pairplot of a few features
sns.pairplot(df[['Glucose', 'BMI', 'Age', 'Insulin', 'Outcome']], hue='Outcome', height=1.4, aspect=1.2)
plt.show()


# Before building a model, pairplots offer intuitive visual understanding of:
# Feature distributions
# Class separability
# Redundancy: Highly correlated features might be redundant for modeling.

# You can infer how well-separated these two classes are across different feature combinations.
# For example, if diabetic and non-diabetic dots are visibly separated in a scatterplot (like Glucose vs. BMI),
# that feature pair is a good discriminator.
# You can compare the distribution of glucose levels between diabetic and non-diabetic individuals
# — diabetic ones likely show a higher peak at high glucose.
# Isolated points away from the cluster may be outliers.
# For example, someone with a BMI of 60 could be flagged for review.
# Features where the two classes are most separated (e.g., Glucose, BMI) are more informative.

# the diagonal shows univariate plots—that is, the distribution of individual features, one at a time.
# Since hue is specified (e.g., diabetic vs. non-diabetic), it shows separate distributions for each class in different colors.


# 4. Data Preprocessing

In [None]:
#columns with zeros
zero_counts = (df == 0).sum()
columns_with_zeros = zero_counts[zero_counts > 0]
print(columns_with_zeros)

In [None]:
#Some columns have biologically implausible values (e.g., zero insulin or zero BMI).
#Imputing missing values (mean, median, or model-based imputation)
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for col in cols_with_zeros:
    # Calculate mean excluding zero values
    mean_value = df[df[col] != 0][col].mean()

    # Replace 0 with the mean
    df[col] = df[col].replace(0, mean_value)

df.head()

In [None]:
# Split the data into features (X) and target (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [None]:
# Train-test split: Train-test split allows us to evaluate how well our model is likely to perform on unseen data.
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Feature scaling/ Standardizing the data:
# Many machine learning models work better when all input features (columns) are on a similar scale.
# Without scaling, features with larger values can dominate the learning process and confuse the model.

#
# Think of this like adjusting different lab measurements to the same unit scale.
# For example, if one feature is blood pressure and another is glucose level, they have different units.
# StandardScaler makes all features "comparable" by bringing them to the same scale (mean = 0, std = 1).


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 5. Model Selection and Training
Model: a machine learning model is a tool that learns patterns from data and makes predictions. You give it patient data (like blood pressure, glucose level, etc.), and it learns how to predict whether someone may have diabetes.

*Here, we are solving a binary classification problem using Logistic Regression and Decision Trees.*

- Logistic Regression: is a model that estimates the probability of a class using a logistic function, best suited for simple, linearly separable data.

- Decision Tree:  is a non-linear model that splits data into branches based on feature values, easy to interpret for complex patterns. Handles non-linear patterns, more intuitive.

Note: Choosing the right machine learning model depends on your data type, the problem you're solving (classification, regression, etc.), the size and quality of your dataset. There's no one-size-fits-all—it's often about trying a few models and comparing their performance.

Model Training: Training is equivalent to teaching the model using examples.
A model looks at many data samples and learns to "understand" patterns.


https://colab.research.google.com/drive/1PtoFXpdx7jml4Q1oBKMgcOqKQFDSFoD8?usp=sharing

In [None]:
# Model 1: Logistic Regression, Predicts probabilities (like disease present or not)

# Logistic regression is a supervised machine learning algorithm that accomplishes binary
# classification tasks by predicting the probability of an outcome, event, or observation.
# The model delivers a binary outcome limited to two possible outcomes: yes/no, 0/1, or true/false.

log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)

#fit identifies the model parameters that minimizes the loss

In [None]:
# Model 2: Decision Tree, Follows a series of questions to make a decision.
tree_model = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_model.fit(X_train, y_train)
#tree_model.plot_decision_boundary(X_train, y_train)

In [None]:
#Decision Tree Plot
#from sklearn.tree import plot_tree
#plt.figure(figsize=(12, 8))
#plot_tree(tree_model, max_depth=2, filled=True, feature_names=X_train.columns, class_names=["No Diabetes", "Diabetes"])
#plt.show()

# 6. Model Evaluation


In [None]:
# model evaluation/predictions

# Making predictions on new (unseen) data:
# Now that the model has learned from training data, we ask it to predict whether a person has diabetes or not,
# based on their health measurements in the test set.

def evaluate(model, X_test, y_test, title):
    y_pred = model.predict(X_test)
    print(f"\n--- {title} ---")
    print(classification_report(y_test, y_pred))
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
    plt.title(f"Confusion Matrix - {title}")
    plt.show()

In [None]:
# Evaluate models
evaluate(log_model, X_test_scaled, y_test, "Logistic Regression")
evaluate(tree_model, X_test, y_test, "Decision Tree")


In [None]:
# Plotting the ROC Curve:
# This curve helps us see how well our model separates diabetic from non-diabetic cases.
# Think of it like checking how good a medical test is at correctly identifying sick and healthy patients.

y_prob_log = log_model.predict_proba(X_test_scaled)[:, 1]
y_prob_tree = tree_model.predict_proba(X_test)[:, 1]


In [None]:
fpr_log, tpr_log, _ = roc_curve(y_test, y_prob_log)
fpr_tree, tpr_tree, _ = roc_curve(y_test, y_prob_tree)

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(fpr_log, tpr_log, label=f'Logistic Regression (AUC = {auc(fpr_log, tpr_log):.2f}')
plt.plot(fpr_tree, tpr_tree, label=f'Decision Tree (AUC = {auc(fpr_tree, tpr_tree):.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()


# 7. Check your understanding


### Understanding Metrics in `classification_report`

The `classification_report` in scikit-learn provides key metrics to evaluate classification model performance.

---

| **Metric**      | **Explanation**                                                                 |
|-----------------|----------------------------------------------------------------------------------|
| **Precision**   | Of all predicted positives, how many were actually positive? <br> **Precision = TP / (TP + FP)** <br>→ Measures how accurate positive predictions are |
| **Recall**      | Of all actual positives, how many were correctly predicted? <br> **Recall = TP / (TP + FN)** <br>→ Measures how well the model captures actual positives |
| **F1-score**    | Harmonic mean of precision and recall <br> **F1 = 2 × (Precision × Recall) / (Precision + Recall)** <br>→ Balances both precision and recall         |
| **Support**     | Number of actual instances for each class in the dataset                           |

---

---

### Additional Metrics:

| **Metric**        | **Explanation**                                                                 |
|-------------------|----------------------------------------------------------------------------------|
| **Accuracy**       | Overall, how many predictions were correct:<br>`(TP + TN) / Total Samples`      |
| **Macro Avg**      | Average of precision/recall/F1 across classes (all classes treated equally)     |
| **Weighted Avg**   | Average weighted by class support (more influenced by majority class)           |

---

### When to Use Which?
- Use **Recall** when missing positives is risky (e.g., disease detection)
- Use **Precision** when false alarms are costly
- Use **F1-score** for balance between Precision and Recall
- Use **Macro Avg** for imbalanced data (equal class importance)
- Use **Weighted Avg** to reflect performance based on class distribution



In [None]:
#Diabetes ML Quiz with Answers and Explanations

print("Diabetes ML Quiz – Type a, b, c, or d to answer. You'll get explanations after each question!\n")

print("\n--- Dataset Understanding ---")

# Question 1
ans = input("1. What does each row in the dataset represent?\n(a) A patient\n(b) A hospital\n(c) A test result\n(d) A machine learning model\nYour answer: ")
if ans.lower() == 'a':
    print("Correct! Each row is one patient’s medical data.")
else:
    print("Incorrect. The correct answer is (a). Each row represents one patient.")

# Question 2
ans = input("\n2. What does the target column 'Outcome' represent?\n(a) The type of diabetes\n(b) Whether a person has diabetes or not\n(c) Glucose levels\n(d) Age group\nYour answer: ")
if ans.lower() == 'b':
    print("Correct! Outcome is 1 if the person has diabetes, 0 if not.")
else:
    print("Incorrect. The correct answer is (b).")

print("\n--- Model Training and Evaluation ---")

# Question 3
ans = input("\n3. Why do we use StandardScaler?\n(a) To normalize text\n(b) To improve plot visibility\n(c) To bring all features to the same scale\n(d) To remove missing values\nYour answer: ")
if ans.lower() == 'c':
    print("Correct! Scaling helps models treat all features equally.")
else:
    print("Incorrect. The correct answer is (c).")

# Question 4
ans = input("\n4. Which models did we use to classify diabetes?\n(a) Logistic Regression\n(b) Decision Tree\n(c) Random Forest\n(d) a and b\nYour answer: ")
if ans.lower() == 'd':
    print("Correct! We trained Logistic Regression and Decision Tree models.")
else:
    print("Incorrect. The correct answer is (d).")

# Question 5
ans = input("\n5. What does the confusion matrix show us?\n(a) The training accuracy\n(b) A visual of correct and incorrect predictions\n(c) The data distribution\n(d) The model structure\nYour answer: ")
if ans.lower() == 'b':
    print("Correct! It shows where the model is right and where it made mistakes.")
else:
    print("Incorrect. The correct answer is (b).")

# Question 6
ans = input("\n6. What is the purpose of the ROC curve?\n(a) To detect outliers\n(b) To plot feature importance\n(c) To evaluate model’s ability to separate classes\n(d) To train the model\nYour answer: ")
if ans.lower() == 'c':
    print("Correct! It shows how well the model distinguishes between classes.")
else:
    print("Incorrect. The correct answer is (c).")

# Question 7
ans = input("\n7. What would the model output for a new patient’s data?\n(a) A probability of having diabetes\n(b) A diagnosis report\n(c) A pie chart\n(d) A list of symptoms\nYour answer: ")
if ans.lower() == 'a':
    print("Correct! Most models give a probability score between 0 and 1.")
else:
    print("Incorrect. The correct answer is (a).")


# Additional ML Concepts Questions
print("\n--- General ML Concepts ---")

# Q8: What is a model?
ans = input("\n8. What is a model in Machine Learning?\n(a) A program that stores data\n(b) A trained function that makes predictions\n(c) A file that holds images\n(d) A neural network only\nYour answer: ")
if ans.lower() == 'b':
    print("Correct! A model is a trained function that learns from data and makes predictions.")
else:
    print("Incorrect. The correct answer is (b). A model is a trained function that can predict based on input features.")

# Q9: What does 'training a model' mean?
ans = input("\n9. What does 'training a model' mean?\n(a) Installing ML libraries\n(b) Typing Python code\n(c) Finding patterns in data to make predictions\n(d) Testing accuracy\nYour answer: ")
if ans.lower() == 'c':
    print("Correct! Training means the model learns patterns from the training data.")
else:
    print("Incorrect. The correct answer is (c).")

# Q10: Why do we split data into training and testing sets?
ans = input("\n10. Why do we split data into training and testing sets?\n(a) To reduce dataset size\n(b) To test computer speed\n(c) To check if the model works well on new/unseen data\n(d) To improve accuracy only\nYour answer: ")
if ans.lower() == 'c':
    print("Correct! The test set helps us evaluate how the model performs on new, unseen data.")
else:
    print("Incorrect. The correct answer is (c).")

# Q11: Which of the following best describes the major steps in an ML pipeline?
ans = input("\n11. Which of these is the correct order of ML steps?\n(a) Train → Collect Data → Predict → Clean\n(b) Collect Data → Clean → Train → Evaluate → Predict\n(c) Install Python → Collect Data → Predict → Save\n(d) Predict → Clean → Evaluate\nYour answer: ")
if ans.lower() == 'b':
    print("Correct! This is the usual pipeline followed in ML projects.")
else:
    print("Incorrect. The correct answer is (b).")

#More Questions
# Question 12
ans = input("\n12. A classifier gives high accuracy but performs poorly on the minority class. What might be happening?\n(a) The dataset is balanced\n(b) The model has high precision and recall\n(c) The dataset is imbalanced, and the model favors the majority class\n(d) Accuracy always reflects balanced performance\nYour answer: ")
if ans.lower() == 'c':
    print("Correct! High accuracy can be misleading in imbalanced datasets where the model mostly predicts the majority class.")
else:
    print("Incorrect. The correct answer is (c). In imbalanced datasets, accuracy may hide poor performance on the minority class.")

print("\n--- Imbalanced Data & Medical Relevance ---")

# Question 13
ans = input("\n13. If false negatives are more dangerous (e.g., in disease detection), which metric should you focus on?\n(a) Precision\n(b) Recall\n(c) Accuracy\n(d) F1-score\nYour answer: ")
if ans.lower() == 'b':
    print("Correct! Recall is crucial when missing positive cases (false negatives) is risky, like in medical diagnoses.")
else:
    print("Incorrect. The correct answer is (b). Recall helps minimize false negatives.")

# Question 14: Which metric to prioritize in disease detection?
ans = input("\n14. In disease detection, which metric is usually most important?\n(a) Precision\n(b) Recall\n(c) F1-score\n(d) Accuracy\nYour answer: ")
if ans.lower() == 'b':
    print("Correct! Recall is crucial to avoid missing patients who actually have the disease (false negatives).")
else:
    print("Incorrect. The correct answer is (b). Recall helps us detect as many actual cases as possible.")

# Question 15
ans = input("\n15. In an imbalanced dataset (e.g., 90% negative, 10% positive), which model is likely misleading?\n(a) A model with 90% accuracy\n(b) A model with 80% recall\n(c) A model with high F1-score\n(d) A model with good precision\nYour answer: ")
if ans.lower() == 'a':
    print("Correct! A model predicting only the majority class can still get 90% accuracy but completely miss the minority class.")
else:
    print("Incorrect. The correct answer is (a). High accuracy can be misleading in imbalanced datasets.")

# Question 16
# ans = input("\n16. Which model might be better suited to handle imbalanced data directly?\n(a) Logistic Regression without any changes\n(b) Decision Tree with class_weight='balanced'\n(c) KMeans clustering\n(d) Linear Regression\nYour answer: ")
# if ans.lower() == 'b':
#     print("Correct! Decision Trees (and many sklearn models) can be configured with class_weight='balanced' to better handle imbalance.")
# else:
#     print("Incorrect. The correct answer is (b). Using class_weight='balanced' helps focus on minority classes.")

In [None]:

#End