<a href="https://colab.research.google.com/github/surajkr214/Programming-For-Data-Science/blob/main/KNN_Heart_Disease_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Part 1: Importing the appropriate libraries**

In [28]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

**Part 2: Loading the dataset**

In [29]:
try:
    # We load the dataset. 'na_values' helps handle any '?' if they exist in your file version
    df = pd.read_csv('data-heart.csv', na_values='?')

    # Drop any rows that contain missing values (NaN) to prevent errors
    df.dropna(inplace=True)

    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'data-heart.csv' not found. Please ensure the file is in the same directory.")

Dataset loaded successfully.


**Part 3: Printing the length of the dataset**

In [30]:
# This prints the total number of rows remaining after cleaning
print(f"\nLength of the dataset: {len(df)} rows")


Length of the dataset: 303 rows


**Part 4: Printing the first 5 rows of the dataset**

In [31]:
print("\nFirst 5 rows of the dataset:")
print(df.head())


First 5 rows of the dataset:
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  


**Part 5: Splitting the dataset for training and testing (80% training, 20% testing)**

In [32]:
# Define Features (X) and Target (y)
# Assuming the target column is named 'target'
X = df.drop('target', axis=1)
y = df['target']

# Split: 80% Training, 20% Testing
# random_state=42 ensures the split is the same every time you run it
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)
print("\nDataset split into training (80%) and testing (20%) sets.")


Dataset split into training (80%) and testing (20%) sets.


**Part 6: Feature Scaling**

In [33]:
# KNN relies on distance, so scaling is crucial
scaler = StandardScaler()

# Fit on training data, transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature scaling completed.")

Feature scaling completed.


**Part 7: Identifying the appropriate number of neighbours (K)**

In [34]:
# We loop through K=1 to K=20 to find the best accuracy
best_k = 0
best_accuracy = 0
k_range = range(1, 21)

print("\nSearching for the best K value...")

for k in k_range:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_scaled, y_train)
    score = knn_temp.score(X_test_scaled, y_test)

    if score > best_accuracy:
        best_accuracy = score
        best_k = k

print(f"The appropriate number of neighbours identified is: K = {best_k}")


Searching for the best K value...
The appropriate number of neighbours identified is: K = 7


**Part 8: Defining and Fitting the model**

In [35]:
# We use the best_k found in the previous step
final_model = KNeighborsClassifier(n_neighbors=best_k)
final_model.fit(X_train_scaled, y_train)
print(f"Final model fitted with K={best_k}.")

Final model fitted with K=7.


**Part 9: Predicting results from the test set**

In [36]:
# Use the trained model to predict the target values for the test set
y_pred = final_model.predict(X_test_scaled)
print("Predictions generated on the test set.")

Predictions generated on the test set.


**Part 10: Evaluating the model using Confusion matrix and other measures**

In [37]:
print("\n--- Model Evaluation ---")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Other Measures
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"\nPerformance Metrics (K={best_k}):")
print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 Score:  {f1:.4f}")


--- Model Evaluation ---
Confusion Matrix:
[[27  2]
 [ 3 29]]

Performance Metrics (K=7):
Accuracy:  0.9180
Precision: 0.9355
Recall:    0.9062
F1 Score:  0.9206


**Save the ML Model**

In [40]:
import joblib

joblib.dump(final_model, 'heartdisease_knn_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

print("Model and Scaler saved successfully!")

Model and Scaler saved successfully!
