In [None]:
"""
Class Imbalance : Uneven Frequency of Classes
Need a diff way to access(evaluate) Performance of Classification Instead of 'Accuray'

True Positive (TP): Correctly identified positive cases.
True Negative (TN): Correctly identified negative cases.
False Positive (FP): Incorrectly identified positive cases (should have been negative).
False Negative (FN): Incorrectly identified negative cases (should have been positive). 

Accuracy Formula ==> TP+TN/TP+TN+FP+FN

Diff ways of evaluating the performance of a classification model. 

1. Precision ==> Precision is the ratio of correctly predicted positive observations to the total predicted positives.

Formula ===> TP/ TP+ FP
--> To reduce false positives.
--> High precision means that when the model predicts a positive class, it is often correct.
--> Precision focuses on the quality of the positive predictions, minimizing false positives.

2. Recall ==> Recall (also known as sensitivity or true positive rate) is 
              the ratio of correctly predicted positive observations to all the actual positives.
              
Formula ===> TP/ TP+FN
--> To reduce false neg.
--> High recall means that the model identifies most of the positive instances.
--> Recall focuses on capturing all actual positives, minimizing false negatives.

3. F1 Scores ==> A metric that combines 'precision' and 'recall' into a single metric by calculating their harmonic mean. 

Formula ===> 2* (precision*recall/ precision+recall)
---> Especially useful when the two metrics are not equally important or 
     when one metric alone does not provide a complete picture of model performance.
---> High F1-Score: Indicates a good balance between precision and recall, meaning the model is effectively 
     identifying positive cases while minimizing both false positives and false negatives.
     

"""

In [3]:
#Predicting a person has diabetes or not basing on BMI and Age (Binary Classification)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
import pandas as pd

diabetes_df=pd.read_csv('diabetes.csv')
X=diabetes_df[['BMI','Age']].values
y=diabetes_df['Outcome'].values

X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.3, random_state=43)
# Instantiate the scaler
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.fit_transform(X_test)
knn=KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train_scaled,y_train)
y_pred=knn.predict(X_test_scaled)
#print(y_pred)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(knn.score(X_test,y_test))

[[135  17]
 [ 49  30]]
              precision    recall  f1-score   support

           0       0.73      0.89      0.80       152
           1       0.64      0.38      0.48        79

    accuracy                           0.71       231
   macro avg       0.69      0.63      0.64       231
weighted avg       0.70      0.71      0.69       231

0.4805194805194805


In [None]:
"""
-->The model is better at identifying class 0 (no diabetes) than class 1 (diabetes),
as indicated by the higher precision, recall, and F1-score for class 0.

-->The recall for class 1 (diabetes) is relatively low (0.48), 
meaning the model is missing a significant number of actual positive cases (diabetes).

-->The precision for class 1 (diabetes) is also moderate, 
meaning that when the model predicts diabetes, it is correct about 64% of the time.

"""