Wczytaj zbiór `diabetes.csv`. Przygotuj dane do modelowania (podziel na zbiór treningowy, walidacyjny i testowy, następnie skaluj). 
Stwórz modele wykrywające przypadki cukrzycy (kolumna `Outcome`): 
- [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html),
- [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html),
- [`LinearDiscriminantAnalysis`](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html),
- [`QuadraticDiscriminantAnalysis`](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html).

Wykorzystaj zbiór walidacyjny do porównania modeli stosując dokładność, precyzję, pełność, F-miarę. Wykorzystaj zbiór testowy do ewaluacji najlepszego modelu. 

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

In [20]:
# 1 = diabetes, 0 = no diabetes
df = pd.read_csv("../data/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [21]:
X = df.drop(columns="Outcome")
y = df.Outcome

X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=17)
X_train, X_validation, y_train, y_validation = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=17)

In [22]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_validation = scaler.transform(X_validation)
X_test = scaler.transform(X_test)

print(X_train[:5, :])

[[ 1.21831474  1.8479977  -0.30790916 -1.3078218  -0.68350562 -1.22485588
   0.58392127 -0.12699181]
 [ 0.32259053  2.03052253 -0.30790916  0.76904136  2.11506174 -0.1601969
   0.3187704  -0.38024138]
 [-1.17028315  0.38779912 -0.09595523  1.33545859  1.46923851  1.33571635
  -0.33070025 -0.80232399]
 [ 0.91974001  1.72631449  1.33473378  0.64317086 -0.68350562  0.24410398
  -0.92952411  2.23667082]
 [-0.57313368  2.27388896  0.01002173  1.52426433  3.99225462 -0.25453377
  -0.94739945  1.64575516]]


### Logistic Regression

In [23]:
log_reg = LogisticRegression(random_state=17, solver="liblinear")

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_validation)

tn, fp, fn, tp = confusion_matrix(y_validation, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_validation, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)


print("#" * 20)
print(classification_report(y_validation, y_pred))

Confusion Matrix:
[[93  7]
 [32 22]]

True Negatives: 93
False Positives: 7
False Negatives: 32
True Positives: 22
####################
              precision    recall  f1-score   support

           0       0.74      0.93      0.83       100
           1       0.76      0.41      0.53        54

    accuracy                           0.75       154
   macro avg       0.75      0.67      0.68       154
weighted avg       0.75      0.75      0.72       154



  ret = a @ b
  ret = a @ b
  ret = a @ b


### Gaussian Naive Bayes

In [24]:
g_nb = GaussianNB()
g_nb.fit(X_train, y_train)

y_pred = g_nb.predict(X_validation)

tn, fp, fn, tp = confusion_matrix(y_validation, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_validation, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)
print("#" * 20)

print(classification_report(y_validation, y_pred))

Confusion Matrix:
[[89 11]
 [27 27]]

True Negatives: 89
False Positives: 11
False Negatives: 27
True Positives: 27
####################
              precision    recall  f1-score   support

           0       0.77      0.89      0.82       100
           1       0.71      0.50      0.59        54

    accuracy                           0.75       154
   macro avg       0.74      0.70      0.71       154
weighted avg       0.75      0.75      0.74       154



### Linear Discriminant Analysis

In [25]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

y_pred = lda.predict(X_validation)

tn, fp, fn, tp = confusion_matrix(y_validation, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_validation, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)
print("#" * 20)

print(classification_report(y_validation, y_pred))

Confusion Matrix:
[[94  6]
 [32 22]]

True Negatives: 94
False Positives: 6
False Negatives: 32
True Positives: 22
####################
              precision    recall  f1-score   support

           0       0.75      0.94      0.83       100
           1       0.79      0.41      0.54        54

    accuracy                           0.75       154
   macro avg       0.77      0.67      0.68       154
weighted avg       0.76      0.75      0.73       154



  ret = a @ b
  ret = a @ b
  ret = a @ b


### Quadratic Discriminant Analysis

In [26]:
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

y_pred = qda.predict(X_validation)

tn, fp, fn, tp = confusion_matrix(y_validation, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_validation, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)
print("#" * 20)

print(classification_report(y_validation, y_pred))

Confusion Matrix:
[[86 14]
 [29 25]]

True Negatives: 86
False Positives: 14
False Negatives: 29
True Positives: 25
####################
              precision    recall  f1-score   support

           0       0.75      0.86      0.80       100
           1       0.64      0.46      0.54        54

    accuracy                           0.72       154
   macro avg       0.69      0.66      0.67       154
weighted avg       0.71      0.72      0.71       154



### Wybór najlepszego modelu
1. Accuracy jest najlepsze i takie samo dla LR, GNB, i LDA
2. F1-Score najlepiej wypada GNB
3. Recall najlepszy dla GNB
4. Precyzja dla 1 -> najlepsze LDA
5. Weighted avg -> najlepsze dla GNB

Wybierane: GNB

In [27]:
y_pred = g_nb.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)
print("#" * 20)

print(classification_report(y_test, y_pred))

Confusion Matrix:
[[86 11]
 [24 33]]

True Negatives: 86
False Positives: 11
False Negatives: 24
True Positives: 33
####################
              precision    recall  f1-score   support

           0       0.78      0.89      0.83        97
           1       0.75      0.58      0.65        57

    accuracy                           0.77       154
   macro avg       0.77      0.73      0.74       154
weighted avg       0.77      0.77      0.77       154

