In [28]:
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
import joblib  # For saving the model

Model Performance Evaluation

In [30]:
input_path = "Encoded_data.csv"
cleaned_path = pd.read_csv(input_path)
X = cleaned_path.drop(columns=['Condition'])
y = cleaned_path['Condition']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) 
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Assistant
Let me explain the selected code step by step:

1. `input_path = "Encoded_data.csv"` - Creates a variable storing the path to the CSV file

2. `cleaned_path = pd.read_csv(input_path)` - Uses pandas to read the CSV file and store it in a DataFrame

3. `X = cleaned_path.drop(columns=['Condition'])` - Creates feature matrix X by removing the 'Condition' column

4. `y = cleaned_path['Condition']` - Creates target variable y containing only the 'Condition' column

5. `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)` - Splits the data into training (80%) and testing (20%) sets using scikit-learn's train_test_split function

6. `model = KNeighborsClassifier(n_neighbors=5)` - Initializes a K-Nearest Neighbors classifier with 5 neighbors

7. `model.fit(X_train, y_train)` - Trains the model on the training data

8. `y_pred = model.predict(X_test)` - Uses the trained model to make predictions on the test data

In [32]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.8333333333333334

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.50      0.67         2
         1.0       0.80      1.00      0.89         4

    accuracy                           0.83         6
   macro avg       0.90      0.75      0.78         6
weighted avg       0.87      0.83      0.81         6



# Assistant
The selected code evaluates and prints the performance metrics of a classification model. Here's a step-by-step breakdown:

1. `accuracy_score(y_test, y_pred)` - Calculates the accuracy by comparing the predicted values (y_pred) against the actual test values (y_test)

2. `classification_report(y_test, y_pred)` - Generates a detailed report containing:
   - Precision
   - Recall
   - F1-score
   - Support
   For each class in the classification model

3. The results are printed using two print statements:
   - First line shows the overall accuracy
   - Second line displays the full classification report with detailed metrics

This code is commonly used to evaluate how well a classification model is performing on test data.

Hyperparameter Tuning

In [36]:
param_grid = {
    'n_neighbors': [3,5,7,9,11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv = 5, scoring = 'accuracy')
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)

Best Parameters: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
Best Cross-Validation Score: 0.67


# Assistant
This code sets up and performs a grid search for hyperparameter tuning of a K-Nearest Neighbors classifier. Here's the step-by-step breakdown:

1. `param_grid` defines a dictionary of parameters to test:
   - `n_neighbors`: Tests different numbers of neighbors [3,5,7,9,11]
   - `weights`: Tests two weighting methods ('uniform' and 'distance')
   - `metric`: Tests two distance metrics ('euclidean' and 'manhattan')

2. `knn = KNeighborsClassifier()` creates a base KNN model instance

3. `grid_search = GridSearchCV(knn, param_grid, cv = 5, scoring = 'accuracy')` initializes the grid search:
   - Uses the KNN model and parameter grid defined above
   - Performs 5-fold cross-validation (`cv = 5`)
   - Uses accuracy as the scoring metric

4. `grid_search.fit()` performs the actual grid search on the training data

5. The final print statements display the best parameters found and their corresponding cross-validation score

Model Validation

In [42]:
best_knn = grid_search.best_estimator_
y_test_pred = best_knn.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Test Accuracy: 0.8333333333333334


# Assistant
Let me explain the selected code line by line:

1. `best_knn = grid_search.best_estimator_` - Retrieves the best performing KNN model from a previously run grid search.

2. `y_test_pred = best_knn.predict(X_test)` - Uses the best KNN model to make predictions on the test data (X_test).

3. `print("Test Accuracy:", accuracy_score(y_test, y_test_pred))` - Calculates and prints the accuracy score by comparing the predicted values (y_test_pred) with the actual test values (y_test).

This code snippet is typically used after performing grid search to evaluate how well the best model performs on unseen test data.