<a href="https://colab.research.google.com/github/zhenglinchen1096496/DBU-Python/blob/main/AI_HW_5__Classification_v01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Instructions

In this notebook, you will complete code only in cells below the comment `### YOUR SOLUTION HERE`.  You will add your relevant code in cells below the comment, and no where else.

For example, you would complete the code cell below as follows:
```
# Write code to print "Hello"
### YOUR SOLUTION HERE
print("Hello")
```

Once you have completed your code, you can test your code by running the test cell which contains a comment that looks like this:
```
# Tests 5 points: Printing "Hello"
```

**Do not change any parts of this notebook aside from adding code below the cells in the notebook with `### YOUR SOLUTION HERE`.**  Changing unauthorized parts of the notebook could result in a zero for the assignment.

In this notebook, proceed step by step.  Do not move on to the next section until you have successfully completed all of the prior sections.

You can see a video demo of this tool at the following link:

https://youtu.be/yvLWbpgnspM?si=oeUEICnxrC0Ysbjb&t=143


# Step 0 - Run the cells below

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score, classification_report, confusion_matrix

# Set a random seed for reproducibility
RANDOM_SEED = 23
np.random.seed(RANDOM_SEED)




In [None]:
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)



#Step 1

##Step 1a - Your code - Data Splitting

In the cell below, split the data into 80% training and 20% testing using `train_test_split` from the scikitlearn library with the `random_state` parameter set to `RANDOM SEED`.

Your variables should be called `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
# Split the dataset into training and testing sets

### YOUR SOLUTION HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)



## Step 1b - Test your code by running the cell below

In [None]:
# Tests 5 points: Data Loading and Splitting

# Reset index after train-test split
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

assert X_train['worst radius'][3] == 16.45, "The value for X_train['worst radius'][3] is incorrect."
assert X_train['mean area'][2] == 289.1, "The value for X_train['mean area'][2] is incorrect."
assert X_test['mean radius'][11] == 8.597, "The value for X_test['mean radius'][11] is incorrect."
assert X_test['concave points error'][3] == 0.01075, "The value for X_test['concave points error'][3] is incorrect."
print("Visible tests passed.")



Visible tests passed.


#Step 2

## Step 2a - Your code - Data Scaling

In the cell below, apply the `StandardScaler` to transform the feature data.

Your feature vectors should be called `X_train_scaled` and `X_test_scaled`.

In [None]:
# Standardize the feature variables

### YOUR SOLUTION HERE
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Step 2b - Test your code by running the cell below

In [None]:
# Tests 5 points: Data Scaling

assert X_train_scaled[2][3] == -1.0704254306070624, "The value for X_train_scaled[2][3] is incorrect."
assert X_train_scaled[6][2] == -1.1359381046954606, "The value for X_train_scaled[6][2] is incorrect."
assert X_test_scaled[8][11] == -0.6875236011724752, "The value for X_test_scaled[8][11] is incorrect."
assert X_test_scaled[6][3] == 1.5040013560573013, "The value for X_test_scaled[6][3] is incorrect."
print("Visible tests passed.")



Visible tests passed.


# Step 3

## Hyperparameter tuning with k-fold cross-validation using `RandomizedSearchCV`

For each of the models listed below, perform hyperparameter tuning using a randomized search across several combinations of hyperparameters.  Score the performance using F1-score with k-fold cross validation using `RandomizedSearchCV` from `sklearn` library.

*   Logistic Regression using `LogisticRegression` from `sklearn` library
*   Decision Tree using `DecisionTreeClassifier` from `sklearn` library
*   Random Forest using `RandomForestClassifier` from `sklearn` library
*   Support Vector Classifier using `SVC` from `sklearn` library

Below are the model initializations you will use in `RandomizedSearchCV`, as well as the hyperparameters in `param_dist` that you will test in your randomized search.

## Step 3a - Your code - Initialize your models

In [None]:
# Initialize models with default parameters, make sure to set random_state=RANDOM_SEED

### YOUR SOLUTION HERE
models = {
    'Logistic Regression': LogisticRegression(random_state=RANDOM_SEED),
    'Decision Tree': DecisionTreeClassifier(random_state=RANDOM_SEED),
    'Random Forest': RandomForestClassifier(random_state=RANDOM_SEED),
    'Support Vector Classifier': SVC(random_state=RANDOM_SEED)
}



## Step 3b - Test your code by running the cell below

In [None]:
# Tests 5 points: Intializing models


## Hyperparameters that will be used in random combinations

In [None]:
# Run this cell, do not modify
# Define hyperparameter distribution to randomly sample for tuning
param_dist = {
    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10, 100]
    },
    'Decision Tree': {
        'max_depth': [None, 10, 20, 30, 40, 50],
        'min_samples_split': [2, 5, 10]
    },
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10]
    },
    'Support Vector Classifier': {
        'C': [0.01, 0.1, 1, 10, 100],
        'kernel': ['linear', 'rbf']
    }
}


## Step 3c - Your code - Hyperparameter tuning with k-fold crossvalidation

In this lab, you will use the F1-score as the metric by which you evaluate the performance of models.

---

You should save the best version of each model type in the dictionary `best_tuned_models`, and you should save the best overall model in `best_model`.

---

In `best_tuned_models`, you should save the best estimator from
`RandomizedSearchCV` for each model type in its `best estimator` and its best score in its `best score`.

In `best_model`, for the best model overall, you should save the model type in `name` (e.g. `'Logistic Regression'`), the estimator in `estimator`, and the score in `score`.

---

When calling `RandomizedSearchCV`, make sure that you specifically pass the following values for the arguments listed below (in addition to any other values you are passing):
* `scoring = 'f1'`
* `n_iter = NUMBER_OF_PARAMETER_COMBINATIONS_PER_MODEL`
* `cv = NUMBER_OF_FOLDS_PER_MODEL`
* `random_state = RANDOM_SEED`

**IMPORTANT:**

As arguments in `RandomizedSearchCV`, make sure to set `n_iter=NUMBER_OF_PARAMETER_COMBINATIONS_PER_MODEL` and `cv=NUMBER_OF_FOLDS_PER_MODEL`, and do not change the values of these constants.
 I have intentionally set these parameters to lower values so that we do not put high compute demands on the server.

In [None]:
# Perform randomized parameter search with k-fold cross-validation
best_tuned_models = {'Logistic Regression': {'best estimator': None, 'best score': None},
                     'Decision Tree': {'best estimator': None, 'best score': None},
                     'Random Forest':{'best estimator': None, 'best score': None},
                     'Support Vector Classifier': {'best estimator': None, 'best score': None}}
best_model = {'name': None, 'estimator': None, 'score': None}
best_score = 0

NUMBER_OF_PARAMETER_COMBINATIONS_PER_MODEL = 3
NUMBER_OF_FOLDS_PER_MODEL = 3

### YOUR SOLUTION HERE
for name in best_tuned_models:
    random_search = RandomizedSearchCV(models[name], param_dist[name],
                                       n_iter=NUMBER_OF_PARAMETER_COMBINATIONS_PER_MODEL,
                                       cv=NUMBER_OF_FOLDS_PER_MODEL,
                                       scoring='f1', n_jobs=-1, random_state=RANDOM_SEED)
    print(f'Tuning {name}...')

    # Use X_train_scaled for Logistic Regression and SVC
    if name in ['Logistic Regression', 'Support Vector Classifier']:
        random_search.fit(X_train_scaled, y_train)
    else:
        random_search.fit(X_train, y_train)

    # Store the best model hyperparameter settings and F1-score in the dictionary for each model
    best_tuned_models[name]['best estimator'] = random_search.best_estimator_
    best_tuned_models[name]['best score'] = random_search.best_score_

    # Output best settings for each model
    print(f'Best parameters for {random_search.best_estimator_}')
    print(f'F1-score: {random_search.best_score_}\n')

    # Save best model overall in best_model dictionary
    if random_search.best_score_ > best_score:
        best_score = random_search.best_score_
        best_model['name'] = name
        best_model['estimator'] = random_search.best_estimator_
        best_model['score'] = random_search.best_score_

# Print the best model name and score
print(f'\nBest Model: {best_model["estimator"]}')
print(f'F1-score: {best_model["score"]}')



Tuning Logistic Regression...
Best parameters for LogisticRegression(C=0.1, random_state=23)
F1-score: 0.9760271995799955

Tuning Decision Tree...
Best parameters for DecisionTreeClassifier(max_depth=30, min_samples_split=5, random_state=23)
F1-score: 0.9539170506912442

Tuning Random Forest...
Best parameters for RandomForestClassifier(max_depth=20, min_samples_split=5, random_state=23)
F1-score: 0.9717203011774519

Tuning Support Vector Classifier...
Best parameters for SVC(C=1, random_state=23)
F1-score: 0.9720540373461336


Best Model: LogisticRegression(C=0.1, random_state=23)
F1-score: 0.9760271995799955


## Step 3d - Test your code by running the cell below

This test code will verify that you correctly performed the randomized hyperparameter tuning with k-fold cross validation (all done by `RandomizedSearchCV`), and that you saved everything in the correct place.

In [None]:
# Tests 10 points: Hyperparameter Tuning and Best Model Selection
assert str(best_tuned_models['Logistic Regression']['best estimator']) == "LogisticRegression(C=0.1, random_state=23)", "The best model for Logistic Regression is incorrect."
assert str(best_tuned_models['Decision Tree']['best estimator']) == "DecisionTreeClassifier(max_depth=30, min_samples_split=5, random_state=23)", "The best model for Decision Tree is incorrect."

assert best_tuned_models['Logistic Regression']['best score'] == 0.9760271995799955, "The best F1-score for Logistic Regression is incorrect."
assert best_tuned_models['Decision Tree']['best score'] == 0.9539170506912442, "The best F1-score for Decision Tree is incorrect."

assert best_model['name'] == 'Logistic Regression', "The best model name is incorrect."
assert best_model['score'] == 0.9760271995799955, "The best model score is incorrect."
assert str(best_model['estimator']) == 'LogisticRegression(C=0.1, random_state=23)', "The best model is incorrect."
print("Visible tests passed.")




Visible tests passed.


# Step 4

## Step 4a - Your code - Testing the best model

Now you will need to:


1.   Train your best overall model on all of the scaled training data.
2.   Make predictions using the scaled test data and save in `y_pred`.
3.   Compute and display the performance metrics such as `f1_score_on_test_data`.



In [None]:
f1_score_on_test_data=None
y_pred = None

### YOUR SOLUTION HERE

# Fit the best model on the entire scaled training data
best_model['estimator'].fit(X_train_scaled, y_train)

# Evaluate the best model on the scaled test data
y_pred = best_model['estimator'].predict(X_test_scaled)
f1_score_on_test_data = f1_score(y_test, y_pred, average='binary')

# Output the performance metrics for the best model
print('\nBest Model:', best_model['estimator'])
# print(f'\nTest F1-score: {f1_score(y_test, y_pred, average="binary"):.4f}\n')
print(f'\nTest F1-score: {f1_score_on_test_data:.4f}\n')
print(classification_report(y_test, y_pred))



Best Model: LogisticRegression(C=0.1, random_state=23)

Test F1-score: 0.9867

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        39
           1       0.99      0.99      0.99        75

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



## Step 4b - Test your code by running the cell below

In [None]:
# Tests 10 points: Hyperparameter Tuning and Best Model Selection

assert best_model['name'] == 'Logistic Regression', "The best model name is incorrect."
assert best_model['score'] == 0.9760271995799955, "The best model score is incorrect."
assert str(best_model['estimator']) == 'LogisticRegression(C=0.1, random_state=23)', "The best model is incorrect."
assert f1_score_on_test_data == 0.9866666666666667, "The F1-score on the test data is incorrect."
print("Visible tests passed.")



Visible tests passed.
