<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lab_9_%5BSTUDENT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab #9: Logistic Regression, One Hot Encoding, and Data Preprocessing** 
---

### **Description**: 
In this lab, you will practice implementing a logistic regression model using standardization and label encoding. You will also learn how to implement one-hot encoding and dummy variable encoding. And you will learn how to evaluate a model's ability to handle unseen data using K-Folds.


### **Lab Structure**
**Part 1**: [Logistic Regression](#p1)

**Part 2**: [One-hot Encoding](#p2)

**Part 3**: [Dummy Variable Encoding](#p2)

**Part 4**: [K-Folds](#p4)


</br>


### **Goals:** 
By the end of this lab, you will be able to:
* Standardize data using sklearn's `StandardScaler(...)`. 
* Encode categorical data using label encoding, one-hot encoding, and dummy variable encoding.
* Evaluate a model's ability to handle unseen data using K-Folds Cross Validation.

</br> 

### **Cheat Sheets**
[Logistic Regression with sklearn](https://docs.google.com/document/d/1rLTuWGgx9E-K1pgWYxUF4B1ExKKxt6MVSkgEKoUbhuE/edit?usp=sharing)

[Standardization, Encoding, and K-Folds with sklearn](https://docs.google.com/document/d/1wu_J33O9PooGahfrnyyN2-Mwza869Ab8GnzDypqjTaw/edit?usp=sharing)


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
#!pip install pandas
#!pip install scikit-learn

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sklearn
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import metrics

from sklearn.metrics import accuracy_score, r2_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

<a name="p1"></a>

---
## **Part 1: Logistic Regression Independent Practice**
---
We will build a logistic regression model to classify pokemon types based on their attributes.

**Dataset Description:**
This data set includes 898 Pokemon, 1072 including alternate forms, including their number, name, first and second type, the stat total and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed, generation, and legendary status. The attributes of each Pokemon are as follows:

* `Name`: The name of each pokemon

* `Type 1`: Each pokemon has a type, this determines weakness/resistance to attacks

* `Type 2`: Some pokemon are dual type and have 2

* `Total`: Sum of all stats that come after this, a general guide to how strong a pokemon is

* `HP`: Hit points, or health, defines how much damage a pokemon can withstand before fainting

* `Attack`: The base modifier for normal attacks (eg. Scratch, Punch)

* `Defense`: The base damage resistance against normal attacks

* `SP Atk`: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

* `SP Def`: Special defense, the base damage resistance against special attacks

* `Speed`: Determines which pokemon attacks first each round

* `Generation`: The generation of games where the pokemon was first introduced

* `Legendary`: Some pokemon are much rarer than others, and are dubbed "legendary"

<br>


**Source:** [data.world](https://data.world/data-society/pokemon-with-stats)


<br>

**Run the cell below to load in the dataset.**

In [None]:
url ="https://query.data.world/s/p4tnasnlximnov7fpjlu2msnmegyrb"
pokemon_df = pd.read_csv(url,  sep = ",").drop(columns = 'number', axis = 1)
pokemon_df.head()

### **Problem #1: Encode all categorical variables.**
---

Logistic regression can only deal with numerical variables since it creates an equation of them. So, complete the code below to encode `type1`, `type2`, and `legendary` using label encoding as we learned on Day 3.

<br>

**NOTE**: It's important to save the DataFrame with encoded variables in `new_pokemon_df` since we will end up trying a different encoding in Part 2 of the original dataset.

In [None]:
type1_list = pokemon_df[# COMPLETE THIS LINE].unique().tolist()
type1_map = {type1_list[i] : i for i in range(len(type1_list))}

new_pokemon_df['type1_encoded'] = pokemon_df['# COMPLETE THIS LINE

In [None]:
type2_list = # COMPLETE THIS LINE
type2_map = {type2_list[i] : i for i in range(len(type2_list))}

new_pokemon_df['type2_encoded'] = new_pokemon_df['# COMPLETE THIS LINE

In [None]:
# COMPLETE THIS CODE

new_pokemon_df.head()

### **Problem #2: Decide the independent and dependent variables.**
---

Complete the code below to decide the features and label for this problem. **NOTE**: These should only include numerical variables.

In [None]:
features = 
label = 

### **Problem #3: Split the data into train and test sets.**
---

For now, we will skip adding a validation set this time. Make sure the test dataset is 20% of the original dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS LINE

### **Problem #4: Scale your data**
---

In [None]:
scaler = # COMPLETE THIS LINE

X_train_scaled = scaler.# COMPLETE THIS LINE
X_test_scaled = scaler.# COMPLETE THIS LINE

### **Problem #5: Initialize a Multi-Class Logistic Regression model and train it.**
---

Use one versus rest multi-class classification and the standardized training data.

In [None]:
clf = # COMPLETE THIS LINE
clf.# COMPLETE THIS LINE

### **Problem #6: Make predictions for the standardized test data.**
---

**NOTE**: Since this is not a binary classification model, we cannot use the code we used for `y_pred_binary` the day before. Instead, we would need to look at the highest probability class for each datapoint.

In [None]:
y_pred = # COMPLETE THIS LINE
print(y_pred)

y_pred_proba = # COMPLETE THIS LINE
print(y_pred_proba)

### **Problem #7: Print the accuracy score.**
---

Keep in mind that there are 20 unique labels, meaning randomly guessing would achieve an accuracy score of $\frac{1}{20} = 0.05$.

In [None]:
accuracy = # WRITE YOUR CODE HERE
print(f'Accuracy: {accuracy}')

### **Problem #8: Plot the confusion matrix.**
---

Use 'display_labels = type1_list' for `ConfusionMatrixDisplay` to see more meaningful labels.

In [None]:
cm = # WRITE YOUR CODE HERE
disp = # WRITE YOUR CODE HERE
disp.plot()
plt.xticks(rotation = 90)
plt.show()

### **Problem #9: Use your model**

Given the following values, classify these new Pokemon.
1.  `total = 300`,	`hp = 50`, `attack = 40`,	`defense = 60`,	`sp_attack = 60`,	`sp_defense = 67`,	`speed = 40`,	`generation = 6`, `type2_encoded = 3`, `legendary_encoded = 1`

2.  `total = 250`,	`hp = 40`, `attack = 60`,	`defense = 40`,	`sp_attack = 40`,	`sp_defense = 30`,	`speed = 70`,	`generation = 8`, `type2_encoded = 14`, `legendary_encoded = 0`

3. `total = 500`,	`hp = 70`, `attack = 50`,	`defense = 75`,	`sp_attack = 80`,	`sp_defense = 80`,	`speed = 100`,	`generation = 6`, `type2_encoded = 8`, `legendary_encoded = 1`

<br>

####**Remember to standardize the data with the scaler you used in Step #6 and `.transform(...)`.**


In [None]:
new_pokemon = pd.DataFrame([[# COMPLETE THIS LINE , columns = features.columns)

new_scaled = scaler.transform(# COMPLETE THIS LINE 
predictions = # COMPLETE THIS LINE 

print(predictions)
print([type1_list[predictions[i]] for i in range(3)])

---
## **Back To Lecture**
---

<a name="p2"></a>

---
## **Part 2: One-hot Encoding**
---

In this section, you will perform the same modeling process, **except using one-hot encoding instead of label encoding**.

### **Problem #1: Encode all categorical features.**
---

Complete the code below to encode `type2` and `legendary`. However, **one-hot encode** these variables instead. We will leave the label as is to make predictions easier.

<br>

**NOTE**: There are a few extra lines we need to add to allow `OneHotEncoder` and `pandas` to work, so we have provided the full `type2` encoding for you.

In [None]:
# Create the new dataframe
new_pokemon_df = pokemon_df.drop(columns = ['type2', 'legendary'], axis = 1)

# Create the encoder and transform the desired columns
type2_ohe = OneHotEncoder(sparse_output = False)
type2_ohe.set_output(transform = 'pandas')

transformed = type2_ohe.fit_transform(pokemon_df[['type2']])

# Create the new dataframe
new_pokemon_df[transformed.columns] = transformed


new_pokemon_df.head()

In [None]:
# Create the encoder and transform the desired columns
legendary_ohe = # COMPLETE THIS LINE
legendary_ohe.set_output(# COMPLETE THIS LINE

transformed = legendary_ohe.# COMPLETE THIS LINE

# Create the new dataframe
new_pokemon_df[# COMPLETE THIS LINE


new_pokemon_df.head()

### **Problem #2: Prepare the data for modeling.**
---

Specifically,
* Decide the independent and dependent variables (only including numerical variables except `type1`, which is the label).
* Split the data into train and test sets such that 20% is left for testing.
* Scale your data.


In [None]:
# Decide the independent and dependent variables
features = # COMPLETE THIS LINE
label = # COMPLETE THIS LINE

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS LINE

# Scale your data
scaler = # COMPLETE THIS LINE

X_train_scaled = scaler.# COMPLETE THIS LINE
X_test_scaled = scaler.# COMPLETE THIS LINE

### **Problem #3: Initialize and train your model.**
---

In [None]:
clf = # COMPLETE THIS LINE
clf. # COMPLETE THIS LINE

### **Problem #4: Make predictions for the standardized test data.**
---

In [None]:
y_pred = # COMPLETE THIS LINE
print(y_pred)

y_pred_proba = # COMPLETE THIS LINE
print(y_pred_proba)

### **Problem #5: Evaluate your model.**
---

Specifically,
* Print the accuracy score.
* Plot the confusion matrix.

<br>

**NOTE**: Since we are using `type1` as the label directly here, we just use `display_labels=clf.classes_`. This is one good reason for using non-encoded labels.

In [None]:
# Print the accuracy score
accuracy = # COMPLETE THIS LINE
print(f'Accuracy: {accuracy}')

# Plot the confusion matrix
cm = confusion_matrix(# COMPLETE THIS LINE

disp = ConfusionMatrixDisplay(# COMPLETE THIS LINE
disp.plot()

plt.xticks(rotation = 90)
plt.show()

### **Problem #6: Use your model.**
---

Specifically, make predictions for the same data points as in Part 1, Problem #9:

1.  `total = 300`,	`hp = 50`, `attack = 40`,	`defense = 60`,	`sp_attack = 60`,	`sp_defense = 67`,	`speed = 40`,	`generation = 6`, `type2 = 'Dragon'`, `legendary = True`

2.  `total = 250`,	`hp = 40`, `attack = 60`,	`defense = 40`,	`sp_attack = 40`,	`sp_defense = 30`,	`speed = 70`,	`generation = 8`, `type2 = 'Dark'`, `legendary = False`

3. `total = 500`,	`hp = 70`, `attack = 50`,	`defense = 75`,	`sp_attack = 80`,	`sp_defense = 80`,	`speed = 100`,	`generation = 6`, `type2 = 'Ground'`, `legendary = True`

<br>

This is a little more complicated with one-hot encoding, so we will break it down into the following parts:
1. Create an unencoded DataFrame with these points.
2. Transform the `type2` and `legendary` columns.
3. Create an encoded DataFrame with the transformed columns and those that did not need to be transformed.
4. Make our predictions as usual.

#### **1. Create an unencoded DataFrame with these points.**

In [None]:
column_labels = ['total', 'hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'generation', 'type2', 'legendary']

tmp_df = pd.DataFrame([[# COMPLETE THIS LINE

#### **2. Transform the `type2` and `legendary` columns.**

In [None]:
type2_transformed = type2_ohe.transform(tmp_df[[# COMPLETE THIS LINE
legendary_transformed = # COMPLETE THIS LINE

#### **3. Create an encoded DataFrame with the transformed columns and those that did not need to be transformed.**

In [None]:
new_points_df = tmp_df.drop(columns = ['type2', # COMPLETE THIS LINE
new_points_df[type2_transformed.columns] = # COMPLETE THIS LINE
new_points_df[# COMPLETE THIS LINE

#### **4. Make our predictions as usual.**

In [None]:
new_scaled = # COMPLETE THIS LINE
predictions = # COMPLETE THIS LINE

print([type2_map[x] for x in predictions])
print(predictions)

---
## **Back To Lecture**
---

<a name="p3"></a>

---
## **Part 3: Dummy Variable Encoding**
---

In this section, you will perform the same modeling process, **except using dummy variable encoding instead of label or one-hot encoding**.

### **Problem #1: Encode all categorical features.**
---

Complete the code below to encode `type2` and `legendary`. However, **dummy variable encode** these variables instead. We will leave the label as is to make predictions easier.

In [None]:
# Create the new dataframe

new_pokemn_df = pd.get_dummies(pokemon_df, columns = ['type2', 'legendary'])

new_pokemon_df = pokemon_df.drop(columns = ['type2', 'legendary'], axis = 1)

# Create the encoder and transform the desired columns
type2_ohe = OneHotEncoder(sparse_output = False)
type2_ohe.set_output(transform = 'pandas')

transformed = type2_ohe.fit_transform(pokemon_df[['type2']])

# Create the new dataframe
new_pokemon_df[transformed.columns] = transformed


new_pokemon_df.head()

In [None]:
# Create the encoder and transform the desired columns
legendary_ohe = # COMPLETE THIS LINE
legendary_ohe.set_output(# COMPLETE THIS LINE

transformed = legendary_ohe.# COMPLETE THIS LINE

# Create the new dataframe
new_pokemon_df[# COMPLETE THIS LINE


new_pokemon_df.head()

### **Problem #2: Prepare the data for modeling.**
---

Specifically,
* Decide the independent and dependent variables (only including numerical variables except `type1`, which is the label).
* Split the data into train and test sets such that 20% is left for testing.
* Scale your data.


In [None]:
# Decide the independent and dependent variables
features = # COMPLETE THIS LINE
label = # COMPLETE THIS LINE

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS LINE

# Scale your data
scaler = # COMPLETE THIS LINE

X_train_scaled = scaler.# COMPLETE THIS LINE
X_test_scaled = scaler.# COMPLETE THIS LINE

### **Problem #3: Initialize and train your model.**
---

In [None]:
clf = # COMPLETE THIS LINE
clf. # COMPLETE THIS LINE

### **Problem #4: Make predictions for the standardized test data.**
---

In [None]:
y_pred = # COMPLETE THIS LINE
print(y_pred)

y_pred_proba = # COMPLETE THIS LINE
print(y_pred_proba)

### **Problem #5: Evaluate your model.**
---

Specifically,
* Print the accuracy score.
* Plot the confusion matrix.

<br>

**NOTE**: Since we are using `type1` as the label directly here, we just use `display_labels=clf.classes_`. This is one good reason for using non-encoded labels.

In [None]:
# Print the accuracy score
accuracy = # COMPLETE THIS LINE
print(f'Accuracy: {accuracy}')

# Plot the confusion matrix
cm = confusion_matrix(# COMPLETE THIS LINE

disp = ConfusionMatrixDisplay(# COMPLETE THIS LINE
disp.plot()

plt.xticks(rotation = 90)
plt.show()

### **Problem #6: Use your model.**
---

Specifically, make predictions for the same data points as in Part 1, Problem #9:

1.  `total = 300`,	`hp = 50`, `attack = 40`,	`defense = 60`,	`sp_attack = 60`,	`sp_defense = 67`,	`speed = 40`,	`generation = 6`, `type2 = 'Dragon'`, `legendary = True`

2.  `total = 250`,	`hp = 40`, `attack = 60`,	`defense = 40`,	`sp_attack = 40`,	`sp_defense = 30`,	`speed = 70`,	`generation = 8`, `type2 = 'Dark'`, `legendary = False`

3. `total = 500`,	`hp = 70`, `attack = 50`,	`defense = 75`,	`sp_attack = 80`,	`sp_defense = 80`,	`speed = 100`,	`generation = 6`, `type2 = 'Ground'`, `legendary = True`

<br>

This is a little more complicated with one-hot encoding, so we will break it down into the following parts:
1. Create an unencoded DataFrame with these points.
2. Transform the `type2` and `legendary` columns.
3. Create an encoded DataFrame with the transformed columns and those that did not need to be transformed.
4. Make our predictions as usual.

#### **1. Create an unencoded DataFrame with these points.**

In [None]:
column_labels = ['total', 'hp', 'attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'generation', 'type2', 'legendary']

tmp_df = pd.DataFrame([[# COMPLETE THIS LINE

#### **2. Transform the `type2` and `legendary` columns.**

In [None]:
type2_transformed = type2_ohe.transform(tmp_df[[# COMPLETE THIS LINE
legendary_transformed = # COMPLETE THIS LINE

#### **3. Create an encoded DataFrame with the transformed columns and those that did not need to be transformed.**

In [None]:
new_points_df = tmp_df.drop(columns = ['type2', # COMPLETE THIS LINE
new_points_df[type2_transformed.columns] = # COMPLETE THIS LINE
new_points_df[# COMPLETE THIS LINE

#### **4. Make our predictions as usual.**

In [None]:
new_scaled = # COMPLETE THIS LINE
predictions = # COMPLETE THIS LINE

print([type2_map[x] for x in predictions])
print(predictions)

---
## **Back To Lecture**
---

<a name="p4"></a>

---
## **Part 4: K-Folds**
---

### **Problem #1: Train a 5NN Model on the Iris dataset.**

To start, let's train a KNN model with K = 5 on the Iris dataset as we normally would walking through Steps #1 - 7 of implementing of an ML model:
1. Load in the data
2. Decide which features will be the predictors (the X values), and which feature you want to predict (the y values)
3. Split the data into training and testing datasets
4. Import a ML algorithm
5. Set the model’s parameters
6. Fit the model on the training set and test the model on the test dataset.
7. Evaluate the model’s performance
8. Apply your model

#### **Steps #1 - 2**

**Run the code below to load the data and separate into dependent and independent variables.**

In [None]:
# Step #1
iris = load_iris()

# Step #2
X=iris.data
y=iris.target

#### **Step #3**

Make sure to set aside 20% of the data for testing.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(# COMPLETE THIS LINE

#### **Steps #4 - 5**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(# COMPLETE THIS LINE

#### **Steps #6 - 7**

In [None]:
# Step #6
knn.# COMPLETE THIS LINE
pred = # COMPLETE THIS LINE

# Step #7
accuracy = # COMPLETE THIS LINE
print(accuracy)

### **Problem #2: Train a 5NN Model on the Iris dataset using 5-Folds Cross Validation.**

Now, model the data in the same way, **except using K-Folds during training**.

#### **Steps #1 - 2**

**Run the code below to load the data and separate into dependent and independent variables.**

In [None]:
# Step #1
iris = load_iris()

# Step #2
X=iris.data
y=iris.target

#### **Step #3**

Make sure to set aside 20% of the data for testing.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(# COMPLETE THIS LINE

#### **Steps #4 - 5**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(# COMPLETE THIS LINE

#### **Steps #6 - 7**

**Run the code below to perform 5-Fold Cross Validation and then train the final model.**

In [None]:
kf = KFold(n_splits=5)

# store each score in the list "evaluations"
evaluations = []
index = 1

# loop through folds and store the accuracy score
for train_index, test_index in kf.split(X_train):
    X_tr, X_te = X_train[train_index], X_train[test_index]
    y_tr, y_te = y_train[train_index], y_train[test_index]

    # fit model and make a prediction
    knn.fit(X_tr, y_tr)
    pred = knn.predict(X_te)

    # get accuracy score
    score = accuracy_score(y_te, pred)
    print(f'Fold #{index} has an accuracy score of {score}.')

    evaluations.append(score)
    index += 1


# Finish training
knn.fit(X_train, y_train)
print(f'\n{len(evaluations)}-Folds CV demonstrated an average accuracy score of {sum(evaluations)/len(evaluations)}.')

pred = knn.predict(X_test)
print(f'Evaluating on the test set demonstrated an accuracy score of {accuracy_score(y_test, pred)}.')

### **Problem #3: Train an 11NN Model on the Iris dataset using 5-Folds Cross Validation.**

#### **Steps #1 - 2**

**Run the code below to load the data and separate into dependent and independent variables.**

In [None]:
# Step #1
iris = load_iris()

# Step #2
X=iris.data
y=iris.target

#### **Step #3**

Make sure to set aside 20% of the data for testing.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(# COMPLETE THIS LINE

#### **Steps #4 - 5**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(# COMPLETE THIS LINE

#### **Steps #6 - 7**

**Run the code below to perform 5-Fold Cross Validation and then train the final model.**

In [None]:
kf = KFold(n_splits=5)

# store each score in the list "evaluations"
evaluations = []
index = 1

# loop through folds and store the accuracy score
for train_index, test_index in kf.split(X_train):
    X_tr, X_te = X_train[train_index], X_train[test_index]
    y_tr, y_te = y_train[train_index], y_train[test_index]

    # fit model and make a prediction
    knn.fit(X_tr, y_tr)
    pred = knn.predict(X_te)

    # get accuracy score
    score = accuracy_score(y_te, pred)
    print(f'Fold #{index} has an accuracy score of {score}.')

    evaluations.append(score)
    index += 1


# Finish training
knn.fit(X_train, y_train)
print(f'\n{len(evaluations)}-Folds CV demonstrated an average accuracy score of {sum(evaluations)/len(evaluations)}.')

pred = knn.predict(X_test)
print(f'Evaluating on the test set demonstrated an accuracy score of {accuracy_score(y_test, pred)}.')

### **Problem #4: Train a Linear Regression Model on the Diabetes dataset using 5-Folds Cross Validation.**

This dataset contains data from diabetic patients with features such as their BMI, age, blood pressure, and glucose levels that are useful in predicting the diabetes disease progression in patients. We will be looking at these variables that will be used to help predict disease progression in diabetic patients. 

<br>

**NOTE:** Use $R^2$ when evaluating each fold.

#### **Steps #1 - 2**

**Run the code below to load the data and separate into dependent and independent variables.**

In [None]:
# Step #1
diabetes = datasets.load_diabetes()
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['TARGET'] = diabetes.target

# Step #2
X = diabetes_df[['age', 'bmi', 'bp']].values
y = diabetes_df['TARGET'].values

#### **Step #3**

Make sure to set aside 20% of the data for testing.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(# COMPLETE THIS LINE

#### **Steps #4 - 5**

In [None]:
from sklearn.linear_model import LinearRegression
reg = # COMPLETE THIS LINE

#### **Steps #6 - 7**

In [None]:
kf = KFold(# COMPLETE THIS LINE

# store each score in the list "evaluations"
evaluations = []
index = 1

# loop through folds and store the accuracy score
for train_index, test_index in kf.split(# COMPLETE THIS LINE
    X_tr, X_te = X_train[train_index], X_train[test_index]
    y_tr, y_te = y_train[train_index], y_train[test_index]

    # fit model and make a prediction
    reg.fit(# COMPLETE THIS LINE
    pred = reg.predict(# COMPLETE THIS LINE

    # get r2 score
    score = # COMPLETE THIS LINE
    print(f'Fold #{index} has an R^2 of {score}.')

    evaluations.append(score)
    index += 1

# Finish training
reg.fit(# COMPLETE THIS LINE
print(f'\n{len(evaluations)}-Folds CV demonstrated an average R^2 of {sum(evaluations)/len(evaluations)}.')

pred = reg.predict(# COMPLETE THIS LINE
print(f'Evaluating on the test set demonstrated an R^2 of {r2_score(y_test, pred)}.')

### **[OPTIONAL] Problem #5: Train a 5NN Model on the Breast Cancer dataset using 5-Folds Cross Validation.**


The following dataset is taken from the [UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)). The dataset contains mammography exam results and whether or not cancer was detected.

<br>

**NOTE:** Use the accuracy score when evaluating each fold.

#### **Steps #1 - 2**

**Run the code below to load the data and separate into dependent and independent variables.**

In [None]:
# Step #1
cancer_dataset = datasets.load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer_dataset.data, columns=cancer_dataset.feature_names)
cancer_df['TARGET'] = cancer_dataset.target

# Step #2
X = cancer_df[["mean radius","mean texture"]].values
y = cancer_df[["TARGET"]].values

#### **Step #3**

Make sure to set aside 20% of the data for testing.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(# COMPLETE THIS LINE

#### **Steps #4 - 5**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(# COMPLETE THIS LINE

#### **Steps #6 - 7**

In [None]:
# COMPLETE THIS CODE

# End of lab
---
© 2023 The Coding School, All rights reserved