## Introduction to Breast Cancer Biopsy Classification

In this project, imagine that your colleague, an oncologist (cancer doctor), is working in a major hospital that specializes in treating breast cancers. Breast cancer tumors are very complicated at the cellular level, and this makes determining whether a patient's tumor is malignant (dangerous) or benign (not dangerous) a challenge. Your task will be to build a classifier that can determine whether a sample is malignant or benign to help your colleague!

Every patient that arrives at the hospital undergoes a biopsy of their tumor. This means that a small sample of the tumor is taken from the patient and various metrics are recorded about it, including: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

Using a large dataset of labeled biopsy samples from breast cancer tumors, you will build your binary classification model to determine whether a tumor is malignant or benign based on these features. Then, this model can help you to better determine diagnoses for new patients who arrive at the hospital.


## Today...
We will explore together the steps that you could take to help your friend solve this problem!

**Background and data exploration**

- Exploring the data
- Visualizing the data

**Predicting Diagnosis: Working up to Logistic Regression**

- Approach 1: Linear Regression classifier

- Approach 2: Simple boundary classifier

- Approach 3: Modifying with logistic regression

- Approach 4: Multiple feature logistic regression

**Bonus Discussion: What makes a separation good?**

**Optional: Decision trees walkthrough**

**Advanced (Optional): Choosing a Classifier**



# Background and data exploration

## Diagnosing cancer with biopsies


**Before** we dive into building a classifier for breast cancer tumors, we will first discuss how the data are generated and what the various features mean.

<center>
<img src="https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/BreastCells.jpg">
</center>


The above image is an example of cancerous (malignant) breast cells next to benign cells. These cells are part of a tumor biopsy where the extracted tissue is sampled with a special needle. The cells are subsequently stained with different dyes to help visualize their shapes, quantity of DNA, etc. These properties provide clues and insight into the rate of cell division (and remember that rapid cell division = cancerous).
### (Optional) Data Feature Descriptions

Our dataset reports 10 different features of the biopsies. Here's what a few of them mean:

1. <u><b><i>Perimeter</u></b></i>: Total distance between points defining the cell's nuclear perimeter.
2. <u><b><i>Radius</u></b></i>: Average distance from the center of the cell's nucleus to its perimeter.
3. <u><b><i>Texture</u></b></i>: The texture of the cell nucleus is measured by finding the variance of the gray scale intensities in the component pixels.
4. <u><b><i>Area</u></b></i>: Nuclear area is measured by counting the number of pixels on the interior of the nucleus and adding one-half of the pixels in the perimeter.

The following image should give a visual to what these cell nucleus features look like:

<center>
<img src="https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/Perimeter.png">
</center>

5. <u><b><i>Smoothness</u></b></i>: Measures the smoothness of a nuclear contour by measuring the difference between the length of a radial line and the mean length of the lines surrounding it. The image below demonstrates this:

<center>
<img src="https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/Smoothness.png">
</center>

6. <u><b><i>Concavity</u></b></i>: Measures the severity of concavities or indentations in a cell nucleus. Chords are drawn between non-adjacent snake points and measure the extent to which the actual boundary lies inside each chord. The line in bold in the image below is an example of a chord.

<center>
<img src="https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/Concavity.png">
</center>

7. <u><b><i>Symmetry</u></b></i>: The major axis (longest chord) through the center is found. Then, the difference between the distance on both sides of the lines that are perpendicular to the major axis is calculated. The image below shows an example of this:

<center>
<img src="https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/Symmetry.png">
</center>


The paper that first detailed these measurements for this dataset can be found here for more information: https://pdfs.semanticscholar.org/1c4a/4db612212a9d3806a848854d20da9ddd0504.pdf


## Breast cancer diagnostic dataset

The dataset we will use to train our model is called the Breast Cancer Wisconsin (Diagnostic) Data Set. It consists of 569 biopsy samples, just like the ones described above, from breast cancer tumors.

Each biopsy sample in the dataset is labeled with an ID number and whether or not the tumor it came from is malignant (1) or benign (0). Each sample also has 10 different features associated with it, some of which are described above. Remember that each feature value for a given biopsy sample is a real-valued number.

Think: what sorts of features would you expect to be different between a rapidly growing, malignant cancer cell and a healthy one? Why?

In [None]:
#@title Run this to download your data! { display-mode: "form" }
# Load the data!
import pandas as pd
from sklearn import metrics

!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/cancer.csv"

data = pd.read_csv('cancer.csv')
data['diagnosis'].replace({'M':1, 'B':0}, inplace = True)
data.to_csv('cancer.csv')
del data

## Loading our annotated dataset

The first step in building our breast cancer tumor classification model is to load in the dataset we'll use to "teach" (or "train") our model.

In [None]:
# First, import helpful Python tools for loading/navigating data
import os             # Good for navigating your computer's files
import numpy as np    # Great for lists (arrays) of numbers
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv)
from sklearn.metrics import accuracy_score   # Great for creating quick ML models

In [None]:
# This is the name of our data file, which was downloaded in the set up cell.
# Check out the file explorer (folder on the left toolbar) to see where that lives!
data_path = 'cancer.csv'

# Use the 'pd.read_csv(filepath)' function to read in read our data and store it
# in a variable called 'dataframe'
dataframe = pd.read_csv(data_path)

# Redefine `dataframe` to include only the columns discussed
dataframe = dataframe[['diagnosis', 'perimeter_mean', 'radius_mean', 'texture_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean']]

# Define a new, more descriptive `diagnosis_cat` column
dataframe['diagnosis_cat'] = dataframe['diagnosis'].astype('category').map({1: '1 (malignant)', 0: '0 (benign)'})

# Exploring our data


 ## Looking at our dataset

 A key step in machine learning (and coding in general!) is to view the structure and dimensions of our new dataframe, which stores all our training data from the tumor biopsies. You can think of dataframes like Google or Microsoft Excel spreadsheets (large tables with row/column headers).

We want to confirm that the size of our table is correct, check out the features present, and get a more visual sense of what it looks like overall.

### ✍ Exercise

Use the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method to show the first five rows of the table and their corresponding column headers (our biopsy features!)

In [None]:
# YOUR CODE HERE:

# END CODE

In [None]:
#@title Instructor Solution { display-mode: "form" }
dataframe.head()

Our colleague has given us documentation on what each feature column means. Specifically:

* <u><b><i>diagnosis</u></b></i>: Whether the tumor was diagnosed as malignant (1) or benign (0).
* <u><b><i>perimeter_mean</u></b></i>: The average perimeter of cells in that particular biopsy
* <u><b><i>radius_mean</u></b></i>: The average radius of cells in that particular biopsy
* <u><b><i>texture_mean</u></b></i>: The average texture of cells in that particular biopsy
* <u><b><i>area_mean</u></b></i>: The average area of cells in that particular biopsy
* <u><b><i>smoothness_mean</u></b></i>: The average smoothness of cells in that particular biopsy
* <u><b><i>concavity_mean</u></b></i>: The average concavity of cells in that particular biopsy
* <u><b><i>symmetry_mean</u></b></i>: The average symmetry of cells in that particular biopsy

Recall that the term mean refers to taking an average (summing the values for each cell and dividing by the total number of cells observed in that biopsy).

In [None]:
# Next, we'll use the 'info' method to see the data types of each column
dataframe.info()

### 💡 Discussion Question:

Which columns are numeric? Why?


### Instructor Solution

<details><summary> click to reveal! </summary>

perimeter_mean, radius_mean, texture_mean, area_mean, smoothness_mean, concavity_mean, and symmetry_mean are all numeric. These represent the average of cell attributes across the entire biopsy sample. Note that diagnosis is technically stored as a number (int64), but it would be considered categorical.

 ## Visualizing our dataset

How can we determine the relationship between each of the "features" of these cells and the diagnosis?

The best way is to graph certain features in our data and see how they vary between different diagnoses! We will use some Python libraries like Seaborn and Matplotlib to make this an easier task for us.

In [None]:
# First, we'll import some handy data visualization tools
import seaborn as sns
import matplotlib.pyplot as plt

Let's focus on one feature for now: mean radius. How well does it predict diagnosis?

In [None]:
sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', data = dataframe, order=['1 (malignant)', '0 (benign)'])
dataframe.head()

### 💡 Discussion Question

How would you interpret what is going on in the chart above?

### Instructor Solution

<details><summary> click to reveal! </summary>

To some degree, lower mean radius is correlated with benign tumors and higher mean radius with malignant tumors. We expect this biologically -- since cancer is a disease of rapidly growing and dividing cells, we expect cells to be more deformed and to have more replicating (non-compact) DNA on average than healthy cells. Thus, a larger mean nucleus radius. That said, this isn't perfect; we see overlap between mean radius values and the two categories in the middle of the plot (investigated in the next section).

### ✍ Exercise

Try out some other features (e.g. perimeter_mean, texture_mean, smoothness_mean) to see how they relate to the diagnosis. Which single feature seems like the best predictor?

# Predicting Diagnosis

Let's start by predicting a diagnosis using a single feature: radius mean.


## Approach 1: Can we use linear regression to classify these cells?




Let's start by using an algorithm that we've seen before: linear regression!

### 💡 Discussion Question

How might linear regression be useful to classify examples from this dataset?

### Instructor Solution
<details><summary> click to reveal! </summary>

This is a trick question. While you can perform linear regression on this data, it is not optimal for classification problems.

In [None]:
#@title Run this to fit and visualize a linear regression (double-click to see code!)
from sklearn import linear_model

X,y = dataframe[['radius_mean']], dataframe[['diagnosis']]

model = linear_model.LinearRegression()
model.fit(X, y)
preds = model.predict(X)

sns.scatterplot(x='radius_mean', y='diagnosis', data=dataframe)
plt.plot(X, preds, color='r')
plt.legend([ 'Data', 'Linear Regression Fit'])

In [None]:
#@title Take a look at the linear regression model and answer the following questions:

#@markdown What does a diagnosis of 0.0 mean?
diagnosis_0 = "Benign" #@param ["Malignant", "Benign", "Choose An Answer"]

#@markdown What does a diagnosis of 1.0 mean?
diagnosis_1 = "Malignant" #@param ["Malignant", "Benign", "Choose An Answer"]

#@markdown What does the model predict for radius_mean = 20?
radius_mean_20 = "Malignant" #@param ["Malignant", "Benign", "Choose An Answer"]

#@markdown What does the model predict for radius_mean = 11?
radius_mean_11 = "Benign" #@param ["Malignant", "Benign", "Choose An Answer"]

if diagnosis_0 == 'Benign' and diagnosis_1 == 'Malignant':
  print("Correct! 0.0 is a benign prediction and 1.0 is malignant.")
else:
  print("One or both of our diagnoses' interpretations is incorrect. Try again!")

if radius_mean_20 == 'Malignant':
  print("Correct! Our model would predict that a biopsy with radius_mean = 20 is malignant.")
else:
  print("That's not quite what our model would predict for radius_mean = 20. Try again!")

if radius_mean_11 == 'Benign':
  print("Correct! Our model would predict that a biopsy with radius_mean = 11 is benign.")
else:
  print("That's not quite what our model would predict for radius_mean = 11. Try again!")

### 💡 Discussion Question

Did this linear regression model do well?

**Hint**: What would our linear regression model predict for a mean radius of 25? How about 30? Is this an appropriate output?

### Instructor Solution
<details><summary> click to reveal! </summary>

No. Only 0 and 1 are valid classifications. In this model the radius_mean would have to be 10 or 20 to be valid.

##Approach 2: Classification -  Simple Boundary Classifier
The variable we are trying to predict is categorical, not continuous! So we can't use a linear regression; we have to use a classifier.


### Classification is just drawing boundaries!

The simplest approach to classification is just drawing a boundary. Let's pick a boundary value for the radius mean and see how well it separates the data.

In [None]:
#@title Choose a value for your boundary line and click play!

#@markdown Double-click this cell to see the plotting code.
target_boundary = 10 #@param {type:"slider", min:5, max:30, step:0.5}

sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', data = dataframe, order=['1 (malignant)', '0 (benign)'])
plt.plot([target_boundary, target_boundary], [-.2, 1.2], 'g', linewidth = 2)

### 💡 Discussion Question
Does this boundary value separate the data well? What do the points in each part of the graph represent?


### Instructor Solution
<details><summary> click to reveal! </summary>

Since there's overlap in the 12.5 to 17.5 region, we cannot achieve a perfect separation. This model would classify the points to the left of the boundary line as benign, and those to the right would be classified as malignant.

### Building the boundary classifier

Here we build a boundary classifier function that takes in a **target boundary**: a particular value of radius mean. This function will take in a boundary value of our choosing and then classify the data points based on whether or not they are above or below the boundary.

**Exercise: Write a function to implement a boundary classifier.** You'll take in a `target_boundary` (a `float` or `int` like 15) and a `radius_mean_series` (a list of values) and return a list of predictions!

In [None]:
def boundary_classifier(target_boundary, radius_mean_series):
  predictions = []

  for radius_mean in radius_mean_series:
    if radius_mean > target_boundary:
      predictions.append(1)
    else:
      pass # YOUR CODE HERE (delete the pass)

  return predictions

In [None]:
#@title Instructor Solution { display-mode: "form" }
def boundary_classifier(target_boundary, radius_mean_series):
  predictions = []
  for radius_mean in radius_mean_series:
    if radius_mean > target_boundary:
      predictions.append(1)
    else:
      predictions.append(0)
  return predictions

The code below chooses a boundary and runs your classifier.

In [None]:
#@title Choose a value for your boundary line and click play to see your classifier at work!

#@markdown Double-click this cell to see the code for `y_pred` and `y_true`.
chosen_boundary = 10 #@param {type:"slider", min:5, max:30, step:0.5}

y_pred = boundary_classifier(chosen_boundary, dataframe['radius_mean'])
dataframe['predicted'] = y_pred

y_true = dataframe['diagnosis']

sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', hue = 'predicted', data = dataframe, order=['1 (malignant)', '0 (benign)'])
plt.plot([chosen_boundary, chosen_boundary], [-.2, 1.2], 'g', linewidth = 2)

What do you think of the results based on the graph?

We can take a look at `y_true` and `y_pred` - how similar do they look?

In [None]:
print (list(y_true))
print (y_pred)

Let's calculate our accuracy!

In [None]:
accuracy = accuracy_score(y_true,y_pred)
print(accuracy)

**Now adjust the chosen boundary above to get the best possible 'separation'.** As you do that, think about what it means for a separation to be 'good' - is it just the highest accuracy?

##Approach 3: Logistic Regression - using machine learning to determine the optimal boundary



Now, it's time to move away from our simple guess-and-check model and work towards implementing an approach that can automatically find a better separation. One of the most common methods for this is called 'Logistic Regression'.

### Training Data vs Test Data

We'll split up our data set into groups called 'train' and 'test'. We teach our 'model' the patterns using the train data, but the whole point of machine learning is that our prediction should work on 'unseen' data or 'test' data.

The function below does this for you.


In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

Let's now take a look at the 'train' and 'test' groups:


In [None]:
print('Number of rows in training dataframe:', train_df.shape[0])
train_df.head()

In [None]:
print('Number of rows in test dataframe:', test_df.shape[0])
test_df.head()

### Single Variable Logistic Regression
To start with, let's set our input feature to be radius mean and our output variable to be the diagnosis.

We will use this to build a logistic regression model to predict the diagnosis using radius mean.

In [None]:
X = ['radius_mean']
y = 'diagnosis'

X_train = train_df[X]
print('X_train, our input variables:')
print(X_train.head())
print()

y_train = train_df[y]
print('y_train, our output variable:')
print(y_train.head())

### 💡 Discussion Question

What's the difference between `X_train` and `y_train`?



### Instructor Solution
<details><summary> click to reveal! </summary>

`X_train` contains the input numerical data for our model (in this case the radius_mean of our samples). `y_train` contains the output classification: malignant (1) or benign (0)

Now, let's prepare our model (we haven't trained it yet):

In [None]:
# Here, we create a 'logreg_model' object that handles the line fitting for us!
logreg_model = linear_model.LogisticRegression()

###Making Predictions

Next, we want to tell our `logreg_model` object to take in our inputs (X) and our true labels (y) and fit a line that predicts y from X.




#### ✍ Exercise
Can you place the arguments `X_train` and `y_train` correctly into this function to do this?

`logreg_model.fit(FILL_ME_IN, FILL_ME_IN)`

In [None]:
### YOUR CODE HERE

### END CODE

In [None]:
#@title Instructor Solution { display-mode: "form" }
# ANSWER:
logreg_model.fit(X_train, y_train)

### Testing our model

How do we know if our 'model' is actually 'learning' anything? We need to test it on unseen data.

Here we will be designating test inputs to check our model. Let's prepare the inputs and outputs from our testing dataset - try printing them out!

In [None]:
X_test = test_df[X]
y_test = test_df[y]

### Making predictions on our test set

Next, we need to figure out what our line thinks the diagnosis is based on our data points


#### ✍ Exercise

Fill in the appropriate input to this function and run the function below.

`y_pred = logreg_model.predict(FILL_ME_IN)`

In [None]:
## YOUR CODE HERE

## END CODE

In [None]:
#@title Instructor Solution { display-mode: "form" }
# ANSWER:

y_pred = logreg_model.predict(X_test)

Run the code below to visualize the results!

In [None]:
test_df['predicted'] = y_pred
sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', hue = 'predicted', data=test_df, order=['1 (malignant)', '0 (benign)'])

How does it look compared to the predictions before?

### Finally, let's evaluate the accuracy of our model.

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

## What is logistic regression doing? It's giving 'soft' predictions!


In [None]:
#@title Run this to plot logistic regression's soft probabilities { display-mode: "form" }

# Let's visualize the probabilities for `X_test`
y_prob = logreg_model.predict_proba(X_test)
X_test_view = X_test[X].values.squeeze()
plt.xlabel('radius_mean')
plt.ylabel('Predicted Probability')
sns.scatterplot(x = X_test_view, y = y_prob[:,1], hue = y_test, palette=['purple','green'])

The Y-axis is the  probability of being 'malignant' and the X-axis is the radius mean. The colors show the **true** diagnosis (this is different than previous graphs!)


### 💡 Discussion Question

Can you interpret or take a guess about what the graph above is saying?

### Instructor Solution
<details><summary> click to reveal! </summary>

To decide if the sample is 'malignant', it draws a 'decision boundary' where it thinks the sample is equally likely to be 'malignant' and 'normal', and asks 'am I to the left or the right of the boundary?'

# Approach 4: Multiple Feature Logistic Regression

Which features best predict the diagnosis?

Now that we can use logistic regression to find the optimal classification boundary, let's try out other features to see how well they predict the diagnosis.

First let's print out one row of our table so we can see what other features we have available to us.


In [None]:
dataframe.head(1)

### Experimenting with Single-Variable Logistic Regression

First, let's practice what we've done already! Fill in the code below to prepare your X and y data, fit the model on the training data, and predict on the test data.




### ✍ Exercise

Once you have this code working, try replacing `radius_mean` with other features to see how well each feature predicts diagnosis!

In [None]:
X = ['radius_mean'] #Try changing this later!
y = 'diagnosis'

# 1. Split data into train and test
train_df, test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

# 2. Prepare your X_train, X_test, y_train, and y_test variables by extracting the appropriate columns:

# 3. Initialize the model object

# 4. Fit the model to the training data

# 5. Use this trained model to predict on the test data

# 6. Evaluate the accuracy by comparing to to the test labels and print out accuracy.

In [None]:
#@title Instructor Solution { display-mode: "form" }

X = ['radius_mean']
y = 'diagnosis'

# 1. Split data into train and test
train_df, test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

# 2. Prepare your X_train, X_test, y_train, and y_test variables by extracting the appropriate columns:
X_train, X_test = train_df[X], test_df[X]
y_train, y_test = train_df[y], test_df[y]

# 3. Initialize the model object
model = linear_model.LogisticRegression()

# 4. Fit the model to the training data
model.fit(X_train, y_train)

# 5. Use this trained model to predict on the test data
preds = model.predict(X_test)

# 6. Evaluate the accuracy by comparing to to the test labels and print out accuracy.
accuracy = accuracy_score(y_test, preds)
print(X[0], accuracy)

###💡 Discussion Question

Which features best predicted diagnosis? What does this teach us about breast cancer?

### Instructor Solution
<details><summary> click to reveal! </summary>

From testing each feature, area_mean, radius_mean, and perimeter_mean all roughly have the highest accuracy at around 86.8%. This means that abnormally large cells in collected samples are likely indicators for breast cancer.

## Can we use multiple features together to do even better?
So far, we've just been using `radius_mean` to make predictions. But there's lots of other potentially important features that we could be using!

Let's take a look again:

In [None]:
dataframe.head(1)

### Logistic Regression with Multiple Features

Now, let's try re-fitting the model using **your choice of multiple features.**

Just add more features to the list: for example, to use two features you could have

`X = ['radius_mean','area_mean']`

***Instructor Discussion Point Solution:*** `area_mean`, `radius_mean`, `perimeter_mean` together won't affect the accuracy as opposed to just area_mean since correlation-wise they are very similar metrics. `area`, `radius`, and `perimeter` have the same impact.

In [None]:
X = [] # Add your features!
y = 'diagnosis'

# 1. Split data into train and test
train_df, test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

# 2. Prepare your X_train, X_test, y_train, and y_test variables by extracting the appropriate columns:

# 3. Initialize the model object

# 4. Fit the model to the training data

# 5. Use this trained model to predict on the test data

# 6. Evaluate the accuracy by comparing to to the test labels and print out accuracy.

In [None]:
#@title Instructor Solution { display-mode: "form" }
X = ['perimeter_mean', 'radius_mean', 'texture_mean','area_mean']
y = 'diagnosis'

# 1. Split data into train and test
train_df, test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

# 2. Prepare your X_train, X_test, y_train, and y_test variables by extracting the appropriate columns:
X_train, X_test = train_df[X], test_df[X]
y_train, y_test = train_df[y], test_df[y]

# 3. Initialize the model object
model = linear_model.LogisticRegression()

# 4. Fit the model to the training data
model.fit(X_train, y_train)

# 5. Use this trained model to predict on the test data
preds = model.predict(X_test)

# 6. Evaluate the accuracy by comparing to to the test labels and print out accuracy.
accuracy = accuracy_score(y_test, preds)
print(X)
print(accuracy)

Logistic Regression can learn an optimal classification boundary by using multiple features together, which can improve its prediction accuracy even more!

# Bonus Discussion: What makes a separation good?

We know our overall accuracy, so we know how many errors we make overall. Errors however come in two kinds:

- **False positives:** The model predicts that a sample is malignant (positive), but it's actually benign.

- **False negatives:** The model predicts that a sample is benign (negative), but it's actually malignant.

### 💡 Discussion Question

In medical diagnoses, what are the dangers of each kind of mistake? What kind is worse? Can you think of an application where the opposite is true?

A key insight is that there's a trade-off between the two kinds of errors! For example, how could you make a classifier that's guaranteed to have no false negatives? Would that be a good classifier?

We have to find an acceptable balance!

### Instructor Solution
<details><summary>click to reveal!</summary>

False positives could lead to individuals seeking treatment for cancer when they don't actually need it, while false negatives would lead some with cancer to believe they're healthy and not seek the treatment they need. The latter is arguably a worse situation.

An application where a false negative could be preferred is when evaluating the efficacy of cancer drugs. One might prefer an effective drug (positive) to be mistakenly labeled ineffective (negative) over ineffective drugs being approved, needlessly taken, and potentially making the situation worse.

A classifier that's guaranteed to have no false negatives is one that predicts everything as positive, which is definitely not a good classifier!

###Confusion Matrices
Next, let's evaluate the performance of our model quantitatively. We can visualize statistics on the number of correct vs. incorrect predictions using a confusion matrix that shows the following:

![Confusion Matrix](https://miro.medium.com/max/860/1*7EcPtd8DXu1ObPnZSukIdQ.png)

where the terms mean:

* **TP (True Positive)** = The model predicted positive (malignant in our case, since malignant has a label of 1) and it’s true.
* **TN (True Negative)** = The model predicted negative (benign in our case, since benign has a label of 0) and it’s true.
* **FP (False Positive)** = The model predicted positive and it’s false.
* **FN (False Negative)** = The model predicted negative and it’s false.

In [None]:
#@title Run this code to create a confusion matrix. { display-mode: "form" }
#@markdown If you are curious how it works you may double-click to inspect the code.

# Import the metrics class
from sklearn import metrics

# Create the Confusion Matrix
# y_test = dataframe['diagnosis']
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

# Visualizing the Confusion Matrix
class_names = [0,1] # Our diagnosis categories

fig, ax = plt.subplots()
# Setting up and visualizing the plot (do not worry about the code below!)
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g') # Creating heatmap
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y = 1.1)
plt.ylabel('Actual diagnosis')
plt.xlabel('Predicted diagnosis')

In [None]:
#@title Take a look at the confusion matrix and answer the following questions:

#@markdown What are the values in the top left (0, 0)?
top_left = "Choose an Answer" #@param ["True Positives", "True Negatives", "False Positives", "False Negatives", "Choose an Answer"]

#@markdown What are the values in the bottom right (1, 1)?
bottom_right = "Choose an Answer" #@param ["True Positives", "True Negatives", "False Positives", "False Negatives", "Choose an Answer"]

#@markdown What are the values in the top right (1, 0)?
top_right = "Choose an Answer" #@param ["True Positives", "True Negatives", "False Positives", "False Negatives", "Choose an Answer"]

#@markdown What are the values in the bottom left (0, 1)?
bottom_left = "Choose an Answer" #@param ["True Positives", "True Negatives", "False Positives", "False Negatives", "Choose an Answer"]

if top_left == "True Negatives" and bottom_right == "True Positives":
  print("Correct! Our results are True if our model is correct!")
else:
  print("One or both of our (0, 0) and (1, 1) interpretations is incorrect. Try again!")

if top_right == "False Positives":
  print("Correct! A false positive is when our model predicts that a sample is malignant when it's actually benign.")
else:
  print("That's not quite what (1, 0) values are. Try again!")

if bottom_left == "False Negatives":
  print("Correct! A false negative is when our model predicts that a sample is benign when it's actually malignant.")
else:
  print("That's not quite what (0, 1) values are. Try again!")

### Instructor Solution
<details><summary>click to reveal!</summary>

- **top_left**: `True Negatives`
- **bottom_right**: `True Positives`
- **top_right**: `False Positives`
- **bottom_left**: `False Negatives`

### 💡 Discussion Question
- How many `True` values did our model predict?
- How many `False` values?
- Is our model a good classifier? Why or why not?

### Instructor Solution
<details><summary>click to reveal!</summary>

- The model predicted 31 `True` values (where predicted diagnosis is 1, or the right column of the matrix)
- There are 83 `False` predictions
- It depends on which error you're trying to avoid! Our model correctly classifies $70/72 \approx 97\%$ of the healthy individuals and $29/42 \approx 69\%$ of the sick individuals. Based on the previous discussion, we'd probably prefer to identify more of the sick individuals!

###Optional Challenge Exercise: Choosing a Metric

Depending on the situation, we might measure success in different ways. For example, we might use:

- **Accuracy:** What portion of our predictions are right?

- **Precision:** What portion of our positive predictions are actually positive?

- **Recall:** What portion of the actual positives did we identify?



###💡 Discussion Question

Which metric is most important for cancer diagnosis?

### Instructor Solution
<details><summary>click to reveal!</summary>

Each of these metrics has pros and cons. If we use a Clinical Reference Standard for labeling, which is a consensus of multiple doctors, generally accuracy and precision are very important. Accuracy is important since high accuracy corresponds to fewer False Negatives (which could be deadly in this case). High precision ensures that doctors are more accurate with their positives as well.

To calculate any of these, we can use the numbers from our confusion matrix:

In [None]:
print (cnf_matrix)
(tn, fp), (fn, tp) = cnf_matrix
print ("TN, FP, FN, TP:", tn, fp, fn, tp)

Now, calculate your model's performance by your chosen metric! You can use the [table on Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix) to choose a metric and find a formula. How does it change your view of your model?


In [1]:
#YOUR CODE HERE

In [None]:
#@title Instructor Solution
accuracy = (tp + tn)/(tn + fp + fn + tp)
precision = (tp)/(tp + fp)
recall = tp/(tp + fn)

print ("Accuracy:", accuracy)
print ("Precision:", precision)
print ("Recall:", recall)

**Congratulations!** You've successfully trained and evaluated a logistic regression model for diagnosing cancer.

#Optional: Decision Trees Walkthrough

Finally, let's try a different classification model: decision trees! Recall that with decision trees, we choose features that create the best splits of our dataset (separates it into classes as best it can at that time).

In [None]:
#@title Create the model { display-mode: "both" }
from sklearn import tree

# We'll first specify what model we want, in this case a decision tree
class_dt = tree.DecisionTreeClassifier(max_depth=3)

# We use our previous `X_train` and `y_train` sets to build the model
class_dt.fit(X_train, y_train)

In [None]:
#@title Visualize and interpret the tree
plt.figure(figsize=(13,8))  # set plot size
tree.plot_tree(class_dt, fontsize=10)

In [None]:
#@title Find the predictions based on the model { display-mode: "both" }
# now let's see how it performed!
y_pred = class_dt.predict(X_test)

In [None]:
#@title Calculate model performance { display-mode: "both" }
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
print("Precision: ", metrics.precision_score(y_test, y_pred))
print("Recall: ", metrics.recall_score(y_test, y_pred))

###💡Discussion Question

What features are included in this classifier? How might you interpret this tree? Did this do better than the logistic regression?

### Instructor Solution
<details><summary>click to reveal!</summary>

In this solution we are using 'perimeter_mean', 'radius_mean', 'texture_mean', 'area_mean'

A breakdown of each level can be written out as:
- Level 0 (root): If the perimeter mean is less than or equal to 98.775 move down to the left else move down to the right
- Level 1
  - Node 0 (leftmost): If perimeter mean is less than or equal to 89.95 and 98.775 move down to the left else move down to the right
  - Node 1: If texture mean is less than or equal to 16.395 move down to the left else move down to the right
- Level 2
  - Node 0: If perimeter mean is less than or equal to 85.25 both options go to benign
  - Node 1: If texture mean is less than or equal to 85.25 go to benign else go to malignant
  - Node 2: If area mean is less than or equal to 999.05 go to benign else go to malignant
  - Node 3: If perimeter mean is less than or equal to 108.85 both options go to malignant

In this instance decision trees did slightly worse in all performance metrics than multi feature logistic regression.

# Advanced (Optional): Choosing a Classifier
We've studied two common classifiers, but many more are available. You can read about some of them [here](https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/).

Let's try to choose the overall best classifier for this dataset. Fill in the code below to:
*   Use a for loop to train and evaluate each classifer in the list on our dataset.
*   Calculate the precision, recall, and accuracy on the test set for each classifier, and store the results in a data frame so it's easy to analyze.
*   Create plots to show the relationships between precision, accuracy, and recall and help you choose the "best" classifier.

Then experiment with changing the hyperparameters (options) of each classifier - can you get even better results?

In [None]:
#@title Run this to import classifiers
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.gaussian_process.kernels import ConstantKernel, RBF, WhiteKernel
kernal = 1.0 * RBF(length_scale=1e-1, length_scale_bounds=(1e-2, 1e3)) + WhiteKernel( noise_level=1e-2, noise_level_bounds=(1e-10, 1e1))

In [None]:
#Once you've got your code working, try changing the hyperparameters of the classifiers
#to see if you can get even better results.
#Can you find out what the hyperparameters mean?
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]


#Use a for loop to train and test each classifier, and print the results
#You might find the code above useful, as well as https://towardsdatascience.com/a-python-beginners-look-at-loc-part-2-bddef7dfa7f2 .

### YOUR CODE HERE ###




### END CODE ###



#Using pyplot, show the relationships between precision, recall, and/or accuracy.
#Tutorial here: https://matplotlib.org/tutorials/introductory/pyplot.html

### YOUR CODE HERE ###




### END CODE ###

In [None]:
#@title Possible Solution
#Once you've got your code working, try changing the hyperparameters of the classifiers
#to see if you can get even better results.
#Can you find out what the hyperparameters mean?
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    GaussianProcessClassifier(kernal),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

#You might find the code above useful, as well as https://towardsdatascience.com/a-python-beginners-look-at-loc-part-2-bddef7dfa7f2 .

### YOUR CODE HERE ###
accuracies = []
precisions = []
recalls = []
for classifier in classifiers:
  print("---------------")
  print(str(classifier) + '\n')
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)
  acc = metrics.accuracy_score(y_test, y_pred)
  prec = metrics.precision_score(y_test, y_pred)
  rec = metrics.recall_score(y_test, y_pred)
  accuracies.append(acc)
  precisions.append(prec)
  recalls.append(rec)
  print("Accuracy: ", acc)
  print("Precision: ", prec)
  print("Recall: ", rec)

  print("---------------")

### END CODE ###

#Using pyplot, show the relationships between precision, recall, and/or accuracy.
#Tutorial here: https://matplotlib.org/tutorials/introductory/pyplot.html

### YOUR CODE HERE ###

plt.plot(accuracies)
plt.ylabel("Accuracy")
plt.show()

plt.plot(precisions)
plt.ylabel("Precision")
plt.show()

plt.plot(recalls)
plt.ylabel("Recall")
plt.show()

### END CODE ###

**Think about:**
*   Which classifier would you choose?
*   What are the relationships among precision, recall, and accuracy? For this dataset, which is most important?
*   Can you find more successful hyperparameters for each classifer?

Your experiments will help you find a classifier that works very well on our test set. However, you're running a risk by doing so much manual fine-tuning: you might end up "overfitting" (on a more meta level) by choosing a classifier that works well on your test set, but might not work well on other data.

That's why most machine learning projects actually use [*three* datasets](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7): a training set that we use to train each candidate model; a validation set that we use to evaluate each candidate model and choose the best one; and finally, a test set which we use only once, to report the overall performance of our project.


