# Who is ready to learn? (Our Machines Are!)

Our task at hand -- building an AI that can save lives.

Today we are going to work to create an AI that can predict whether a breast cancer tumor is benign or malignant. This diagnosis has incredibly important medical implications for those who are diagnosed with breast cancer. 

Before we get started, you should make sure you know what the difference between benign and malignant tumors are and discuss them with your team.

**Hint:** we recommend you heavily refer to your EDA mini project from last week for this activity! 

### Step 1: Imports and Preprocessing

We need to `import pandas as pd` and read in our data. It is currently a .csv (comma-separated value) file, and we generally use the pd.read_csv() function from pandas to read our data into a dataframe.

In [1]:
import pandas as pd
import numpy as np

In [2]:
breast_cancer = pd.read_csv("https://raw.githubusercontent.com/pkmklong/Breast-Cancer-Wisconsin-Diagnostic-DataSet/master/data.csv")

**TODO:** Do you remember how to print the first five rows of a pandas dataframe? How about all of the columns? How about the number of rows? Please do that below and google anything you are unsure about. 

In [4]:
# print first five rows of breast_cancer
breast_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [3]:
# print columns of breast_cancer
breast_cancer.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

In [5]:
# print the number of rows in breast_cancer
breast_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [0]:
# explore anything else you are interested in here


**TODO:** Can you figure out how many rows are malignant (M) and how many are benign (B)? 

In [6]:
# Figure out how many rows are malignant and how many are benign. Hint: use the .count() function!
M.count()

NameError: name 'M' is not defined

### Step 2: Do Your Research
That is a ton of columns. That is a ton of data! Where do we start? 

You should probably do some research on this dataset to find out what the columns mean. I got it on kaggle - you can read about it here: [Wisconsin Breast Cancer Dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).


### Step 3: Cleaning our data up!

Maybe there are some columns we don't need. The last column makes zero sense right? Do we need an id column for our purposes? You should drop **any** columns that you feel are not worth keeping track of. 

Pandas' `drop()` documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html. Remember to use `inplace = True` if you are changing your dataframe and note creating a new one!

**Note:** This data is pretty clean! We only need a few lines for this dataset.

In [0]:
# use df.drop to drop Unnamed: 32 and ID, as well as any columns you don't think we should keep track of.


In [0]:
# drop any n/a values remaining! Remember inplace!


In [0]:
# print the last five rows of your dataframe 


**TODO:** at this point, all of our data should be numeric except for our diagnosis column. This is because supervised machine learning models work off of numerical data. Use breast_cancer.info() below to check that all our data besides 'diagnosis' are numeric.

In [0]:
# Run breast_cancer.info()


### Dummy Variables - Preprocessing
Our target variable (the variable we are trying to predict) is `diagnosis`. Our only problem is that the diagnosis variable will either be an 'M' or a 'B', which is not something our model will understand. We need to change our M's to 1's and B's to 0's so that our model can think like computers do -- using numbers!  

To do so, we use something called a dummy variable. I'll show you how to do so below. 

In [0]:
dummies = pd.get_dummies(breast_cancer["diagnosis"])
dummies.head()

**Ok, we've got our dummy variables. What now?** Now we have to pick one and add it to our dataframe -- we also have to delete the old diagnosis column!

In [0]:
breast_cancer["outcome"] = dummies["M"]
breast_cancer.head()

**TODO:** Now that our outcome column is there, we need to remove the diagnosis column that had the bad data, right? Please do so below.


In [0]:
#Drop the 'diagnosis' column from your dataframe. Remember to use inplace if you are changing the dataframe itself!


### Visualize our data
Now that we've cleaned our data and it is all ready for our model, we should understand some of the relationships. Use your learnings from last week to create two plots -- you can always start with simple [scatter plots](https://plotly.com/python/line-and-scatter/) or maybe a [scatterplot matrix](https://plotly.com/python/splom/). Get creative!

In [0]:
import plotly.express as px
import numpy
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
# Visualization 1 here


In [0]:
# Visualization 2 here


### Are you ready to perform machine learning? 
Do you think deciding whether a tumor is benign or malignant is classification or regression? Talk with your team for a minute.

Either way, we need to tell our code what our target variable is, and what our input is going to be. <br> **TODO:** Can you do so below?

In [0]:
# Create a variable named target and set it equal to the ["outcome"] column in our breast_cancer dataframe.


In [0]:
# Create a variable named input_columns and set it equal to breast_cancer.loc[:, breast_cancer.columns != "outcome"]. This means our inputs are every single column except for column!


### Splitting the Data

For every single supervised machine learning problem, we will follow the same flow with `sklearn`:

> *Split, Fit, Predict, and Evaluate.*

What is first? Splitting our data into training and testing sets. You can watch a video on why we do that [in the first 36 seconds of this video](https://www.youtube.com/watch?v=_vdMKioCXqQ). 

We will split our data using the [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function on `sklearn` using the same flow as this tutorial: https://realpython.com/train-test-split-python-data/#application-of-train_test_split. The code to do so is below. 

In [0]:
import sklearn
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(input_columns, target, test_size=0.2)

**TODO:** can you look at the shape of the x_train, x_test, y_train, and y_test columns? Why does the shape of these columns make sense? 

In [0]:
#hint: start with x_train.shape


**TODO:** Can you print the outputs with print(y_train) and print(y_test)? What do these 1's and 0's represent?

In [0]:
#print outputs


### We have split, now we gotta fit!
Now that are data is preprocessed (all numerical), and split (80/20 train/test) for machine learning, we can create a model to **classify** whether a patient has malignant (1) or benign (0) breast cancer!

The first model is a classification model called K Nearest Neighbors. You can watch a video on how K Nearest Neighbors works here: https://www.youtube.com/watch?v=0p0o5cmgLdE.

First things first, we need to import our model using SKLearn. You can follow the [guide]( https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py) here to see how this tutorial works.


In [0]:
import sklearn #our favorite machine learning library
from sklearn.neighbors import KNeighborsClassifier as KNN #this will import our model as KNN

**Creating the model object:**  we need to create an instance of the model. This is like having a model and giving it a nickname so we can remember it, change it, and save it to work custom on our dataset. 

In [0]:
# Creating the KNN model object -- we need to tell it the number of neighbors to look at.
k = 7 #number of neighbors to look at (look at 7 nearest neighbors)
my_KNN_model = KNN(n_neighbors=k)

And there you have it. We have a K Nearest Neighbors model that will look at the 7 nearest neighbors, and gave it a nice nickname, my_KNN_model. <br> <br>

**Fitting (Training) the model object:** **Supervised machine learning** training involves learning from past results (`y_train`) and past inputs (`x_train`). We pass these to our model for learning to teach our model to answer the question "*based on these inputs, what outputs can we expect?*"
<br>

To do this, we want to let this model learn from our data. We do so by running our model object's `.fit()` function on our training data.

In [0]:
my_KNN_model.fit(x_train, y_train) #train the model on our past results x_train (inputs) and y_train (outputs)

**Prediction and Evaluation:** How do we know if our model is actually good? We need to **test** our model by giving it a two-step quiz. 

Step 1: We let our model use our testing inputs (`x_test`) to make predictions that guess the whether the patient has benign or malignant tumors (we call these predictions `y_hat`). In this case our `y_hat`'s are a bunch of 1's and 0's that represent whether our model thinks the patient's tumor is benign or malignant. 

We do this by running our model object's `.predict()` function.

In [0]:
#make some predictions based on your testing inputs (x_test) using the predict() function
y_hat = my_KNN_model.predict(x_test)

In [0]:
y_hat #these are our predictions for our x_test inputs. 

In [0]:
np.array(y_test) #these are what actually happened for our x_test inputs. Do you see some differences?

Step 2 - Evaluation: Here we want to compare `y_hat` to `y_test`. Discuss with your team about why we should do this. 

One we we can compare them is visual inspection. With that said, would you recommend going through by hand and making a tally for every time the one's and zero's don't match up?

Please don't do this. Please save yourself this nasty headache when we can just do math. Below is a common way to do this, called mean squared error. 

To get mean squared error, we compute the difference between each element in `y_hat` and `y_test` and square it and then get the average of all of that to see how far off we are.

In [0]:
total_squared_error = (np.sum((y_test - y_hat)**2)) #get the sum of all the errors (error = what we want (y_test) - what we predicted (y_hat))
mean_squared_error = total_squared_error/len(y_test) #divide this by how many rows/observations we have 
print(mean_squared_error)

**Not too bad!** MSE represents  the average error (difference) between our model and the correct result. A low mean squared error means that our model makes the correct guess most of the time. 

But we have to think about the nature of the problem. As Data Scientists, do we want to tell someone that we have this average error when we are predicting something as serious as breast cancer? 

And what does an error mean? Are we telling them they don't have a malignant tumor (y_hat=0) when they actually do (y_test = 1)? Are we telling them they have a tumor (y_hat=1) when they actually don't (y_test = 0)?

We can get insights on this through a confusion matrix. 

In [0]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True, fmt='g')

**What is this???** I encourage you and your team to read about it here: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62. You should define what true positives, true negatives, false positives, and false negatives represent for this scenario. 

Discuss what other  findings you can take away, and talk with your instructor about metrics like accuracy, precision and recall!

**The last step!!** We need to change parameters and data to improve our model as we go. If we can make even 1 more correct cancer prediction, it is worth it, right? 

**TODO:** Go back and change the value of `k` in the **creating your model object** section and re-run all the code below it. See what changes with your MSE and confusion matrix!

### Code Your Own Machine Learning Model

It's time for us to create our own classifier. Let's go with an [SVM](https://www.youtube.com/watch?v=_YPScrckx28). We will do the first step for you.

I know it was a lot of information in the last section, but the code is *basically* the same. Since our data is already split, you will follow the same steps to fit your model, make some predictions, evaluate the performance, and visualize your confusion matrix.

In [0]:
from sklearn.svm import SVC
#Create your model object -- you can go back and change the kernel argument after evaluating your data.

Do you remember how to fit a model? What parts of our data do we fit/train on? You can go look at the section above for inspiration.

In [0]:
#fit your model


Now create some predictions and set them equal to `y_hat`. You can look above or google how to make predictions on sklearn if you get confused!

In [0]:
#create some predictions on your testing inputs and set them equal to y_hat.


Compute your MSE below. Is this higher or lower than KNN? Which one would you recommend so far?

In [0]:
#Compute your MSE.


Now, create a confusion matrix! How are the results? Do we end up with more false negatives or false positives?

In [0]:
#Create a confusion matrix


Amazing, now try changing your kernel and seeing how your results change. You can research what kernels are available here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html. If your interested in more ML with SKLearn you should get used to pages like these!

### Want more?

There are dozens of models available on SKLearn: https://scikit-learn.org/stable/supervised_learning.html.

You can try an implement your own below, I recommend starting with logistic regression or random forests :) 

In [0]:
#have fun!