# Lab 2: Building a supervised machine learning model

Your objectives for this lab are to:
* practice basic preprocessing steps to set up a supervised learning task,
* learn how to explore a dataset to understand its structure, variables, and missingness, and
* implement `KNeighborsClassifier` and adjust its hyperparameters.

First, let's make the necessary imports for today with the code below.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Now load the Breast Cancer dataset, which is provided by `sklearn`.

In [2]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

# Part 1: Preprocessing and EDA

A widely used library for data wrangling is `pandas` (`pd`for short), which you imported above. To work with data using `pandas`, we first need convert the `cancer` dataset into a `DataFrame`. To do this, run the code below.

Notice how we also specify the target variable — the variable we'll try to to predict with `KNeighborsClassifier` later on in this notebook.

In [3]:
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

Now check the shape of `df` and inspect the column names.

In [4]:
print("Our dataframe has", df.shape[0], "rows and", df.shape[1], "columns.")
print("The column names are:", df.columns)

Our dataframe has 569 rows and 31 columns.
The column names are: Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')


Do you know what those column names are referring to?

If not, a good first step for working with the data would be to familiarize
yourself with the to source of the data, how it was collected, and what each variable is measuring. Here's where you can find that information: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

## Question 1
No let's start exploring the data. Use `describe()` to get some descriptive statistics.

Then print the different values of the target column, and how many occurrences there are of each value. What does each value of the target variable represent?

In [6]:
df.describe()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [5]:

print(df['target'])

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: int32


## Question 2

In the real world, the datasets you work with often have missing values. This usually isn't a problem if there are very few missing values, or if the values are missing completely at random. But, if there are lots of missing values and/or there's a pattern to the missingness (e.g., there's data missing on all patients above 50 years-old, there's data missing on all female patients, etc.), then missing values could indicate that there's a bias in the dataset. So, it's important to check for missing values.

Use `isnull().sum()` to count the number of missing values in each column.

(*Hint: there should be zero missing values in this dataset*)


In [7]:

print(df.isnull().sum())

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


## Question 3

Which attributes correlate with the target variable? Do any of the attributes correlate with one another?

Produce a correlation matrix to check and get an inital sense of relationships between variables.

In [8]:

df.corr()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
mean radius,1.0,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,-0.311631,...,0.297008,0.965137,0.941082,0.119616,0.413463,0.526911,0.744214,0.163953,0.007066,-0.730029
mean texture,0.323782,1.0,0.329533,0.321086,-0.023389,0.236702,0.302418,0.293464,0.071401,-0.076437,...,0.912045,0.35804,0.343546,0.077503,0.27783,0.301025,0.295316,0.105008,0.119205,-0.415185
mean perimeter,0.997855,0.329533,1.0,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,-0.261477,...,0.303038,0.970387,0.94155,0.150549,0.455774,0.563879,0.771241,0.189115,0.051019,-0.742636
mean area,0.987357,0.321086,0.986507,1.0,0.177028,0.498502,0.685983,0.823269,0.151293,-0.28311,...,0.287489,0.95912,0.959213,0.123523,0.39041,0.512606,0.722017,0.14357,0.003738,-0.708984
mean smoothness,0.170581,-0.023389,0.207278,0.177028,1.0,0.659123,0.521984,0.553695,0.557775,0.584792,...,0.036072,0.238853,0.206718,0.805324,0.472468,0.434926,0.503053,0.394309,0.499316,-0.35856
mean compactness,0.506124,0.236702,0.556936,0.498502,0.659123,1.0,0.883121,0.831135,0.602641,0.565369,...,0.248133,0.59021,0.509604,0.565541,0.865809,0.816275,0.815573,0.510223,0.687382,-0.596534
mean concavity,0.676764,0.302418,0.716136,0.685983,0.521984,0.883121,1.0,0.921391,0.500667,0.336783,...,0.299879,0.729565,0.675987,0.448822,0.754968,0.884103,0.861323,0.409464,0.51493,-0.69636
mean concave points,0.822529,0.293464,0.850977,0.823269,0.553695,0.831135,0.921391,1.0,0.462497,0.166917,...,0.292752,0.855923,0.80963,0.452753,0.667454,0.752399,0.910155,0.375744,0.368661,-0.776614
mean symmetry,0.147741,0.071401,0.183027,0.151293,0.557775,0.602641,0.500667,0.462497,1.0,0.479921,...,0.090651,0.219169,0.177193,0.426675,0.4732,0.433721,0.430297,0.699826,0.438413,-0.330499
mean fractal dimension,-0.311631,-0.076437,-0.261477,-0.28311,0.584792,0.565369,0.336783,0.166917,0.479921,1.0,...,-0.051269,-0.205151,-0.231854,0.504942,0.458798,0.346234,0.175325,0.334019,0.767297,0.012838


 ## Question 4

Which variables display the strongest correlation with the `target` variable?

 Select the final column of the correlation matrix (the `target` column), sort the column by absolute values, and display it.

the variable smoothness error seems to have the largest correlation with the target variable

In [12]:
df.corrwith(df['target'])


mean radius               -0.730029
mean texture              -0.415185
mean perimeter            -0.742636
mean area                 -0.708984
mean smoothness           -0.358560
mean compactness          -0.596534
mean concavity            -0.696360
mean concave points       -0.776614
mean symmetry             -0.330499
mean fractal dimension     0.012838
radius error              -0.567134
texture error              0.008303
perimeter error           -0.556141
area error                -0.548236
smoothness error           0.067016
compactness error         -0.292999
concavity error           -0.253730
concave points error      -0.408042
symmetry error             0.006522
fractal dimension error   -0.077972
worst radius              -0.776454
worst texture             -0.456903
worst perimeter           -0.782914
worst area                -0.733825
worst smoothness          -0.421465
worst compactness         -0.590998
worst concavity           -0.659610
worst concave points      -0

In [13]:

df['target'].sort_values(key=abs)

0      0
190    0
193    0
194    0
196    0
      ..
269    1
268    1
267    1
281    1
568    1
Name: target, Length: 569, dtype: int32

# Part 2: Modeling

Now that you've done some basic preprocessing and data exploration, let's implement a `KNeighborsClassifier` to predict which tumors are cancerous (where `target = 0`).

## Question 5

Assign the target column to `y`, and assign the dataframe with the target column removed yo `X`.

Create a test train split using `X` and `y`.


In [19]:

cleandf= df.drop(['target'], axis=1)
x=cleandf
y=df['target']
x_train, x_test, y_train, y_test = train_test_split(x,y)

## Question 6

Create a `KNeighborsClassifier`. Use the fit classifier to the training data. Then print both the training and test scores.

How accurately does the `KNeighborsClassifier` predict whether a tumor is cancerous vs. benign?

In [20]:
KNN = KNeighborsClassifier(n_neighbors=5)

KNN.fit(x_train,y_train)

print(KNN.score(x_train,y_train))
print(KNN.score(x_test,y_test))

0.9460093896713615
0.9230769230769231


this model predicts with a 92.3% accuracy

## Question 7
Create a loop that builds kNN classifiers with either distance or uniform weighting, with numbers of neighbors varying between 1 and 20. What is the best combination? Produce a list consisting of test accuracy, training accuracy, number of neighbors and weighting choice. The list should be sorted by test accuracy.

<hr>

<i>Hints: </i>
<br>
<i>
Create two dicts: training_accuracy and test_accuracy. Make two loops: an outer loop ranging over number of neighbors, and an inner loop ranging over weighting choice. Then build the model with the current options and create a key unique to that option (eg. by concatenating the two options). Then you can store the current results in the the training and test dicts, using the current key. When you're done with the loops, you print out the sorted results of the two dicts.</i>

In [21]:

training_accuracy = {}
test_accuracy = {}


for n_neighbors in range(1, 21):
    for weight in ["uniform", "distance"]:
        

        knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weight)
        knn.fit(x_train, y_train)
        

        train_acc = knn.score(x_train, y_train)
        test_acc = knn.score(x_test, y_test)
        
        key = f"{n_neighbors}_{weight}"
        
        training_accuracy[key] = train_acc
        test_accuracy[key] = test_acc

sorted_results = sorted(test_accuracy.items(), key=lambda x: x[1], reverse=True)

print("Test Accuracy | Train Accuracy | Neighbors | Weighting")
for key, test_acc in sorted_results:
    train_acc = training_accuracy[key]
    n_neighbors, weight = key.split("_")
    print(f"{test_acc:.4f} | {train_acc:.4f} | {n_neighbors} | {weight}")




Test Accuracy | Train Accuracy | Neighbors | Weighting
0.9301 | 1.0000 | 5 | distance
0.9231 | 0.9695 | 2 | uniform
0.9231 | 0.9460 | 5 | uniform
0.9231 | 0.9437 | 6 | uniform
0.9231 | 1.0000 | 6 | distance
0.9161 | 1.0000 | 1 | uniform
0.9161 | 1.0000 | 1 | distance
0.9161 | 1.0000 | 2 | distance
0.9161 | 0.9577 | 3 | uniform
0.9161 | 1.0000 | 3 | distance
0.9161 | 1.0000 | 4 | distance
0.9161 | 0.9437 | 7 | uniform
0.9161 | 1.0000 | 7 | distance
0.9161 | 1.0000 | 8 | distance
0.9161 | 0.9343 | 17 | uniform
0.9161 | 0.9343 | 18 | uniform
0.9161 | 1.0000 | 18 | distance
0.9161 | 0.9343 | 19 | uniform
0.9161 | 1.0000 | 19 | distance
0.9161 | 0.9343 | 20 | uniform
0.9161 | 1.0000 | 20 | distance
0.9091 | 0.9507 | 4 | uniform
0.9091 | 0.9484 | 8 | uniform
0.9091 | 0.9390 | 9 | uniform
0.9091 | 1.0000 | 9 | distance
0.9091 | 0.9390 | 10 | uniform
0.9091 | 1.0000 | 10 | distance
0.9091 | 0.9437 | 11 | uniform
0.9091 | 1.0000 | 11 | distance
0.9091 | 0.9437 | 12 | uniform
0.9091 | 1.0000 | 1