#Using a dataset from [Kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) to predict breast cancer diagnosis

## What will this notebook cover?
This notebook will have EDA peppered in and a Random Forest as well as a KNN model.

## Goal of the project
My main objective is to learn by doing.  I want to expand my skill set and I believe Random Forests and KNN models are two of the classic models to build.


#Data Dictionary

| Column | Meaning |
| ------ | ------- |
| ID | The unique identifier for each person |
| Diagnosis | The diagnosis of breast tissue | 
| Radius_mean | Mean of distances from center to points on the perimeter | 
| Texture_mean | Standard deviation of grey-scale-values | 
| Perimeter_mean | Mean size of the core tumor | 
| Area_mean | The mean of the area | 
| Smoothness_mean | Mean of local variation in radius lengths | 
| Compactness_mean | Mean of perimeter<sup>2</sup> | 
| Concavity_mean | Mean of severity of concavr portions of the contour | 
| Concave points | Mean for number of concave portions of the contour | 
| Concave_points_mean | | 
| Fractal_dimension_mean | | 
| Radius_se | 
| Texture | The texture of the tumor | 
| Perimeter_se | | 
| Area_se | | 
| Smoothness_se | | 
| Compactness_se | | 
| Concavity_se | | 
| Concave points_se | | 
| Symmetry_se | | 
| Fractual_dimensions_se | | 
| Radius_worst | | 
| Texture_worst | | 
| Perimeters_wosrt | | 
| Area_worst | | 
| Smoothness | |
| Compactness_worst | |
| Concavity_worst | |
| Concave points_worst | |
| Symmetry_worst | |
| Fractal_dimension_worst | |
| Unname: 32 | |

#Let's import the libarires

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import tensorflow as tf
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.utils import compute_class_weight
from sklearn.neighbors import KNeighborsClassifier


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Use pandas to take the uploaded file and make it into a useful dataframe
os.chdir("/kaggle/input/breast-cancer-wisconsin-data")
dataset = pd.read_csv("data.csv")

In [None]:
dataset.head()

# Let's familize ourself with the data

This is a method I commonly use to find a lot of quick and import information about the dataset I am using. 

The methods I use here are:
`head()`, `info()`, `describe()`, `tail()`, `columns()`, `dtypes()`, `shape()`

In [None]:
def background_check(dataframe):
  """
    Summary: This function give us a quick look at the key components of our data

    Description: These are pandas functions I use all the time and I figured I may
    as well make it all in one function.  This is typically in a utils file and imported
    in order to be conveinent.

    Paramters: Dataframe

    Return: None
  """
  print("#" * 100)
  print(dataframe.head())

  print("#" * 100)
  print(dataframe.info())

  print("#" * 100)
  print(dataframe.describe)

  print("#" * 100)
  print(dataframe.tail())

  print("#" * 100)
  print(dataframe.columns)

  print("#" * 100)
  print(dataframe.dtypes)

  print("#" * 100)
  print(dataframe.shape)

In [None]:
#Call the background check
background_check(dataset)

> **Note!**: The only categorical column is the diagnosis which we can handle.  Let's check our value counts to see how much we have of each.

> **Note!**: The other value counts do not help as much since there are a lot of different values so I removed them for the sake of keeping this notebook clean

In [None]:
dataset.diagnosis.value_counts()

What can we tell from this?  We can see we have 357 benign and 212 malignant.  This is an imbalance but it is okay for now.

# It is time to visualize this data!

What type of graphs should we make?

How about we use these:
- Correlations (to help see which columns influence the other)
- Count plots (to help visualize the value counts)
- Distplots (to help see the data distribution and see if we have outliers)
- Violin plots (to help see the data distribution in a different light)

There are others we could try: pair plots, bar charts, pie charts, etc., but let's keep it simple!

As Andrej Karpathy said, "Become one with the data."  Let's go

# Correlations and Heatmaps

Why correlations and heatmaps?

Well, a correlation and heatmap are a great way to familiarize ourselves with our data.  We can see based on the color and the numbers what columns have a meaning too us and which do not.

In [None]:
def correlations(dataframe):
  """
  Summary: This will plot the correlations in a heatmap

  Description: This function will plot the correlations in a heatmap format for us.
  We have about 30 columns so it will be a big graph.  Let's take it step by step

  Parameters:  A dataframe

  Return: None
  """

  #We need to drop the useless Unnamed 32 column
  dataframe = dataframe.drop(["Unnamed: 32"], axis = 1)

  #Create the correlation and the cmap
  corr = dataframe.corr(method = "pearson")
  cmap = sns.diverging_palette(230, 20, as_cmap = True) #Set a nice color pattern for us to see

  #This is our figure we will put the heatmap on
  fig, ax = plt.subplots(figsize = (20, 15))

  #This is the actual heatmap making with the arguments specifically chosen.  There are comments next to each to see what it does
  sns_corr = sns.heatmap(corr,                                   #The data to correlate
                         annot = True,                           #The numbers in the boxes
                         fmt = ".1g",                            #A formating option for the numbers in the boxes
                         vmin = -1,                              #Take a glance at the right and see the bar?  We are adjusting that minimum dark blue value
                         vmax = 1,                               #This is the same as before but the deep red
                         center = 0,                             #This is the central value for the color.  Think of this as neutral
                         cmap = "coolwarm",                      #The color scheme of the heatmap
                         linecolor = "black",                    #The dark line clearly sepearting the value
                         linewidths = 3,                         #The thickness of the black line
                         cbar_kws = {"orientation": "vertical"}) #The bar on the right is set to vertical

  #The column names are very close so make the rotation 90 degress to have them fit.  I am sorry - you have to turn your head to see them all.
  sns_corr.set_xticklabels(sns_corr.get_xticklabels(), rotation = 90)

In [None]:
correlations(dataset)

This is a lot to take in.  We can take our time going through to develop our own intuition about the data.  Let's keep going!

# Countplots

Why countplots

Well, a countplot can show us the value counts method as a graph.  I used the value counts above and I wanted to use this since I learn more with the help of visuals and doing.

In [None]:
def countplots(dataframe):
  """
  Summary:  This will show us the countplot of the diagnosis

  Description:  This shows us the countplot of only one column but I am trying to
  modularize my code so I put it in its own method

  Parameters: Dataframe

  Return: None
  """
  ax = sns.countplot(x = "diagnosis", data = dataframe)

In [None]:
countplots(dataset)

This can visualize the value counts for us.  We can easily see our class imbalance now

# Distplots

Why distplots?

Distplots can help us see the distribution of the data.  We can easily see what needs to be standardized.

In [None]:
def distplots(dataframe):
  """
  Summary: This will make a distribution plot for all of the columns

  Description: This will make distribution plots for all of the data.  This will 
  help us see what needs to be standardized and normalized.

  Parameters:  Dataframe

  Return: None
  """
  fig, ax = plt.subplots(nrows = 10, ncols = 3, figsize = (30, 30))

  id_distplot = sns.distplot(dataframe.id, ax = ax[0][0])
  radius_mean_distplot = sns.distplot(dataframe.radius_mean, ax = ax[0][1])
  texture_mean_distplot = sns.distplot(dataframe.texture_mean, ax = ax[0][2])
  perimeter_mean_distplot = sns.distplot(dataframe.perimeter_mean, ax = ax[1][0])
  area_mean_distplot = sns.distplot(dataframe.area_mean, ax = ax [1][1])
  smoothness_mean_distplot = sns.distplot(dataframe.smoothness_mean, ax = ax[1][2])
  compactness_mean_distplot = sns.distplot(dataframe.compactness_mean, ax = ax[2][0])
  concavity_mean_distplot = sns.distplot(dataframe.concavity_mean, ax = ax[2][1])
  concave_points_mean_distplot = sns.distplot(dataframe["concave points_mean"], ax = ax[2][2])
  symmetry_mean_distplot = sns.distplot(dataframe.symmetry_mean, ax = ax[3][0])
  fractal_dimension_mean_distplot = sns.distplot(dataframe.fractal_dimension_mean, ax = ax[3][1])
  radius_se_distplot = sns.distplot(dataframe.radius_se, ax = ax[3][2])
  texture_se_distplot = sns.distplot(dataframe.texture_se, ax = ax[4][0])
  perimeter_se_distplot = sns.distplot(dataframe.perimeter_se, ax = ax[4][1])
  area_se_distplot = sns.distplot(dataframe.area_se, ax = ax[4][2])
  smoothness_se_distplot = sns.distplot(dataframe.smoothness_se, ax = ax[5][0])
  compactness_se_distplot = sns.distplot(dataframe.compactness_se, ax = ax[5][1])
  concavity_se_distplot = sns.distplot(dataframe.concavity_se, ax = ax[5][2])
  concave_points_se_distplot = sns.distplot(dataframe["concave points_se"], ax = ax[6][0])
  symmetry_se_distplot = sns.distplot(dataframe.symmetry_se, ax = ax[6][1])
  fractal_dimensions_se_distplot = sns.distplot(dataframe.fractal_dimension_se, ax = ax[6][2])
  radius_worst_distplot = sns.distplot(dataframe.radius_worst, ax = ax[7][0])
  texture_worst_distplot = sns.distplot(dataframe.texture_worst, ax = ax[7][1])
  perimeter_worst_distplot = sns.distplot(dataframe.perimeter_worst, ax = ax[7][2])
  area_worst_distplot = sns.displot(dataframe.area_worst, ax = ax[7][0])
  smoothness_worst_displot = sns.displot(dataframe.smoothness_worst, ax = ax[7][1])
  compactness_worst_distplot = sns.distplot(dataframe.compactness_worst, ax = ax[7][2])
  concavity_worst_distplot = sns.distplot(dataframe.concavity_worst, ax = ax[8][0])
  concave_points_worst_distplot = sns.distplot(dataframe["concave points_worst"], ax = ax[8][1])
  symmetry_worst_distplot = sns.distplot(dataframe.symmetry_worst, ax = ax[8][2])
  fractal_dimension_worst_distplot = sns.distplot(dataframe.fractal_dimension_worst, ax = ax[9][0])

  plt.show()
  fig.tight_layout()

In [None]:
distplots(dataset)

# What can we see

The distributions give us an idea of what needs to be normalzied and standardized.

I am not sure why these last two are not in the subplots, however, if you know and want to let me know feel free to say it!

# Violin plots

Why violin plots?

Violin plots help us see what a distribution plot can show us but it is even easier to see the distributions (at least I think so).

In [None]:
def violin_plots(dataframe):
  """
  Summary: This will make violin plots for the data
  
  Description: This will make a bunch of violin plots for the data to help us visualize
  
  Parameters: Dataframe
  
  Return: None
  """
  fig, ax = plt.subplots(nrows = 10, ncols = 3, figsize = (30, 20))

  id_violin = sns.violinplot(x = dataframe.id, y = dataframe["diagnosis"], ax = ax[0][0])
  radius_mean_violin = sns.violinplot(x = dataframe["radius_mean"], y = dataframe["diagnosis"], ax = ax[0][1])
  texture_mean_violin = sns.violinplot(x = dataframe.texture_mean, y = dataframe["diagnosis"], ax = ax[0][2])
  perimeter_mean_violin = sns.violinplot(x = dataframe.perimeter_mean, y = dataframe["diagnosis"], ax = ax[1][0])
  area_mean_violin = sns.violinplot(x = dataframe.area_mean, y = dataframe["diagnosis"], ax = ax [1][1])
  smoothness_mean_violin = sns.violinplot(x = dataframe.smoothness_mean, y = dataframe["diagnosis"], ax = ax[1][2])
  compactness_mean_violin = sns.violinplot(x = dataframe.compactness_mean, y = dataframe["diagnosis"], ax = ax[2][0])
  concavity_mean_violin = sns.violinplot(x = dataframe.concavity_mean, y = dataframe["diagnosis"], ax = ax[2][1])
  concave_points_mean_violin = sns.violinplot(x = dataframe["concave points_mean"], y = dataframe["diagnosis"], ax = ax[2][2])
  symmetry_mean_violin = sns.violinplot(x = dataframe.symmetry_mean, y = dataframe["diagnosis"], ax = ax[3][0])
  fractal_dimension_mean_violin = sns.violinplot(x = dataframe.fractal_dimension_mean, y = dataframe["diagnosis"], ax = ax[3][1])
  radius_se_violin = sns.violinplot(x = dataframe.radius_se, y = dataframe["diagnosis"], ax = ax[3][2])
  texture_se_violin = sns.violinplot(x = dataframe.texture_se, y = dataframe["diagnosis"],  ax = ax[4][0])
  perimeter_se_violin = sns.violinplot(x = dataframe.perimeter_se, y = dataframe["diagnosis"], ax = ax[4][1])
  area_se_violin = sns.violinplot(x = dataframe.area_se, y = dataframe["diagnosis"], ax = ax[4][2])
  smoothness_se_violin = sns.violinplot(x = dataframe.smoothness_se, y = dataframe["diagnosis"], ax = ax[5][0])
  compactness_se_violin = sns.violinplot(x = dataframe.compactness_se, y = dataframe["diagnosis"], ax = ax[5][1])
  concavity_se_violin = sns.violinplot(x = dataframe.concavity_se, y = dataframe["diagnosis"],  ax = ax[5][2])
  concave_points_se_violin = sns.violinplot(x = dataframe["concave points_se"], y = dataframe["diagnosis"], ax = ax[6][0])
  symmetry_se_violin = sns.violinplot(x = dataframe.symmetry_se, y = dataframe["diagnosis"],  ax = ax[6][1])
  fractal_dimensions_se_violin = sns.violinplot(x = dataframe.fractal_dimension_se, y = dataframe["diagnosis"], ax = ax[6][2])
  radius_worst_violin = sns.violinplot(x = dataframe.radius_worst, y = dataframe["diagnosis"], ax = ax[7][0])
  texture_worst_violin = sns.violinplot(x = dataframe.texture_worst, y = dataframe["diagnosis"], ax = ax[7][1])
  perimeter_worst_violin = sns.violinplot(x = dataframe.perimeter_worst, y = dataframe["diagnosis"], ax = ax[7][2])
  area_worst_violin = sns.violinplot(x = dataframe.area_worst, y = dataframe["diagnosis"], ax = ax[7][0])
  smoothness_worst_violin = sns.violinplot(x = dataframe.smoothness_worst, y = dataframe["diagnosis"],  ax = ax[7][1])
  concavity_worst_viloin = sns.violinplot(x = dataframe.concavity_worst, y = dataframe["diagnosis"], ax = ax[8][0])
  concave_points_worst_violin = sns.violinplot(x = dataframe["concave points_worst"], y = dataframe["diagnosis"], ax = ax[8][1])
  symmetry_worst_violin = sns.violinplot(x = dataframe.symmetry_worst, y = dataframe["diagnosis"],  ax = ax[8][2])
  fractal_dimension_worst_violin = sns.violinplot(x = dataframe.fractal_dimension_worst, y = dataframe["diagnosis"], ax = ax[9][0])

  fig.tight_layout()
  plt.show()

In [None]:
violin_plots(dataset)

# Models!

Let's split the data then try a random forest and a clustering algorithm

In [None]:
#Split the data
X = dataset.drop(["Unnamed: 32", "diagnosis"], axis = 1)
y = pd.get_dummies(dataset["diagnosis"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Random forest with no normalization

Let's see how it does with no normalization and then we can experiment.

In [None]:
def random_forest_model():
  """
  Summary: This will create the random forest

  Description: This will create, fit, and evaluate the random forest.  We also print a
  classification report to see the precision, recall, f1-score, and support

  Parameters: None

  Return: None
  """
  try:
    random_forest = RandomForestClassifier()
    
    random_forest.fit(X_train, y_train)
    
    random_forest_preds = random_forest.predict(X_test)
    
    print("Accuracy is: ", accuracy_score(y_test, random_forest_preds) * 100)
    print(classification_report(y_test, random_forest_preds))
  except: AttributeError

In [None]:
random_forest_model()

# KNN with no normalization

Let's see how this does with no normalization and then we can experiment

In [None]:
def knn():
  """
  Summary: This is the K Nearest Neighbor's example

  Description: This is the creation, fitting, and evaluating of the KNN model

  Parameters: None

  Return: None
  """
  neighbors = KNeighborsClassifier(n_neighbors=3)
  neighbors.fit(X_train, y_train)
  neighbors_preds = neighbors.predict(X_test)

  print("Accuracy is: ", accuracy_score(y_test, neighbors_preds) * 100)
  print(classification_report(y_test, neighbors_preds))

In [None]:
knn()

# What can we see from this?

The random forest is the winner.  I used all defaults too and it is in the high 90s.  It is not even normalized.  The RF is the clear winner to the clustering algorithm.

What can we do differnetly?  Well we can noramlize the data as a start...  We can also try a different algorithm: Logistic regression, SVM, etc.  We can mess with the hyperparameters of the models: more estimators of clustes, depth of the tree, etc

I am going to put this on my GitHub() with a README.md if you wish to view this on GitHub feel free to check it out