<h1> Breast Cancer Classifcation Practice</h1>
<h3> The data is from the Breast Cancer Wisconsin (Diagnostic) Data Set and can be found <a src="https://www.kaggle.com/uciml/breast-cancer-wisconsin-data">here.</a></h3>
<p> The dataset is a labeled set of breast tissue cells with a variety of features. The label is either "M" for malignant or "B" or benign. The goal of this notebook is to practice data science APIs with Python that I have learned as well as practive basic machine learning. This notebook uses Pandas for manipulating the data set, matplotlib to plot data, and scikit-learn for machine learning, including classification and prediction of cancerous cells.</p>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from matplotlib import style
style.use("ggplot")
%matplotlib inline

In [None]:
cells = pd.read_csv("../input/data.csv")

<h2>I converted the diagnosis from "M" for malignant and "B" for benign to 1 and 0 respectively so that the diagnosis could be correlated with the other features as shown below. </h2>

In [None]:
i = 0
for row in cells["diagnosis"]:
    if row == 'M':
        cells.set_value(i,'diagnosis', 1)
    else:
        cells.set_value(i, 'diagnosis', 0)
    i += 1

cells["diagnosis"] = pd.to_numeric(cells["diagnosis"], errors='coerce')
    

In [None]:
print(cells.head())

In [None]:
cells.diagnosis.value_counts().plot(kind='barh')

In [None]:
cells.describe()

<h2>Using pandas, we can explore the dataset to find how each feature is correlated with each other.</h2>

In [None]:
cells.corr()

<h2>Since we are most interested in the factors that predict diagnosis, we can focus in on diagnosis. Judging from the ouput below, we find that "concave points_worst" is the largest predictor for the diagnosis of each cell.</h2>

In [None]:
cells.corr()["diagnosis"].plot(kind="bar")
cells.corr()["diagnosis"]

<h2>Now, we can use scikit-learn to classify our data, split our data into training and testing sets, and make predictions on either our testing sets or completely new data</h2>

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
cells.columns

In [None]:
# This creates the classifier
clf = svm.SVC(kernel='linear', C = 1.0)

In [None]:
X = cells[['radius_mean', 'texture_mean', 'perimeter_mean',
          'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
          'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
          'radius_se', 'texture_se', 'perimeter_se', 'area_se','smoothness_se',
          'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
          'fractal_dimension_se','radius_worst','texture_worst',
          'perimeter_worst', 'area_worst', 'smoothness_worst',
          'compactness_worst', 'concavity_worst', 'concave points_worst',
          'symmetry_worst','fractal_dimension_worst']]
y = cells['diagnosis']

# Splits the data into traning and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# This trains the classifier model
clf.fit(X_train,y_train)

In [None]:
# Import the scikit-learn function to compute eror
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

# Generate our predictions for the test set.
predictions = clf.predict(X_test)

# Compute error between our test predictions and the actual value
mean_squared_error(predictions, y_test)


In [None]:
# Accuracy Score
print(accuracy_score(y_test, predictions))

print("{}%".format(round(accuracy_score(y_test, predictions)*100)))