### In this project, we would use breast cancer wisconsin dataset from Scikit-Learn to demostrate/review the basic workflows on Data Analysis. Examples steps include understanding and cleaning the data, training and evaluating the data, tuning performance, etc.

#### First, we would like to import the data

In [8]:
from sklearn.datasets import load_breast_cancer
import numpy as np

In [2]:
breat_cancer = load_breast_cancer()

In [4]:
print(f'The breat cancer dataset includes the following elements: {breat_cancer.keys()}')

The breat cancer dataset includes the following elements: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


#### Let's further explore by selecting some data samples from each element

In [5]:
breat_cancer['data'][0] #all numeric

array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])

In [9]:
np.unique(breat_cancer['target']) #We have two classes, what those classes represents?

array([0, 1])

In [10]:
breat_cancer['target_names'] #We still do not know if 0 means malignant or benign...

array(['malignant', 'benign'], dtype='<U9')

#### To find out what do 0, 1 represent in the target, we would try the following

In [16]:
print(breat_cancer['DESCR']) 
#From the description, we come to know that Class Distribution: 212 - Malignant, 357 - Benign

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [20]:
#If we try to get the value counts:
cl, cl_counts = np.unique(breat_cancer['target'], return_counts = True)
dict(zip(cl, cl_counts))

{0: 212, 1: 357}

#### So we come into a conclusion that 0 means malignant, 1 means benign. This is countrary to our(or my) common belief that 1 means malignant

#### Lastly, let's look at feature names

In [22]:
breat_cancer['feature_names']

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

#### Feature names are useful for interpreting the results. However, since we are focusing on making best predictions in this project, we would not use feature_names. So the part of understanding the data is complete

---------------------------------------------------------

#### Next we would check if the data is complete by checking for NA values, first, we would assign the data to X and y

In [24]:
X = breat_cancer['data']
y = breat_cancer['target']

In [32]:
#Check for NA values
print(f'X contains NA values: {np.isnan(X).any()}')
print(f'y contains NA values: {np.isnan(y).any()}')

X contains NA values: False
y contains NA values: False


#### Great, we do not have any NA values, we would move on to the next step

---------------

#### Next, we would try to select a model to predict whether the breast cancer is malignant or benign. We would start from Support Vector Machines (SVM), since it generally has good performance and there are some hyper-parameters we could tune for demonstration

In [33]:
from sklearn.svm import SVC

#### Start by spliting the data into Training and Testing sets

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)

#### We stratify by y so that we have the same class proportions in train, test as well as the full sets

#### Next we would standardize our data to make sure SVM works properly

In [38]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)