# Introduction

This codebook contains sets of Python that you can cut-and-paste for data analysis and visualization projects.

For the dataset we are using the BCA Wisconsin dataset.
It is a `csv` formatted file containing morphological and image features of nuclei for breast cancer fine-needle aspirates.
Each sample is labeled as "benign" or "malignant".

# Imports

For this codebook we put some generic imports at the top, and then each section will have its own imports depending on where it gets used. 
This means there may be some things that are imported more than is strictly necessary if the notebook is run top to bottom; however, it should make it easier to cut and paste later on.

In [None]:
# Generic Python imports
import os
import sys
import json

# Standard Scientific Python imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

# Loading Data

We typically load data from `.csv` files, which are basically spreadsheets.

In Excel, you can export spreadsheets as `.csv` by going to "File" -> "Export" -> "Change File Type" and selecting **"CSV (Comma delimited) (*.csv)**" under "Other File Types".

When loading data into Python for data analysis, it's pretty common to use [Pandas](https://pandas.pydata.org/).
Here are some resources:

- [Statistical Data Analysis in Python](https://www.kdnuggets.com/2016/07/statistical-data-analysis-python.html)
    - [Introduction to Pandas](https://nbviewer.jupyter.org/urls/gist.github.com/fonnesbeck/5850375/raw/c18cfcd9580d382cb6d14e4708aab33a0916ff3e/1.+Introduction+to+Pandas.ipynb)
    - [Data Wrangling with Pandas](https://nbviewer.jupyter.org/urls/gist.github.com/fonnesbeck/5850413/raw/3a9406c73365480bc58d5e75bc80f7962243ba17/2.+Data+Wrangling+with+Pandas.ipynb)
    - [Plotting and Visualization](https://nbviewer.jupyter.org/urls/gist.github.com/fonnesbeck/5850463/raw/a29d9ffb863bfab09ff6c1fc853e1d5bf69fe3e4/3.+Plotting+and+Visualization.ipynb)
    - [Statistical Data Modeling](https://nbviewer.jupyter.org/urls/gist.github.com/fonnesbeck/5850483/raw/5e049b2fdd1c83ae386aa3205d3fc50a1a05e5b0/4.+Statistical+Data+Modeling.ipynb)
- [Official Documentation: 10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
- [Pandas Cheatsheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)


In [None]:
# Use os.path.join to construct the file path
# Keeps things nice when going between Windows / non-Windows platforms
csv_file = os.path.join('data', 'bca_wisconsin', 'bca_wisconsin.csv')

# Use `read_csv` to get the file into a dataframe
data_frame = pd.read_csv(csv_file)

# Display the first 5 rows of the dataframe
data_frame.head()

In [None]:
data_frame.info()

# Data Wrangling and Exploration

The order of data wrangling techniques is debatable, but number one is **separate out a testing set**.

## Train-Test Splitting

Rule of thumb: **60-70%** of the data for training

- **Stratified** sampling: Maintain the overall label distribution in your training and testing sets
- **Random** sampling: Ignore labels in splitting up the data

If you have labels, then perform stratified labeling as described below:

In [None]:
# Stratified Split
from sklearn.model_selection import StratifiedShuffleSplit

# Create the splitting object
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

# Apply the split to the data frame using the "diagnosis" column as our label
for train_index, test_index in split.split(data_frame, data_frame["diagnosis"]):
    train_set = data_frame.loc[train_index]
    test_set = data_frame.loc[test_index]


In [None]:
# Verify that the data has been split appropriately
print('Overall class ratio:')
print('{}'.format(data_frame["diagnosis"].value_counts() / len(data_frame)))
print(' ')
print('Train set class ratio:')
print('{}'.format(train_set["diagnosis"].value_counts() / len(train_set)))
print(' ')
print('Test set class ratio:')
print('{}'.format(test_set["diagnosis"].value_counts() / len(test_set)))

If you do not have labels, then you can just do a random split of the data.

You need to make sure that the split data uses some kind of unique identifier for each sample, to ensure that multiple splits do not cause some of the data to appear in training (i.e. if more data is added to the set, then your random split might shuffle everything together).

Benefits:
- Each sample is "assigned" to test set based on some immutable value (the identifier), 
- You can specify the percentage of samples that go into testing vs. training (20%), 
- If you add new data, the samples that you had before will still be assigned to the correct split,
- You can still set the random seed to ensure repeatability.

If you don't have a unique identifier, then you can use the index -- just make sure that new data is added to the end of the dataset.


In [None]:
# Import functions to calculate a hash for the dataset
from zlib import crc32

def test_set_check(identifier, test_ratio):
    '''Return a boolean that states whether the current sample should be included in the testing set.
    
    Calculates a hash value from an identifier, and returns True if the value is in the bottom 
    (test_ratio)-percent of the maximum possible hash value.
    '''
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    '''Return training and testing dataframes given a test ratio and column to use for computing sample hash.
    
    Uses test_set_check to actually compute hash and put the data into training or testing.
    '''
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

# Apply the above functions to the dataset
train_set, test_set = split_train_test_by_id(data_frame, 0.3, "id")


In [None]:
# Verify that the data has been split appropriately
print('Overall class ratio:')
print('{}'.format(data_frame["diagnosis"].value_counts() / len(data_frame)))
print(' ')
print('Train set class ratio:')
print('{}'.format(train_set["diagnosis"].value_counts() / len(train_set)))
print(' ')
print('Test set class ratio:')
print('{}'.format(test_set["diagnosis"].value_counts() / len(test_set)))

## Data Exploration

You can now start looking at the training data to try and identify some correlations between features.

First create a copy of the data so you don't mess anything up:

In [None]:
data_copy = train_set.copy()

# Drop the 'id' column for analysis
data_copy = data_copy.drop('id', axis=1)

### Calculating Feature Correlations

You can calculate **standard correlation coefficient** a.k.a. **Pearson's r** to look for pairwise correlations.

See [this page](https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/) which illustrates how to use the `.corr()` function to actually drop correlated features.

In [None]:
# We are interested in finding ALL correlated features, not just positively correlated ones
corr_matrix = data_copy.corr().abs()

# `corr_matrix` is a symmetric matrix, so we just want the upper triangle
upper_triangle_locations = np.triu( np.ones(corr_matrix.shape), k=1).astype(np.bool)

# `upper` now contains just the upper triangle of correlations, with the rest as NaNs
upper = corr_matrix.where(upper_triangle_locations)


In [None]:
# Now get a list of columns in `upper` that contain feature values correlated above 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

print('{} columns to drop: {}'.format(len(to_drop), to_drop))

In [None]:
# Actually perform the drop
data_copy = data_copy.drop(data_copy[to_drop], axis=1)

In [None]:
# Display the results 
data_copy.head()

### Data Visualization

We can visualize the data to see if anything pops out -- this is very much "explore mode".

As an example, we can create a scatter matrix (e.g. "facet plot") to display possible correlations between attributes.
This is most helpful for regression targets, when you have a numeric value you want to estimate from the others (we don't have this in the BCA dataset, though).

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["radius_mean", "texture_mean", "compactness_mean", "fractal_dimension_mean"]

scatter_matrix(data_copy[attributes], figsize=(12,8))

If we want to zoom in on a particular pair of features that might be informative, we can select them specifically.

In [None]:
data_copy.plot(kind="scatter", x="radius_mean", y="fractal_dimension_mean", 
             alpha=0.3)

### Attribute Combinations

We won't do this here, but combining attributes can be done to include additional features you may want to look at. 
If you have a regression target, you can re-calculate your correlation between the target and your new feature to see if the new feature is correlated as well.

## Data Cleaning

First we can separate out our targets -- don't want to transform those.

In [None]:
data_values = train_set.drop('diagnosis', axis=1)

## NOTE: Using double-brackets here to make the result a dataframe and not a series
# See here: https://github.com/ageron/handson-ml/issues/259
data_labels = train_set[['diagnosis']].copy()

First I will outline different data cleaning approaches, and then explain how to use Pipelines to string these all together.

### Handle Missing Attributes

Options:
- Drop data with missing attributes
- Remove attributes that are not complete
- Set missing values to some other value

Refer back to the "Hands On" notes for information about these approaches -- we have all numeric, complete features here so we won't worry about this.

In [None]:
# For imputing missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

### Handle Categorical Attributes

Again, we have options:

- Use `OrdinalEncoder` if the categorical attributes are ordinal
- Use `OneHotEncoder` if the categorical attributes are not ordinal

And again, we don't have this situation here. Refer to "Hands On" for guidance.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()

### Scaling Attributes

Data should always be scaled. 

**Min-Max Scaling**: AKA *normalization*: values are shifted and rescaled to range from 0 to 1, by subtracting the min and dividing by the max - min. There is a `MinMaxScaler` transformer for this, with a `feature_range` hyperparameter.

**Standardization**: Zero mean, unit variance. Subtract the mean, divide by standard deviation. Not bounded to any specific range, which may be a problem (e.g. for neural networks expecting a 0-1 value), but much less affected by outliers. `StandardScaler` will do this.

Scaling should be calculated only on the training set, and the proper transform applied to testing.

In [None]:
from sklearn.preprocessing import StandardScaler

### Using Pipelines

Pipelines allow you to "chain" together processing of the data (imputer, encoder, scaler).
In this case, since we aren't imputing and we aren't handling categorical values, we only have a scaling component to our pipeline, but this is a good way to set up a series of cleaning operations if needed.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

input_pipeline = Pipeline([
        ('std_scaler', StandardScaler()),
    ])

# This does the same thing, adds a name automatically
input_pipeline = make_pipeline(StandardScaler())

data_values_transformed = input_pipeline.fit_transform(data_values)

# Create a numeric label for our system to work on
data_labels_num = ordinal_encoder.fit_transform(data_labels)

# Dimensionality Reduction

These methods will give you a more "all-encompassing" view of your data by providing a low-dimensional embedding or representation of the data in 2 or 3 dimensions.

You can add these methods to the pipeline as well, by the way.

In [None]:
# Dimensionality Reduction Imports
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.manifold import MDS
from sklearn.manifold import TSNE

Now create pipelines for different dimensionality reduction targets. 
There may (?) be a way to define multiple alternatives for a given step, but I'm not sure.

Here are some of the different methods you can try:

```
dimred_pipeline = make_pipeline(input_pipeline, PCA(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, Isomap(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, LocallyLinearEmbedding(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, MDS(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, TSNE(n_components=2))
```

Cut and paste whichever of those you want down below.

In [None]:
dimred_pipeline = make_pipeline(input_pipeline, PCA(n_components=2))
#dimred_pipeline = make_pipeline(input_pipeline, TSNE(n_components=2))


In [None]:
# Fit and apply the transform right away
X_reduced = dimred_pipeline.fit_transform(data_values)

In [None]:
# Plot the reduced-dimensional space (2D)
plt.scatter(X_reduced[:,0], X_reduced[:,1], alpha=0.2)

In [None]:
# Plot the reduced-dimensional space - 3D
#from mpl_toolkits.mplot3d import Axes3D
#fig = plt.figure()
#ax = fig.add_subplot(111, projection='3d')
#ax.scatter(X_reduced[:,0], X_reduced[:,1], X_reduced[:,2], alpha=0.2)

In [None]:
# Plot the dim reduction, with labels for the maps
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=np.squeeze(data_labels_num), alpha=0.2)

In [None]:
#from mpl_toolkits.mplot3d import Axes3D
#fig = plt.figure()
#ax = fig.add_subplot(111, projection='3d')
#ax.scatter(X_reduced[:,0], X_reduced[:,1], c=data_labels_num, alpha=0.2)

# Clustering

Clustering is for unlabeled data, where you can decide on a label just based on the structure of the data.

## KMeans

In [None]:
from sklearn.cluster import KMeans

# Clustering pipeline - Start to finish
kmeans_pipeline = make_pipeline(input_pipeline, PCA(n_components=2), KMeans(n_clusters=2))

# Cluster via K-means
X_clustered = kmeans_pipeline.fit_predict(data_values)

# Alternative: Create the input data first, then fit the model and transform separately
kmeans_model = KMeans(n_clusters=2).fit(X_reduced)

In [None]:
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=X_clustered, alpha=0.2)

## Spectral Clustering

In [None]:
from sklearn.cluster import SpectralClustering
sc_pipeline = make_pipeline(dimred_pipeline, SpectralClustering(n_clusters=2, assign_labels='discretize'))
data_reduced_sc = sc_pipeline.fit_predict(data_values)

In [None]:
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=data_reduced_sc, alpha=0.2)

## Mean Shift

In [None]:
from sklearn.cluster import MeanShift
ms_pipeline = make_pipeline(dimred_pipeline, MeanShift())
data_reduced_ms = ms_pipeline.fit_predict(data_values)

In [None]:
plt.scatter(X_reduced[:,0], X_reduced[:,1], c=data_reduced_ms, alpha=0.2)

## Evaluation

To evaluate, we have several metrics to choose from depending on whether or not we have ground truth labels.

If we **DO** have the labels:
- [Adjusted Rand index](https://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-index)
- [Mutual Information](https://scikit-learn.org/stable/modules/clustering.html#mutual-information-based-scores)
- [Homogeneity, Completeness, and V-measure](https://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness-and-v-measure)
- [Fowlkes-Mallows Scores](https://scikit-learn.org/stable/modules/clustering.html#fowlkes-mallows-scores)
- [Contingency Matrix](https://scikit-learn.org/stable/modules/clustering.html#contingency-matrix)

If we **DO NOT** have the labels:
- [Silhouette Coefficient](https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient)
- [Calinski-Harabasz Index](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index)
- [Davies-Bouldin Index](https://scikit-learn.org/stable/modules/clustering.html#davies-bouldin-index)



In [None]:
from sklearn import metrics

print('Uses Labels')
print('===========')
print('Adjusted Rand Index: \t{:.3f}'.format(metrics.adjusted_rand_score(np.squeeze(data_labels_num), X_clustered)))
print('Mutual Information: \t{:.3f}'.format(metrics.adjusted_mutual_info_score(np.squeeze(data_labels_num), X_clustered, average_method='arithmetic')))
print('Homogeneity Score: \t{:.3f}'.format(metrics.homogeneity_score(np.squeeze(data_labels_num), X_clustered)))
print('Completeness Score: \t{:.3f}'.format(metrics.completeness_score(np.squeeze(data_labels_num), X_clustered)))
print('Fowlkes-Mallows Score: \t{:.3f}'.format(metrics.fowlkes_mallows_score(np.squeeze(data_labels_num), X_clustered)))
print(' ')
print('No Labels')
print('=========')
print('Silhouette Coefficient: \t{:.3f}'.format(metrics.silhouette_score(X_reduced, X_clustered, metric='euclidean')))
print('Calinski-Harabasz Index:\t{:.3f}'.format(metrics.calinski_harabasz_score(X_reduced, X_clustered)))
print('Davies-Bouldin Score: \t\t{:.3f}'.format(metrics.davies_bouldin_score(X_reduced, X_clustered)))

# Classification

Once we have labels, we can turn our attention to classification -- this will allow us to assign labels to our testing set.

We'll go through some common methods, training and calculating the evaluation performance for each of them using basic parameters.
For details on modifying / optimizing these, see individual notebooks or the `scikit-learn` User Guide.

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(data_values_transformed, data_labels_num)

In [None]:
some_training = data_values.iloc[:20]
some_training_transformed = input_pipeline.transform(some_training)
some_training_labels = data_labels.iloc[:5]
some_training_labels_num = ordinal_encoder.transform(some_training_labels)

print("Predictions:", list(lin_reg.predict(some_training_transformed)))
print("Labels:", list(some_training_labels_num))


In [None]:
from sklearn.metrics import mean_squared_error

predictions = lin_reg.predict(data_values_transformed)
lin_mse = mean_squared_error(data_labels_num, predictions)
lin_rmse = np.sqrt(lin_mse)
print("Linear Regressor Root Mean Squared Error: {:.3f}".format(lin_rmse))


## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(data_values_transformed, data_labels_num)

# Make predictions
predictions = tree_clf.predict(data_values_transformed)
tree_mse = mean_squared_error(data_labels_num, predictions)
tree_rmse = np.sqrt(tree_mse)
print("Tree Root Mean Squared Error: ", tree_rmse)