# Introduction

This notebook illustrates basic data loading and visualization techniques.

These are nearly identicial to the functions I used to generate the figures in my previous lecture, but the display code for the plotting has been greatly simplifed (to save on space and complexity).

## Understanding Python Code

I don't expect that you will be familiar with the code below; the notebook format will try to explain everything that's going on. In addition, there are comments within the code itself that explain what's happening. Comments are green lines that start with a pound sign (`#`).

You can also feel free to view and mess with this notebook -- your changes should not be saved, but you can play around with various settings.

# Imports and Project Setup

We set up our project and import important packages here.

Python requires the use of "modules", which are basically packages of code that enable us to use more advanced functionality than the core library. We don't want to have to write all the functions to load data, create plots, and train models by hand, so we load these modules at the start so we have access to pre-built sets of code that will do all this boring stuff for us. 

## Imports

Most of the packages we import are very common and popular for machine learning. Because they are so important, I will provide some links with more info for them, so you can explore their functionality on your own. 

- `pandas`: This module defines a table-like format for data called "dataframes"; these are similar to MATLAB's `table`, R's `dataframe`, or Excel spreadsheets. Here we'll use them for loading and parsing the data.
- `numpy`: This is a very common module for doing numeric analysis. It provides support for matrices and tensors, as well as hundreds of mathematical operations and commonly-used linear algebra functions.
- `matplotlib`: This is the "matrix plotting library", which allows us to easily generate plots and charts of our data. Thanks to the format of the notebook, these plots will appear directly in our web browser!

The code cell below, when executed, will run the import statements.

In [None]:
import os
import pandas as pd
import numpy as np

# We use two different plotting libraries, depending on which kind of plot we want
import matplotlib.pyplot as plt
import seaborn as sns

# Set an option for Pandas to display smaller floating-point numbers
pd.options.display.float_format = '{:,.2f}'.format

# Import libraries to work with strings
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

## Loading Data

We typically load data from `.csv` files, which are basically spreadsheets.

In Excel, you can export spreadsheets as `.csv` by going to "File" -> "Export" -> "Change File Type" and selecting **"CSV (Comma delimited) (*.csv)**" under "Other File Types".

## Data Access in Local Files

The easiest way to perform data loading is to load it from the local workspace. If this notebook is being run locally, then the code below will work.

In [None]:
# Use os.path.join to construct the file path
# Keeps things nice when going between Windows / non-Windows platforms
csv_file = os.path.join('data', 'bca_wisconsin', 'bca_wisconsin.csv')

# Use `read_csv` to get the file into a dataframe
df = pd.read_csv(csv_file)

## Data Access in Google Colab

One area that Goole's Colab needs some work is in accessing data within the Google Drive. Currently (I think for security reasons), we have to go through the Google Drive API.

This process will allow us to access the file which holds our data.

### Option 1: Mounting Google Drive

We can mount the Google Drive to the notebook's VM, and then load the csv file to a dataframe.

In [None]:
# Need to get Google Drive access
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Load the dataset into a Pandas dataframe
data_dir = os.path.join('/content/gdrive/My Drive/2020-tata-memorial-workshop/wisconsin_breast_cancer_data.csv')
df = pd.read_csv(data_dir)

### Option 2: Uploading a File

Alternatively, you can use the `files.upload()` function to upload a file from your local computer. If you download the csv file, you can upload it to the notebook's VM using the code below:

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

`uploaded` is now a dictionary where the key is the name of the file, and the value is a byte-formatted string. We can pass this through `StringIO` to obtain a csv-file-like object, and then load it with `read_csv`.

In [None]:
df = pd.read_csv(StringIO(uploaded['wisconsin_breast_cancer_data.csv'].decode("utf-8")))

## Double-Check Our Data

It's a good idea to take a peek at what we've loaded, just to make sure that we don't have an empty or corrupted dataset.

In [None]:
df.head()

We can also use the `.info()` method to get a peek at each of the columns in our dataset and see what kind of values we have. We can compare the number of entries to the number of `non-null` values in each column to see whether we have any missing data, and we can check which values are integers, floating points (i.e. decimal places), etc.

In [None]:
# View some statistics on the dataset
df.info()

## Handle Categorical Attributes

Generally, for machine learning, if you have a feature with categorical values (like "Hot" and "Cold"), you want to convert them to numeric values. There are two ways of doing this.

### Ordinal Encoding

If the categorical values are **ordinal**, meaning you can place them in some kind of order (e.g. "Low", "Intermediate", and "High"), you can convert these into ordered numeric values where Low = 0, Intermediate = 1, and High = 2. In Python you can use the `OrdinalEncoder` package to do this. 

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

### One-Hot Encoding

If the values are **not ordinal**, meaning the order of them doesn't matter (e.g. "Blood Type A", "Blood Type B") then you can use **one-hot encoding**: Replace the feature with $N$ new features, where $N$ is the number of categories. Each of the new features is *binary*, meaning it's only 0 or 1, 

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()

Luckily, we don't have this situation here in our feature set: All of our measured values are floating-point values. The only categorical entry in our data is our target categories. Targets (or classes) are often encoded numerically, so let's do that now.  

### Converting Target to Numeric

Our targets in this dataset are encoded as characters, "M" (standing for "Malignant") and "B" (standing for "Benign"). Practically speaking, it's easier to work with these labels if they are numeric.

We have what's called a **binary class problem**, meaning that there are only two categories of data that we need to worry about. For this type of problem, it's common to encode the categories as 0 and 1. 

In our case, we're going to set "Benign" to 0 and "Malignant" to 1 (there's no technical reason for this; to me, it makes sense because 1 is typically referred to as "positive", much like a "positive" diagnosis of malignancy).

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

diagnosis_cat = df['diagnosis']

# Fit the encoder to the categories, and immediately 
diagnosis_lab = label_encoder.fit_transform(diagnosis_cat)

# Add the diagnosis label back to the dataframe
df['diagnosis_label'] = diagnosis_lab

In [None]:
# Ensure the labels were added correctly
df.head(20)

# Training and Testing Dataset Split

The order of data wrangling techniques is debatable, but number one is **separate out a testing set**.

Separating out a training and testing set is a fundamental step of good machine learning. During data exploration, it helps to prevent **data snooping bias**, which can influence your design decisions. During training, it helps prevent **overfitting** your model to your specific data, thus improving **generalization**. 

A good general rule is that around **60-70%** of your dataset should be set aside for training, and the remaining **40-30%** should be used for performance evaluation.

## Splitting Strategies: Random Sampling

How do we identify which samples go in which split? 



### Na&iuml;ve Random Sampling

We *could* start by just saying that we'll take a random sample of the data:

In [None]:
def split_train_test(data, test_ratio=0.3):
    """Return a random split of the "data" dataframe, with the percentage of 
    data specified in "test_ratio" in the testing set.
    """

    # First get a random list of indices into the data
    shuffled_indices = np.random.permutation(len(data))
    
    # Calculate the number of indices that belong to the test set 
    # (e.g. 20% if test_ratio is 0.2)
    test_set_size = int(len(data) * test_ratio)

    # Split the random list of indices into two sets, one for test indices and 
    # another for training indices 
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]

    # Return two dataframes, the first with the training data and the other 
    # with the testing data 
    return data.iloc[train_indices], data.iloc[test_indices]

However, this doesn't quite work: If I run this code twice, I will get two different datasets, because the `np.random.permutation()` method will run two different times. This means that in Run 2, I may have some data in my training set that was previously in my testing set, and vice versa. 

You can prevent this by setting a "random seed", meaning that each run will give you the *same* random numbers -- but if more data is added to the set, then even using the same seed will give you a completely different random split.

**This is no good!**

### Indexed Random Sampling

A better strategy is to use some kind of unique identifier for each sample, and based on that number, place it into either the training or the testing set. 

Benefits:
- Each sample is "assigned" to test set based on some immutable value (the identifier);
- You can specify the percentage of samples that go into testing vs. training (20%);
- If you add new data, the samples that you had before will still be assigned to the correct split; and
- You can still set the random seed to ensure repeatability.

If you don't have a unique identifier, then you can use the index -- just make sure that new data is added to the end of the dataset. 

Luckily, we **do** have a unique identifier, a numeric value assigned to each subject in the database. So we'll just use that.

In [None]:
# Import functions to calculate a hash for the dataset
from zlib import crc32

def test_set_check(identifier, test_ratio):
    '''Return a boolean that states whether the current sample should be included in the testing set.
    
    Calculates a hash value from an identifier, and returns True if the value is in the bottom 
    (test_ratio)-percent of the maximum possible hash value.
    '''
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    '''Return training and testing dataframes given a test ratio and column to use for computing sample hash.
    
    Uses test_set_check to actually compute hash and put the data into training or testing.
    '''
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

# Apply the above functions to the dataset
train_set, test_set = split_train_test_by_id(df, 0.3, "id")

In [None]:
test_set.info()

Note: If you don't understand this code, don't worry about it -- it's a bit more complex than most of what we're doing, and the implementation details aren't important. Basically, it says "Look at the subject's ID number, and based on that number, either put it into the test set or the training set. 

This works pretty well, and if you don't know the labels / targets / classes of your dataset, this is the best option. 

However, we **do** have labels for our data. So we'd like to know whether the **class balance** of the overall dataset matches the training and / or testing splits. Let's take a look:

In [None]:
print('================')
print(' Random Sampling')
print('================')
print('')
print('Overall class balance:')
print('{}'.format(df["diagnosis"].value_counts() / len(df)))
print(' ')
print('Train set class ratio:')
print('{}'.format(train_set["diagnosis"].value_counts() / len(train_set)))
print(' ')
print('Test set class ratio:')
print('{}'.format(test_set["diagnosis"].value_counts() / len(test_set)))

The ratios are different between each of the splits, and both of them differ from the overall dataset. This effect is exaggerated when you have a small number of samples. How can we fix this?

## Splitting Strategies: Stratified Sampling

How do you **balance** the samples in your dataset while maintaining randomness?

**Stratified Sampling** maintains the overall label distribution in your training and testing sets -- this ensures that your training and testing sets accurately represent the class distribution in the overall dataset.

Because this is such a common thing to do, there is a built-in function to `sklearn` that will do it for us:

In [None]:
# Stratified Split
from sklearn.model_selection import StratifiedShuffleSplit

# Create the splitting object
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

# Apply the split to the data frame using the "diagnosis" column as our label
for train_index, test_index in split.split(df, df["diagnosis"]):
    train_set = df.loc[train_index]
    test_set = df.loc[test_index]

The `StratifiedShuffleSplit()` function allows you to define the number of splits, the size of the testing set, and the random seed to use for reproducibility. You then call the `split()` method on the splitting object, and give it the object to split (our dataframe) and the thing to use to perform the split (our class labels, i.e. the "diagnosis" column).

Let's take a look at the class balance now:

In [None]:
print('====================')
print(' Stratified Sampling')
print('====================')
print('')
print('Overall class ratio:')
print('{}'.format(df["diagnosis"].value_counts() / len(df)))
print(' ')
print('Train set class ratio:')
print('{}'.format(train_set["diagnosis"].value_counts() / len(train_set)))
print(' ')
print('Test set class ratio:')
print('{}'.format(test_set["diagnosis"].value_counts() / len(test_set)))

This is much more balanced: Both our training and testing sets have the same ratio of benign to malignant samples, and they are also close to the overall class ratio. 



### Side Note: Balanced vs. Unbalanced Classes

The ratio of benign to malignant classes is not 50-50, but is closer to 60-40. This is very common in biomedical datasets, where disease cases or rare classes are by definition very small percentages of the overall data.

A common theme in machine learning is the tension between needing enough data to build a model, but most of the phenomena we're interested in are comparatively rare. It will be up to you (or your data scientist) to select the best method for the amount of data you have available, and to adjust your evaluation metrics accordingly. 

If necessary, you can **over-sample** or **under-sample** a class to try to achieve an even split. However, in our case, we're going to leave the class imbalance in place.


# Data Cleaning

Very often, the data you receive will be "messy" -- meaning there will be **missing values**, **categorical** rather than numeric features, and values that **need to be scaled**. 

We are NOT considering cases where values may be incorrectly entered, for two reasons:

1. Detecting "outliers" is an entire lecture on its own, with a variety of different approaches that all have pros and cons. 
2. It's possible that a value can be incorrect, but still within a reasonable range -- these are not outliers, but they are wrong. We **must assume** that the data is correctly entered into the spreadsheet, and that there are protections in place to catch incorrectly-entered data.

First we can separate out our targets -- don't want to transform those.

In [None]:
training_values = train_set.drop(['id','diagnosis', 'diagnosis_label'], axis=1)

## NOTE: Using double-brackets here to make the result a dataframe and not a series
# See here: https://github.com/ageron/handson-ml/issues/259
training_labels = train_set[['diagnosis_label']].copy()

First I will outline different data cleaning approaches, and then explain how to use Pipelines to string these all together.

### Handle Missing Attributes

Options:
- Drop data with missing attributes
- Remove attributes that are not complete
- Set missing values to some other value

Refer back to the "Hands On" notes for information about these approaches -- we have all numeric, complete features here so we won't worry about this.

In [None]:
# For imputing missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

### Scaling Attributes

Data should always be scaled. 

**Min-Max Scaling**: AKA *normalization*: values are shifted and rescaled to range from 0 to 1, by subtracting the min and dividing by the max - min. There is a `MinMaxScaler` transformer for this, with a `feature_range` hyperparameter.

**Standardization**: Zero mean, unit variance. Subtract the mean, divide by standard deviation. Not bounded to any specific range, which may be a problem (e.g. for neural networks expecting a 0-1 value), but much less affected by outliers. `StandardScaler` will do this.

Scaling should be calculated only on the training set, and the proper transform applied to testing.

### Using Pipelines

Pipelines allow you to "chain" together processing of the data (imputer, encoder, scaler).
In this case, since we aren't imputing and we aren't handling categorical values, we only have a scaling component to our pipeline, but this is a good way to set up a series of cleaning operations if needed.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

input_pipeline = Pipeline([
        ('std_scaler', StandardScaler()),
    ])

# This does the same thing, adds a name automatically
input_pipeline = make_pipeline(StandardScaler())

training_values_transformed = input_pipeline.fit_transform(training_values)

# Create a numeric label for our system to work on
#training_labels_num = ordinal_encoder.fit_transform(training_labels)

 # Data Exploration and Visualization

In this section, we start looking at the training data to try and identify some patterns and correlations between features.

First create a copy of the data so you don't mess anything up:

In [None]:
data_copy = train_set.copy()

# Drop the 'id' and 'diagnosis' columns for analysis
data_copy = data_copy.drop(['id', 'diagnosis', 'diagnosis_label'], axis=1)

### Calculating Feature Correlations

You can calculate **standard correlation coefficient** a.k.a. **Pearson's r** to look for pairwise correlations.

See [this page](https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/) which illustrates how to use the `.corr()` function to actually drop correlated features.

In [None]:
# We are interested in finding ALL correlated features, not just positively correlated ones
corr_matrix = data_copy.corr().abs()

# `corr_matrix` is a symmetric matrix, so we just want the upper triangle
upper_triangle_locations = np.triu( np.ones(corr_matrix.shape), k=1).astype(np.bool)

# `upper` now contains just the upper triangle of correlations, with the rest as NaNs
upper = corr_matrix.where(upper_triangle_locations)

# Now get a list of columns in `upper` that contain feature values correlated above 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

print('{} columns to drop: {}'.format(len(to_drop), to_drop))

# Actually perform the drop
data_copy = data_copy.drop(data_copy[to_drop], axis=1)

# Display the results 
data_copy.head()

## Feature Selection and Visualization

We can visualize the data to see if anything pops out -- this is very much "explore mode".

### Univariate Data Visualization: Histograms

In [None]:
# Create a histogram of a single feature
plt.hist(data_copy['radius_mean'], density=True)
plt.title('Radius Mean for All Samples')

plt.show()

In [None]:
# Separate the data into classes for easier plotting
malignant = data_copy[train_set['diagnosis_label'] == 1]
benign = data_copy[train_set['diagnosis_label'] == 0]

In [None]:
# Create data plots
f, ax = plt.subplots(figsize=(10,6))

x1 = 'texture_mean'
x1_display = x1.replace('_', ' ').title()


# These two lines do the work of plotting a histogram
ax.hist(malignant[x1], density=True, alpha=.8, label="Malignant")
ax.hist(benign[x1], density=True, alpha=.8, label="Benign")


ax.set(xlim=(5,35))

# Annotate Plot
ax.set(xlabel=r'$x$: '+x1_display,
       ylabel=r'$p(x|\omega_{j})$',
       title='Probability Density Function (PDF) for '+x1_display)

ax.legend(frameon=True)
ax.grid(linestyle=':')
plt.tight_layout()

plt.show()

In [None]:
# Create data plots
f, ax = plt.subplots(figsize=(10,6))

x1 = 'radius_mean'
x1_display = x1.replace('_', ' ').title()

ax.hist(malignant[x1], density=True, alpha=.8, label="Malignant")
ax.hist(benign[x1], density=True, alpha=.8, label="Benign")
ax.set(xlim=(5,35))

# Annotate Plot
ax.set(xlabel=r'$x$: '+x1_display,
       ylabel=r'$p(x|\omega_{j})$',
       title='Probability Density Function (PDF) for '+x1_display)

ax.legend(frameon=True)
ax.grid(linestyle=':')
plt.tight_layout()

plt.show()

### Multivariate Data Visualization: Scatter Plots and 2D Histograms

In [None]:
f, ax = plt.subplots(figsize=(10,6))
x1 = 'radius_mean'
x2 = 'texture_mean'

x1_display = x1.replace('_', ' ').title()
x2_display = x2.replace('_', ' ').title()

ax.scatter(malignant[x1], malignant[x2], alpha=.8, label="Malignant")
ax.scatter(benign[x1], benign[x2], alpha=.8, label="Benign")

# Annotate Plot
ax.set(xlabel=r'$x_{1}$: '+x1_display,
       ylabel=r'$x_{2}$: '+x2_display,
       title=x1_display+' vs. '+x2_display)

ax.legend(frameon=True)
ax.grid(linestyle=':')
plt.tight_layout()

plt.show()

### Viewing More than 2 or 3 Variables with Facet Plots

As an example, we can create a scatter matrix (e.g. "facet plot") to display possible correlations between attributes.
This is most helpful for regression targets, when you have a numeric value you want to estimate from the others (we don't have this in the BCA dataset, though).

In [None]:
attributes = ["radius_mean", "texture_mean", "compactness_mean", "fractal_dimension_mean"]

# We need to add the "diagnosis" label back in here, so Seaborn can plot it using the `hue` parameter
data_copy_display = data_copy[attributes].copy()
data_copy_display['diagnosis'] = train_set['diagnosis']

g = sns.pairplot(data_copy_display, hue='diagnosis', plot_kws={'alpha': 0.5, 'edgecolor': None}, height=3, aspect=1)

# Alter the plot
g.fig.suptitle('Pair Plot of '+str(len(attributes)-1)+' Features', y=1.02)
g._legend.set_title("Diagnosis")

plt.show()

If we want to zoom in on a particular pair of features that might be informative, we can select them specifically.

In [None]:
f, ax = plt.subplots(figsize=(10,6))

x1 = 'fractal_dimension_mean'
x2 = 'radius_mean'

x1_display = x1.replace('_', ' ').title()
x2_display = x2.replace('_', ' ').title()

ax.scatter(malignant[x1], malignant[x2], alpha=.8, label="Malignant")
ax.scatter(benign[x1], benign[x2], alpha=.8, label="Benign")

# Annotate Plot
ax.set(xlabel=r'$x_{1}$: '+x1_display,
       ylabel=r'$x_{2}$: '+x2_display,
       title=x1_display+' vs. '+x2_display)

ax.legend(frameon=True)
ax.grid(linestyle=':')
plt.tight_layout()

plt.show()

### Attribute Combinations

We won't do this here, but combining attributes can be done to include additional features you may want to look at. 
If you have a regression target, you can re-calculate your correlation between the target and your new feature to see if the new feature is correlated as well.

# Dimensionality Reduction

These methods will give you a more "all-encompassing" view of your data by providing a low-dimensional embedding or representation of the data in 2 or 3 dimensions.

You can add these methods to the pipeline as well, by the way.

In [None]:
# Dimensionality Reduction Imports
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.manifold import MDS
from sklearn.manifold import TSNE

Now create pipelines for different dimensionality reduction targets. 
There may (?) be a way to define multiple alternatives for a given step, but I'm not sure.

Here are some of the different methods you can try:

```
dimred_pipeline = make_pipeline(input_pipeline, PCA(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, Isomap(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, LocallyLinearEmbedding(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, MDS(n_components=2))
dimred_pipeline = make_pipeline(input_pipeline, TSNE(n_components=2))
```

Cut and paste whichever of those you want down below.

In [None]:
dimred_pipeline = make_pipeline(input_pipeline, PCA(n_components=2))

In [None]:
# Fit and apply the transform right away
X_reduced = dimred_pipeline.fit_transform(data_copy)

In [None]:
f, ax = plt.subplots(figsize=(10,6))

ax.scatter(X_reduced[:,0], X_reduced[:,1], alpha=.8, label="Unlabeled")

# Annotate Plot
ax.set(xlabel=r'$x_{1}$',
       ylabel=r'$x_{2}$',
       title=r'Reduced Dimensional Space (Unlabeled)')

ax.legend(frameon=True)
ax.grid(linestyle=':')
plt.tight_layout()

plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10,6))

# Separate the data into classes for easier plotting
malignant = X_reduced[train_set['diagnosis_label'] == 1]
benign = X_reduced[train_set['diagnosis_label'] == 0]

ax.scatter(malignant[:,0], malignant[:,1], alpha=.8, label="Malignant")
ax.scatter(benign[:,0], benign[:,1], alpha=.8, label="Benign")

# Annotate Plot
ax.set(xlabel=r'$x_{1}$',
       ylabel=r'$x_{2}$',
       title=r'Reduced Dimensional Space')

ax.legend(frameon=True)
ax.grid(linestyle=':')
plt.tight_layout()

plt.show()