Python for Data Analytics | Module 6
<br>Professor James Ng

# Introduction to Machine Learing with Scikit-Learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import sklearn as skl

## Toy Datasets available in the package

The scikit-learn package has datasets available in the library which makes it is easy to practice and learn even if you don't have external data.

In [None]:
import sklearn.datasets

### Example: Breast cancer dataset

Let's load a sample dataset that is already available in the package and can be loaded by just calling `load_breast_cancer()` method. The source of this data is https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).

In [None]:
cancer = sklearn.datasets.load_breast_cancer()

In [None]:
# IGNORE HOW THE FOLLOWING CODE WORKS BUT CONCENTRATE ON THE HOW THE DATA LOOKS
# Convert the data into a DataFrame

cancer_df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
cancer_df['Target'] = pd.Series(cancer.target)

# In case you want labels instead of numbers.
cancer_df.replace(to_replace={'Target': {0: cancer.target_names[0]}}, inplace=True)
cancer_df.replace(to_replace={'Target': {1: cancer.target_names[1]}}, inplace=True)

cancer_df.sample(5)

## Load a dataset

Dataset of incidence of high blood pressure among a sample of adults

In [None]:
bp_data = pd.read_csv("https://www3.nd.edu/~jng2/bp_classific.csv")

In [None]:
bp_data.head()

### Pairplot to understand the relationshp between features

You can specific the column you want to use for classification (in this case `high_bp`) as `hue` parameter to distinguish between one class to another. 

In [None]:
sns.pairplot(bp_data, hue='high_bp', diag_kind='hist')

## Machine Learning: Classification 

In [None]:
## Split the input features and outcome variable

bp_data_X = bp_data.drop('high_bp',axis = 1)
bp_data_Y = bp_data['high_bp']

In [None]:
bp_data_X.head()

### `train_test_split()`: Method to split the data into train and test

Split the data randomly into training set to learn a classifier and then a test set to validate how good our model is 

Important parameters to this method

* **random_state**: Seed to be used by randomizer to randomly split the data. Set to the same number each run if you want to reproduce results.
* **train_size**: Use float to specify what fraction to use for training. Usually 0.7, 0.75 or 0.8

In [None]:
from sklearn.model_selection import train_test_split

bp_train_X, bp_test_X, bp_train_Y, bp_test_Y = train_test_split(bp_data_X, 
                                                                bp_data_Y, 
                                                                random_state=42, 
                                                                train_size = 0.75)

In [None]:
print(len(bp_data_X), len(bp_train_X), len(bp_test_X))

### Learn a classifier

In [None]:
# Classify with Gaussian naive Bayes. We will not delve into this algorithm here. This is just to illustrate 
# the general approach.

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(bp_train_X, bp_train_Y)

### Predict on test data

In [None]:
bp_predict_Y = model.predict(bp_test_X)

In [None]:
import sklearn.metrics as sklmetrics

sklmetrics.accuracy_score(bp_test_Y, bp_predict_Y)

### Confusion Matrix and plotting it

In [None]:
conf_mat = sklmetrics.confusion_matrix(bp_test_Y, bp_predict_Y, labels =[0,1])
conf_mat

In [None]:
sns.heatmap(conf_mat, square=True, annot=True, cbar = False)
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

So in the above case we can see that we have 

Good cases:
* **True Negatives**: 10 cases where the true value was No High BP (high_bp = 0) and we predicted that there will be No High BP 
* **True Positives**: 14 cases where the true value was High BP (high_bp = 1) and we predicted that there will be High BP 

Bad cases:
* **False Positives**: 4 cases where the true value was No High BP (high_bp = 0) and we predicted that there will be High BP (TYPE I ERROR)
* **False Negatives**: 2 cases where the true value was High BP (high_bp = 1) and we predicted that there will be No High BP (TYPE II ERROR)

Think about the real-world consequence of making both these these two types of errors. Is Type I Error or Type II Error worse? It depends! Here most people would say Type II Error is worse (patient who actually has high BP is misdiagnosed as having normal BP so condition left untreated). 

# Activity

We will use the breast cancer data available in scikit-learn to predict if an biopsy image is cancerous or not. 

Follow these steps (steps 1 and 2 are done for you)
1. Load the data 
2. Separate X (input features) and Y (outcome)
3. Split into training data and test data. Use 80% of data for training
    * Verify if the data is appropriately split by checking the number of rows in each of the training and test data. 
4. Learn the GaussianNB classifier to predict cancer or malignant
5. Predict using the test data
6. Provide accuracy score as well as plot the confusion matrix
    * Think about the consequence of False Positives and False Negatives

In [None]:
# Step 1: Load the data

cancer = sklearn.datasets.load_breast_cancer()

In [None]:
# Step 2: Separate X (input features) and Y (outcome)

cancer_X = cancer.data
cancer_Y = cancer.target

In [None]:
# Step 3: Split into training data and test data

cancer_train_X, cancer_test_X, cancer_train_Y, cancer_test_Y = train_test_split(cancer_X, cancer_Y, random_state=123, 
                                                                               train_size = 0.8, test_size=0.2)

In [None]:
print(len(cancer_X), len(cancer_train_X), len(cancer_test_X))

In [None]:
model_cancer = GaussianNB()

model_cancer.fit(cancer_train_X, cancer_train_Y)

In [None]:
cancer_predict_Y = model_cancer.predict(cancer_test_X)

In [None]:
conf_mat = sklmetrics.confusion_matrix(cancer_test_Y, cancer_predict_Y, labels =[0,1])
conf_mat
sns.heatmap(conf_mat, square=True, annot=True, cbar = False)
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
sklmetrics.accuracy_score(cancer_test_Y, cancer_predict_Y)

In [None]:
matches = (cancer_test_Y == cancer_predict_Y)
print(matches.sum())
print(len(matches))
matches.sum() / float(len(matches))

# Machine Learning: Unsupervised Learning

The example below is borrowed from your textbook. 

'Unsupervised' means the Target variable is not labeled.

### Loading and visualizing the digits data

We'll use Scikit-Learn's data access interface and take a look at this data:

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape

In [None]:
# IGNORE HOW THE FOLLOWING CODE WORKS BUT CONCENTRATE ON THE HOW THE DATA LOOKS
# Convert the data into a DataFrame

digits_df = pd.DataFrame(digits.data)
digits_df['Target'] = pd.Series(digits.target)
digits_df.head()

The images data is a three-dimensional array: 1,797 samples each consisting of an 8 × 8 grid of pixels.
Let's visualize the first hundred of these:

In [None]:
## DO NOT DWELL ON HOW THE IMAGES ARE LOADED
## CONCENTRATE ON THE IMAGES THAT ARE PRODUCED AS OUTPUT

import matplotlib.pyplot as plt

fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')

### Splitting the data into X (input characteristics) and Y (outcome)

In [None]:
digit_X = digits.data
digit_Y = digits.target

## Dimensionality Reduction

If a dataset has say 1000 columns, can the essential features of the data be represnted by fewer columns? To check this, we use dimensionality reduction techniques.

We often use dimensionality reduction to aid in visualizing data: after all, it is much easier to plot data in two dimensions than in three dimensions or higher!

Here we will use principal component analysis (PCA) to reduce the dimension of the digits data. We will ask the model to return two 'components' —that is, a two-dimensional representation of the data.

In [None]:
from sklearn.decomposition import PCA  
model = PCA(n_components=2)            
model.fit(digit_X)     
print(model.explained_variance_ratio_)

X_2D = model.transform(digit_X)

digits_df['PCA1'] = X_2D[:, 0]
digits_df['PCA2'] = X_2D[:, 1]


### Visualizing the digits data (first two principal components) after dimensionality reduction

In [None]:
sns.lmplot("PCA1", "PCA2", hue='Target', data=digits_df, fit_reg=False, 
          palette = sns.color_palette("Set1", n_colors=10, desat=.5));

## Unsupervised learning: clustering digits

Let's next look at applying clustering to the digits data.
A clustering algorithm attempts to find distinct groups of data without reference to any labels.
Here we will use a powerful clustering method called a Gaussian mixture model (GMM). 
A GMM attempts to model the data as a collection of Gaussian blobs. 

We can fit the Gaussian mixture model as follows:

In [None]:
# 1. Choose the model class
from sklearn.mixture import GaussianMixture      
# 2. Instantiate the model with hyperparameters. 
# We are building 10 clusters (n_components) because we believe there 
# may be 10 clusters, one for each digit
model = GaussianMixture(n_components=10,
            covariance_type='full')  
 # 3. Fit to data. Notice y is not specified!
model.fit(digit_X)  
# 4. Determine cluster labels
y_gmm = model.predict(digit_X)        

## Visualizing the clusters and dimensionality reduction

In [None]:
# Again ignore the technical details, focus on the plot

digits_df['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=digits_df, hue='Target',
           col='cluster', fit_reg=False, col_wrap=3,
          palette = sns.color_palette("Set1", n_colors=10, desat=.5));

**Important Note**

The cluster digits may not correspond to the actual digits they represent. From the image, see if you can find which cluster number corresponds to which number. 

# Activity: Getting back to Classification!

Let's try to classify the digits data into corresponding integers. 

In [None]:
# split the data into training and validation sets
digit_X_train, digit_X_test, digit_Y_train, digit_Y_test = train_test_split(digit_X, 
                                                    digit_Y, 
                                                    random_state=123, 
                                                    train_size = 0.8, 
                                                    test_size=0.2)

In [None]:
# train the model
learndigits = GaussianNB()
learndigits.fit(digit_X_train, digit_Y_train)

In [None]:
# use the model to predict the labels of the test data
digit_Y_predicted = learndigits.predict(digit_X_test)

In [None]:
# Plot the prediction
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digit_X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
              interpolation='nearest')

    # label the image with the target value
    if digit_Y_predicted[i] == digit_Y_test[i]:
        ax.text(0, 7, str(digit_Y_predicted[i]), color='green')
    else:
        ax.text(0, 7, str(digit_Y_predicted[i]), color='red')
            

In [None]:
conf_mat = sklmetrics.confusion_matrix(digit_Y_test, digit_Y_predicted, labels = np.arange(0,10))
conf_mat
sns.heatmap(conf_mat, square=True, annot=True, cbar = False)
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

In [None]:
sklmetrics.accuracy_score(digit_Y_test, digit_Y_predicted)

In [None]:
# What exactly was accuracy score?
matches = (digit_Y_predicted == digit_Y_test)
print(matches.sum())
print(len(matches))
matches.sum() / float(len(matches))