# Introduction to Machine Learing with Scikit-Learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import sklearn as skl

## Toy Datasets available in the package

The scikit-learn package has datasets available in the library which makes it is easy to practice and learn without needing to load it from the data through a file

In [None]:
import sklearn.datasets

### Example: Breast cancer dataset

We are going to work with a sample dataset which is already available in the package and can be loaded by just calling `load_breast_cancer()` method. It is [available here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic%29). 

In [None]:
cancer = sklearn.datasets.load_breast_cancer()

In [None]:
# IGNORE HOW THE FOLLOWING CODE WORKS BUT CONCENTRATE ON THE HOW THE DATA LOOKS
# Convert the data into a DataFrame

cancer_df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
cancer_df['Target'] = pd.Series(cancer.target)

# In case you want labels instead of numbers.
cancer_df.replace(to_replace={'Target': {0: cancer.target_names[0]}}, inplace=True)
cancer_df.replace(to_replace={'Target': {1: cancer.target_names[1]}}, inplace=True)

cancer_df.head()

## Load a dataset

In [None]:
bp_data = pd.read_csv("./data/bp_classific.csv")

In [None]:
bp_data.head()

### Pairplot to understand the effect of various features

You can specific the column you want to use for classification (in this case `high_bp`) as `hue` parameter to distinguish between one class to another. 

In [None]:
sns.pairplot(bp_data, hue='high_bp')

## Machine Learning: Classfication 

In [None]:
## Split the input features and outcome variable

bp_data_X = bp_data.drop('high_bp',axis = 1)
bp_data_Y = bp_data['high_bp']

In [None]:
bp_data_X.head()

### `train_test_split()`: Method to split the data into train and test

We usually split the data into training set to learn a classifier and then a test set to validate how good our model is 

Important parameters to this method

* **random_state**: Seed to used by randomizer to randomly split the data
* **train_size**: Use float to specify what fraction to use for training. Usually 0.75

In [None]:
from sklearn.model_selection import train_test_split

bp_train_X, bp_test_X, bp_train_Y, bp_test_Y = train_test_split(bp_data_X, bp_data_Y, random_state=42, train_size = 0.75)

In [None]:
print(len(bp_data_X), len(bp_train_X), len(bp_test_X))

### Learn a classifier

In [None]:
# FOR NOW IGNORE THE CLASSIFIER NAME. WE'LL UNDERSTAND THAT IN THE NEXT CLASS
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(bp_train_X, bp_train_Y)

### Predict on test data

In [None]:
bp_predict_Y = model.predict(bp_test_X)

In [None]:
import sklearn.metrics as sklmetrics

sklmetrics.accuracy_score(bp_test_Y, bp_predict_Y)

### Confusion Matrix and plotting it

In [None]:
conf_mat = sklmetrics.confusion_matrix(bp_test_Y, bp_predict_Y, labels =[0,1])
conf_mat

In [None]:
sns.heatmap(conf_mat, square=True, annot=True, cbar = False)
plt.xlabel("Predicted Value")
plt.ylabel("True Value")

So in the above case we can see that we have 

Good cases:
* **True Negatives**: 10 cases when the true value was No High BP (high_bp = 0) and we predicted that there will be No High BP 
* **True Positives**: 14 cases when the true value was High BP (high_bp = 1) and we predicted that there will be High BP 

Bad cases:
* **False Positives**: 4 cases when the true value was No High BP (high_bp = 0) and we predicted that there will be High BP (TYPE I ERROR)
* **False Negatives**: 2 cases when the true value was High BP (high_bp = 1) and we predicted that there will be No High BP (TYPE II ERROR)

Now, you can think about the consequence of making both these mistakes (related to Type I and Type II errors), how it might effect a doctor to use these methods in real-world 

# Activity

We will use the breast cancer data available in scikit-learn to predict if an biopsy image is cancerous or not. 

Follow these steps (steps 1 and 2 are done for you)
1. Load the data 
2. Seperate X (input features) and Y (outcome)
3. Split into training data and test data. Use 80% of data for training
    * Verify if the data is appropriately split by checking the number of rows in each of the training and test data. 
4. Learn the GaussianNB classifier to predict cancer or malignant
5. Predict using the test data
6. Provide accuracy score as well as plot the confusion matrix
    * Think about the consequence of False Positives and False Negatives

In [None]:
# Step 1: Load the data

cancer = sklearn.datasets.load_breast_cancer()

In [None]:
# Step 2: Seperate X (input features) and Y (outcome)

cancer_X = cancer.data
cancer_Y = cancer.target

In [None]:
# Step 3: 


In [None]:
# Step 3.1: Verify the shapes

In [None]:
# Step 4: Learn GuassianNB

In [None]:
# Step 5: Predict on test data

In [None]:
# Step 6: Check the accuracy


In [None]:
# Step 6.1: Check the confusion matrix

# Machine Learning: Unsupervised Learning

The example below is borrowed from your textbook. 

### Loading and visualizing the digits data

We'll use Scikit-Learn's data access interface and take a look at this data:

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape

In [None]:
# IGNORE HOW THE FOLLOWING CODE WORKS BUT CONCENTRATE ON THE HOW THE DATA LOOKS
# Convert the data into a DataFrame

digits_df = pd.DataFrame(digits.data)
digits_df['Target'] = pd.Series(digits.target)
digits_df.head()

The images data is a three-dimensional array: 1,797 samples each consisting of an 8 × 8 grid of pixels.
Let's visualize the first hundred of these:

In [None]:
## DO NOT WORRY HOW THE IMAGES ARE LOADED
## CONCENTRATE ON THE IMAGES THAT ARE PRODUCED AS OUTPUT

import matplotlib.pyplot as plt

fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')

### Splitting the data into X (input characteristics) and Y (outcome)

In [None]:
digit_X = digits.data
digit_Y = digits.target

## Dimensionality Reduction

The task of dimensionality reduction is to ask whether there is a suitable lower-dimensional representation that retains the essential features of the data.
Often dimensionality reduction is used as an aid to visualizing data: after all, it is much easier to plot data in two dimensions than in four dimensions or higher!

Here we will use principal component analysis (PCA), which is a fast linear dimensionality reduction technique.
We will ask the model to return two components—that is, a two-dimensional representation of the data.

In [None]:
from sklearn.decomposition import PCA  
model = PCA(n_components=2)            
model.fit(digit_X)                      
X_2D = model.transform(digit_X)

digits_df['PCA1'] = X_2D[:, 0]
digits_df['PCA2'] = X_2D[:, 1]


### Visualizing the dimensionality reduction

In [None]:
sns.lmplot(x = "PCA1", y = "PCA2", hue='Target', data=digits_df, fit_reg=False, 
          palette = sns.color_palette("Set1", n_colors=10, desat=.5));

## Unsupervised learning: Iris clustering

Let's next look at applying clustering to the digits data.
A clustering algorithm attempts to find distinct groups of data without reference to any labels.
Here we will use a powerful clustering method called a Gaussian mixture model (GMM). 
A GMM attempts to model the data as a collection of Gaussian blobs. 

We can fit the Gaussian mixture model as follows:

In [None]:
# 1. Choose the model class
from sklearn.mixture import GaussianMixture      
# 2. Instantiate the model with hyperparameters. 
# We are building 10 clusters (n_components) because we believe they 
# may be 10 clusters, one for each number
model = GaussianMixture(n_components=10,
            covariance_type='full')  
 # 3. Fit to data. Notice y is not specified!
model.fit(digit_X)  
# 4. Determine cluster labels
y_gmm = model.predict(digit_X)        

## Visualizing the clusters and dimensionality reduction

In [None]:
# Again ignore the technical details

digits_df['cluster'] = y_gmm
sns.lmplot(x = "PCA1", y = "PCA2", data=digits_df, hue='Target',
           col='cluster', fit_reg=False, col_wrap=3,
          palette = sns.color_palette("Set1", n_colors=10, desat=.5));

** Important Note**

The cluster numbers may not correspond to actual digit they represent. From the image, see if you can find which cluster number corresponds to which number. 