# Supervised Machine Learning Systems - (Classification)

In [None]:
# Helper functions to display a video or an image 
from IPython.display import HTML
def display_video(src):
    print('Source : '+src+ '?autoplay=1;modestbranding=1;rel=0')
    return HTML('<iframe width="800" height="400" src=' + src + '?autoplay=1;modestbranding=1;rel=0 frameborder="0" allowfullscreen></iframe>')

def display_image(src):
    print('Source : '+src)
    return HTML('<img width="600" height="300" src=' + src + '></img>')

def displayResults (y_test, predictions, n=10):
    Results = pd.DataFrame({'Actual': y_test})
    column = pd.DataFrame({'Predictions': predictions})
    Results = Results.join(column.set_index(Results.index))
    return Results.head(n)

def display_cm(cm):
    plt.clf()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
    classNames = ['Benign','Malignant']
    plt.title('Confusion Matrix')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(2)
    plt.xticks(tick_marks, classNames, rotation=45)
    plt.yticks(tick_marks, classNames)
    s = [['TN','FP'], ['FN', 'TP']]
    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
    plt.show()

## What is a Classification Problem ?

We briefly talked about Classification problem in our previous notebook.

1. **Independent Variables for classification** - These are also called features of our dataset. They are the variables which when varied can affect our target classes that we want to predict.
2. **Dependent Variable for classification** - When your target variable has certain class labels, its a classification problem. For instance classifying pictures of dogs and cats or a tumour to be cancerous or non cancerous etc. You are not predicting a continuous quantity here but different classes.

Lets take an example to understand it clearly :

<b> [Breast Cancer Diagnostic] </b>

There are two main classifications of tumors. One is known as benign and the other as malignant. A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body.

In [None]:
display_image('https://www.verywellhealth.com/thmb/xnYC1DVmfPtwjWCEdO0HjSZbcBo=/1787x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/514240-article-img-malignant-vs-benign-tumor2111891f-54cc-47aa-8967-4cd5411fdb2f-5a2848f122fa3a0037c544be.png')

Our target it to train a Logistic Regression model that can predict whether the cancer is benign (B) or malignant (M).

Attribute Information:
<br>1) ID number 
<br>2) Diagnosis (M = malignant, B = benign) 
<br>3-32) Ten real-valued features are computed for each cell nucleus: 
<br>a) radius (mean of distances from center to points on the perimeter) 
<br>b) texture (standard deviation of gray-scale values) 
<br>c) perimeter 
<br>d) area 
<br>e) smoothness (local variation in radius lengths) 
<br>f) compactness (perimeter^2 / area - 1.0) 
<br>g) concavity (severity of concave portions of the contour) 
<br>h) concave points (number of concave portions of the contour) 
<br>i) symmetry 
<br>j) fractal dimension ("coastline approximation" - 1)

**`'Diagnosis'`** column is the **Dependent Variable or target column** because we want our algorithm to predict this class.

**`'1,3-32'`** are your **Features or Independent Variables** which will help you predict the Benign/Malignant class. Vary any one of them and it is going to affect your Diagnostic.

## Basic Intuition 

Now we will discuss about the Logistic Regression algorithm. Don't be confused by the name "Logistic Regression"; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.

Instead of our output vector y being a continuous range of values, it will only be 'M' or 'B'.

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Read the CSV file 

In [None]:
# Loding the dataset into pandas dataframe and print the sample of first 5 rows
df = pd.read_csv('/Data/Breast_Cancer_Diagnostic.csv')
df.sample(n=5, random_state=10)

In [None]:
df['diagnosis'] = df.diagnosis.apply(lambda x: 1 if x == 'M' else 0)

In [None]:
df.sample(n=5, random_state=10)

*We will only consider ten real-valued features in this project for diagnostic!<br>
Let's separate the required features along with diagnosis column.*

In [None]:
df.columns

In [None]:
df = df[['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean','diagnosis']]

In [None]:
df.shape

In [None]:
df['diagnosis'].value_counts()

## Check if Any Null Values are there

In [None]:
df.columns[df.isnull().any()]

## Load X and Y variables

In [None]:
# Load the features to a variable X
# x is created by simply dropping the diagnosis column and retaining all others
X = df.drop('diagnosis', axis = 1)
# Load the dependent variable to y
y = df['diagnosis']

In [None]:
X.head()

In [None]:
y.tail(10)

## Split Test Train

**> Train-Test split -** We split our data into two parts, namely, the train set and the test set (ideally its a 70-30 train-test split which is upto you). We then try to build our function f(x) (aka model) using the train set and see how well it does on the test set.   

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_test.value_counts()

## Create an Instance of the classifier

In [None]:
# Let's create an instance for the LogisticRegression model
from sklearn.linear_model import LogisticRegression
Classifier = LogisticRegression()

## Build the Model using Fit

In [None]:
# Training the model on our train dataset
Classifier.fit(X_train,y_train)

## Get the Predictions

In [None]:
predictions_proba = Classifier.predict_proba(X_test)

In [None]:
# Getting predictions from the model 
y_test_hat = Classifier.predict(X_test)

In [None]:
all = pd.DataFrame({'Prob(y_test==1)': predictions_proba[:,1], 'y_test_hat': y_test_hat, 'y_test': y_test})
all.sample(n=10)

## Model Evaluation Techniques for Classification 

### 1. The  Classification Report

In [None]:
# Getting classification report
from sklearn.metrics import classification_report
report = classification_report(y_test,y_test_hat)

print(report)

### 2. The confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score
 
cm = confusion_matrix(y_test, y_test_hat)
display_cm(cm)

In [None]:
## Calculations 

TN = cm[0,0]
FP = cm[0,1]
FN = cm[1,0]
TP = cm[1,1]

accuracy = (TP+TN) / float(TP+TN+FN + FP)
print('Accuracy ',accuracy)

sensitivity = TP / float(FN + TP)
print('sensitivity or recall (Malignant Predictions):',sensitivity)

specificity = TN / (TN + FP)
print('specificity(Benign Class)',specificity)

precision = TP / float(TP + FP)
print('precision',precision)

#### Positive Predictive Value or Precision : 
 When a positive value is predicted, how often is the prediction correct?
<br> How "precise" is the classifier when predicting positive instances?
<br><b>Sensitivity or Recall or True Positive Rate:</b> 
<br>When the actual value is positive, how often is the prediction correct?
<br>Something we want to maximize and Minimise FN
<br>How "sensitive" is the classifier to detecting positive instances?
<br>Also known as "True Positive Rate" or "Recall"
<br>TP / TP + FN (all positive)
#### Specificity:  When the actual value is negative, how often is the prediction correct?
Something we want to maximize
<br>How "specific" (or "selective") is the classifier in predicting positive instances?
<br>TN / TN + FP (all negative)

**Conclusion:**
<br>Our model is specific but not that sensitive.
<br>Confusion matrix gives you a more complete picture of how your classifier is performing
<br>Also allows you to compute various classification metrics, and these metrics can guide your model selection

Which metrics should you focus on?

Choice of metric depends on your business objective
Identify if FP or FN is more important to reduce
Choose metric with relevant variable (FP or FN in the equation)
Spam filter (positive class is "spam"):
Optimize for precision or specificity
precision
false positive as variable
specificity
false positive as variable
Because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)
Fraudulent transaction detector (positive class is "fraud"):
Optimize for sensitivity
FN as a variable
Because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)
