Given that there are so many fewer cancer cases than benign cases, accuracy would _not_ be a good statistic to use. A human or an algorithm could label _all_ of the cancer cases as benign and still achieve an accuracy of over 80%. 

Higher levels of false negatives aren't great ever in clinical settings, but they have less of an impact on the patient if we _know_ that there will be a second reader (i.e. our algorithm reads an image first, and then the label is confirmed by a radiologist). This also would only be appropriate if there wasn't a time-sensitive aspect to making the diagnosis. It seems more acceptable to have a high level of false positives in a situation as we saw in the previous exercise, where our algorithm was being used to prioritize emergency reads, in which case we would want to be somewhat liberal. 

Changing the threshold on the algorithm performance changed the FP and FN rates, one at the expense of the other. 

The algorithm's true positive rate increased when using the same threshold and comparing it to the human instead of the perfect labeler. This means that without the _true_ ground truth of diagnoses in an image, we may never be able to 100% accurately assess our algorithm, and its results may be inflated based on the quality of the radiologist labels that we use in comparing our outputs. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

### I then read the set of labels into a dataframe:

In [2]:
labels = pd.read_csv('labels.csv')

In [8]:
labels.head(6)

Unnamed: 0,perfect_labeler,radiologist,algorithm
0,cancer,cancer,0.99
1,cancer,cancer,0.94
2,cancer,cancer,0.73
3,cancer,cancer,0.82
4,cancer,cancer,0.98
5,cancer,cancer,0.63


### I first started with assessing the radiologist's performance based upon :
* Assessing the _accuracy_ of the radiologist by just looking at the percent of cases that they correctly labeled
* Next, I look at the true positive and true negative rates of the radiologist by generating a _confusion matrix_ 
    * A Confusion matrix is an N x N matrix used for evaluating the performance of a                 classification model, where N is the number of target classes. The matrix compares the         actual target values with those predicted by the machine learning model. ... The rows          represent the predicted values of the target variable

        * For more info, see : https://www.youtube.com/watch?v=bgyN3RO2ICo

In [4]:
radiologist_accuracy = sum(labels.perfect_labeler == labels.radiologist)/len(labels)

In [5]:
radiologist_accuracy

0.8993288590604027

### A classification accuracy of 8.9 is a very high accuracy

In [7]:
confusion_matrix(labels.perfect_labeler.values,labels.radiologist.values,labels=["cancer","benign"])

array([[ 25,   4],
       [ 11, 109]], dtype=int64)


#### Looking at the above you may be confused to what the output of line 7 ([7]) is. 
#### This is what you call a matrix in Data  Science
#### Array = its a data structure 
#### you can have different dimension (1D, 2D, 3D, 4D ... just like space and time )
#### More info about Arrays https://subscription.packtpub.com/bookbig_data_and_business_intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix


     
    ### Now let's look at the algorithm's performance compared to the perfect labeler:
* Since the algorithm doesn't create a binary label, it instead returns a _probability_ of cancer, choose a probability cut-off to use for the algorithm's labeling of cancer vs. bening. _(Hint: 0.5 is a reasonable starting place) - Supervised Learning - Regression 
* Start with assessing _accuracy_ again here
* Generate a confusion matrix

In [10]:
## Here, I'm going to change my entire dataframe to 0's and 1's to make processing easier
labels = labels.replace('cancer',1).replace('benign',0)
labels.head(5)

Unnamed: 0,perfect_labeler,radiologist,algorithm
0,1,1,0.99
1,1,1,0.94
2,1,1,0.73
3,1,1,0.82
4,1,1,0.98


In [12]:
algorithm_thresh = (labels.algorithm > 0.5)

In [13]:
confusion_matrix(labels.perfect_labeler.values,algorithm_thresh,labels=[1,0])

array([[ 21,   8],
       [  8, 112]], dtype=int64)

In [26]:
algorithm_thresh = sum(labels.perfect_labeler == labels.radiologist)/len(labels)

In [27]:
algorithm_thresh

0.8993288590604027

### You can see the Machines accuracy on detecting the accurate diagnosis is also 0.8 or 0.9 rounded ( at cut off of 0.5)

What happens now if you change the threshold cut-off for your algorithm's classification to 0.4? What if you raise it to 0.6? How do accuracy, fp, fn, tp, and tn change?

In [15]:
algorithm_thresh = (labels.algorithm > 0.4)

In [16]:
confusion_matrix(labels.perfect_labeler.values,algorithm_thresh,labels=[1,0])

array([[ 25,   4],
       [ 16, 104]], dtype=int64)

In [17]:
algorithm_thresh = (labels.algorithm > 0.6)

In [19]:
confusion_matrix(labels.perfect_labeler.values,algorithm_thresh,labels=[1,0])

array([[ 20,   9],
       [  5, 115]], dtype=int64)

### Finally, let's compare our algorithm to the radiologist
* A "perfect labeler" might not exist in the real world, and in fact, if often does not
* In AI for medical imaging, using a radiologist's labels as our "true" label is often the standard of practice, and algorithm performance is judged in both an academic setting as well as in the regulated industry landscape based on performance against an expert human

* Repeat the steps above using a set threshold for your algorithm (again, 0.5 is perfectly reasonable) but now computing accuracy, tp, tn, fp, fn against the radiologist. 

In [28]:
algorithm_thresh = (labels.algorithm > 0.5)

In [29]:
confusion_matrix(labels.radiologist.values,algorithm_thresh,labels=[1,0])

array([[ 23,  13],
       [  6, 107]], dtype=int64)

## SEE Discussion.py to see the reflection of our studies and discussion on the results obtained.  

# END OF DOCUMENT
