2.

Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |

- in the context of this problem, what is a false positive?
- In the context of this problem, what is a false negative?
- How would you describe this model?


False Positive : predicted a cat when it was actually a dog

False Negative : predicted a dog when it is actually a cat

Recall/Sensitivity = $\frac{34}{34+13} = \frac{34}{59} = 0.72$

Precision/PPV = $\frac{34}{34+7} = \frac{34}{53} = 0.83$

Accuracy = $\frac{46+34}{46+7+13+34} = \frac{80}{100} = 0.8$

Model is better than a coin flip in terms of accuracy.

3. 

You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?



In [83]:
# positive case : the duck has a defect
# negative case : the duck does not have a defect

# true positive : the duck is flagged as defective and is defective
# true negative : the duck is not flagged as defective and is not defective
# false positive : the duck is flagged as defective when it is not defective
# false negative : the duck is not flagged as defective when it is defective

import pandas as pd

df = pd.read_csv('c3.csv')
df

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


In [80]:
pd.crosstab(df.actual, df.model1, margins = True)

model1,Defect,No Defect,All
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Defect,8,8,16
No Defect,2,182,184
All,10,190,200


In [84]:
((df['model1'] == 'Defect') & (df.actual == 'Defect')).value_counts()

False    192
True       8
dtype: int64

In [89]:
# compute the recall, precision and accuracy of each model

models = ['model1', 'model2', 'model3']

for model in models:
    cur_model = model
    tp = ((df[cur_model] == 'Defect') & (df.actual == 'Defect')).sum()
    tn = ((df[cur_model] == 'No Defect') & (df.actual == 'No Defect')).sum()
    fp = ((df[cur_model] == 'Defect') & (df.actual == 'No Defect')).sum()
    fn = ((df[cur_model] == 'No Defect') & (df.actual == 'Defect')).sum()
    recall = tp/(tp+fn)
    precision = tp/(tp+fp)
    accuracy = (tp+tn)/(tp + tn + fp + fn)
    #print(f"{tp} {tn} {fp} {fn}")
    print(f"For {cur_model}: \n\t sensitivity : {recall},\n\t precision : {precision},\n\t accuracy : {accuracy}")

For model1: 
	 sensitivity : 0.5,
	 precision : 0.8,
	 accuracy : 0.95
For model2: 
	 sensitivity : 0.5625,
	 precision : 0.1,
	 accuracy : 0.56
For model3: 
	 sensitivity : 0.8125,
	 precision : 0.13131313131313133,
	 accuracy : 0.555


In [48]:
#actual percentage of defects:
df[['actual', 'baseline']]

Unnamed: 0,actual,baseline
0,No Defect,No Defect
1,No Defect,No Defect
2,No Defect,No Defect
3,No Defect,No Defect
4,No Defect,No Defect
...,...,...
195,No Defect,No Defect
196,Defect,No Defect
197,No Defect,No Defect
198,No Defect,No Defect


In [49]:
pd.crosstab(df.actual, df.baseline)

baseline,No Defect
actual,Unnamed: 1_level_1
Defect,16
No Defect,184


In [60]:
# what is the evaluation for a baseline where all ducks are the negative case?

tn = 184
tp = 0
fn = 16
fp = 0
recall = tp/(tp+fn)
accuracy = (tp+tn)/(tp + tn + fp + fn)
print(f"For baseline: \n\t recall : {recall},\n\t precision : div by 0,\n\t accuracy : {accuracy}")

For baseline: 
	 recall : 0.0,
	 precision : div by 0,
	 accuracy : 0.92


- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

The team wants to see all the defective ducks therefore it is important to err on the side of caution and flag ducks as defective even if the are not (false positive case).  This means that we need to maximize the ratio between true positive and false negatives; this means we want a model with high sensitivity.  The model with the highest sensitivity is model3.

- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

The goal of the PR team is to minimize the number of ducks that are not defective but are flagged as defective.  This is the false positive.  To minimize false positives use precision. Model3 is the most precise.

4.

You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

- In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
- Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?
- Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [90]:
# Postive : The picture is a cat
# Negative : The picture is not a cat

# True positive : the picture is labeled cat and is of a cat
# True negative : the picture is labeld not cat and is not of a cat
# False positive : the picture is labeled as a cat and is not of a cat
# False negative : the picture is not labeled as a cat and is of a cat


df_paws = pd.read_csv('gives_you_paws.csv')
df_paws

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog
...,...,...,...,...,...
4995,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog
4997,dog,cat,cat,dog,dog
4998,cat,cat,cat,cat,dog


In [91]:
(df_paws.actual == 'cat').mean() # the picture is most likely to be of a dog

0.3492

In [92]:
# our baseline is to always predict a dog and we will be right 1 - 0.35 of the time
df_paws['baseline'] = 'dog'
df_paws

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


In [93]:
models = ['model1', 'model2', 'model3', 'model4']

for model in models:
    cur_model = model
    tp = ((df_paws[cur_model] == 'cat') & (df_paws.actual == 'cat')).sum()
    tn = ((df_paws[cur_model] == 'dog') & (df_paws.actual == 'dog')).sum()
    fp = ((df_paws[cur_model] == 'cat') & (df_paws.actual == 'dog')).sum()
    fn = ((df_paws[cur_model] == 'dog') & (df_paws.actual == 'cat')).sum()
    recall = tp/(tp+fn)
    precision = tp/(tp+fp)
    accuracy = (tp+tn)/(tp + tn + fp + fn)
    print(f"For {cur_model}: \n\t sensitivity : {recall},\n\t precision : {precision},\n\t accuracy : {accuracy}")

For model1: 
	 sensitivity : 0.8150057273768614,
	 precision : 0.6897721764420747,
	 accuracy : 0.8074
For model2: 
	 sensitivity : 0.8906071019473081,
	 precision : 0.4841220423412204,
	 accuracy : 0.6304
For model3: 
	 sensitivity : 0.5114547537227949,
	 precision : 0.358346709470305,
	 accuracy : 0.5096
For model4: 
	 sensitivity : 0.34536082474226804,
	 precision : 0.8072289156626506,
	 accuracy : 0.7426


In [95]:
# calculate for baseline
pd.crosstab(df_paws.actual, df_paws.baseline)

baseline,dog
actual,Unnamed: 1_level_1
cat,1746
dog,3254


In [96]:
tn = 3254
tp = 0
fn = 1746
fp = 0
recall = tp/(tp+fn)
accuracy = (tp+tn)/(tp + tn + fp + fn)
print(f"For baseline: \n\t recall : {recall},\n\t precision : div by 0,\n\t accuracy : {accuracy}")

For baseline: 
	 recall : 0.0,
	 precision : div by 0,
	 accuracy : 0.6508


- In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

Model1 and model4 have a higher accuracy.

- Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

The team wants to be sure they are getting all of the dog pictures.  Therefore they probably do not mind getting a picture of a cat that was not labeled as a cat (false negative case).  A model that is precise--ie minimizes false negatives--is needed. This is model4.

For phase II we still don't mind a picture of a cat, as long as we don't miss any pictures with dogs. Again, this is model4

- Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

The team does not want to miss any pictures of a cat.  Therefore false positives are ok at both stages. A sensitive test is needed.  The most sensitive model is model2.  This holds for phase I and II.

5.

Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

- sklearn.metrics.accuracy_score
- sklearn.metrics.precision_score
- sklearn.metrics.recall_score
- sklearn.metrics.classification_report

In [113]:
import sklearn.metrics as metrics

In [118]:
metrics.accuracy_score(df.actual, df.model1)

0.95

In [123]:
#metrics.precision_score(df.actual, df.model1, labels = ['Defect', 'No Defect'])

In [116]:
metrics.classification_report(df.actual, df.model1, labels = ['Defect', 'No Defect'])

'              precision    recall  f1-score   support\n\n      Defect       0.80      0.50      0.62        16\n   No Defect       0.96      0.99      0.97       184\n\n    accuracy                           0.95       200\n   macro avg       0.88      0.74      0.79       200\nweighted avg       0.95      0.95      0.94       200\n'

In [117]:
metrics.classification_report(df.actual, df.model1, labels = ['Defect', 'No Defect'], output_dict=True)

{'Defect': {'precision': 0.8,
  'recall': 0.5,
  'f1-score': 0.6153846153846154,
  'support': 16},
 'No Defect': {'precision': 0.9578947368421052,
  'recall': 0.9891304347826086,
  'f1-score': 0.9732620320855614,
  'support': 184},
 'accuracy': 0.95,
 'macro avg': {'precision': 0.8789473684210527,
  'recall': 0.7445652173913043,
  'f1-score': 0.7943233237350884,
  'support': 200},
 'weighted avg': {'precision': 0.9452631578947368,
  'recall': 0.95,
  'f1-score': 0.9446318387494856,
  'support': 200}}

In [122]:
x = pd.DataFrame(metrics.classification_report(df.actual, df.model1, labels = ['Defect', 'No Defect'], output_dict=True))
x.T

Unnamed: 0,precision,recall,f1-score,support
Defect,0.8,0.5,0.615385,16.0
No Defect,0.957895,0.98913,0.973262,184.0
accuracy,0.95,0.95,0.95,0.95
macro avg,0.878947,0.744565,0.794323,200.0
weighted avg,0.945263,0.95,0.944632,200.0


In [126]:
for model in models:
    print(model)
    x = pd.DataFrame(metrics.classification_report(df_paws.actual, df_paws[model], labels = ['dog', 'cat'], output_dict=True))
    print(x.T)


model1
              precision    recall  f1-score    support
dog            0.890024  0.803319  0.844452  3254.0000
cat            0.689772  0.815006  0.747178  1746.0000
accuracy       0.807400  0.807400  0.807400     0.8074
macro avg      0.789898  0.809162  0.795815  5000.0000
weighted avg   0.820096  0.807400  0.810484  5000.0000
model2
              precision    recall  f1-score    support
dog            0.893177  0.490781  0.633479  3254.0000
cat            0.484122  0.890607  0.627269  1746.0000
accuracy       0.630400  0.630400  0.630400     0.6304
macro avg      0.688649  0.690694  0.630374  5000.0000
weighted avg   0.750335  0.630400  0.631310  5000.0000
model3
              precision    recall  f1-score    support
dog            0.659888  0.508605  0.574453  3254.0000
cat            0.358347  0.511455  0.421425  1746.0000
accuracy       0.509600  0.509600  0.509600     0.5096
macro avg      0.509118  0.510030  0.497939  5000.0000
weighted avg   0.554590  0.509600  0.521016 

In [109]:
df.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [110]:
df.model1.value_counts()

No Defect    190
Defect        10
Name: model1, dtype: int64