### Calculating Errors

Here are two datasets that represent two of the examples you have seen in this lesson.  

One dataset is based on the parachute example, and the second is based on the judicial example.  Neither of these datasets is based on real people.

Use the exercises below to assist in answering the quiz questions at the bottom of this page.

In [1]:
import numpy as np
import pandas as pd

jud_data = pd.read_csv('judicial_dataset_predictions.csv')
par_data = pd.read_csv('parachute_dataset.csv')

In [2]:
jud_data.head()

Unnamed: 0,defendant_id,actual,predicted
0,22574,innocent,innocent
1,35637,innocent,innocent
2,39919,innocent,innocent
3,29610,guilty,guilty
4,38273,innocent,innocent


In [3]:
par_data.head()

Unnamed: 0,parachute_id,actual,predicted
0,3956,opens,opens
1,2147,opens,opens
2,2024,opens,opens
3,8325,opens,opens
4,6598,opens,opens


`1.` Above, you can see the actual and predicted columns for each of the datasets.  Using the **jud_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 1 below.  

**Hint for quiz:** an error is any time the prediction doesn't match an actual value.  Additionally, there are Type I and Type II errors to think about.  We also know we can minimize one type of error by maximizing the other type of error.  If we predict all individuals as innocent, how many of the guilty are incorrectly labeled?  Similarly, if we predict all individuals as guilty, how many of the innocent are incorrectly labeled?

$H_0$: The defendant is **innocent**.

$H_1$: The defendant is **guilty**.

In [4]:
# total percentage of errors
len(jud_data[jud_data['actual'] != jud_data['predicted']]) / len(jud_data) * 100

4.21529589454895

In [5]:
# percentage of type 1 errors
len(jud_data.query("actual == 'innocent' and predicted == 'guilty'")) / len(jud_data) * 100 

0.1510366607167376

In [6]:
# percentage of type 2 errors
len(jud_data.query("actual == 'guilty' and predicted == 'innocent'")) / len(jud_data) * 100 

4.064259233832212

In [7]:
# If everyone was predicted to be guilty, then every actual innocent person would be a type I error.
# Type I = pred guilty, but actual is innocent
len(jud_data.query("actual == 'innocent'")) / len(jud_data)

0.45159961554304545

In [8]:
#If everyone was predicted to be guilty, then no one is predicted innocent
#Therefore, there would be no type 2 errors in this case
# Type II errs = pred innocent, but actual = guilty
0

0

`2.` Using the **par_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 2 below.

These should be very similar operations to those you performed in the previous question.

$H_0$: The parachute **fails**.

$H_1$: The parachute **opens**.

In [9]:
par_data.actual.unique()

array(['opens', 'fails'], dtype=object)

In [10]:
# total percentage of errors
len(par_data[par_data['actual'] != par_data['predicted']]) / len(par_data)

0.039972551037913875

In [11]:
# percentage of type 1 errors
len(par_data.query("actual == 'fails' and predicted == 'opens'")) / len(par_data)

0.00017155601303825698

In [12]:
# percentage of type 2 errors
len(par_data.query("actual == 'opens' and predicted == 'fails'")) / len(par_data)

0.03980099502487562

In [13]:
# If every parachute was predicted to not open, then there is no failed parachute
# Type I = 0
0

0

In [14]:
# If every parachute was predicted to not open, then every open parachute is Type II error
len(par_data.query("actual == 'opens'")) / len(par_data)

0.9917653113741637