### Calculating Errors

Here are two datasets that represent two of the examples you have seen in this lesson.  

One dataset is based on the parachute example, and the second is based on the judicial example.  Neither of these datasets is based on real people.

Use the exercises below to assist in answering the quiz questions at the bottom of this page.

In [25]:
import numpy as np
import pandas as pd

jud_data = pd.read_csv('judicial_dataset_predictions.csv')
par_data = pd.read_csv('parachute_dataset.csv')

In [26]:
jud_data.head()

Unnamed: 0,defendant_id,actual,predicted
0,22574,innocent,innocent
1,35637,innocent,innocent
2,39919,innocent,innocent
3,29610,guilty,guilty
4,38273,innocent,innocent


In [27]:
par_data.head()

Unnamed: 0,parachute_id,actual,predicted
0,3956,opens,opens
1,2147,opens,opens
2,2024,opens,opens
3,8325,opens,opens
4,6598,opens,opens


`1.` Above, you can see the actual and predicted columns for each of the datasets.  Using the **jud_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 1 below.  

**Hint for quiz:** an error is any time the prediction doesn't match an actual value.  Additionally, there are Type I and Type II errors to think about.  We also know we can minimize one type of error by maximizing the other type of error.  If we predict all individuals as innocent, how many of the guilty are incorrectly labeled?  Similarly, if we predict all individuals as guilty, how many of the innocent are incorrectly labeled?

In [28]:
len_jud_data = len(jud_data)
h0_remains = len(jud_data.query("actual == 'innocent' and predicted == 'innocent'"))
h0_rejected = len(jud_data.query("actual == 'guilty' and predicted == 'guilty'"))
type_I_error = len(jud_data.query("actual == 'innocent' and predicted == 'guilty'"))
type_II_error = len(jud_data.query("actual == 'guilty' and predicted == 'innocent'"))
p_error = (type_I_error + type_II_error) / len_jud_data
p_error

0.042152958945489497

In [29]:
p_type_I_error = type_I_error / len_jud_data
p_type_I_error

0.001510366607167376

In [30]:
p_type_II_error = type_II_error / len_jud_data
p_type_II_error

0.04064259233832212

In [31]:
# What if the jury doesn't want to make Type I Errors and sais everybody is innocent?
type_II_error = len(jud_data.query("actual == 'guilty'"))
type_II_error / len_jud_data

0.5484003844569546

In [32]:
# What if the jury doesn't want to make Type II Errors and sais everybody is guilty?
type_I_error = len(jud_data.query("actual == 'innocent'"))
type_I_error / len_jud_data

0.45159961554304545

`2.` Using the **par_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 2 below.

These should be very similar operations to those you performed in the previous question.

In [33]:
len_par_data = len(par_data)
h0_remains = len(par_data.query("actual == 'fails' and predicted == 'fails'"))
h0_rejected = len(par_data.query("actual == 'opens' and predicted == 'opens'"))
type_I_error = len(par_data.query("actual == 'fails' and predicted == 'opens'"))
type_II_error = len(par_data.query("actual == 'opens' and predicted == 'fails'"))
p_error = (type_I_error + type_II_error) / len_par_data
p_error

0.039972551037913875

In [34]:
data = np.array([[h0_remains, type_II_error], 
                 [type_I_error, h0_rejected]]) / len_par_data
pp = pd.DataFrame(data=data, columns=['H0_True', 'H0_False'], index=['H0_Accepted', 'H0_Rejected'])
pp

Unnamed: 0,H0_True,H0_False
H0_Accepted,0.008063,0.039801
H0_Rejected,0.000172,0.951964


In [37]:
# Type I Error = 0
# Type II Error
len(par_data.query("actual == 'opens'")) / len_par_data

0.9917653113741637