In [1]:
# Homework 4 (due 07/24/2024)

# Decision trees, interpretability, and algorithmic bias

## Objective

In this week's project, you will explore the COMPAS data set. COMPAS stands for "Correctional Offender Management Profiling for Alternative Sanctions". It is a software/algorithm that is used to assess the risk of a registered offender is going to commit another offense. Although researchers and journalists have pointed to [various problems](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) of this algorithm over many years, the algorithm is still used to inform sentences and parole decisions in several US states. 
You can learn more about the COMPAS data set [here](https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis). 

Through this project, you will practice fitting and validating several classification models and you will explore some distinct benefits of using decision trees in machine learning. As part of that exploration, you are going to audit your model for demographic biases via a "closed box" and an "open box" approach.

The COMPAS data set is a favorite example among critics of machine learning because it demonstrates several shortcomings and failure modes of machine learning techniques. The lessons learned from this project might be discouraging, and they are important. Keep in mind, however, that what you see here does not generalize to all data sets. 

This project has four parts.

### Part 1: Prepare the COMPAS data set  (PARTIALLY YOU TO COMPLETE)

In this part, you will load the COMPAS data set, explore its content, and select several variables as features (i.e., queries) or class labels (i.e., responses). Some of these features are not numerical, so you will need to replace some categorical values with zeros and ones. Your features will include categorical variable with more than two categories. You will uses 1-hot encoding to include this feature in your data set. 

This part includes four steps:
1. Load and explore data set
2. Select features and response variables
3. Construct numerical coding for categorical features
4. Split the data

### Part 2: Train and validate a decision tree  (PARTIALLY YOU TO COMPLETE)

In this part, you will fit a decision tree to your data. You will examine the effect of tuning the complexity of the tree via the "maximum number of leaves" parameter and use 5-fold cross-validation to find an optimal value.

This part includes three steps:

1. Fit a decision tree on the training data
2. Tune the parameter "maximum number of leaves"
3. Calculate the selected model's test performance


### Part 3: Auditing a decision tree for demographic biases  (PARTIALLY YOU TO COMPLETE)

Your training data includes several demographic variables (i.e., age, sex, race). A crude way to assess whether a model has some demographic bias is to remove the corresponding variables from your training data and explore how that removal affects your model's performance. Decision trees have the advantage of being interpretable machine learning models. By going through the decision nodes (i.e., branching points), you can "open the black box and look inside". Specifically, you can assess how each feature is used in the decision making process.

This part includes three steps:

1. Fit a decision tree
2. Check for racial bias via performance assessment
3. Check for racial bias via decision rules

### Part 4: Comparison to other linear classifiers (FOR YOU TO COMPLETE)

For some types of data, decision trees tend to achieve lower prediction accuracies In this part, you will train and tune several classifiers on the COMPAS data. You will then compare their performance on your test set.

This part includes three steps:

1. Fit LDA and logistic regression
2. Tune and fit ensemble methods
3. Tune and fit SVC
4. Compare performance metrics for all models 

In [4]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

## Part 1: Prepare the COMPAS data set

>In this part, you will load the COMPAS data set, explore its content, and select several variables as features (i.e., queries) or class labels (i.e., responses). Some of these features are not numerical, so you will need to replace some categorical values with zeros and ones. Your features will include categorical variable with more than two categories. You will uses 1-hot encoding to include this feature in your data set.
>
>This part includes four steps:
>1. Load and explore data set
>2. Select features and response variables
>3. Construct numerical coding for categorical features
>4. Split the data



### Part 1, Step 1: Load and explore data set

This folder includes the 'compas-scores-two-years.csv' file. The COMPAS data that you will use for this project is in this file. It is always a good idea to look at the raw data before proceeding with one's machine learning pipeline.

In [6]:
# load data
raw_data = pd.read_csv('compas-scores-two-years.csv') # NOTE: I had to download the CSV from the ProPublica github itself b/c it wasn't included in the class folder
# print a list of variable names
print(raw_data.columns)
# look at the first 5 rows 
raw_data.head(5)

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


The data set includes 53 variables. There are different types of information. Some variables
* personal data (e.g., name, first name ("first"), last name ("last")) 
* demographic data (i.e., sex, age, age category ("age_cat"), and race)
* related to the person's history of commited offenses (e.g., juvenile felony count ("juv_fel_count"), juvenile misdemeanor count ("juv_misd_count"), and prior offenses count ("priors-count"))
* related to the charge against the person (e.g., charge offense date ("c_offense_date"), charge arrest date ("c_arrest_date"), charge degree ("c_charge_degree"), and description of charge ("c_charge_desc"))
* recidivism scores assigned by the COMPAS algorithm (e.g., "decile_score", "score_text", "v_decile_score", "v_score_text")
* related to an actual recidivism charge (e.g., degree of recidivism charge ("r_charge_degree"), data of recidivism offense ("r_offense_date"), description of recidivism charge ("r_charge_desc"))
* related to an actual violent recidivism charge (e.g., degree of violent recidivism charge ("vr_charge_degree"), data of violent recidivism offense ("vr_offense_date"), description of violent recidivism charge ("vr_charge_desc")).

### Part 1, Step 2: Select features and response variables

The ProPublica article was assessing bias in the COMPAS scores. Here, you will ignore the COMPAS scores and instead explore the challenges of predicting recidivism based on the survey data. What variables seem like sensible predictors? What variables would be sensible outcome variables? The code in the cell below selects some numerical and categorical variables for you to include in your model.

In [10]:
# Select features and response variables

# Features by type
numerical_features = ['juv_misd_count', 'juv_other_count', 'juv_fel_count', 
    'priors_count', 'age']
binary_categorical_features = ['sex', 'c_charge_degree']
other_categorical_features = ['race']
all_features = binary_categorical_features + other_categorical_features + numerical_features

# Possible esponse variables
response_variables = ['is_recid', 'is_violent_recid', 'two_year_recid']

# Variables that are used for data cleaning
check_variables = ['days_b_screening_arrest']

ProPublica filtered some observations (i.e., rows in the data frame). See their explanation below. Let's follow their procedure.


> There are a number of reasons remove rows because of missing data:
>
> * If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense.
> * We coded the recidivist flag -- is_recid -- to be -1 if we could not find a compas case at all.
> * In a similar vein, ordinary traffic offenses -- those with a c_charge_degree of 'O' -- will not result in Jail time are removed (only two of them).
> * We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.


In [13]:
# Subselect data
df = raw_data[all_features+response_variables+check_variables]

# Apply filters
df = df[(df['days_b_screening_arrest'] <= 30) & 
        (df['days_b_screening_arrest'] >= -30) & 
        (df['is_recid'] != -1) & 
        (df['c_charge_degree'] != 'O')]

df = df[all_features+response_variables]
print('Dataframe has {} rows and {} columns.'.format(df.shape[0], df.shape[1]))

Dataframe has 6172 rows and 11 columns.


### Part 1, Step 3: Construct numerical coding for categorical features

Some of these features in the subselected data are not numerical, so you will need to replace some categorical values with zeros and ones. Your features will include "race", which was surveyed as a one categorical variable with more than two categories. You will uses [1-hot encoding](https://en.wikipedia.org/wiki/One-hot) to include this feature in your data set. 

In [16]:
# Code binary features as 0 and 1
for x in binary_categorical_features:
    for new_value, value in enumerate(set(df[x])):
        print("Replace {} with {}.".format(value, new_value))
        df = df.replace(value, new_value)

Replace Female with 0.
Replace Male with 1.
Replace F with 0.
Replace M with 1.


In [18]:
# Use 1-hot encoding for other categorical variables
one_hot_features = []
for x in other_categorical_features:
    for new_feature, value in enumerate(set(df[x])):
        feature_name = "{}_is_{}".format(x,value)
        df.insert(3, feature_name, df[x]==value)
        one_hot_features += [feature_name]

In [20]:
# Check what the data frame looks like now
df.head(10)

Unnamed: 0,sex,c_charge_degree,race,race_is_Asian,race_is_Hispanic,race_is_Caucasian,race_is_Native American,race_is_African-American,race_is_Other,juv_misd_count,juv_other_count,juv_fel_count,priors_count,age,is_recid,is_violent_recid,two_year_recid
0,1,0,Other,False,False,False,False,False,True,0,0,0,0,69,0,0,0
1,1,0,African-American,False,False,False,False,True,False,0,0,0,0,34,1,1,1
2,1,0,African-American,False,False,False,False,True,False,0,1,0,4,24,1,0,1
5,1,1,Other,False,False,False,False,False,True,0,0,0,0,44,0,0,0
6,1,0,Caucasian,False,False,True,False,False,False,0,0,0,14,41,1,0,1
7,1,0,Other,False,False,False,False,False,True,0,0,0,3,43,0,0,0
8,0,1,Caucasian,False,False,True,False,False,False,0,0,0,0,39,0,0,0
10,1,0,Caucasian,False,False,True,False,False,False,0,0,0,0,27,0,0,0
11,1,1,African-American,False,False,False,False,True,False,0,0,0,3,23,1,0,1
12,0,1,Caucasian,False,False,True,False,False,False,0,0,0,0,37,0,0,0


### Part 1, Step 4: Split the data

Let's collect the features in one data frame and the responses in another data frame. After that, you will set a small portion of the data set aside for testing.

In [23]:
# list of features
features = numerical_features + binary_categorical_features + one_hot_features

# features data frame
X = df[features]

# responses data frame
Y = df[response_variables]

# Split the data into a training set containing 90% of the data
# and test set containing 10% of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 100)

# Part 2: Train and validate a decision tree

>In this part, you will fit a decision tree to your data. You will examine the effect of tuning the complexity of the tree via the "maximum number of leaves" parameter and use 5-fold cross-validation to find an optimal value.
>
>This part includes three steps:
>
>1. Fit a decision tree on the training data
>2. Tune the parameter "maximum number of leaves"
>3. Calculate the selected model's test performance

### Part 2, Step 1: Fit a decision tree on the training data

Start by fitting a decision tree to your training data. Assess its training accuracy and its size.

In [26]:
# Create a model
dtc = DecisionTreeClassifier()
    
# Fit model to training data
dtc.fit(X_train, Y_train)

# Evaluate training accuracy
accuracy = dtc.score(X_train, Y_train)

# Check size of decision tree
num_leaves = dtc.get_n_leaves()

# Report results
print('Trained decision tree with {} leaves and training accuracy {:.2f}.'.format(num_leaves, accuracy))

Trained decision tree with 2058 leaves and training accuracy 0.79.


Your tree has a good training accuracy for the standards of tabular data prediction problems, but its size is enormous! It has so many leaves, that on average every 3 to 4 training observations get a leaf to themselves. It is very probable that this tree is overfitting.

### Part 2, Step 2: Tune the parameter "maximum number of leaves"

Let's try to constrain the complexity of a decision tree during training by setting a value for the argument ``maximum number of leaves``. You can use the sci-kit learn's `cross_val_score` function to quickly assess the out-of-sample performance of trees of varying complexity.

In [29]:
# Perform 5-fold cross-validation for different tree sizes

print('Leaves\tMean accuracy')
print('---------------------')
for num_leaves in range(100,1800,100):

    # Trees must have at least 2 leaves
    if num_leaves >= 2:

        # construct a classifier with a limit on its number of leaves
        dtc = DecisionTreeClassifier()
        dtc.max_leaf_nodes = num_leaves

        # Get validation accuracy via 5-fold cross-validation
        scores = cross_val_score(dtc, X_train, Y_train, cv = 5)
    
    print("{}\t{:.3f}".format(num_leaves,scores.mean()))

Leaves	Mean accuracy
---------------------
100	0.564
200	0.550
300	0.535
400	0.532
500	0.538
600	0.539
700	0.536
800	0.527
900	0.518
1000	0.517
1100	0.518
1200	0.519
1300	0.510
1400	0.511
1500	0.512
1600	0.509
1700	0.511


Regardless of the high training accuracy of our dataset, we find that the mean accuracy of a test dataset (as we find through our use of k-fold cross validation, with k = 5) remains at around 0.5 regardless of our number of leaves -- and actually goes down gradually, which tells us that past n = 100 number of leaves, the model overfits to its training data and becomes much less accurate when predicting test data.

Adjust the range of values for `max_leaf_nodes` in the cell above, to identify the best value.

### Part 2, Step 3: Calculate the selected model's test performance

Train a decision tree using your selected value of `max_leaf_nodes` on the full training set. Assess its accuracy on your test set.

In [33]:
# Create a model
dtc_full = DecisionTreeClassifier()
dtc_full.max_leaf_nodes = 100 # Chosen number of leaves as per cross-validation finding
    
# Fit model to training data
dtc_full.fit(X_train, Y_train)

# Evaluate test accuracy
accuracy = dtc_full.score(X_test, Y_test)

# Check size of decision tree
num_leaves = dtc_full.get_n_leaves()

# Report results
print('Trained decision tree with {} leaves and test accuracy {:.2f}.'.format(num_leaves, accuracy))

Trained decision tree with 100 leaves and test accuracy 0.55.


# Part 3: Auditing a decision tree for demographic biases

>Your training data includes several demographic variables (i.e., age, sex, race). A crude way to assess whether a model has some demographic bias is to remove the corresponding variables from your training data and explore how that removal affects your model's performance. Decision trees have the advantage of being interpretable machine learning models. By going through the decision nodes (i.e., branching points), you can "open the black box and look inside". Specifically, you can assess how each feature is used in the decision making process.
>
>This part includes two steps:
>
>1. Check for racial bias via performance assessment
>2. Check for racial bias via decision rules
  
### Part 3, Step 2: Check for racial bias via performance assessment
A simple approach to identifying demographic biases in machine learning is the following: (i) Train and validate the model on the full training set, (ii) train and validate the model on a subset of training variables that excludes the variables related to a potential demographic bias, (iii) compare the results. 

You have noticed that the validation accuracy of your model can vary for different holdout set selections. To account for these variations, you are going to compare the mean validation accuracy over 100 trees. (You have completed (i) in the previous cell already. Continue now with (ii).)

In [36]:
# Create subset of training data without information on race. 
# (The information on race was encoded in the one-hot features.)
remaining_features = [v for v in X.columns if v not in one_hot_features]
X_train_sub = X_train[remaining_features]
X_test_sub = X_test[remaining_features]

# Create a model
dtc_raceless = DecisionTreeClassifier(max_leaf_nodes = 100) # NOTE: original code maxes nodes to 39, changes to 100 to keep it consistent with mymodel from Part 2, Step 3
    
# Fit model to training data
dtc_raceless.fit(X_train_sub, Y_train)

# Evaluate training accuracy
y_pred = dtc_raceless.predict(X_test_sub)
# accuracy = (y_pred == Y_test['two_year_recid']).mean()

accuracy = dtc_raceless.score(X_test_sub, Y_test) # Achieves the same result as current method

# Check size of decision tree
num_leaves = dtc_raceless.get_n_leaves()

# Report results
print('Trained decision tree with {} leaves and test accuracy {:.2f}.'.format(num_leaves, accuracy))

Trained decision tree with 100 leaves and test accuracy 0.54.


In [38]:
print(X_train)

      juv_misd_count  juv_other_count  juv_fel_count  priors_count  age  sex  \
264                0                0              0             2   30    1   
3938               0                0              0             0   35    0   
1307               0                0              0             4   32    1   
1448               0                0              0             1   27    0   
6515               0                0              2            14   27    1   
...              ...              ...            ...           ...  ...  ...   
7051               0                0              0             1   23    1   
91                 0                0              0             1   21    1   
4585               0                0              0             4   54    1   
6973               0                0              0             0   22    1   
6602               0                0              0             1   26    1   

      c_charge_degree  race_is_Other  r

In [40]:
# In this code block, I will briefly check the recidivism rates of different races as captured in the COMPAS dataset (to be used in my argumentation in the following markdown block)

# I will be using rates since I don't think the various races have equal numbers

recid_native = df[df["race_is_Native American"] == True]["is_recid"]
recid_caucasian = df[df["race_is_Caucasian"] == True]["is_recid"]
recid_afr_amer = df[df["race_is_African-American"] == True]["is_recid"]
recid_other = df[df["race_is_Other"] == True]["is_recid"]
recid_hispanic = df[df["race_is_Hispanic"] == True]["is_recid"]
recid_asian = df[df["race_is_Asian"] == True]["is_recid"]

recid_native_rate = sum(recid_native) / len(recid_native)
recid_caucasian_rate = sum(recid_caucasian) / len(recid_caucasian)
recid_afr_amer_rate = sum(recid_afr_amer) / len(recid_afr_amer)
recid_other_rate = sum(recid_other) / len(recid_other)
recid_hispanic_rate = sum(recid_hispanic) / len(recid_hispanic)
recid_asian_rate = sum(recid_asian) / len(recid_asian)

races = ["Native American", "Caucasian", "African-American", "Other", "Hispanic", "Asian"]
recid_rate = [recid_native_rate, recid_caucasian_rate, recid_afr_amer_rate, recid_other_rate, recid_hispanic_rate, recid_asian_rate]

for i in range(len(races)):
	print(f"Recidivism rate of {races[i]}s is {np.round(np.multiply(recid_rate[i], 100), 2)}%")

Recidivism rate of Native Americans is 54.55%
Recidivism rate of Caucasians is 41.56%
Recidivism rate of African-Americans is 55.84%
Recidivism rate of Others is 37.9%
Recidivism rate of Hispanics is 38.7%
Recidivism rate of Asians is 32.26%


Comparing the mean accuracy values on the all features versus the subselected feature set, what do you conclude about the importance of racial information in this classification problem?

I find that the mean accuracy values on all features is MORE accurate than the subselected feature set (test accuracy 0.55 for full set vs 0.54 for subselected set). Essentially, this tells me that the full model utilizes demographic variables for race from its training set and becomes biased towards certain races as a result -- and then when evaluating that with the test set, the bias is set in when the algorithm selects the likelihood of recidivism for an individual as based on their race. 

While this is "accurate" in the sense that people of color are more likely to commit crime and be sentenced for them, the algorithm only upholds systemic oppression as the COMPAS dataset uses actual crime data (where POCs are already more likely to be sentenced due to their race) and further cements it through the algorithm's use of race to then more likely sentence POCs.

### Part 2, Step 3: Check for racial bias via decision rules
The interpretability of decision trees allows for an alternative approach to detecting racial bias. You can simply look at the decision rules. Use the scit-kit learn's function `export_text` to get your decision tree in text format. Compare the decision rules of the your tree with all features and your tree fitted on the subset without racial information. Do you find any indication of racial bias in the decision rules of the first tree?

NOTE: I am creating a single-output model as export_text does NOT support multi-output models at this time. I will be choosing "is_recid" as the chosen output as it overall captures our variable of interest.

In [46]:
# Create a model
dtc_full_single = DecisionTreeClassifier()
dtc_full_single.max_leaf_nodes = 100 # Chosen number of leaves as per cross-validation finding
    
# Fit model to training data
dtc_full_single.fit(X_train, Y_train["is_recid"])

# Evaluate test accuracy
accuracy_dtc = dtc_full_single.score(X_test, Y_test["is_recid"])

# Check size of decision tree
num_leaves = dtc_full_single.get_n_leaves()

# Report results
print('Trained decision tree (SINGLE-OUTPUT) with {} leaves and test accuracy {:.2f}.'.format(num_leaves, accuracy_dtc))

Trained decision tree (SINGLE-OUTPUT) with 100 leaves and test accuracy 0.68.


In [48]:
feature_names = list(dtc_full_single.feature_names_in_)
class_names = dtc_full_single.classes_
dtc_full_text = export_text(dtc_full_single, 
							feature_names = feature_names,
							show_weights = True)

print(dtc_full_text)

|--- priors_count <= 1.50
|   |--- age <= 22.50
|   |   |--- age <= 20.50
|   |   |   |--- age <= 19.50
|   |   |   |   |--- weights: [0.00, 15.00] class: 1
|   |   |   |--- age >  19.50
|   |   |   |   |--- sex <= 0.50
|   |   |   |   |   |--- juv_fel_count <= 0.50
|   |   |   |   |   |   |--- weights: [10.00, 12.00] class: 1
|   |   |   |   |   |--- juv_fel_count >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- sex >  0.50
|   |   |   |   |   |--- juv_misd_count <= 0.50
|   |   |   |   |   |   |--- weights: [29.00, 67.00] class: 1
|   |   |   |   |   |--- juv_misd_count >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |--- age >  20.50
|   |   |   |--- sex <= 0.50
|   |   |   |   |--- age <= 21.50
|   |   |   |   |   |--- juv_misd_count <= 0.50
|   |   |   |   |   |   |--- race_is_Hispanic <= 0.50
|   |   |   |   |   |   |   |--- race_is_African-American <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 2.0

We can see several branches in the text of our Decision Tree where one's race is a factor in the classification of likely recidivism. Specifically, in just the first 10 branches of the tree, we see many decisions being made based on whether an individual is Caucasian, African-American, or Hispanic (much more so than in other races). 

In one branch (depth = 6), the algorithm checks if an individual is Hispanic: if not, then it goes deeper into checking if the individual is African-American -- if they are African-American, then there are cases where "is_recid" = 1, and if not then "is_recid" = 0. (We can see that in that initial branch, if the individual IS Hispanic, they are simply tagged with "is_recid" = 1). There are several other situations in which African-American and Hispanic individuals in particular have decision nodes that are more likely to lead to "is_recid" = 1 while Asian and Caucasian individuals have greater likelihood of being classified "is_recid" = 0.

# Part 4: Comparison to other linear classifiers

>For some types of data, decision trees tend to achieve lower prediction accuracies In this part, you will train and tune several classifiers on the COMPAS data. You will then compare their performance on your test set.
>
>This part includes three steps:
>
>1. Fit LDA and logistic regression
>2. Tune and fit ensemble methods
>3. Tune and fit SVC
>4. Compare test accuracy of all your models 

NOTE: the full SINGLE-OUTPUT ("is_recid" only) Decision Tree model has a test accuracy of 0.68. I chose to only review single-output classifiction for each method for simplicity.

In [53]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

#### Part 4, Step 1: Fit LDA and Logistic Regression

In [56]:
# LDA
model_lda = LinearDiscriminantAnalysis().fit(X_train, Y_train["is_recid"])
accuracy_lda = model_lda.score(X_test, Y_test["is_recid"])

# LOGISTIC REGRESSION
model_log = LogisticRegression().fit(X_train, Y_train["is_recid"])
accuracy_log = model_log.score(X_test, Y_test["is_recid"])

#### Part 4, Step 2: Tune and Fit Ensemble Methods

In [59]:
# RANDOM FOREST
model_forest = RandomForestClassifier().fit(X_train, Y_train["is_recid"])
accuracy_forest = model_forest.score(X_test, Y_test["is_recid"])

# BAGGING
model_bag = BaggingClassifier().fit(X_train, Y_train["is_recid"])
accuracy_bag = model_bag.score(X_test, Y_test["is_recid"])

# GRADIENT BOOSTING
model_boost = GradientBoostingClassifier().fit(X_train, Y_train["is_recid"])
accuracy_boost = model_boost.score(X_test, Y_test["is_recid"])

#### Part 4, Step 3: Tune and Fit SVC

In [62]:
model_svc = SVC().fit(X_train, Y_train["is_recid"])
accuracy_svc = model_svc.score(X_test, Y_test["is_recid"])

#### Part 4, Step 4: Compare Test Accuracy of All Your Models

In [65]:
models_acc = [accuracy_dtc, accuracy_lda, accuracy_log, accuracy_forest, accuracy_bag, accuracy_boost, accuracy_svc]
model_names = ["Decision Tree Classifier", 
			   "Linear - Linear Discriminant Analysis",
			   "Linear - Logistic Regression",
			   "Ensemble - Random Forest",
			   "Ensemble - Bagging",
			   "Ensemble - Gradient Boosting",
			   "Support Vector Classification (SVC)"]

for i in range(len(models_acc)):
	print(f"Test accuracy of {model_names[i]} is {np.round(models_acc[i] * 100, 2)}%")

Test accuracy of Decision Tree Classifier is 67.8%
Test accuracy of Linear - Linear Discriminant Analysis is 68.12%
Test accuracy of Linear - Logistic Regression is 67.64%
Test accuracy of Ensemble - Random Forest is 64.4%
Test accuracy of Ensemble - Bagging is 65.05%
Test accuracy of Ensemble - Gradient Boosting is 68.93%
Test accuracy of Support Vector Classification (SVC) is 70.23%


We find that all of our models' test accuracies fall in the 65-70% range. 

SVC is the most accurate model, and Bagging is the least accurate.