# Week 17 - Ensemble Learning and Random Forest

### 1.	What is inductive reasoning? Deductive reasoning? Give an example of each, different from the examples given in class. 

#### What is inductive reasoning?
Inductive reasoning is an approach to logical thinking that involves making generalizations based on specific details. You make observations to reach a conclusion. 

#### What is deductive reasoning?
With deductive reasoning, you start with a generalization or theory and then test it by applying it to specific incidents. It is using general ideas to reach a specific conclusion. If something is assumed to be true and another thing relates to the first assumption, then the original truth must also hold true for the second thing.

#### Example of each
- Inductive reasoning

A recruiter conducts a study of recent hires who have achieved success and stayed on with the organization. She finds that they graduated from three local colleges, so she decides to focus recruiting efforts on those three schools.

- Deductive reasoning

The career counseling center at my college is offering free resume reviews to students. I am a student and I plan on having my resume reviewed, so I will not have to pay anything for this service.

Using ONE of the following sources, complete the questions for only that source.

Credit approval: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29

Cardiac Arrhythmia: https://archive.ics.uci.edu/ml/datasets/Arrhythmia 

Abalone age: https://archive.ics.uci.edu/ml/datasets/Abalone - this one is a bit harder since it’s not binary like the others, but if you really want to master these concepts, you should pick this one. Use RMSE as a performance metric if you do this as regression. You should target a value of under 3.  

Note: at least one of your models should have the most relevant performance metric above .90 . All performance metrics should be above .75 . You will partially be graded on model performance.

### 2.	Preprocess your dataset. Indicate which steps worked and which didn’t. Include your thoughts on why certain steps worked and certain steps didn’t. 

In [1]:
# importing libraries
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("arrhythmia.txt", header=None, names=['Age', 'Sex', 'Height', 'Weight', 'QRS_duration', 'P_R_interval',
                                                               'Q_T_interval', 'T_interval', 'P_interval', 'QRS_vect', 'T_vect',
                                                               'P_vect', 'QRST_vect', 'J_vect', 'Heart_rate', 'DI_avg_Q',
                                                               'DI_avg_R', 'DI_avg_S', 'DI_avg_R_prime', 'DI_avg_S_prime',
                                                               'DI_intrinsic_deflect', 'DI_ragged_R', 'DI_diphasic_R',
                                                               'DI_ragged_P', 'DI_diphasic_P', 'DI_ragged_T', 'DI_diphasic_T',
                                                               'DII_avg_Q', 'DII_avg_R', 'DII_avg_S', 'DII_avg_R_prime', 
                                                               'DII_avg_S_prime','DII_intrinsic_deflect', 'DII_ragged_R', 
                                                               'DII_diphasic_R', 'DII_ragged_P', 'DII_diphasic_P', 
                                                               'DII_ragged_T', 'DII_diphasic_T',
                                                               'DIII_avg_Q', 'DIII_avg_R', 'DIII_avg_S', 'DIII_avg_R_prime', 
                                                               'DIII_avg_S_prime','DIII_intrinsic_deflect', 'DIII_ragged_R', 
                                                               'DIII_diphasic_R', 'DIII_ragged_P', 'DIII_diphasic_P', 
                                                               'DIII_ragged_T', 'DIII_diphasic_T',
                                                               'AVR_avg_Q', 'AVR_avg_R', 'AVR_avg_S', 'AVR_avg_R_prime', 
                                                               'AVR_avg_S_prime','AVR_intrinsic_deflect', 'AVR_ragged_R', 
                                                               'AVR_diphasic_R', 'AVR_ragged_P', 'AVR_diphasic_P', 
                                                               'AVR_ragged_T', 'AVR_diphasic_T',
                                                               'AVL_avg_Q', 'AVL_avg_R', 'AVL_avg_S', 'AVL_avg_R_prime', 
                                                               'AVL_avg_S_prime','AVL_intrinsic_deflect', 'AVL_ragged_R', 
                                                               'AVL_diphasic_R', 'AVL_ragged_P', 'AVL_diphasic_P', 
                                                               'AVL_ragged_T', 'AVL_diphasic_T',
                                                               'AVF_avg_Q', 'AVF_avg_R', 'AVF_avg_S', 'AVF_avg_R_prime', 
                                                               'AVF_avg_S_prime','AVF_intrinsic_deflect', 'AVF_ragged_R', 
                                                               'AVF_diphasic_R', 'AVF_ragged_P', 'AVF_diphasic_P', 
                                                               'AVF_ragged_T', 'AVF_diphasic_T',
                                                               'V1_avg_Q', 'V1_avg_R', 'V1_avg_S', 'V1_avg_R_prime', 
                                                               'V1_avg_S_prime','V1_intrinsic_deflect', 'V1_ragged_R', 
                                                               'V1_diphasic_R', 'V1_ragged_P', 'V1_diphasic_P', 
                                                               'V1_ragged_T', 'V1_diphasic_T',
                                                               'V2_avg_Q', 'V2_avg_R', 'V2_avg_S', 'V2_avg_R_prime', 
                                                               'V2_avg_S_prime','V2_intrinsic_deflect', 'V2_ragged_R', 
                                                               'V2_diphasic_R', 'V2_ragged_P', 'V2_diphasic_P', 
                                                               'V2_ragged_T', 'V2_diphasic_T',
                                                               'V3_avg_Q', 'V3_avg_R', 'V3_avg_S', 'V3_avg_R_prime', 
                                                               'V3_avg_S_prime','V3_intrinsic_deflect', 'V3_ragged_R', 
                                                               'V3_diphasic_R', 'V3_ragged_P', 'V3_diphasic_P', 
                                                               'V3_ragged_T', 'V3_diphasic_T',
                                                               'V4_avg_Q', 'V4_avg_R', 'V4_avg_S', 'V4_avg_R_prime', 
                                                               'V4_avg_S_prime','V4_intrinsic_deflect', 'V4_ragged_R', 
                                                               'V4_diphasic_R', 'V4_ragged_P', 'V4_diphasic_P', 
                                                               'V4_ragged_T', 'V4_diphasic_T',
                                                               'V5_avg_Q', 'V5_avg_R', 'V5_avg_S', 'V5_avg_R_prime', 
                                                               'V5_avg_S_prime','V5_intrinsic_deflect', 'V5_ragged_R', 
                                                               'V5_diphasic_R', 'V5_ragged_P', 'V5_diphasic_P', 
                                                               'V5_ragged_T', 'V5_diphasic_T',
                                                               'V6_avg_Q', 'V6_avg_R', 'V6_avg_S', 'V6_avg_R_prime', 
                                                               'V6_avg_S_prime','V6_intrinsic_deflect', 'V6_ragged_R', 
                                                               'V6_diphasic_R', 'V6_ragged_P', 'V6_diphasic_P', 
                                                               'V6_ragged_T', 'V6_diphasic_T',
                                                               'DI_JJ_amp', 'DI_Q_amp', 'DI_R_amp', 'DI_S_amp', 'DI_R_prime_amp',
                                                               'DI_S_prime_amp', 'DI_P_amp', 'DI_T_amp', 'DI_QRSA', 'DI_QRSTA',
                                                               'DII_JJ_amp', 'DII_Q_amp', 'DII_R_amp', 'DII_S_amp', 'DII_R_prime_amp',
                                                               'DII_S_prime_amp', 'DII_P_amp', 'DII_T_amp', 'DII_QRSA', 'DII_QRSTA',
                                                               'DIII_JJ_amp', 'DIII_Q_amp', 'DIII_R_amp', 'DIII_S_amp', 'DIII_R_prime_amp',
                                                               'DIII_S_prime_amp', 'DIII_P_amp', 'DIII_T_amp', 'DIII_QRSA', 'DIII_QRSTA',
                                                               'AVR_JJ_amp', 'AVR_Q_amp', 'AVR_R_amp', 'AVR_S_amp', 'AVR_R_prime_amp',
                                                               'AVR_S_prime_amp', 'AVR_P_amp', 'AVR_T_amp', 'AVR_QRSA', 'AVR_QRSTA',
                                                               'AVL_JJ_amp', 'AVL_Q_amp', 'AVL_R_amp', 'AVL_S_amp', 'AVL_R_prime_amp',
                                                               'AVL_S_prime_amp', 'AVL_P_amp', 'AVL_T_amp', 'AVL_QRSA', 'AVL_QRSTA',
                                                               'AVF_JJ_amp', 'AVF_Q_amp', 'AVF_R_amp', 'AVF_S_amp', 'AVF_R_prime_amp',
                                                               'AVF_S_prime_amp', 'AVF_P_amp', 'AVF_T_amp', 'AVF_QRSA', 'AVF_QRSTA',
                                                               'V1_JJ_amp', 'V1_Q_amp', 'V1_R_amp', 'V1_S_amp', 'V1_R_prime_amp',
                                                               'V1_S_prime_amp', 'V1_P_amp', 'V1_T_amp', 'V1_QRSA', 'V1_QRSTA',
                                                               'V2_JJ_amp', 'V2_Q_amp', 'V2_R_amp', 'V2_S_amp', 'V2_R_prime_amp',
                                                               'V2_S_prime_amp', 'V2_P_amp', 'V2_T_amp', 'V2_QRSA', 'V2_QRSTA',
                                                               'V3_JJ_amp', 'V3_Q_amp', 'V3_R_amp', 'V3_S_amp', 'V3_R_prime_amp',
                                                               'V3_S_prime_amp', 'V3_P_amp', 'V3_T_amp', 'V3_QRSA', 'V3_QRSTA',
                                                               'V4_JJ_amp', 'V4_Q_amp', 'V4_R_amp', 'V4_S_amp', 'V4_R_prime_amp',
                                                               'V4_S_prime_amp', 'V4_P_amp', 'V4_T_amp', 'V4_QRSA', 'V4_QRSTA',
                                                               'V5_JJ_amp', 'V5_Q_amp', 'V5_R_amp', 'V5_S_amp', 'V5_R_prime_amp',
                                                               'V5_S_prime_amp', 'V5_P_amp', 'V5_T_amp', 'V5_QRSA', 'V5_QRSTA',
                                                               'V6_JJ_amp', 'V6_Q_amp', 'V6_R_amp', 'V6_S_amp', 'V6_R_prime_amp',
                                                               'V6_S_prime_amp', 'V6_P_amp', 'V6_T_amp', 'V6_QRSA', 'V6_QRSTA','Class_code'])

In [3]:
df.head()


Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
2,54,0,172,95,138,163,386,185,102,96,...,0.0,9.5,-2.4,0.0,0.0,0.3,3.4,12.3,49.0,10
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Columns: 280 entries, Age to Class_code
dtypes: float64(120), int64(155), object(5)
memory usage: 988.9+ KB


In [5]:
# info() did not work as before - guessing it was due to number of columns in this dataframe.
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 280 columns):
 #    Column                  Dtype  
---   ------                  -----  
 0    Age                     int64  
 1    Sex                     int64  
 2    Height                  int64  
 3    Weight                  int64  
 4    QRS_duration            int64  
 5    P_R_interval            int64  
 6    Q_T_interval            int64  
 7    T_interval              int64  
 8    P_interval              int64  
 9    QRS_vect                int64  
 10   T_vect                  object 
 11   P_vect                  object 
 12   QRST_vect               object 
 13   J_vect                  object 
 14   Heart_rate              object 
 15   DI_avg_Q                int64  
 16   DI_avg_R                int64  
 17   DI_avg_S                int64  
 18   DI_avg_R_prime          int64  
 19   DI_avg_S_prime          int64  
 20   DI_intrinsic_deflect    int64  
 21   DI_ragged_R   

In [6]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Columns: 280 entries, Age to Class_code
dtypes: float64(120), int64(155), object(5)
memory usage: 1.1 MB


Documentation said there were several missing values

In [7]:
df['Class_code'].value_counts()

1     245
10     50
2      44
6      25
16     22
3      15
4      15
5      13
9       9
15      5
14      4
7       3
8       2
Name: Class_code, dtype: int64

In [8]:
df.isna().values.sum()

0

In [9]:
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

In [10]:
df.isna().values.sum()

408

In [11]:
df[df.isnull().any(axis='columns')]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7
5,13,0,169,51,100,167,321,174,91,107,...,-0.6,12.2,-2.8,0.0,0.0,0.9,2.2,13.5,31.1,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
445,45,0,175,75,91,134,376,160,83,91,...,0.0,7.1,-2.4,0.0,0.0,-0.4,1.3,8.5,17.6,1
446,20,1,157,57,81,151,363,166,80,43,...,0.0,7.2,-0.7,0.0,0.0,0.5,2.3,17.6,39.2,1
447,53,1,160,70,80,199,382,154,117,-37,...,0.0,4.3,-5.0,0.0,0.0,0.7,0.6,-4.4,-0.5,1
448,37,0,190,85,100,137,361,201,73,86,...,0.0,15.6,-1.6,0.0,0.0,0.4,2.4,38.0,62.4,10


In [12]:
df.interpolate(method='linear', limit_direction = 'forward', inplace=True)

In [13]:
df[df.isnull().any(axis='columns')]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6


In [14]:
df.interpolate(method='linear', limit_direction = 'backward', inplace=True)

In [15]:
df[df.isnull().any(axis='columns')]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code


In [16]:
df.isna().values.sum()

0

In [17]:
df['Age'].max()

83

In [18]:
df['Age'].min()


0

In [19]:
df.describe()

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
count,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,...,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0
mean,46.471239,0.550885,166.188053,68.170354,88.920354,155.152655,367.207965,169.949115,90.004425,33.676991,...,-0.278982,9.048009,-1.457301,0.003982,0.0,0.514823,1.222345,19.326106,29.47323,3.880531
std,16.466631,0.497955,37.17034,16.590803,15.364394,44.842283,33.385421,35.633072,25.826643,45.431434,...,0.548876,3.472862,2.00243,0.050118,0.0,0.347531,1.426052,13.503922,18.493927,4.407097
min,0.0,0.0,105.0,6.0,55.0,0.0,232.0,108.0,0.0,-172.0,...,-4.1,0.0,-28.6,0.0,0.0,-0.8,-6.0,-44.2,-38.6,1.0
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.75,...,-0.425,6.6,-2.1,0.0,0.0,0.4,0.5,11.45,17.55,1.0
50%,47.0,1.0,164.0,68.0,86.0,157.0,367.0,162.0,91.0,40.0,...,0.0,8.8,-1.1,0.0,0.0,0.5,1.35,18.1,27.9,1.0
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.0,11.2,0.0,0.0,0.0,0.7,2.1,25.825,41.125,6.0
max,83.0,1.0,780.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,0.0,23.6,0.0,0.8,0.0,2.4,6.0,88.8,115.9,16.0


figure any height over 200 centimeters is unrealistic (213 centimeters is 7ft tall)


In [20]:
df[df['Height']>200]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
141,1,1,780,6,85,165,237,150,106,88,...,0.0,5.0,-4.6,0.0,0.0,1.3,0.7,2.7,5.5,5
316,0,0,608,10,83,126,232,128,60,125,...,-0.7,4.5,-5.5,0.0,0.0,0.5,2.5,-11.8,1.7,5


In [21]:
df = df[df['Height']<200]

In [22]:
df.describe()

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
count,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,...,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0,450.0
mean,46.675556,0.551111,163.842222,68.437778,88.942222,155.195556,367.797778,170.086667,90.035556,33.353333,...,-0.278667,9.067111,-1.441333,0.004,0.0,0.513111,1.220667,19.432222,29.588222,3.875556
std,16.214228,0.497934,10.412195,16.132715,15.394913,44.918555,32.260307,35.644734,25.834294,45.254362,...,0.54958,3.468655,1.992218,0.050229,0.0,0.346322,1.427738,13.430692,18.453662,4.416267
min,1.0,0.0,105.0,10.0,55.0,0.0,240.0,108.0,0.0,-172.0,...,-4.1,0.0,-28.6,0.0,0.0,-0.8,-6.0,-44.2,-38.6,1.0
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.25,...,-0.4,6.6,-2.1,0.0,0.0,0.4,0.5,11.5,17.725,1.0
50%,47.0,1.0,164.0,68.0,86.5,157.0,367.5,162.0,91.0,40.0,...,0.0,8.8,-1.1,0.0,0.0,0.5,1.35,18.15,28.1,1.0
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.0,11.2,0.0,0.0,0.0,0.7,2.1,25.875,41.175,6.0
max,83.0,1.0,190.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,0.0,23.6,0.0,0.8,0.0,2.4,6.0,88.8,115.9,16.0


In [23]:
#kind of thought there shouldn't be heart issues for younger children but then realized that can happen at any age
df[df['Age']<10]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
60,1,0,110,10,80,121,287,156,67,126,...,-0.5,5.1,-4.8,0.0,0.0,0.8,0.9,-1.8,5.2,5
113,9,0,132,33,87,159,335,167,65,78,...,-0.4,12.4,-3.1,0.0,0.0,0.3,5.1,15.2,63.1,1
297,7,1,130,30,131,161,377,216,100,155,...,-0.6,3.3,-1.2,0.0,0.0,0.8,3.0,0.8,35.0,10
320,3,0,105,12,69,155,240,133,64,93,...,0.0,3.3,0.0,0.0,0.0,1.1,-0.1,5.9,5.4,5
379,8,0,120,28,118,126,303,164,80,120,...,-0.6,12.5,-3.6,0.0,0.0,0.5,2.3,9.2,32.2,10
401,9,0,120,25,95,118,347,156,66,84,...,-1.9,16.5,-1.4,0.0,0.0,0.4,3.0,25.3,49.9,14
403,7,1,127,22,185,204,284,123,72,-172,...,0.0,3.9,-15.0,0.0,0.0,-0.8,3.6,-36.6,-20.1,5
424,7,0,119,21,140,157,438,226,81,-40,...,0.0,10.0,-2.1,0.0,0.0,1.0,5.5,36.7,115.9,9
429,8,1,130,24,77,125,358,159,70,87,...,0.0,11.3,-2.1,0.0,0.0,0.7,3.6,16.1,49.2,16


In [24]:
df[df['Height']<120]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_Q_amp,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code
60,1,0,110,10,80,121,287,156,67,126,...,-0.5,5.1,-4.8,0.0,0.0,0.8,0.9,-1.8,5.2,5
320,3,0,105,12,69,155,240,133,64,93,...,0.0,3.3,0.0,0.0,0.0,1.1,-0.1,5.9,5.4,5
424,7,0,119,21,140,157,438,226,81,-40,...,0.0,10.0,-2.1,0.0,0.0,1.0,5.5,36.7,115.9,9


In [25]:
df['Height_in'] = df['Height']/2.54

In [26]:
df[df['Height']<120]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code,Height_in
60,1,0,110,10,80,121,287,156,67,126,...,5.1,-4.8,0.0,0.0,0.8,0.9,-1.8,5.2,5,43.307087
320,3,0,105,12,69,155,240,133,64,93,...,3.3,0.0,0.0,0.0,1.1,-0.1,5.9,5.4,5,41.338583
424,7,0,119,21,140,157,438,226,81,-40,...,10.0,-2.1,0.0,0.0,1.0,5.5,36.7,115.9,9,46.850394


In [27]:
df[df['Height_in']<50]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_R_amp,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code,Height_in
60,1,0,110,10,80,121,287,156,67,126,...,5.1,-4.8,0.0,0.0,0.8,0.9,-1.8,5.2,5,43.307087
210,11,1,124,25,90,161,349,209,98,80,...,8.6,-4.6,0.0,0.0,0.6,4.2,3.5,42.1,10,48.818898
320,3,0,105,12,69,155,240,133,64,93,...,3.3,0.0,0.0,0.0,1.1,-0.1,5.9,5.4,5,41.338583
379,8,0,120,28,118,126,303,164,80,120,...,12.5,-3.6,0.0,0.0,0.5,2.3,9.2,32.2,10,47.244094
401,9,0,120,25,95,118,347,156,66,84,...,16.5,-1.4,0.0,0.0,0.4,3.0,25.3,49.9,14,47.244094
424,7,0,119,21,140,157,438,226,81,-40,...,10.0,-2.1,0.0,0.0,1.0,5.5,36.7,115.9,9,46.850394


In [28]:
df['Weight_lbs'] = df['Weight'] * 2.205

In [29]:
df[df['Weight']<15]

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code,Height_in,Weight_lbs
60,1,0,110,10,80,121,287,156,67,126,...,-4.8,0.0,0.0,0.8,0.9,-1.8,5.2,5,43.307087,22.05
320,3,0,105,12,69,155,240,133,64,93,...,0.0,0.0,0.0,1.1,-0.1,5.9,5.4,5,41.338583,26.46


One year old boys aren't typically over 43" tall - planning to drop that record as well.

In [30]:
df = df[df['Age']>1]

In [31]:
df.describe()

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,V6_S_amp,V6_R_prime_amp,V6_S_prime_amp,V6_P_amp,V6_T_amp,V6_QRSA,V6_QRSTA,Class_code,Height_in,Weight_lbs
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,...,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,46.777283,0.552339,163.962138,68.567929,88.962138,155.271715,367.977728,170.11804,90.08686,33.146993,...,-1.433853,0.004009,0.0,0.512472,1.221381,19.47951,29.642539,3.873051,64.552023,151.192283
std,16.087909,0.497808,10.107939,15.91244,15.40628,44.939564,32.069392,35.678273,25.840151,45.092422,...,1.988104,0.050285,0.0,0.346443,1.42925,13.408118,18.438199,4.420873,3.979504,35.086931
min,3.0,0.0,105.0,12.0,55.0,0.0,240.0,108.0,0.0,-172.0,...,-28.6,0.0,0.0,-0.8,-6.0,-44.2,-38.6,1.0,41.338583,26.46
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.0,...,-2.1,0.0,0.0,0.4,0.5,11.5,17.8,1.0,62.992126,130.095
50%,47.0,1.0,164.0,68.0,87.0,157.0,368.0,162.0,91.0,40.0,...,-1.1,0.0,0.0,0.5,1.4,18.2,28.3,1.0,64.566929,149.94
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.0,0.0,0.0,0.7,2.1,25.9,41.2,6.0,66.929134,174.195
max,83.0,1.0,190.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,0.0,0.8,0.0,2.4,6.0,88.8,115.9,16.0,74.80315,388.08


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 449 entries, 0 to 451
Columns: 282 entries, Age to Weight_lbs
dtypes: float64(127), int64(155)
memory usage: 992.7 KB


Decided to combine some of the columns and create new ones of the mean of those. It will help have fewer features as well
Once that is done, I am going to combine the class codes so the 'regular' ones show as 0 (meaning not arrhythmia), and the rest show as 1 (positive for arrhythmia)

In [33]:
cols = ['DI_avg_Q','DII_avg_Q','DIII_avg_Q','AVR_avg_Q', 'AVL_avg_Q', 'AVF_avg_Q', 'V1_avg_Q', 'V2_avg_Q', 'V3_avg_Q', 'V4_avg_Q', 
       'V5_avg_Q','V6_avg_Q']
df['avg_Q'] = df[cols].mean(axis=1)

In [34]:
cols1 = ['DI_avg_R', 'DII_avg_R', 'DIII_avg_R', 'AVR_avg_R', 'AVL_avg_R', 'AVF_avg_R', 'V1_avg_R', 'V2_avg_R', 'V3_avg_R', 'V4_avg_R', 
         'V5_avg_R', 'V6_avg_R']
df['avg_R'] = df[cols1].mean(axis=1)
                                                            

In [35]:
cols2 = ['DI_avg_S', 'DII_avg_S', 'DIII_avg_S', 'AVR_avg_S', 'AVL_avg_S', 'AVF_avg_S', 'V1_avg_S', 'V2_avg_S', 'V3_avg_S', 'V4_avg_S', 
        'V5_avg_S',  'V6_avg_S']
df['avg_S'] = df[cols2].mean(axis=1)

In [36]:
cols3 = ['DI_avg_R_prime', 'DII_avg_R_prime', 'DIII_avg_R_prime', 'AVR_avg_R_prime', 'AVL_avg_R_prime', 'AVF_avg_R_prime', 
         'V1_avg_R_prime', 'V2_avg_R_prime', 'V3_avg_R_prime', 'V4_avg_R_prime', 'V5_avg_R_prime',  'V6_avg_R_prime']
df['avg_R_prime'] = df[cols3].mean(axis=1)

In [37]:
cols4 = ['DI_avg_S_prime', 'DII_avg_S_prime', 'DIII_avg_S_prime', 'AVR_avg_S_prime', 'AVL_avg_S_prime', 'AVF_avg_S_prime', 
         'V1_avg_S_prime', 'V2_avg_S_prime', 'V3_avg_S_prime', 'V4_avg_S_prime', 'V5_avg_S_prime',  'V6_avg_S_prime']
df['avg_S_prime'] = df[cols4].mean(axis=1)

In [38]:
cols5 = ['DI_intrinsic_deflect', 'DII_intrinsic_deflect', 'DIII_intrinsic_deflect', 'AVR_intrinsic_deflect', 'AVL_intrinsic_deflect',
         'AVF_intrinsic_deflect', 'V1_intrinsic_deflect', 'V2_intrinsic_deflect', 'V3_intrinsic_deflect', 'V4_intrinsic_deflect',
         'V5_intrinsic_deflect',  'V6_intrinsic_deflect']
df['intrinsic_deflect'] = df[cols5].mean(axis=1)

In [39]:
cols6 = ['DI_ragged_R', 'DII_ragged_R', 'DIII_ragged_R', 'AVR_ragged_R', 'AVL_ragged_R', 'AVF_ragged_R', 
         'V1_ragged_R', 'V2_ragged_R', 'V3_ragged_R', 'V4_ragged_R', 'V5_ragged_R',  'V6_ragged_R']
df['ragged_R'] = df[cols6].mean(axis=1)

In [40]:
cols7 = ['DI_diphasic_R', 'DII_diphasic_R', 'DIII_diphasic_R', 'AVR_diphasic_R', 'AVL_diphasic_R', 'AVF_diphasic_R', 
         'V1_diphasic_R', 'V2_diphasic_R', 'V3_diphasic_R', 'V4_diphasic_R', 'V5_diphasic_R',  'V6_diphasic_R']
df['diphasic_R'] = df[cols7].mean(axis=1)

In [41]:
cols8 = ['DI_ragged_P', 'DII_ragged_P', 'DIII_ragged_P', 'AVR_ragged_P', 'AVL_ragged_P', 'AVF_ragged_P', 
         'V1_ragged_P', 'V2_ragged_P', 'V3_ragged_P', 'V4_ragged_P', 'V5_ragged_P',  'V6_ragged_P']
df['ragged_P'] = df[cols8].mean(axis=1)

In [42]:
cols9 = ['DI_diphasic_P', 'DII_diphasic_P', 'DIII_diphasic_P', 'AVR_diphasic_P', 'AVL_diphasic_P', 'AVF_diphasic_P', 
         'V1_diphasic_P', 'V2_diphasic_P', 'V3_diphasic_P', 'V4_diphasic_P', 'V5_diphasic_P',  'V6_diphasic_P']
df['diphasic_P'] = df[cols9].mean(axis=1)

In [43]:
cols10 = ['DI_ragged_T', 'DII_ragged_T', 'DIII_ragged_T', 'AVR_ragged_T', 'AVL_ragged_T', 'AVF_ragged_T', 
         'V1_ragged_T', 'V2_ragged_T', 'V3_ragged_T', 'V4_ragged_T', 'V5_ragged_T',  'V6_ragged_T']
df['ragged_T'] = df[cols10].mean(axis=1)

In [44]:
cols11 = ['DI_diphasic_T', 'DII_diphasic_T', 'DIII_diphasic_T', 'AVR_diphasic_T', 'AVL_diphasic_T', 'AVF_diphasic_T', 
         'V1_diphasic_T', 'V2_diphasic_T', 'V3_diphasic_T', 'V4_diphasic_T', 'V5_diphasic_T',  'V6_diphasic_T']
df['diphasic_T'] = df[cols11].mean(axis=1)

In [45]:
colsa = ['DI_JJ_amp', 'DII_JJ_amp', 'DIII_JJ_amp', 'AVR_JJ_amp', 'AVL_JJ_amp', 'AVF_JJ_amp', 
         'V1_JJ_amp', 'V2_JJ_amp', 'V3_JJ_amp', 'V4_JJ_amp', 'V5_JJ_amp', 'V6_JJ_amp']
df['JJ_amp'] = df[colsa].mean(axis=1)

In [46]:
colsb = ['DI_Q_amp', 'DII_Q_amp', 'DIII_Q_amp', 'AVR_Q_amp', 'AVL_Q_amp', 'AVF_Q_amp', 
         'V1_Q_amp', 'V2_Q_amp', 'V3_Q_amp', 'V4_Q_amp', 'V5_Q_amp', 'V6_Q_amp']
df['Q_amp'] = df[colsb].mean(axis=1)

In [47]:
colsc = ['DI_R_amp', 'DII_R_amp', 'DIII_R_amp', 'AVR_R_amp', 'AVL_R_amp', 'AVF_R_amp', 
         'V1_R_amp', 'V2_R_amp', 'V3_R_amp', 'V4_R_amp', 'V5_R_amp', 'V6_R_amp']
df['R_amp'] = df[colsc].mean(axis=1)

In [48]:
colsd = ['DI_S_amp', 'DII_S_amp', 'DIII_S_amp', 'AVR_S_amp', 'AVL_S_amp', 'AVF_S_amp', 
         'V1_S_amp', 'V2_S_amp', 'V3_S_amp', 'V4_S_amp', 'V5_S_amp', 'V6_S_amp']
df['S_amp'] = df[colsd].mean(axis=1)

In [49]:
colse = ['DI_R_prime_amp', 'DII_R_prime_amp', 'DIII_R_prime_amp', 'AVR_R_prime_amp', 'AVL_R_prime_amp', 'AVF_R_prime_amp', 
         'V1_R_prime_amp', 'V2_R_prime_amp', 'V3_R_prime_amp', 'V4_R_prime_amp', 'V5_R_prime_amp', 'V6_R_prime_amp']
df['R_prime_amp'] = df[colse].mean(axis=1)

In [50]:
colsf = ['DI_S_prime_amp', 'DII_S_prime_amp', 'DIII_S_prime_amp', 'AVR_S_prime_amp', 'AVL_S_prime_amp', 'AVF_S_prime_amp', 
         'V1_S_prime_amp', 'V2_S_prime_amp', 'V3_S_prime_amp', 'V4_S_prime_amp', 'V5_S_prime_amp', 'V6_S_prime_amp']
df['S_prime_amp'] = df[colsf].mean(axis=1)

In [51]:
colsg = ['DI_P_amp', 'DII_P_amp', 'DIII_P_amp', 'AVR_P_amp', 'AVL_P_amp', 'AVF_P_amp', 
         'V1_P_amp', 'V2_P_amp', 'V3_P_amp', 'V4_P_amp', 'V5_P_amp', 'V6_P_amp']
df['P_amp'] = df[colsg].mean(axis=1)

In [52]:
colsh = ['DI_T_amp', 'DII_T_amp', 'DIII_T_amp', 'AVR_T_amp', 'AVL_T_amp', 'AVF_T_amp', 
         'V1_T_amp', 'V2_T_amp', 'V3_T_amp', 'V4_T_amp', 'V5_T_amp', 'V6_T_amp']
df['T_amp'] = df[colsh].mean(axis=1)

In [53]:
colsi = ['DI_QRSA', 'DII_QRSA', 'DIII_QRSA', 'AVR_QRSA', 'AVL_QRSA', 'AVF_QRSA', 
         'V1_QRSA', 'V2_QRSA', 'V3_QRSA', 'V4_QRSA', 'V5_QRSA', 'V6_QRSA']
df['QRSA'] = df[colsi].mean(axis=1)

In [54]:
colsj = ['DI_QRSTA', 'DII_QRSTA', 'DIII_QRSTA', 'AVR_QRSTA', 'AVL_QRSTA', 'AVF_QRSTA', 
         'V1_QRSTA', 'V2_QRSTA', 'V3_QRSTA', 'V4_QRSTA', 'V5_QRSTA', 'V6_QRSTA']
df['QRSTA'] = df[colsj].mean(axis=1)

In [55]:
df.describe()

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,JJ_amp,Q_amp,R_amp,S_amp,R_prime_amp,S_prime_amp,P_amp,T_amp,QRSA,QRSTA
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,...,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,46.777283,0.552339,163.962138,68.567929,88.962138,155.271715,367.977728,170.11804,90.08686,33.146993,...,0.128248,-0.945063,5.970082,-3.795137,0.072253,-0.006292,0.297717,1.244469,2.929733,14.657814
std,16.087909,0.497808,10.107939,15.91244,15.40628,44.939564,32.069392,35.678273,25.840151,45.092422,...,0.353153,0.868811,2.120282,1.897839,0.227238,0.047672,0.18239,0.859647,8.693708,11.740763
min,3.0,0.0,105.0,12.0,55.0,0.0,240.0,108.0,0.0,-172.0,...,-1.091667,-9.216667,1.025,-16.083333,0.0,-0.8,-0.541667,-2.241667,-46.283333,-39.141667
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.0,...,-0.058333,-1.0,4.6,-4.591667,0.0,0.0,0.208333,0.675,-0.6,6.0
50%,47.0,1.0,164.0,68.0,87.0,157.0,368.0,162.0,91.0,40.0,...,0.108333,-0.733333,5.691667,-3.475,0.0,0.0,0.3,1.25,3.183333,15.216667
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.275,-0.541667,7.091667,-2.625,0.066667,0.0,0.408333,1.775,8.133333,21.616667
max,83.0,1.0,190.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,2.791667,-0.075,14.125,-0.083333,3.625,0.0,1.025,4.208333,49.8,56.691667


In [56]:
columns_to_drop = (cols + cols1 + cols2 + cols3 + cols4 + cols5 + cols6 + cols7 + cols8 + cols9 + cols10 + cols11 + colsa +
                   colsb + colsc + colsd + colse + colsf + colsg + colsh + colsi + colsj)


In [57]:
df.drop(columns_to_drop, axis=1, inplace=True)

In [58]:
df.describe().round(2)

Unnamed: 0,Age,Sex,Height,Weight,QRS_duration,P_R_interval,Q_T_interval,T_interval,P_interval,QRS_vect,...,JJ_amp,Q_amp,R_amp,S_amp,R_prime_amp,S_prime_amp,P_amp,T_amp,QRSA,QRSTA
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,...,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,46.78,0.55,163.96,68.57,88.96,155.27,367.98,170.12,90.09,33.15,...,0.13,-0.95,5.97,-3.8,0.07,-0.01,0.3,1.24,2.93,14.66
std,16.09,0.5,10.11,15.91,15.41,44.94,32.07,35.68,25.84,45.09,...,0.35,0.87,2.12,1.9,0.23,0.05,0.18,0.86,8.69,11.74
min,3.0,0.0,105.0,12.0,55.0,0.0,240.0,108.0,0.0,-172.0,...,-1.09,-9.22,1.03,-16.08,0.0,-0.8,-0.54,-2.24,-46.28,-39.14
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.0,...,-0.06,-1.0,4.6,-4.59,0.0,0.0,0.21,0.68,-0.6,6.0
50%,47.0,1.0,164.0,68.0,87.0,157.0,368.0,162.0,91.0,40.0,...,0.11,-0.73,5.69,-3.48,0.0,0.0,0.3,1.25,3.18,15.22
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.28,-0.54,7.09,-2.62,0.07,0.0,0.41,1.78,8.13,21.62
max,83.0,1.0,190.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,2.79,-0.08,14.13,-0.08,3.62,0.0,1.02,4.21,49.8,56.69


In [59]:
df.drop(['Height','Weight'], axis=1, inplace=True)

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 449 entries, 0 to 451
Data columns (total 38 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                449 non-null    int64  
 1   Sex                449 non-null    int64  
 2   QRS_duration       449 non-null    int64  
 3   P_R_interval       449 non-null    int64  
 4   Q_T_interval       449 non-null    int64  
 5   T_interval         449 non-null    int64  
 6   P_interval         449 non-null    int64  
 7   QRS_vect           449 non-null    int64  
 8   T_vect             449 non-null    float64
 9   P_vect             449 non-null    float64
 10  QRST_vect          449 non-null    float64
 11  J_vect             449 non-null    float64
 12  Heart_rate         449 non-null    float64
 13  Class_code         449 non-null    int64  
 14  Height_in          449 non-null    float64
 15  Weight_lbs         449 non-null    float64
 16  avg_Q              449 non

In [61]:
df['Class_code'] = df.Class_code.apply(lambda x: 0 if x==1 else x)

In [62]:
df['Class_code'].value_counts()

0     245
10     50
2      44
6      25
16     22
3      15
4      15
5      10
9       9
15      5
14      4
7       3
8       2
Name: Class_code, dtype: int64

In [63]:
df['Class_code'] = df.Class_code.apply(lambda x: 1 if x!=0 else x)

In [64]:
df['Class_code'].value_counts()

0    245
1    204
Name: Class_code, dtype: int64

### 3.	Create a decision tree model tuned to the best of your abilities. Explain how you tuned it.

Kind of crazy but the first time I ran the decision tree model, I had not combined columns nor changed the class code to reflect 0's and 1's. The results of running the decision tree model were the same before and after. 

https://www.datacamp.com/community/tutorials/decision-tree-classification-python
https://hbr.org/1964/07/decision-trees-for-decision-making

In [65]:
X = df.drop(['Class_code'],axis=1)
y = df['Class_code']

In [66]:
#Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

#Import train_test_split
from sklearn.model_selection import train_test_split

#Import metrics to use
from sklearn.metrics import classification_report, plot_confusion_matrix, accuracy_score, precision_score, recall_score

from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42, stratify=y)
y_train.value_counts()

0    183
1    153
Name: Class_code, dtype: int64

In [67]:
#Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42,max_depth=8,min_samples_split=60,criterion='entropy')


In [68]:
#Fit dt to the training set
dt.fit(X_train, y_train)

#Predict the test set
y_pred = dt.predict(X_test)
#Evaluate the test_set accuracy
accuracy_score(y_test, y_pred)

0.8053097345132744

In [69]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.79      0.82        62
           1       0.76      0.82      0.79        51

    accuracy                           0.81       113
   macro avg       0.80      0.81      0.80       113
weighted avg       0.81      0.81      0.81       113



In [70]:
#recall needs to be over .90 and all others over .75 


Original results: accuracy of .81, recall .82, and precision .76.

Need to get recall over .90 without losing the other 2.

random_state=42,max_depth=8,min_samples_split=60,criterion='entropy'

In [71]:
dt = DecisionTreeClassifier(criterion='entropy', 
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.79      0.82        62
           1       0.76      0.82      0.79        51

    accuracy                           0.81       113
   macro avg       0.80      0.81      0.80       113
weighted avg       0.81      0.81      0.81       113



In [72]:
#these are all of the parameters that can be used - will set up some functions to check out the ones I haven't used above
dt = DecisionTreeClassifier(max_depth=10, 
                            splitter='random',
                            min_samples_split=15,
                            min_samples_leaf=5,
                            min_weight_fraction_leaf=0.03,
                            max_features=20,
                            random_state=42, 
                            max_leaf_nodes=2,
                            class_weight='balanced',
                            min_impurity_decrease=0.01,
                            ccp_alpha = 0.005,
                            criterion='entropy')
                              
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
print('precision for estimate is', precision_score(y_test,y_pred).round(2))
print('recall for estimate is', recall_score(y_test,y_pred).round(2))

precision for estimate is 0.86
recall for estimate is 0.12


In [73]:
def min_samp_leaf(n):
    dt = DecisionTreeClassifier(criterion='entropy',
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=n)
    
    dt.fit(X_train,y_train)
    y_pred = dt.predict(X_test)
    print('precision for min_samp_leaf ', n, 'is', precision_score(y_test,y_pred).round(2))
    print('recall for min_samp_leaf ', n, 'is', recall_score(y_test,y_pred).round(2))
    
min_samp_leaf(0.01)
min_samp_leaf(0.03)
min_samp_leaf(1)
min_samp_leaf(2)
min_samp_leaf(5)
min_samp_leaf(10)
min_samp_leaf(20)
min_samp_leaf(50)


precision for min_samp_leaf  0.01 is 0.76
recall for min_samp_leaf  0.01 is 0.82
precision for min_samp_leaf  0.03 is 0.76
recall for min_samp_leaf  0.03 is 0.63
precision for min_samp_leaf  1 is 0.76
recall for min_samp_leaf  1 is 0.82
precision for min_samp_leaf  2 is 0.76
recall for min_samp_leaf  2 is 0.82
precision for min_samp_leaf  5 is 0.76
recall for min_samp_leaf  5 is 0.82
precision for min_samp_leaf  10 is 0.76
recall for min_samp_leaf  10 is 0.63
precision for min_samp_leaf  20 is 0.75
recall for min_samp_leaf  20 is 0.59
precision for min_samp_leaf  50 is 0.75
recall for min_samp_leaf  50 is 0.75


In [74]:
def min_fract_leaf(n):
    dt = DecisionTreeClassifier(criterion='entropy',
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=0.01,
                                min_weight_fraction_leaf=n)
    
    dt.fit(X_train,y_train)
    y_pred = dt.predict(X_test)
    print('precision for min_fract_leaf ', n, 'is', precision_score(y_test,y_pred).round(2))
    print('recall for min_fract_leaf ', n, 'is', recall_score(y_test,y_pred).round(2))
    
min_fract_leaf(0)
min_fract_leaf(0.001)
min_fract_leaf(0.03)
min_fract_leaf(0.005)


precision for min_fract_leaf  0 is 0.76
recall for min_fract_leaf  0 is 0.82
precision for min_fract_leaf  0.001 is 0.76
recall for min_fract_leaf  0.001 is 0.82
precision for min_fract_leaf  0.03 is 0.76
recall for min_fract_leaf  0.03 is 0.63
precision for min_fract_leaf  0.005 is 0.76
recall for min_fract_leaf  0.005 is 0.82


In [75]:
def max_feat(n):
    dt = DecisionTreeClassifier(criterion='entropy',
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=0.01,
                                min_weight_fraction_leaf=0.001,
                                max_features=n)
                           
    dt.fit(X_train,y_train)
    y_pred = dt.predict(X_test)
    print('precision for max_feat ', n, 'is', precision_score(y_test,y_pred).round(2))
    print('recall for max_feat ', n, 'is', recall_score(y_test,y_pred).round(2))
    

max_feat(0.5)
max_feat(None)
max_feat(5)
max_feat(10)
max_feat(20)

precision for max_feat  0.5 is 0.65
recall for max_feat  0.5 is 0.84
precision for max_feat  None is 0.76
recall for max_feat  None is 0.82
precision for max_feat  5 is 0.67
recall for max_feat  5 is 0.69
precision for max_feat  10 is 0.81
recall for max_feat  10 is 0.76
precision for max_feat  20 is 0.72
recall for max_feat  20 is 0.75


In [76]:
def max_leaf(n):
    dt = DecisionTreeClassifier(criterion='entropy',
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=0.01,
                                min_weight_fraction_leaf=0.001,
                                max_features=None,
                                max_leaf_nodes=n)

    dt.fit(X_train,y_train)
    y_pred = dt.predict(X_test)
    print('precision for max_leaf ', n, 'is', precision_score(y_test,y_pred).round(2))
    print('recall for max_leaf ', n, 'is', recall_score(y_test,y_pred).round(2))
    

max_leaf(None)
max_leaf(2)
max_leaf(5)
max_leaf(10)
max_leaf(20)

precision for max_leaf  None is 0.76
recall for max_leaf  None is 0.82
precision for max_leaf  2 is 0.87
recall for max_leaf  2 is 0.39
precision for max_leaf  5 is 0.85
recall for max_leaf  5 is 0.57
precision for max_leaf  10 is 0.8
recall for max_leaf  10 is 0.71
precision for max_leaf  20 is 0.8
recall for max_leaf  20 is 0.71


In [77]:
def min_imp(n):
    dt = DecisionTreeClassifier(criterion='entropy',
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=0.01,
                                min_weight_fraction_leaf=0.001,
                                max_features=None,
                                max_leaf_nodes=None,
                                min_impurity_decrease=n)
    dt.fit(X_train,y_train)
    y_pred = dt.predict(X_test)
    print('precision for min_imp ', n, 'is', precision_score(y_test,y_pred).round(2))
    print('recall for min_imp ', n, 'is', recall_score(y_test,y_pred).round(2))
    

min_imp(0.003)
min_imp(0.01)
min_imp(0.001)
min_imp(0)
min_imp(0.005)
min_imp(0.04)

precision for min_imp  0.003 is 0.76
recall for min_imp  0.003 is 0.82
precision for min_imp  0.01 is 0.76
recall for min_imp  0.01 is 0.82
precision for min_imp  0.001 is 0.76
recall for min_imp  0.001 is 0.82
precision for min_imp  0 is 0.76
recall for min_imp  0 is 0.82
precision for min_imp  0.005 is 0.76
recall for min_imp  0.005 is 0.82
precision for min_imp  0.04 is 0.65
recall for min_imp  0.04 is 0.8


In [78]:
def class_w(n):
    dt = DecisionTreeClassifier(criterion='entropy',
                                max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=0.01,
                                min_weight_fraction_leaf=0.001,
                                max_features=None,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.003,
                                class_weight=n)
    dt.fit(X_train,y_train)
    y_pred = dt.predict(X_test)
    print('precision for class_w ', n, 'is', precision_score(y_test,y_pred).round(2))
    print('recall for class_w ', n, 'is', recall_score(y_test,y_pred).round(2))
    

class_w('balanced')
class_w(None)
class_w({0: 1, 1: 1})
class_w({0: 1, 1: 5})
class_w({0: 5, 1: 10})
class_w({0: 5, 1: 7})


precision for class_w  balanced is 0.77
recall for class_w  balanced is 0.71
precision for class_w  None is 0.76
recall for class_w  None is 0.82
precision for class_w  {0: 1, 1: 1} is 0.76
recall for class_w  {0: 1, 1: 1} is 0.82
precision for class_w  {0: 1, 1: 5} is 0.52
recall for class_w  {0: 1, 1: 5} is 0.92
precision for class_w  {0: 5, 1: 10} is 0.68
recall for class_w  {0: 5, 1: 10} is 0.8
precision for class_w  {0: 5, 1: 7} is 0.66
recall for class_w  {0: 5, 1: 7} is 0.75


In [79]:
dt = DecisionTreeClassifier(criterion='entropy',
                            max_depth=8, 
                                random_state=42, 
                                min_samples_split=60,
                                min_samples_leaf=0.01,
                                min_weight_fraction_leaf=0.001,
                                max_features=None,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.003,
                                class_weight={0: 1, 1: 1})
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
print('precision for is', precision_score(y_test,y_pred).round(2))
print('recall for is', recall_score(y_test,y_pred).round(2))

precision for is 0.76
recall for is 0.82


In [80]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.79      0.82        62
           1       0.76      0.82      0.79        51

    accuracy                           0.81       113
   macro avg       0.80      0.81      0.80       113
weighted avg       0.81      0.81      0.81       113



Using RandomizedSearchCV to try to find the best hyperparameters to use.

In [139]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

dtc = DecisionTreeClassifier(random_state=42)
param_grid = [{'max_depth': range(2,100), 'max_features': randint(1,9), 'min_samples_split': randint(2,100),
                 'min_samples_leaf': randint(1, 9), 'criterion':['gini','entropy']}]
cv = RandomizedSearchCV(dtc,param_grid, n_iter=55, cv=5, n_jobs=-1)
cv.fit(X_train, y_train)
print(cv.best_params_)

{'criterion': 'gini', 'max_depth': 74, 'max_features': 7, 'min_samples_leaf': 2, 'min_samples_split': 54}


In [140]:
dt = DecisionTreeClassifier(criterion='gini',
                            max_depth=74, 
                                random_state=42, 
                                min_samples_split=54,
                                min_samples_leaf=2,
                                max_features=7)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
print('precision for is', precision_score(y_test,y_pred).round(2))
print('recall for is', recall_score(y_test,y_pred).round(2))

precision for is 0.73
recall for is 0.8


In [141]:
y_pred = dt.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.76      0.79        74
           1       0.73      0.80      0.77        61

    accuracy                           0.78       135
   macro avg       0.78      0.78      0.78       135
weighted avg       0.78      0.78      0.78       135



I tried various things, adjusting the parameters, and could not seem to get recall higher than .82. Even using RandomizedSearchCV did not help.

Recall - .82, precision - .76, accuracy - .81 for DecisionTree.

### 4.	Create a random forest model tuned to the best of your abilities. Explain how you tuned it.


https://arxiv.org/pdf/1810.07748.pdf

In [81]:
X = df.drop(['Class_code'],axis=1)
y = df['Class_code']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [82]:
from sklearn.ensemble import RandomForestClassifier
#estimator = model
rf = RandomForestClassifier(n_estimators=200,random_state=42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.837037037037037

In [83]:
y_pred = rf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.93      0.86        74
           1       0.90      0.72      0.80        61

    accuracy                           0.84       135
   macro avg       0.85      0.83      0.83       135
weighted avg       0.85      0.84      0.83       135



In [84]:
def n_est(n):
    rf = RandomForestClassifier(n_estimators=n,random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for estimate ', n, 'is', precision_score(y_test,predictions).round(2))
    print('recall for estimate ', n, 'is', recall_score(y_test,predictions).round(2))
    
n_est(100)
n_est(200)
n_est(500)
n_est(800)

precision for estimate  100 is 0.92
recall for estimate  100 is 0.74
precision for estimate  200 is 0.9
recall for estimate  200 is 0.72
precision for estimate  500 is 0.87
recall for estimate  500 is 0.75
precision for estimate  800 is 0.87
recall for estimate  800 is 0.75


In [85]:
def max_d(d):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=d,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for max_depth ', d, 'is', precision_score(y_test,predictions).round(2))
    print('recall for max_depth ', d, 'is', recall_score(y_test,predictions).round(2))
    
max_d(2)
max_d(5)
max_d(10)
max_d(20)
max_d(50)
max_d(100)
max_d(200)
max_d(250)
max_d(500)

precision for max_depth  2 is 0.94
recall for max_depth  2 is 0.48
precision for max_depth  5 is 0.87
recall for max_depth  5 is 0.67
precision for max_depth  10 is 0.88
recall for max_depth  10 is 0.75
precision for max_depth  20 is 0.87
recall for max_depth  20 is 0.75
precision for max_depth  50 is 0.87
recall for max_depth  50 is 0.75
precision for max_depth  100 is 0.87
recall for max_depth  100 is 0.75
precision for max_depth  200 is 0.87
recall for max_depth  200 is 0.75
precision for max_depth  250 is 0.87
recall for max_depth  250 is 0.75
precision for max_depth  500 is 0.87
recall for max_depth  500 is 0.75


In [86]:
def min_sam(s):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=10,
                                min_samples_split=s,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_samples_split ', s, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_samples_split ', s, 'is', recall_score(y_test,predictions).round(2))
    
min_sam(100)
min_sam(50)
min_sam(10)
min_sam(5)
min_sam(2)
min_sam(0.2)
min_sam(0.5)

precision for min_samples_split  100 is 0.88
recall for min_samples_split  100 is 0.69
precision for min_samples_split  50 is 0.87
recall for min_samples_split  50 is 0.77
precision for min_samples_split  10 is 0.87
recall for min_samples_split  10 is 0.75
precision for min_samples_split  5 is 0.87
recall for min_samples_split  5 is 0.74
precision for min_samples_split  2 is 0.88
recall for min_samples_split  2 is 0.75
precision for min_samples_split  0.2 is 0.89
recall for min_samples_split  0.2 is 0.77
precision for min_samples_split  0.5 is 0.91
recall for min_samples_split  0.5 is 0.51


In [87]:
def min_sam_leaf(l):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=10,
                                min_samples_split=0.2,
                                min_samples_leaf=l,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_samples_leaf ', l, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_samples_leaf ', l, 'is', recall_score(y_test,predictions).round(2))
    
min_sam_leaf(50)
min_sam_leaf(30)
min_sam_leaf(20)
min_sam_leaf(5)
min_sam_leaf(2)
min_sam_leaf(1)

precision for min_samples_leaf  50 is 0.93
recall for min_samples_leaf  50 is 0.44
precision for min_samples_leaf  30 is 0.87
recall for min_samples_leaf  30 is 0.56
precision for min_samples_leaf  20 is 0.89
recall for min_samples_leaf  20 is 0.64
precision for min_samples_leaf  5 is 0.88
recall for min_samples_leaf  5 is 0.74
precision for min_samples_leaf  2 is 0.89
recall for min_samples_leaf  2 is 0.77
precision for min_samples_leaf  1 is 0.89
recall for min_samples_leaf  1 is 0.77


In [88]:
def min_weight(w):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=10,
                                min_samples_split=0.2,
                                min_samples_leaf=2,
                                min_weight_fraction_leaf=w,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_weight_fraction_leaf ', w, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_weight_fraction_leaf ', w, 'is', recall_score(y_test,predictions).round(2))
    
min_weight(0.05)
min_weight(0.04)
min_weight(0.03)
min_weight(0.02)
min_weight(0.01)
min_weight(0.005)
min_weight(0.001)
min_weight(0.003)
min_weight(0.0)

precision for min_weight_fraction_leaf  0.05 is 0.85
recall for min_weight_fraction_leaf  0.05 is 0.67
precision for min_weight_fraction_leaf  0.04 is 0.85
recall for min_weight_fraction_leaf  0.04 is 0.67
precision for min_weight_fraction_leaf  0.03 is 0.88
recall for min_weight_fraction_leaf  0.03 is 0.7
precision for min_weight_fraction_leaf  0.02 is 0.88
recall for min_weight_fraction_leaf  0.02 is 0.74
precision for min_weight_fraction_leaf  0.01 is 0.88
recall for min_weight_fraction_leaf  0.01 is 0.75
precision for min_weight_fraction_leaf  0.005 is 0.89
recall for min_weight_fraction_leaf  0.005 is 0.77
precision for min_weight_fraction_leaf  0.001 is 0.89
recall for min_weight_fraction_leaf  0.001 is 0.77
precision for min_weight_fraction_leaf  0.003 is 0.89
recall for min_weight_fraction_leaf  0.003 is 0.77
precision for min_weight_fraction_leaf  0.0 is 0.89
recall for min_weight_fraction_leaf  0.0 is 0.77


In [89]:
def max_nodes(m):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=10,
                                min_samples_split=0.2,
                                min_samples_leaf=2,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=m,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for max_leaf_nodes ', m, 'is', precision_score(y_test,predictions).round(2))
    print('recall for max_leaf_nodes ', m, 'is', recall_score(y_test,predictions).round(2))

max_nodes(2)
max_nodes(5)
max_nodes(10)
max_nodes(20)
max_nodes(100)
max_nodes(250)
max_nodes(500)
max_nodes(None)

precision for max_leaf_nodes  2 is 0.96
recall for max_leaf_nodes  2 is 0.41
precision for max_leaf_nodes  5 is 0.86
recall for max_leaf_nodes  5 is 0.62
precision for max_leaf_nodes  10 is 0.88
recall for max_leaf_nodes  10 is 0.75
precision for max_leaf_nodes  20 is 0.88
recall for max_leaf_nodes  20 is 0.75
precision for max_leaf_nodes  100 is 0.88
recall for max_leaf_nodes  100 is 0.75
precision for max_leaf_nodes  250 is 0.88
recall for max_leaf_nodes  250 is 0.75
precision for max_leaf_nodes  500 is 0.88
recall for max_leaf_nodes  500 is 0.75
precision for max_leaf_nodes  None is 0.89
recall for max_leaf_nodes  None is 0.77


In [90]:
def min_imp(i):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=10,
                                min_samples_split=0.2,
                                min_samples_leaf=2,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=i,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_impurity_decrease ', i, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_impurity_decrease ', i, 'is', recall_score(y_test,predictions).round(2))
    
min_imp(0.03)
min_imp(0.02)
min_imp(0.01)
min_imp(0.1)
min_imp(0.001)
min_imp(0.003)
min_imp(0.0)

precision for min_impurity_decrease  0.03 is 0.89
recall for min_impurity_decrease  0.03 is 0.54
precision for min_impurity_decrease  0.02 is 0.86
recall for min_impurity_decrease  0.02 is 0.62
precision for min_impurity_decrease  0.01 is 0.88
recall for min_impurity_decrease  0.01 is 0.7


  _warn_prf(average, modifier, msg_start, len(result))


precision for min_impurity_decrease  0.1 is 0.0
recall for min_impurity_decrease  0.1 is 0.0
precision for min_impurity_decrease  0.001 is 0.89
recall for min_impurity_decrease  0.001 is 0.77
precision for min_impurity_decrease  0.003 is 0.88
recall for min_impurity_decrease  0.003 is 0.75
precision for min_impurity_decrease  0.0 is 0.89
recall for min_impurity_decrease  0.0 is 0.77


In [91]:
rf = RandomForestClassifier(n_estimators=200,
                                max_depth=10,
                                min_samples_split=2,
                                min_samples_leaf=2,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.9
recall is 0.75


In [92]:
y_pred = rf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.93      0.87        74
           1       0.90      0.75      0.82        61

    accuracy                           0.85       135
   macro avg       0.86      0.84      0.85       135
weighted avg       0.86      0.85      0.85       135



Using RandomizedSearchCV to try to find the best hyperparameters to use.

In [123]:
from sklearn.model_selection import RandomizedSearchCV
rfc = RandomForestClassifier(random_state=42)
param_grid1 = [{'n_estimators': range(10,800),'max_depth': range(2,100), 
                'max_leaf_nodes': range(2,100), 'min_samples_leaf': range(1, 50), 
                'bootstrap':[True, False]}]
cv = RandomizedSearchCV(rfc,param_grid1, n_iter=72, cv=5, n_jobs=-1)
cv.fit(X_train, y_train)
print(cv.best_params_)

{'n_estimators': 669, 'min_samples_leaf': 4, 'max_leaf_nodes': 91, 'max_depth': 74, 'bootstrap': False}


In [124]:
rf = RandomForestClassifier(n_estimators=669,
                                max_depth=74,
                                min_samples_leaf=4,
                                max_leaf_nodes=91,
                                random_state=42,
                                min_samples_split=2,
                                min_weight_fraction_leaf=0.001,
                                min_impurity_decrease=0.0,
                                bootstrap=False)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.88
recall is 0.7


In [125]:
y_pred = rf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.79      0.92      0.85        74
           1       0.88      0.70      0.78        61

    accuracy                           0.82       135
   macro avg       0.83      0.81      0.82       135
weighted avg       0.83      0.82      0.82       135



Ran a number of different parameters through trying to tune my model - the highest I could get recall was .75, precision was .90 and accuracy was .85. Looks like even using RandomizedSearchCV didn't do any better than I did manually. 

### 5.	Create an xgboost model tuned to the best of your abilities. Explain how you tuned it. 


https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/

In [148]:
#XGBoost
from xgboost import XGBClassifier

X = df.drop(['Class_code'],axis=1)
y = df['Class_code']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

xgb = XGBClassifier(use_label_encoder=False,max_depth=10)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.92      0.86        74
           1       0.88      0.72      0.79        61

    accuracy                           0.83       135
   macro avg       0.84      0.82      0.82       135
weighted avg       0.84      0.83      0.83       135



In [149]:
xgb = XGBClassifier(use_label_encoder=False,max_depth=10)
xgb

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None,
              enable_categorical=False, gamma=None, gpu_id=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=10,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, use_label_encoder=False,
              validate_parameters=None, verbosity=None)

In [150]:
xgb = XGBClassifier(use_label_encoder=False, max_depth=10, 
                    base_score=0.5, colsample_bylevel=1, learning_rate=0.1,
                    colsample_bynode=1, colsample_bytree=1, 
                    max_delta_step=0, min_child_weight=2,
                    n_estimators=100, n_jobs=0, nthread=5, 
                    random_state=42, reg_alpha=0, reg_lambda=1,scale_pos_weight=1)

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86        74
           1       0.87      0.75      0.81        61

    accuracy                           0.84       135
   macro avg       0.84      0.83      0.83       135
weighted avg       0.84      0.84      0.84       135



In [151]:
#Hyperparameter optimization  using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
import xgboost

classifier = xgboost.XGBClassifier()

In [152]:
params = {
    'learning_rate': [0.05, 0.10, 0.15, 0.20, 0.25, 0.30],
    'max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
    'min_child_weight': [1, 3, 5, 7, 9],
    'gamma': [0.0, 0.1, 0.2, 0.3, 0.4],
    'colsample_bytree': [0.3, 0.4, 0.5, 0.7]
}


In [153]:
rs_model = RandomizedSearchCV(classifier,params, n_iter=5, n_jobs=-1, cv=5)

In [154]:
rs_model.fit(X_train, y_train)
rs_model.best_estimator_





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5,
              enable_categorical=False, gamma=0.0, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.2, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1,
              predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [156]:
xgb = XGBClassifier(use_label_encoder=False, base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3,
              enable_categorical=False, gamma=0.4, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=8,
              min_child_weight=3, missing=np.nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1,
              predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.78      0.93      0.85        74
           1       0.89      0.69      0.78        61

    accuracy                           0.82       135
   macro avg       0.84      0.81      0.81       135
weighted avg       0.83      0.82      0.82       135



Changed parameters many times, ended with recall of .75, precision of .87 and accuracy of .81. Using RandomizedSearchCV, recall was .69, precision was .89 and accuracy was .82.

### 6.	Which model performed best? What is your performance metric? Why? 

Recall was my chosen metric as I believe the false positives are less important than false negatives when it comes to heart health. The DecisionTreeClassifier worked best when looking at Recall.

Recall - .82, precision - .76, accuracy - .81 for DecisionTree.

Recall - .75, precision - .90, accuracy - .85 for RandomForest.

Recall - .75, precision - .87, accuracy - .81 for XGBoost.

![ML_Tree.PNG](attachment:ML_Tree.PNG)