# Week7 - Decision Tree Lab

* Train-test split
* Train a decison tree model
* Train a random forest model
* Evaluate the models
* Explain findings

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/log_reg/employee-turnover-balanced.csv')
y = df['left_company']
X = df.iloc[:, 1:]

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 19 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   left_company                   1000 non-null   object
 1   age                            1000 non-null   int64 
 2   frequency_of_travel            1000 non-null   object
 3   department                     1000 non-null   object
 4   commuting_distance             1000 non-null   int64 
 5   education                      1000 non-null   int64 
 6   satisfaction_with_environment  1000 non-null   int64 
 7   gender                         1000 non-null   object
 8   seniority_level                1000 non-null   int64 
 9   position                       1000 non-null   object
 10  satisfaction_with_job          1000 non-null   int64 
 11  married_or_single              1000 non-null   object
 12  last_raise_pct                 1000 non-null   int64 
 13  last

In [3]:

numerical_vars = ['age', 'education', 'commuting_distance', 'satisfaction_with_environment', 
        'seniority_level', 'satisfaction_with_job', 'last_raise_pct', 'last_performance_rating',
        'total_years_working', 'years_at_company', 'years_in_current_job', 'years_since_last_promotion','years_with_current_supervisor']

categorical_Vars = ['frequency_of_travel', 'department','gender','position', 'married_or_single']

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

Train-test split

In [5]:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=142)

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Create pipelines as before
numerical_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])

categorical_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer([('numeric', numerical_pipeline, numerical_vars),('categorical', categorical_pipeline, categorical_Vars)])

# Verify the pipelines contain the expected columns
X_train_transformed = preprocessor.fit_transform(X_train)
print(f"Shape of numerical data after processing: {X_train_transformed[:, :len(numerical_vars)].shape}")
print(f"Shape of categorical data after processing: {X_train_transformed[:, len(categorical_Vars):].shape}")

Shape of numerical data after processing: (800, 13)
Shape of categorical data after processing: (800, 28)


Train a decison tree model

In [7]:

pipeline = Pipeline([('preprocessor', preprocessor),('dt', DecisionTreeClassifier())])

# Fit pipeline on training data
pipeline.fit(X_train, y_train)

# Print accuracy on training and test data
print(f"Training accuracy: {pipeline.score(X_train, y_train):.3f}")
print(f"Test accuracy: {pipeline.score(X_test, y_test):.3f}")

Training accuracy: 1.000
Test accuracy: 0.750


Train a random forest model

In [8]:

rf_pipeline = Pipeline([('preprocessor', preprocessor),('rf', RandomForestClassifier())])

# Fit pipeline on training data
rf_pipeline.fit(X_train, y_train)

# Print accuracy on training and test data
print(f"Training accuracy: {rf_pipeline.score(X_train, y_train):.3f}")
print(f"Test accuracy: {rf_pipeline.score(X_test, y_test):.3f}")

Training accuracy: 1.000
Test accuracy: 0.870


Evaluate the models

In [9]:
from sklearn.metrics import classification_report

# Evaluate Decision Tree model
y_pred_dt = pipeline.predict(X_test)
print("Decision Tree Model Performance:")
print(classification_report(y_test, y_pred_dt))



Decision Tree Model Performance:
              precision    recall  f1-score   support

          No       0.90      0.60      0.72       108
         Yes       0.66      0.92      0.77        92

    accuracy                           0.75       200
   macro avg       0.78      0.76      0.75       200
weighted avg       0.79      0.75      0.75       200



In [10]:
# Evaluate Random Forest model
y_pred_rf = rf_pipeline.predict(X_test)
print("Random Forest Model Performance:")
print(classification_report(y_test, y_pred_rf))

Random Forest Model Performance:
              precision    recall  f1-score   support

          No       0.91      0.84      0.87       108
         Yes       0.83      0.90      0.86        92

    accuracy                           0.87       200
   macro avg       0.87      0.87      0.87       200
weighted avg       0.87      0.87      0.87       200



Explain findings

Based on the above performance metrics we can see that Random Forest model outperforms Decision Tree. The accuracy, recall, precision and f1-score metrics are higher and better for Random Forest model. By this we can confirm that Random Forest is a better model for this task.