## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
attrition_df.nunique().sort_values(ascending=False)

HourlyRate                  71
Age                         43
TotalWorkingYears           40
YearsAtCompany              37
DistanceFromHome            29
YearsInCurrentRole          19
YearsWithCurrManager        18
YearsSinceLastPromotion     16
PercentSalaryHike           15
NumCompaniesWorked          10
JobRole                      9
TrainingTimesLastYear        7
EducationField               6
JobLevel                     5
Education                    5
EnvironmentSatisfaction      4
JobInvolvement               4
JobSatisfaction              4
RelationshipSatisfaction     4
StockOptionLevel             4
WorkLifeBalance              4
Department                   3
BusinessTravel               3
MaritalStatus                3
OverTime                     2
Attrition                    2
PerformanceRating            2
dtype: int64

In [3]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[["Attrition", "Department"]]


In [4]:
# Create a list of at least 10 column names to use as X data
X_columns_to_include = [
    column_name
    for column_name in attrition_df.columns
    if column_name not in ("Attrition", "Department")
]


# Create X_df using your selected columns
X_df = attrition_df[X_columns_to_include]

# Show the data types for X_df
X_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   BusinessTravel            1470 non-null   object
 2   DistanceFromHome          1470 non-null   int64 
 3   Education                 1470 non-null   int64 
 4   EducationField            1470 non-null   object
 5   EnvironmentSatisfaction   1470 non-null   int64 
 6   HourlyRate                1470 non-null   int64 
 7   JobInvolvement            1470 non-null   int64 
 8   JobLevel                  1470 non-null   int64 
 9   JobRole                   1470 non-null   object
 10  JobSatisfaction           1470 non-null   int64 
 11  MaritalStatus             1470 non-null   object
 12  NumCompaniesWorked        1470 non-null   int64 
 13  OverTime                  1470 non-null   object
 14  PercentSalaryHike       

In [5]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

In [6]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
from sklearn.preprocessing import LabelEncoder

# Create an instance of the label encoder
le = LabelEncoder()

# List of categorical columns to be encoded
categorical_columns = [
    "OverTime",
    "BusinessTravel",
    "EducationField",
    "JobRole",
    "MaritalStatus",
]

# Fit the label encoder on the combined data from both training and test sets
for column in categorical_columns:
    combined_data = list(X_train[column]) + list(X_test[column])
    le.fit(combined_data)

    # Transform the training data
    X_train[column] = le.transform(X_train[column])

    # Transform the test data
    X_test[column] = le.transform(X_test[column])

# Print value counts for each encoded column in the training set
for column in categorical_columns:
    print(f"Value counts for {column} in training set:")
    print(X_train[column].value_counts())

# Print value counts for each encoded column in the test set
for column in categorical_columns:
    print(f"Value counts for {column} in test set:")
    print(X_test[column].value_counts())

Value counts for OverTime in training set:
0    837
1    339
Name: OverTime, dtype: int64
Value counts for BusinessTravel in training set:
2    835
1    228
0    113
Name: BusinessTravel, dtype: int64
Value counts for EducationField in training set:
1    491
3    369
2    124
5    101
4     69
0     22
Name: EducationField, dtype: int64
Value counts for JobRole in training set:
7    254
6    242
2    204
4    107
0    105
3     79
8     75
5     68
1     42
Name: JobRole, dtype: int64
Value counts for MaritalStatus in training set:
1    550
2    359
0    267
Name: MaritalStatus, dtype: int64
Value counts for OverTime in test set:
0    217
1     77
Name: OverTime, dtype: int64
Value counts for BusinessTravel in test set:
2    208
1     49
0     37
Name: BusinessTravel, dtype: int64
Value counts for EducationField in test set:
1    115
3     95
2     35
5     31
4     13
0      5
Name: EducationField, dtype: int64
Value counts for JobRole in test set:
7    72
2    55
6    50
4    38
0   

In [7]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
X_scaler = scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


In [9]:
pd.DataFrame(X_train_scaled, columns=X_train.columns).head()

Unnamed: 0,Age,BusinessTravel,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobRole,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,-1.388559,0.589281,1.440396,-0.863356,2.085607,0.279706,-0.472832,-1.01234,-0.932274,-1.008801,...,-0.42929,-0.639822,2.547471,-1.167368,0.157319,0.357435,-0.974263,-0.888208,-0.67611,-1.142448
1,-2.040738,-2.463556,-0.522699,-0.863356,-0.930284,-0.639104,0.309374,0.389912,-0.932274,0.609133,...,-0.42929,1.211176,-0.945525,-1.423397,-0.613546,0.357435,-1.138573,-1.165051,-0.67611,-1.142448
2,-0.845077,0.589281,1.317703,-0.863356,-0.176312,1.198515,-1.059487,0.389912,-0.025447,1.013617,...,-0.42929,1.211176,0.218807,-0.143254,-0.613546,0.357435,-0.645643,-0.611364,-0.67611,-0.575084
3,0.241886,0.589281,0.336155,0.099933,0.577661,1.198515,-0.032841,0.389912,-0.025447,-0.199834,...,2.329427,0.285677,-0.945525,-0.527297,0.157319,0.357435,-0.317023,-0.057676,-0.355244,-1.142448
4,-0.627685,0.589281,1.317703,0.099933,-0.930284,-0.639104,1.09158,0.389912,-0.025447,-1.008801,...,-0.42929,-1.565321,0.218807,-0.143254,-0.613546,0.357435,0.504527,1.0497,-0.67611,-0.575084


In [13]:
# Create a OneHotEncoder for the Department column
from sklearn.preprocessing import OneHotEncoder
dept_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the training data
dept_encoder.fit(y_train['Department'].values.reshape(-1,1))

# Create two new variables by applying the encoder
# to the training and testing data
encoded_train_Department = dept_encoder.transform(y_train['Department'].values.reshape(-1,1))
encoded_test_Department = dept_encoder.transform(y_test['Department'].values.reshape(-1,1))

encoded_train_Department[:]



array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [14]:
# Create a OneHotEncoder for the Attrition column
attr_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the training data
attr_encoder.fit(y_train['Attrition'].values.reshape(-1,1))

# Create two new variables by applying the encoder
# to the training and testing data
encoded_train_Attrition = attr_encoder.transform(y_train['Attrition'].values.reshape(-1,1))
encoded_test_Attrition = attr_encoder.transform(y_test['Attrition'].values.reshape(-1,1))

encoded_train_Attrition[:]


array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [1., 0.],
       [1., 0.]])

## Create, Compile, and Train the Model

In [15]:
# Find the number of columns in the X training data
input_features = X_train_scaled.shape[1]

# Create the input layer
input_layer = layers.Input(shape=(input_features,))

# Create at least two shared layers
shared_layer1 = layers.Dense(64, activation='relu', name='Shared_1')(input_layer)
shared_layer2 = layers.Dense(32128, activation='relu', name='Shared_2')(shared_layer1)

In [16]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
branch_hidden_layer_1 = layers.Dense(32, activation='relu', name = 'hid_dept')(shared_layer2)

# Create the output layer
output_layer = layers.Dense(3, activation='softmax', name = 'out_dept')(branch_hidden_layer_1)


In [17]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
branch_hidden_layer_2 = layers.Dense(32, activation='relu', name = 'hid_attrition')(shared_layer2)

# Create the output layer
output_layer2 = layers.Dense(2, activation='sigmoid', name = 'out_attrition')(branch_hidden_layer_2)


In [18]:
# Create the model
model = Model(inputs=input_layer, outputs=[output_layer, output_layer2])

# Compile the model
model.compile(optimizer='adam',
              loss={'out_dept': 'categorical_crossentropy', 'out_attrition': 'categorical_crossentropy'},
              metrics={'out_dept':'accuracy', 'out_attrition':'accuracy'})

# Summarize the model
model.summary()

In [19]:
# Train the model
model.fit(X_train_scaled, {'out_dept': encoded_train_Department, 'out_attrition': encoded_train_Attrition}, epochs=100, shuffle=True, verbose=2)


Epoch 1/100
37/37 - 2s - 49ms/step - loss: 1.1065 - out_attrition_accuracy: 0.8206 - out_dept_accuracy: 0.7304
Epoch 2/100
37/37 - 1s - 16ms/step - loss: 0.7916 - out_attrition_accuracy: 0.8316 - out_dept_accuracy: 0.8376
Epoch 3/100
37/37 - 1s - 15ms/step - loss: 0.6945 - out_attrition_accuracy: 0.8316 - out_dept_accuracy: 0.8827
Epoch 4/100
37/37 - 1s - 15ms/step - loss: 0.5995 - out_attrition_accuracy: 0.8316 - out_dept_accuracy: 0.9039
Epoch 5/100
37/37 - 1s - 16ms/step - loss: 0.5296 - out_attrition_accuracy: 0.8469 - out_dept_accuracy: 0.9116
Epoch 6/100
37/37 - 1s - 16ms/step - loss: 0.4522 - out_attrition_accuracy: 0.8810 - out_dept_accuracy: 0.9260
Epoch 7/100
37/37 - 1s - 18ms/step - loss: 0.4020 - out_attrition_accuracy: 0.8869 - out_dept_accuracy: 0.9464
Epoch 8/100
37/37 - 1s - 18ms/step - loss: 0.3318 - out_attrition_accuracy: 0.9192 - out_dept_accuracy: 0.9600
Epoch 9/100
37/37 - 1s - 17ms/step - loss: 0.2809 - out_attrition_accuracy: 0.9141 - out_dept_accuracy: 0.9736
E

<keras.src.callbacks.history.History at 0x1d3d21b5330>

In [20]:
# Evaluate the model with the testing data
results_test = model.evaluate(X_test_scaled, {'out_dept': encoded_test_Department, 'out_attrition': encoded_test_Attrition}, verbose=2)
print(results_test)

10/10 - 0s - 15ms/step - loss: 2.6152 - out_attrition_accuracy: 0.8639 - out_dept_accuracy: 0.8878
[2.615182399749756, 0.8639456033706665, 0.8877550959587097]


In [21]:
# Print the accuracy for both department and attrition
print(f"Department predictions Accuracy: {results_test[2]}")
print(f"Attrition predictions Accuracy: {results_test[1]}")

Department predictions Accuracy: 0.8877550959587097
Attrition predictions Accuracy: 0.8639456033706665


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. The imbalance in our dataset, where certain classes dominate others (like "no attrition" and "Research & Development"), makes accuracy a deceptive metric. A simplistic model that always predicts these majority classes might appear accurate, yet it lacks practical value.

To gauge the effectiveness of our model, we must consider the specific business context. If our main objective is to identify employees likely to leave and implement retention strategies, then a model that minimizes false negatives (high recall) is critical. This means prioritizing the correct identification of potential attrition, even if it leads to some false positives (incorrectly predicting attrition for some employees).

Beyond accuracy, we'll focus on metrics like recall and precision, balancing their trade-off to ensure the model aligns with our business goals.

2. In my model's architecture, I employed different activation functions for the department and attrition output layers. Specifically, I chose the softmax function for the department output, as it adeptly distributes probabilities across multiple classes, ensuring their sum totals 1. This is ideal for multi-class classification tasks like predicting which department an employee might belong to.

Conversely, for the attrition output layer, I utilized the sigmoid function. This function is well-suited for binary classification, mapping input values to a range between 0 and 1, effectively representing probabilities of "yes/no" or "true/false" outcomes. In this context, the sigmoid function is ideal for predicting whether an employee will leave the company (attrition).

By selecting activation functions that align with the nature of each output layer's task, I looked to enhance the model's overall predictive performance and accuracy.

3. Several avenues exist for enhancing the model's performance:

-Architecture Modifications: Experimenting with the model's structure by adding more layers or neurons could increase its capacity to learn complex patterns.
- Hyperparameter Optimization: Fine-tuning hyperparameters such as learning rate, batch size, the number of epochs, and optimizer choice can significantly impact model performance.
- Feature Engineering: Creating new features or transforming existing ones could uncover hidden relationships in the data and improve predictive power.
- Data Balancing: Addressing biases or imbalances in the dataset, perhaps through techniques like oversampling or undersampling, can help the model generalize better.

These are just a few potential strategies for improving the model. A systematic approach involving experimentation and careful analysis will be critical in determining the most effective enhancements.