## Part 1: Preprocessing

In [5]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

# Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')

# Print the first few rows to confirm successful load
print("Data successfully loaded:")
print(attrition_df.head())


Data successfully loaded:
   Age Attrition     BusinessTravel              Department  DistanceFromHome  \
0   41       Yes      Travel_Rarely                   Sales                 1   
1   49        No  Travel_Frequently  Research & Development                 8   
2   37       Yes      Travel_Rarely  Research & Development                 2   
3   33        No  Travel_Frequently  Research & Development                 3   
4   27        No      Travel_Rarely  Research & Development                 2   

   Education EducationField  EnvironmentSatisfaction  HourlyRate  \
0          2  Life Sciences                        2          94   
1          1  Life Sciences                        3          61   
2          2          Other                        4          92   
3          4  Life Sciences                        4          56   
4          1        Medical                        1          40   

   JobInvolvement  ...  PerformanceRating RelationshipSatisfaction  \
0       

In [7]:
# Determine the number of unique values in each column
unique_values = attrition_df.nunique()

# Print the unique values for each column
print("Number of unique values in each column:")
print(unique_values)


Number of unique values in each column:
Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64


In [9]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]

# Print the first few rows to verify
print("y_df created with the Attrition and Department columns:")
print(y_df.head())


y_df created with the Attrition and Department columns:
  Attrition              Department
0       Yes                   Sales
1        No  Research & Development
2       Yes  Research & Development
3        No  Research & Development
4        No  Research & Development


In [25]:
# Create a list of at least 10 column names to use as X data
x_columns = [
    'Age', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction',
    'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'NumCompaniesWorked',
    'PercentSalaryHike', 'OverTime'
]

# Verify that the selected columns exist in the dataset
missing_columns = [col for col in x_columns if col not in attrition_df.columns]
if missing_columns:
    print(f"Error: The following columns are missing from the dataset: {missing_columns}")
else:
    # Create X_df using your selected columns
    X_df = attrition_df[x_columns].copy()

    # Convert the 'OverTime' column to numeric values
    X_df['OverTime'] = X_df['OverTime'].map({'Yes': 1, 'No': 0})

    # Show the data types for X_df
    print("Data types for X_df after encoding:")
    print(X_df.dtypes)


Data types for X_df after encoding:
Age                        int64
DistanceFromHome           int64
Education                  int64
EnvironmentSatisfaction    int64
JobInvolvement             int64
JobLevel                   int64
JobSatisfaction            int64
NumCompaniesWorked         int64
PercentSalaryHike          int64
OverTime                   int64
dtype: object


In [27]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data with 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets to verify
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


X_train shape: (1176, 10)
X_test shape: (294, 10)
y_train shape: (1176, 2)
y_test shape: (294, 2)


In [None]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary


Unnamed: 0_level_0,count
OverTime,Unnamed: 1_level_1
No,797
Yes,305


In [29]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary

# Ensure all data in X_df is numeric
X_df = pd.get_dummies(X_df, drop_first=True)

# Verify the data types for X_df
print("Data types for X_df after conversion:")
print(X_df.dtypes)


Data types for X_df after conversion:
Age                        int64
DistanceFromHome           int64
Education                  int64
EnvironmentSatisfaction    int64
JobInvolvement             int64
JobLevel                   int64
JobSatisfaction            int64
NumCompaniesWorked         int64
PercentSalaryHike          int64
OverTime                   int64
dtype: object


In [33]:
from sklearn.preprocessing import OneHotEncoder

# Create a OneHotEncoder for the Department column
encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the training data
encoder.fit(y_train[['Department']])

# Create two new variables by applying the encoder
# to the training and testing data
y_train_department_encoded = encoder.transform(y_train[['Department']])
y_test_department_encoded = encoder.transform(y_test[['Department']])

# Print after success
print("Department column successfully encoded.")
print("y_train_department_encoded shape:", y_train_department_encoded.shape)
print("y_test_department_encoded shape:", y_test_department_encoded.shape)


Department column successfully encoded.
y_train_department_encoded shape: (1176, 3)
y_test_department_encoded shape: (294, 3)


In [37]:
from sklearn.preprocessing import OneHotEncoder

# Create a OneHotEncoder for the Attrition column
attrition_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the training data
attrition_encoder.fit(y_train[['Attrition']])

# Create two new variables by applying the encoder
# to the training and testing data
y_train_attrition_encoded = attrition_encoder.transform(y_train[['Attrition']])
y_test_attrition_encoded = attrition_encoder.transform(y_test[['Attrition']])

# Print after success
print("Attrition column successfully encoded.")
print("y_train_attrition_encoded shape:", y_train_attrition_encoded.shape)
print("y_test_attrition_encoded shape:", y_test_attrition_encoded.shape)


Attrition column successfully encoded.
y_train_attrition_encoded shape: (1176, 2)
y_test_attrition_encoded shape: (294, 2)


## Part 2: Create, Compile, and Train the Model

In [39]:
from tensorflow.keras import layers, Input

# Find the number of columns in the X training data
input_features = X_train.shape[1]
print(f"Number of columns in the X training data: {input_features}")

# Create the input layer
input_layer = Input(shape=(input_features,))
print("Input layer created.")

# Create at least two shared layers
shared_layer1 = layers.Dense(128, activation='relu')(input_layer)
shared_layer2 = layers.Dense(64, activation='relu')(shared_layer1)
print("Two shared layers created.")


Number of columns in the X training data: 10
Input layer created.
Two shared layers created.


In [41]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden_layer = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer
department_output_layer = layers.Dense(y_train_department_encoded.shape[1], activation='softmax', name='department_output')(department_hidden_layer)

print("Branch for Department created with a hidden layer and an output layer.")


Branch for Department created with a hidden layer and an output layer.


In [43]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden_layer = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer
attrition_output_layer = layers.Dense(1, activation='sigmoid', name='attrition_output')(attrition_hidden_layer)

print("Branch for Attrition created with a hidden layer and an output layer.")



Branch for Attrition created with a hidden layer and an output layer.


In [45]:
from tensorflow.keras.models import Model

# Create the model
model = Model(inputs=input_layer, outputs=[department_output_layer, attrition_output_layer])

# Compile the model
model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'binary_crossentropy'},
              metrics={'department_output': 'accuracy', 'attrition_output': 'accuracy'})

# Summarize the model
model.summary()

print("Model created, compiled, and summarized successfully.")


Model created, compiled, and summarized successfully.


In [59]:
# Convert Attrition to binary and ensure the shape is correct
y_train_attrition_binary = y_train['Attrition'].map({'Yes': 1, 'No': 0}).values.reshape(-1, 1)
y_test_attrition_binary = y_test['Attrition'].map({'Yes': 1, 'No': 0}).values.reshape(-1, 1)

# Convert all inputs and outputs to NumPy arrays
X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()
y_train_department_np = np.array(y_train_department_encoded)
y_test_department_np = np.array(y_test_department_encoded)
y_train_attrition_np = np.array(y_train_attrition_binary)
y_test_attrition_np = np.array(y_test_attrition_binary)

# Debug: Print shapes to ensure alignment
print(f"X_train_np shape: {X_train_np.shape}")
print(f"y_train_department_np shape: {y_train_department_np.shape}")
print(f"y_train_attrition_np shape: {y_train_attrition_np.shape}")
print(f"X_test_np shape: {X_test_np.shape}")
print(f"y_test_department_np shape: {y_test_department_np.shape}")
print(f"y_test_attrition_np shape: {y_test_attrition_np.shape}")

# Debug: Verify there are no missing or mismatched data points
print("Checking for missing values...")
print(f"X_train missing: {np.isnan(X_train_np).sum()}")
print(f"y_train_department_np missing: {np.isnan(y_train_department_np).sum()}")
print(f"y_train_attrition_np missing: {np.isnan(y_train_attrition_np).sum()}")
print(f"X_test missing: {np.isnan(X_test_np).sum()}")
print(f"y_test_department_np missing: {np.isnan(y_test_department_np).sum()}")
print(f"y_test_attrition_np missing: {np.isnan(y_test_attrition_np).sum()}")

# Rebuild the model
from tensorflow.keras import layers, Input, Model

# Input layer
input_layer = Input(shape=(X_train_np.shape[1],))

# Shared layers
shared_layer1 = layers.Dense(128, activation='relu')(input_layer)
shared_layer2 = layers.Dense(64, activation='relu')(shared_layer1)

# Department branch
department_hidden = layers.Dense(32, activation='relu')(shared_layer2)
department_output = layers.Dense(y_train_department_np.shape[1], activation='softmax', name='department_output')(department_hidden)

# Attrition branch
attrition_hidden = layers.Dense(32, activation='relu')(shared_layer2)
attrition_output = layers.Dense(1, activation='sigmoid', name='attrition_output')(attrition_hidden)

# Define the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output])

# Compile the model
model.compile(
    optimizer='adam',
    loss={
        'department_output': 'categorical_crossentropy',
        'attrition_output': 'binary_crossentropy'
    },
    metrics={
        'department_output': 'accuracy',
        'attrition_output': 'accuracy'
    }
)

# Summarize the model
model.summary()

# Train the model
history = model.fit(
    X_train_np,
    {
        'department_output': y_train_department_np,
        'attrition_output': y_train_attrition_np
    },
    validation_data=(
        X_test_np,
        {
            'department_output': y_test_department_np,
            'attrition_output': y_test_attrition_np
        }
    ),
    epochs=50,
    batch_size=32
)

print("Model training complete.")


X_train_np shape: (1176, 10)
y_train_department_np shape: (1176, 3)
y_train_attrition_np shape: (1176, 1)
X_test_np shape: (294, 10)
y_test_department_np shape: (294, 3)
y_test_attrition_np shape: (294, 1)
Checking for missing values...
X_train missing: 0
y_train_department_np missing: 0
y_train_attrition_np missing: 0
X_test missing: 0
y_test_department_np missing: 0
y_test_attrition_np missing: 0


Epoch 1/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 23ms/step - attrition_output_accuracy: 0.7113 - attrition_output_loss: 0.7702 - department_output_accuracy: 0.5078 - department_output_loss: 1.3276 - loss: 2.0979 - val_attrition_output_accuracy: 0.8673 - val_attrition_output_loss: 0.3699 - val_department_output_accuracy: 0.6667 - val_department_output_loss: 0.7816 - val_loss: 1.1775
Epoch 2/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - attrition_output_accuracy: 0.8360 - attrition_output_loss: 0.4433 - department_output_accuracy: 0.6601 - department_output_loss: 0.7969 - loss: 1.2402 - val_attrition_output_accuracy: 0.8639 - val_attrition_output_loss: 0.3591 - val_department_output_accuracy: 0.6667 - val_department_output_loss: 0.7646 - val_loss: 1.1474
Epoch 3/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - attrition_output_accuracy: 0.8508 - attrition_output_loss: 0.4122 - department_output_accurac

In [61]:
# Evaluate the model with the testing data
results = model.evaluate(
    X_test_np,
    {
        'department_output': y_test_department_np,
        'attrition_output': y_test_attrition_np
    },
    verbose=1
)

# Print the evaluation results
print("Model evaluation results:")
print(f"Total Loss: {results[0]}")
print(f"Department Output Loss: {results[1]}")
print(f"Department Output Accuracy: {results[2]}")
print(f"Attrition Output Loss: {results[3]}")
print(f"Attrition Output Accuracy: {results[4]}")


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - attrition_output_accuracy: 0.8278 - attrition_output_loss: 0.4217 - department_output_accuracy: 0.5617 - department_output_loss: 0.8231 - loss: 1.2508
Model evaluation results:
Total Loss: 1.182944893836975
Department Output Loss: 0.7853711247444153
Department Output Accuracy: 0.36471009254455566
Attrition Output Loss: 0.8503401279449463
Attrition Output Accuracy: 0.5884353518486023


In [63]:
# Evaluate the model with the testing data
results = model.evaluate(
    X_test_np,
    {
        'department_output': y_test_department_np,
        'attrition_output': y_test_attrition_np
    },
    verbose=1
)

# Print the accuracy for both department and attrition
print(f"Attrition predictions accuracy: {results[4]}")
print(f"Department predictions accuracy: {results[2]}")


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - attrition_output_accuracy: 0.8278 - attrition_output_loss: 0.4217 - department_output_accuracy: 0.5617 - department_output_loss: 0.8231 - loss: 1.2508 
Attrition predictions accuracy: 0.5884353518486023
Department predictions accuracy: 0.36471009254455566


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Accuracy is a helpful metric, but it might not be the best for this data, especially for the attrition prediction task. The dataset may have class imbalance (e.g., more employees staying than leaving), where accuracy could be misleading. Metrics like precision, recall, and F1-score should also be considered to evaluate the model's performance more comprehensively, particularly for the attrition_output.
   
2. Attrition Output Layer:
We used the sigmoid activation function because it is ideal for binary classification tasks. It outputs a value between 0 and 1, which can be interpreted as a probability for predicting whether an employee will leave the company.
Department Output Layer:
We used the softmax activation function for the multi-class classification of department predictions. It provides a probability distribution across the three departments, ensuring the output sums to 1 and can be used to determine the most likely department for each employee.

3. Hyperparameter Tuning:
Experiment with learning rates, batch sizes, number of neurons, and the number of layers to optimize model performance.

Dropout Layers:
Add dropout layers to reduce overfitting by randomly deactivating a fraction of neurons during training.

Feature Engineering:
Add or transform features to make them more predictive, such as interaction terms or normalized continuous variables.

Class Balancing:
Use techniques like oversampling, undersampling, or class weights to address any imbalance in the attrition data.

Regularization:
Apply L1 or L2 regularization in dense layers to prevent overfitting.

Advanced Models:
Consider using ensemble techniques, like random forests or gradient boosting, to compare with the neural network's performance. 