<a href="https://colab.research.google.com/github/sirarasi/neural-network-challenge-2/blob/main/attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers, models, Model

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64

In [3]:
# Create y_df with the Attrition and Department columns
# Split data into X and two separate y variables

y_df = attrition_df[['Attrition', 'Department']]

y_df.head()

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [4]:
# Create a list of at least 10 column names to use as X data
columns_for_X_df = ['Education', 'Age', 'DistanceFromHome', 'JobSatisfaction', 'OverTime', 'StockOptionLevel', 'WorkLifeBalance', 'YearsAtCompany', 'YearsSinceLastPromotion', 'NumCompaniesWorked']

# Create X_df using your selected columns
X_df = attrition_df[columns_for_X_df]

# Show the data types for X_df
X_df.dtypes



Education                   int64
Age                         int64
DistanceFromHome            int64
JobSatisfaction             int64
OverTime                   object
StockOptionLevel            int64
WorkLifeBalance             int64
YearsAtCompany              int64
YearsSinceLastPromotion     int64
NumCompaniesWorked          int64
dtype: object

In [5]:
# X_df.info()
X_df['OverTime'].value_counts()

OverTime
No     1054
Yes     416
Name: count, dtype: int64

In [6]:
from sklearn.preprocessing import LabelEncoder

# Initialize the label encoder
label_encoder = LabelEncoder()

# Encode the 'OverTime' column
# Use label encoding to convert 'OverTime' column to numeric values
X_df['OverTime'] = label_encoder.fit_transform(X_df['OverTime'])

# Check the data types of the DataFrame
X_df.dtypes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df['OverTime'] = label_encoder.fit_transform(X_df['OverTime'])


Education                  int64
Age                        int64
DistanceFromHome           int64
JobSatisfaction            int64
OverTime                   int64
StockOptionLevel           int64
WorkLifeBalance            int64
YearsAtCompany             int64
YearsSinceLastPromotion    int64
NumCompaniesWorked         int64
dtype: object

In [7]:
X_df['OverTime'].value_counts()
# X_df['OverTime'].dtype

OverTime
0    1054
1     416
Name: count, dtype: int64

In [8]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_df_train, y_df_test = train_test_split(X_df, y_df)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_df_train:", y_df_train.shape)
print("Shape of y_df_test:", y_df_test.shape)

Shape of X_train: (1102, 10)
Shape of X_test: (368, 10)
Shape of y_df_train: (1102, 2)
Shape of y_df_test: (368, 2)


In [9]:
# Create a StandardScaler
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


In [10]:
# Create a OneHotEncoder for the Department column

from sklearn.preprocessing import OneHotEncoder

# Create the OneHotEncoder
encoder = OneHotEncoder()

# Fit the encoder to the training data
encoder.fit(y_df_train[['Department']])

# Transform the training and testing data
y_train_Department_encoded = encoder.transform(y_df_train[['Department']])
y_test_Department_encoded = encoder.transform(y_df_test[['Department']])


# Create two new variables by applying the encoder
# to the training and testing data
print("array",y_train_Department_encoded.toarray())



array [[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 ...
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [11]:
# Create a OneHotEncoder for the Attrition column

Attrition_encoder = OneHotEncoder()

# Fit the encoder to the training data
Attrition_encoder.fit(y_df_train[['Attrition']])

# Transform the training and testing data
y_train_Attrition_encoded = Attrition_encoder.transform(y_df_train[['Attrition']])
y_test_Attrition_encoded = Attrition_encoder.transform(y_df_test[['Attrition']])

# Create two new variables by applying the encoder
# to the training and testing data
print("array",y_train_Attrition_encoded.toarray())


array [[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [0. 1.]]


## Create, Compile, and Train the Model

In [12]:
# Find the number of columns in the X training data

num_columns = X_train.shape[1]
print("Number of columns in X_train:", num_columns)

Number of columns in X_train: 10


In [13]:
# 1. Find the number of columns in the X training data.
num_features = X_train.shape[1]

# Input layer
input_layer = layers.Input(shape=(num_features,), name='input')

# Shared layers
shared_layer1 = layers.Dense(64, activation='relu', name='shared1')(input_layer)
shared_layer2 = layers.Dense(128, activation='relu', name='shared2')(shared_layer1)

# Department branch
department_hidden = layers.Dense(32, activation='relu', name='department_hidden')(shared_layer2)
department_output = layers.Dense(3, activation='softmax', name='department_output')(department_hidden)

# Attrition branch
attrition_hidden = layers.Dense(32, activation='relu', name='attrition_hidden')(shared_layer2)
attrition_output = layers.Dense(2, activation='sigmoid', name='attrition_output')(attrition_hidden)

# Define the model
model = models.Model(inputs=input_layer, outputs=[department_output, attrition_output])


# # 7. Compile the model.
model.compile(optimizer='adam',
               loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'binary_crossentropy'},
               metrics={'department_output': 'accuracy', 'attrition_output': 'accuracy'})

# # 8. Summarize the model.
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input (InputLayer)          [(None, 10)]                 0         []                            
                                                                                                  
 shared1 (Dense)             (None, 64)                   704       ['input[0][0]']               
                                                                                                  
 shared2 (Dense)             (None, 128)                  8320      ['shared1[0][0]']             
                                                                                                  
 department_hidden (Dense)   (None, 32)                   4128      ['shared2[0][0]']             
                                                                                              

In [14]:
print("Shape of X_train_scaled:", X_train_scaled.shape)
print("Shape of y_train_Department_encoded:", y_train_Department_encoded.shape)
print("Shape of y_train_Attrition_encoded:", y_train_Attrition_encoded.shape)

Shape of X_train_scaled: (1102, 10)
Shape of y_train_Department_encoded: (1102, 3)
Shape of y_train_Attrition_encoded: (1102, 2)


In [15]:
# Split the data into training and validation sets
from sklearn.model_selection import train_test_split


# Split the data into training and validation sets
X_train_again, X_val, y_train_department_encoded, y_val_department_encoded, y_train_attrition_encoded, y_val_attrition_encoded = train_test_split(
    X_train_scaled,  # Input features
    y_train_Department_encoded,  # Encoded labels for department
    y_train_Attrition_encoded,  # Encoded labels for attrition
    test_size=0.2,  # Use 20% of the data for validation
    random_state=42  # Set random state for reproducibility
)

# print("Shape of X_train_again:", X_train_again.shape)
# print("Shape of y_train_department_encoded:", y_train_department_encoded.shape)
# print("Shape of y_train_attrition_encoded:", y_train_attrition_encoded.shape)
# print("Shape of y_val_department_encoded:", y_val_department_encoded.shape)
# print("Shape of y_val_attrition_encoded:", y_val_attrition_encoded.shape)

model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'categorical_crossentropy'})

history = model.fit(
    X_train_again,
    {'department_output': y_train_department_encoded.toarray(), 'attrition_output': y_train_attrition_encoded.toarray()},
    validation_data=(X_val, {'department_output': y_val_department_encoded.toarray(), 'attrition_output': y_val_attrition_encoded.toarray()}),
    epochs=10,
    batch_size=32
)


# Train the model using the preprocessed training and validation data
# history = model.fit(X_train,
#                     {'department_output': y_train_Department_encoded, 'attrition_output': y_train_Attrition_encoded},
#                     validation_data=(X_val, {'department_output': y_val_Department_encoded, 'attrition_output': y_val_Attrition_encoded}),
#                     epochs=10,
#                     batch_size=32)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [16]:
# # Evaluate the model with the testing data
# Convert SparseTensor to NumPy array for evaluation
y_val_department_array = y_val_department_encoded.toarray()
y_val_attrition_array = y_val_attrition_encoded.toarray()

# Evaluate the model on the validation data
val_loss, val_department_loss, val_attrition_loss = model.evaluate(
    X_val,
    {'department_output': y_val_department_array, 'attrition_output': y_val_attrition_array},
    verbose=0
)

# Print the validation loss and losses for each output
print("Validation Loss:", val_loss)
print("Validation Department Loss:", val_department_loss)
print("Validation Attrition Loss:", val_attrition_loss)



Validation Loss: 1.2716256380081177
Validation Department Loss: 0.8192228674888611
Validation Attrition Loss: 0.452402800321579


In [17]:
# Evaluate the model with the testing data
# Evaluate the model on the testing data

# Convert sparse tensors to dense tensors
y_test_Department_encoded_dense = y_test_Department_encoded.toarray()
y_test_Attrition_encoded_dense = y_test_Attrition_encoded.toarray()

# Evaluate the model on the testing data
test_metrics = model.evaluate(X_test_scaled,
                               {'department_output': y_test_Department_encoded_dense, 'attrition_output': y_test_Attrition_encoded_dense})

# Print the test metrics
print(test_metrics)



[1.2721201181411743, 0.8596864938735962, 0.4124334454536438]


In [18]:
# Print the accuracy for both department and attrition
from sklearn.metrics import accuracy_score

# Make predictions on the validation data
y_pred = model.predict(X_val)

# Calculate accuracy for department output
y_val_department_pred = y_pred[0]
y_val_department_true = y_val_department_encoded.toarray()
department_accuracy = accuracy_score(y_val_department_true, y_val_department_pred.round())

# Calculate accuracy for attrition output
y_val_attrition_pred = y_pred[1]
y_val_attrition_true = y_val_attrition_encoded.toarray()
attrition_accuracy = accuracy_score(y_val_attrition_true, y_val_attrition_pred.round())

print("Department Accuracy:", department_accuracy)
print("Attrition Accuracy:", attrition_accuracy)


Department Accuracy: 0.5610859728506787
Attrition Accuracy: 0.8099547511312217


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Accuracy may not be the best metric to use on this data because the classes might be imbalanced, especially for the department prediction task where there could be significantly more data for certain departments compared to others. In such cases, accuracy might not provide an accurate representation of model performance. Metrics like precision, recall, and F1-score could provide a more balanced evaluation, especially for imbalanced datasets.
2. For the output layers, I chose both softmax and sigmoid activation functions since the model is designed to handle both a multi-class classification task (department_output) and a binary classification task (attrition_output).
For the department_output, which appears to be a multi-class classification task (with 3 classes), using softmax activation is appropriate. Softmax ensures that the model's output probabilities sum up to 1 across all classes, making it suitable for multi-class classification where each sample belongs to exactly one class.
For the attrition_output, which seems to be a binary classification task (with 2 classes), using sigmoid activation is suitable. Sigmoid activation squashes the output to a range between 0 and 1, representing the probability of the positive class. This is commonly used for binary classification tasks.
Therefore, in this case, using softmax for department_output and sigmoid for attrition_output is the correct choice based on the nature of the tasks.
3. Hyperparameter Tuning: Experiment with different architectures, layer sizes, and activation functions to find the optimal configuration.
Regularization: Introduce regularization techniques such as dropout or L2 regularization to prevent overfitting.
Batch Normalization: Adding batch normalization layers can help stabilize and speed up the training process.
Data Augmentation: If the dataset is limited, consider augmenting the data through techniques like rotation, scaling, or adding noise to improve model generalization.
Ensemble Methods: Combine predictions from multiple models (ensemble learning) to improve overall performance.
Feature Engineering: Explore additional features or transformations that could better represent the underlying patterns in the data.
Class Balancing: Address class imbalances, especially for the department prediction task, by using techniques like oversampling, undersampling, or class weighting.