<a href="https://colab.research.google.com/github/tpeterz/neural-network-challenge-2/blob/main/attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
!pip install tensorflow
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()



Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64

In [3]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition','Department']].copy()
y_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Attrition   1470 non-null   object
 1   Department  1470 non-null   object
dtypes: object(2)
memory usage: 23.1+ KB


In [4]:
# Exploratory check, where I can take a look inside of each of these columns, for my overall analysis/summary
print(y_df['Attrition'].value_counts())
print(y_df['Department'].value_counts())

Attrition
No     1233
Yes     237
Name: count, dtype: int64
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64


In [5]:
# Create a list of at least 10 column names to use as X data
X_columns = ['BusinessTravel', 'TotalWorkingYears', 'HourlyRate', 'YearsAtCompany', 'DistanceFromHome',
              'Age', 'YearsWithCurrManager', 'YearsSinceLastPromotion', 'YearsInCurrentRole', 'MaritalStatus']
# Create X_df using your selected columns
X_df = attrition_df[X_columns].copy()
# Show the data types for X_df
show = X_df.dtypes
print(show)

BusinessTravel             object
TotalWorkingYears           int64
HourlyRate                  int64
YearsAtCompany              int64
DistanceFromHome            int64
Age                         int64
YearsWithCurrManager        int64
YearsSinceLastPromotion     int64
YearsInCurrentRole          int64
MaritalStatus              object
dtype: object


In [6]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

In [7]:
# Double checking if there are null values within the training dataset (by percentage)
# Reference: Module 14, Day 2, activity 7 - "third_model_solution"
X_train.isna().sum()/len(X_train)

BusinessTravel             0.0
TotalWorkingYears          0.0
HourlyRate                 0.0
YearsAtCompany             0.0
DistanceFromHome           0.0
Age                        0.0
YearsWithCurrManager       0.0
YearsSinceLastPromotion    0.0
YearsInCurrentRole         0.0
MaritalStatus              0.0
dtype: float64

In [8]:
# Double checking if there are null values within the testing dataset (by percentage)
X_test.isna().sum()/len(X_test)

BusinessTravel             0.0
TotalWorkingYears          0.0
HourlyRate                 0.0
YearsAtCompany             0.0
DistanceFromHome           0.0
Age                        0.0
YearsWithCurrManager       0.0
YearsSinceLastPromotion    0.0
YearsInCurrentRole         0.0
MaritalStatus              0.0
dtype: float64

In [9]:
# For the below step, I will need to check the classes of the 'BusinessTravel' column
print(X_train['BusinessTravel'].value_counts())

BusinessTravel
Travel_Rarely        835
Travel_Frequently    228
Non-Travel           113
Name: count, dtype: int64


In [10]:
# # For the below step, I will need to check the classes of the 'MaritalStatus' column
print(X_train['MaritalStatus'].value_counts())

MaritalStatus
Married     550
Single      359
Divorced    267
Name: count, dtype: int64


# Rationale for deciding the encoding method below
---
### From my understanding, label encoding is only used with binary data. Given the values counts of the 2 columns that I am choosing to convert to numerical values (through encoding), it appears that OneHotEncoding will be best.

### Referencing Module 19, activity 2, for branching, the OneHotEncoder was used for data with mutiple, not binary categories

### Reference:
> "Preprocess "color" column (**label encoding for binary; one-hot encoding for multiple categories**)" - Label encoding was used here for the 2 categories that the 'color' column contained.
### Seeing as both of my columns have 3 categories, OneHot encoding will be the best suited encoder.
---
# Secondary rationale for the 'MaritalStatus' column:
### According to Module 14, Day 2, activity 7 - "third_model_solution", there is column named "marital" that has the same values inside of it (single, married, divorced). The way it is handled is below:


>
    # Only three values; lets use two OneHotEncoded columns
    # remembering to choose options for unknown values
    encode_marital = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)

    # Train the encoder
    encode_marital.fit(X_train_filled['marital'].values.reshape(-1, 1))





In [11]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
# Convert categorical columns to numeric
# The categorical columns that I need to convert for use will be 'MaritalStatus' and 'BusinessTravel
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

In [12]:
# Establish columns to be encoded (together)
columns_to_encode = ['BusinessTravel', 'MaritalStatus']

In [13]:
# Fit the encoder on the training data and transform both train and test data
encoder.fit(X_train[columns_to_encode])
X_train_encoded = encoder.transform(X_train[columns_to_encode])
X_test_encoded = encoder.transform(X_test[columns_to_encode])

In [14]:
# Get feature names for the encoded columns
encoded_columns = encoder.get_feature_names_out(columns_to_encode)

In [15]:
# Convert the encoded features into a dataframe
# and concatenate with the original X_df dataframe, dropping the 'un-encoded' columns
X_train_processed = pd.concat([
    X_train.drop(columns_to_encode, axis=1),
    pd.DataFrame(X_train_encoded, columns=encoded_columns, index=X_train.index)
], axis=1)

In [16]:
# Doing the same for the testing data
X_test_processed = pd.concat([
    X_test.drop(columns_to_encode, axis=1),
    pd.DataFrame(X_test_encoded, columns=encoded_columns, index=X_test.index)
], axis=1)

In [17]:
# Verify the columns and data contained after encoding
# It looks like now instead of 10 columns to work with, the encoding makes it 14 total usuable columns
print(X_train_processed.head())

      TotalWorkingYears  HourlyRate  YearsAtCompany  DistanceFromHome  Age  \
1097                  2          57               1                21   24   
727                   0          73               0                 5   18   
254                  10          45               3                20   29   
1175                  7          66               5                12   39   
1341                 10          89              10                20   31   

      YearsWithCurrManager  YearsSinceLastPromotion  YearsInCurrentRole  \
1097                     0                        0                   1   
727                      0                        0                   0   
254                      2                        0                   2   
1175                     0                        1                   4   
1341                     2                        0                   8   

      BusinessTravel_Non-Travel  BusinessTravel_Travel_Frequently  \
1097       

In [18]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
scaler.fit(X_train_processed)

# Scale the training and testing data
X_train_scaled = scaler.transform(X_train_processed)
X_test_scaled = scaler.transform(X_test_processed)

In [19]:
# Convert arrays back to dfs
# Following the Module 19.3 activities, it looks like I need these to be in dataframe format for the model
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_processed.columns, index=X_train_processed.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_processed.columns, index=X_test_processed.index)

In [20]:
# Taking a look at the dataframe (in printed form)
print(X_train_scaled_df.head())

      TotalWorkingYears  HourlyRate  YearsAtCompany  DistanceFromHome  \
1097          -1.167368   -0.472832       -0.974263          1.440396   
727           -1.423397    0.309374       -1.138573         -0.522699   
254           -0.143254   -1.059487       -0.645643          1.317703   
1175          -0.527297   -0.032841       -0.317023          0.336155   
1341          -0.143254    1.091580        0.504527          1.317703   

           Age  YearsWithCurrManager  YearsSinceLastPromotion  \
1097 -1.388559             -1.142448                -0.676110   
727  -2.040738             -1.142448                -0.676110   
254  -0.845077             -0.575084                -0.676110   
1175  0.241886             -1.142448                -0.355244   
1341 -0.627685             -0.575084                -0.676110   

      YearsInCurrentRole  BusinessTravel_Non-Travel  \
1097           -0.888208                  -0.326041   
727            -1.165051                   3.067096   
254  

In [21]:
# Create a OneHotEncoder for the Department column
encode_department = OneHotEncoder(sparse=False)

# Fit the encoder to the training data
encode_department.fit(y_train[['Department']])

# Create two new variables by applying the encoder
# to the training and testing data
y_train_department_encoded = encode_department.transform(y_train[['Department']])
y_test_department_encoded = encode_department.transform(y_test[['Department']])

# Looking at the training array created (for y_train, (encoded deparrtment))
y_train_department_encoded



array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       ...,
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [22]:
# Create a OneHotEncoder for the Attrition column
encode_attrition = OneHotEncoder(sparse_output=False)

# Fit the encoder to the training data
encode_attrition.fit(y_train[['Attrition']])

# Create two new variables by applying the encoder
# to the training and testing data
y_train_attrition_encoded = encode_attrition.transform(y_train[['Attrition']])
y_test_attrition_encoded = encode_attrition.transform(y_test[['Attrition']])

# Print array
y_train_attrition_encoded

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [1., 0.],
       [1., 0.]])

In [23]:
# Additional verification step to check shape and display the beginning of the array
print(y_train_department_encoded.shape)
print(y_train_department_encoded[:10])

(1176, 3)
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


## Create, Compile, and Train the Model

In [24]:
# Verify correct installations/imports (Colab is a little funny with the way it handles this)
# But in order to proceed, I absolutely need these
# !pip install tensorflow
# import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

In [25]:
# Verifying and checking the shape (this will be important for the output layer(s) creation)
input_shape = X_train_scaled.shape[1]
print(input_shape)

14


In [26]:
# Checking again the array
X_train_scaled

array([[-1.1673683 , -0.47283217, -0.97426331, ...,  1.8451272 ,
        -0.93733358, -0.66288195],
       [-1.42339685,  0.30937375, -1.13857331, ..., -0.54196806,
        -0.93733358,  1.50856422],
       [-0.14325407, -1.05948661, -0.6456433 , ...,  1.8451272 ,
        -0.93733358, -0.66288195],
       ...,
       [-1.29538258, -0.912823  , -1.13857331, ..., -0.54196806,
         1.06685604, -0.66288195],
       [-0.14325407, -1.01059874, -0.4813333 , ..., -0.54196806,
         1.06685604, -0.66288195],
       [ 2.03298865, -0.37505643, -0.97426331, ..., -0.54196806,
         1.06685604, -0.66288195]])

In [27]:
# Reference (Mod. 19, day 3)

# Find the number of columns in the X training data
input_shape = X_train_scaled.shape[1]

# Create the input layer
input_layer = Input(shape=(input_shape))

# Create at least two shared layers
shared_layer_1 = Dense(128, activation='relu')(input_layer)
shared_layer_2 = Dense(64, activation='relu')(shared_layer_1)

In [28]:
# Create a branch for Department
# with a hidden layer and an output layer

# Choosing softmax

# Create the hidden layer
department_hidden = Dense(32, activation='relu')(shared_layer_2)

# Create the output layer
department_output = Dense(3, activation='softmax', name='department_output')(department_hidden)

In [29]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Choosing sigmoid

# Create the hidden layer
attrition_hidden = Dense(32, activation='relu')(shared_layer_2)

# Create the output layer
attrition_output = Dense(2, activation='sigmoid', name='attrition_output')(attrition_hidden)

In [30]:
# Create the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output], name='model')

# Compile the model

model.compile(
    optimizer='adam',
    loss={
        'department_output': 'categorical_crossentropy',
        'attrition_output': 'categorical_crossentropy'
    },
    metrics={
        'department_output': 'accuracy',
        'attrition_output': 'accuracy'
    }
)
# Summarize the model
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 14)]                 0         []                            
                                                                                                  
 dense (Dense)               (None, 128)                  1920      ['input_1[0][0]']             
                                                                                                  
 dense_1 (Dense)             (None, 64)                   8256      ['dense[0][0]']               
                                                                                                  
 dense_2 (Dense)             (None, 32)                   2080      ['dense_1[0][0]']             
                                                                                              

In [31]:
# Train the model
# Including the validation_data in the fitting of this model, as it is in the 19.2 example of batching
model.fit(
    X_train_scaled_df,
    y={'department_output': y_train_department_encoded, 'attrition_output': y_train_attrition_encoded},
    validation_data=(X_test_scaled_df, {'department_output': y_test_department_encoded, 'attrition_output': y_test_attrition_encoded}),
    epochs=100,
    batch_size=32,
    verbose=1
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7eb7aefba110>

In [32]:
# Evaluate the model with the testing data
# This is very similar to the example output provided in the base starter code!
test_results = model.evaluate(X_test_scaled_df, {'department_output': y_test_department_encoded, 'attrition_output': y_test_attrition_encoded})
test_results



[3.783148765563965,
 2.3647406101226807,
 1.4184081554412842,
 0.5476190447807312,
 0.8027210831642151]

In [33]:
# Print the accuracy for both department and attrition
print(f"Department predictions accuracy: {test_results[3]}")
print(f"Attrition predictions accuracy: {test_results[4]}")

Department predictions accuracy: 0.5476190447807312
Attrition predictions accuracy: 0.8027210831642151


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

# Answers:
1. Accuracy may not adequately reflect the effectiveness of this model. Looking at the data I began with for my `y_df`, it is easily visible that the data from both of the columns is rather imbalanced and probably skewed. This can cause bias or mispleading/high scores during the training process, possibly leaning towards one outcome over another. I believe that I need to incorporate other metrics to use on this data, such a precision, recall, F1-score, and possibly a confusion matrix.
2. For the **department** output layer, `department_output`, I chose **Softmax**. This is because this column is a multi-class categorical variable. From my understanding, this is best so that exactly one class is the correct classification for each *instance*, through the probability distribution over the classes. It will ensure that the sum of potential outcomes will be 1. A similar thought process was applied when thinking about the encoding menthod that the starter code required for this column (**OneHotEncoder**). A simpler explanation would be that each value in the department column cannot be true at the same time (For example, you cannot work in both the sales AND human resources departments at the same time. They are exclusive of each other).

  For the **attrition** output layer, `attrition_output`, I chose **Sigmoid**. This is because of the binary categorical variable (Yes or No : 1 or 0). There are only two possibilites here; if it is not 'Yes', it will be 'No', and vice versa. My first choice for use was Softmax, however even after changing it to my final choice of sigmoid, the results did not change.
3.
*  First, when creating `shared_layer1` and `shared_layer_2`, I used 'relu' for activation. I could use a different activation function for these, such as leaky relu.
* Second point about my layer creation; I used 64 neurons for the first shared layer, and only 32 for the second shared layer. This could be changed when tuning the model as well, such as increasing the neurons and adding additional hidden layers (adjusting neurons from there).
* Third, I could adjust the number of epochs used to train the model, increasing them to allow further analysis.
* The last and main improvement that I will point out is the encoding method used for the 'Attrition' column. While 'OneHotEncoder' was specifically what I am told to use, wouldn't a 'LabelEncoder' be better to use in this instance? With such a binary choice as "Yes" or "No," this would dispell any unnecessary complexity, and result in just those two output probabilities. This would further change the compile method I used, where it would require *`binary_crossentropy`*. I could also use an OrdinalEncoder with categories set to yes or no.


