# HR Analytics - Predicting Employee Turnover

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Finding categorical variables</h1><div class=""><p>Categorical variables are variables that receive a limited number of values that describe a category. They can be of two types:</p>
<ul>
<li><strong>Ordinal</strong> – variables with two or more categories that <strong>can be ranked or ordered</strong> (e.g. “low”, “medium”, “high”)</li>
<li><strong>Nominal</strong> – variables with two or more categories that <strong>do not have an intrinsic order</strong> (e.g. “men”, “women”)</li>
</ul>
<p>In this exercise, you will find the categorical variables in the dataset. To do that, first of all, you will import the <code>pandas</code> library and read the CSV file called <code>"turnover.csv"</code>. Then, after viewing the first 5 rows and learning (visually) that there are non-numeric values in the DataFrame, you will get some information about the types of variables that are available in the dataset.</p></div></div>

In [1]:
# Import pandas (as pd) to read the data
import pandas as pd

# Read "turnover.csv" and save it in a DataFrame called data
data = pd.read_csv("turnover.csv")


In [2]:
# Take a quick look to the first 5 rows of data
print(data.head())

   satisfaction  evaluation  number_of_projects  average_montly_hours  \
0          0.38        0.53                   2                   157   
1          0.80        0.86                   5                   262   
2          0.11        0.88                   7                   272   
3          0.72        0.87                   5                   223   
4          0.37        0.52                   2                   159   

   time_spend_company  work_accident  churn  promotion department  salary  
0                   3              0      1          0      sales     low  
1                   6              0      1          0      sales  medium  
2                   4              0      1          0      sales  medium  
3                   5              0      1          0      sales     low  
4                   3              0      1          0      sales     low  


In [3]:
# Get some information on the types of variables in data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   satisfaction          14999 non-null  float64
 1   evaluation            14999 non-null  float64
 2   number_of_projects    14999 non-null  int64  
 3   average_montly_hours  14999 non-null  int64  
 4   time_spend_company    14999 non-null  int64  
 5   work_accident         14999 non-null  int64  
 6   churn                 14999 non-null  int64  
 7   promotion             14999 non-null  int64  
 8   department            14999 non-null  object 
 9   salary                14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


<p class="">You have two columns of type <code>object</code>, <code>department</code> and <code>salary</code>, which are actually categorical. Let’s explore these in more details.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Observing categoricals</h1><div class=""><p>Remember from the previous step that:</p>
<ul>
<li><strong>Ordinal</strong> variables have two or more categories which <strong>can be ranked or ordered</strong></li>
<li><strong>Nominal</strong> variables have two or more categories which <strong>do not have an intrinsic order</strong></li>
</ul>
<p>In your dataset:</p>
<ul>
<li><code>salary</code> is an ordinal variable</li>
<li><code>department</code> is a nominal variable</li>
</ul>
<p>In this exercise, you're going to observe the categorical variables found in the previous exercise. To do that, first of all, you will import the <code>pandas</code> library and read the CSV file called <code>"turnover.csv"</code>. Then, you will print the unique values of those variables.</p></div></div>

In [4]:
# Print the unique values of the "department" column
print(data.department .unique())

['sales' 'accounting' 'hr' 'technical' 'support' 'management' 'IT'
 'product_mng' 'marketing' 'RandD']


In [5]:
# Print the unique values of the "salary" column
print(data.salary.unique())

['low' 'medium' 'high']


<p class="">As you can see, <code>['low' 'medium' 'high']</code> is ordered, but <code>['sales' 'accounting' 'hr' 'technical' 'support' 'management' 'IT' 'product_mng' 'marketing' 'RandD']</code> isn't.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Encoding categories</h1><div class=""><p>You need to help your algorithm understand that you're dealing with categories. You will encode categories of the <code>salary</code> variable, which you know is ordinal based on the values you observed:</p>
<ul>
<li>you first have to tell Python that the <code>salary</code> column is actually categorical</li>
<li>you then have to specify the correct order of categories </li>
<li>finally, you should encode each category with a numeric value corresponding to its specific position in the order</li>
</ul></div></div>

In [6]:
# Change the type of the "salary" column to categorical
data.salary = data.salary.astype('category')

# Provide the correct order of categories
data.salary = data.salary.cat.reorder_categories(['low', 'medium', 'high'])

# Encode categories
data.salary = data.salary.cat.codes

<p class="">Nicely done! Our <code>salary</code> column is now encoded as an ordered category, and optimized for our prediction algorithm.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Getting dummies</h1><div class=""><p>You will now transform the <code>department</code> variable, which you know is nominal based on the values you observed. To do that, you will use so-called <strong>dummy variables</strong>.</p></div></div>

In [7]:
# Get dummies and save them inside a new DataFrame
departments = pd.get_dummies(data.department)

# Take a quick look to the first 5 rows of the new DataFrame called departments
print(departments.head())

   IT  RandD  accounting  hr  management  marketing  product_mng  sales  \
0   0      0           0   0           0          0            0      1   
1   0      0           0   0           0          0            0      1   
2   0      0           0   0           0          0            0      1   
3   0      0           0   0           0          0            0      1   
4   0      0           0   0           0          0            0      1   

   support  technical  
0        0          0  
1        0          0  
2        0          0  
3        0          0  
4        0          0  


<p class="">There are 10 departments in the dataset, so you now get 10 columns. The first five rows in your dataset refer to people working in the sales department, so you get values equal to 1 in the <code>department</code> column, and values equal to 0 in the others.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Dummy trap</h1><div class=""><p>A dummy trap is a situation where different dummy variables convey the same information. In this case, if an employee is, say, from the accounting department (i.e. value in the <code>accounting</code> column is 1), then you're certain that s/he is not from any other department (values everywhere else are 0). 
Thus, you could actually learn about his/her department by looking at all the other departments.</p>
<p>For that reason, whenever <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-1-Frame" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mi>n</mi></math>" role="presentation" style="position: relative;"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-1" style="width: 0.697em; display: inline-block;"><span style="display: inline-block; position: relative; width: 0.59em; height: 0px; font-size: 117%;"><span style="position: absolute; clip: rect(1.605em, 1000.59em, 2.353em, -999.997em); top: -2.188em; left: 0em;"><span class="mrow" id="MathJax-Span-2"><span class="mi" id="MathJax-Span-3" style="font-family: MathJax_Math-italic;">n</span></span><span style="display: inline-block; width: 0px; height: 2.193em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.059em; border-left: 0px solid; width: 0px; height: 0.628em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math></span></span><script type="math/tex" id="MathJax-Element-1">n</script> dummies are created (in your case, 10), only <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-2-Frame" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mi>n</mi></math>" role="presentation" style="position: relative;"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-4" style="width: 0.697em; display: inline-block;"><span style="display: inline-block; position: relative; width: 0.59em; height: 0px; font-size: 117%;"><span style="position: absolute; clip: rect(1.605em, 1000.59em, 2.353em, -999.997em); top: -2.188em; left: 0em;"><span class="mrow" id="MathJax-Span-5"><span class="mi" id="MathJax-Span-6" style="font-family: MathJax_Math-italic;">n</span></span><span style="display: inline-block; width: 0px; height: 2.193em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.059em; border-left: 0px solid; width: 0px; height: 0.628em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math></span></span><script type="math/tex" id="MathJax-Element-2">n</script> - 1 (in your case, 9) of them are enough, and the <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-3-Frame" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mi>n</mi></math>" role="presentation" style="position: relative;"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-7" style="width: 0.697em; display: inline-block;"><span style="display: inline-block; position: relative; width: 0.59em; height: 0px; font-size: 117%;"><span style="position: absolute; clip: rect(1.605em, 1000.59em, 2.353em, -999.997em); top: -2.188em; left: 0em;"><span class="mrow" id="MathJax-Span-8"><span class="mi" id="MathJax-Span-9" style="font-family: MathJax_Math-italic;">n</span></span><span style="display: inline-block; width: 0px; height: 2.193em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.059em; border-left: 0px solid; width: 0px; height: 0.628em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math></span></span><script type="math/tex" id="MathJax-Element-3">n</script>-th column's information is already included.</p>
<p>Therefore, you will get rid of the old department column, drop one of the department dummies to avoid dummy trap, and then the two DataFrames.</p></div></div>

In [8]:
# Drop the "accounting" column to avoid "dummy trap"
departments = departments.drop("accounting", axis=1)

# Drop the old column "department" as you don't need it anymore
data = data.drop("department", axis=1)

# Join the new dataframe "departments" to your employee dataset: done
data = data.join(departments)

In [9]:
data.head()

Unnamed: 0,satisfaction,evaluation,number_of_projects,average_montly_hours,time_spend_company,work_accident,churn,promotion,salary,IT,RandD,hr,management,marketing,product_mng,sales,support,technical
0,0.38,0.53,2,157,3,0,1,0,0,0,0,0,0,0,0,1,0,0
1,0.8,0.86,5,262,6,0,1,0,1,0,0,0,0,0,0,1,0,0
2,0.11,0.88,7,272,4,0,1,0,1,0,0,0,0,0,0,1,0,0
3,0.72,0.87,5,223,5,0,1,0,0,0,0,0,0,0,0,1,0,0
4,0.37,0.52,2,159,3,0,1,0,0,0,0,0,0,0,0,1,0,0


<p class="">Notice that in the new <code>data</code> DataFrame, the <code>department</code> column has disappeared, being replaced by a column for each department except <code>accounting</code>.</p>

<p class="">There is a negative correlation between employee <code>satisfaction</code> and <code>churn</code>. A negative correlation doesn’t mean the correlation between both variables is weak, it means that they are <strong>inversely correlated</strong> (and in this case, strongly inversely correlated).</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Percentage of employees who churn</h1><div class=""><p>The column <code>churn</code> is providing information about whether an employee has left the company or not is the column <code>churn</code>:</p>
<ul>
<li>if the value of this column is <strong>0</strong>,  the employee is <strong>still with the company</strong></li>
<li>if the value of this column is <strong>1</strong>, then the employee has <strong>left the company</strong></li>
</ul>
<p>Let’s calculate the turnover rate:</p>
<ul>
<li>you will first count the number of times the variable <code>churn</code> has the value 1 and the value 0, respectively</li>
<li>you will then divide both counts by the total, and multiply the result by 100 to get the percentage of employees who left and stayed</li>
</ul></div></div>

In [10]:
# Get the total number of observations and save it as the number of employees
n_employees = len(data)

# Print the number of employees who left/stayed
print(data.churn.value_counts())

# Print the percentage of employees who left/stayed
print(data.churn.value_counts()/n_employees*100)

0    11428
1     3571
Name: churn, dtype: int64
0    76.191746
1    23.808254
Name: churn, dtype: float64


<p class="">As you can see, <strong>11,428</strong> employees stayed, which accounts for about <strong>76%</strong> of the total employee count. Similarly, <strong>3,571</strong> employees left, which accounts for about <strong>24%</strong> of them.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Separating Target and Features</h1><div class=""><p>In order to make a prediction (in this case, whether an employee would leave or not), one needs to separate the dataset into two components:</p>
<ul>
<li>the <strong>dependent</strong> variable or <strong>target</strong> which needs to be predicted</li>
<li>the <strong>independent</strong> variables or <strong>features</strong> that will be used to make a prediction</li>
</ul>
<p>Your task is to separate the <code>target</code> and <code>features</code>. The target you have here is the employee churn, and features include everything else.</p>
<p>Reminder: the dataset has already been modified by encoding categorical variables and getting dummies.</p>
<p><code>pandas</code> has been imported for you as <code>pd</code>.</p></div></div>

In [11]:
# Set the target and features

# Choose the dependent variable column (churn) and set it as target
target = data.churn

# Drop column churn and set everything else as features
features = data.drop("churn",axis=1)

<p class="">Now target and features are separated!</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Spliting employee data</h1><div class=""><p>Overfitting the dataset is a common problem in analytics. This happens when a model is working well on the dataset it was developed upon, but fails to generalize outside of it.</p>
<p>A train/test split is implemented to ensure model generalization: you develop the model using the training sample and try it out on the test sample later on.</p>
<p>In this exercise, you will split both <code>target</code> and <code>features</code> into train and test sets with 75%/25% ratio, respectively.</p></div></div>

In [12]:
# Import the function for splitting dataset into train and test
from sklearn.model_selection import train_test_split

# Use that function to create the splits both for target and for features
# Set the test sample to be 25% of your observations
target_train, target_test, features_train, features_test = train_test_split(target,features,test_size=0.25,random_state=42)

<p class="">You’ve set aside 25% of your data to evaluate your model performance after training.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Computing Gini index</h1><div class=""><p>The decision tree algorithm aims to achieve partitions in the terminal nodes that are as pure as possible. The Gini index is one of the methods used to achieve this. It is calculated based on the proportion of samples in each group.</p>
<p>Given the number of people who stayed and left respectively, calculate the Gini index for that node.</p></div></div>

In [13]:
#number of people who stayed/left
stayed = 37
left = 1138

#sum of stayed and left
total = stayed + left

#gini index
gini = 2*(stayed/total)*(left/total)

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Splitting the tree</h1><div class=""><p>Given the Gini index that would result from splitting by either variable A or B, respectively, decide by which variable the tree should split next.</p></div></div>

In [14]:
# Gini index in case of splitting by variable A or B
gini_A = 0.65
gini_B = 0.15

# check which Gini is lower and use it for splitting
if gini_A < gini_B:
    print("split by A!")
else:
    print("split by B!")

split by B!


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Fitting the tree to employee data</h1><div class=""><p>A train/test split provides the opportunity to develop the classifier on the training component and test it on the rest of the dataset. In this exercise, you will start developing an employee turnover prediction model using the <strong>decision tree</strong> classification algorithm. The algorithm provides a <code>.fit()</code> method, which can be used to fit the features to the model in the training set.</p>
<p><em>Reminder: both target and features are already split into train and test components (Train: <code>features_train</code>, <code>target_train</code>, Test: <code>features_test</code>, <code>target_test</code>)</em></p></div></div>

In [15]:
# Import the classification algorithm
from sklearn.tree import DecisionTreeClassifier

# Initialize it and call model by specifying the random_state parameter
model = DecisionTreeClassifier(random_state=42)

# Apply a decision tree model to fit features to the target
model.fit(features_train, target_train)

DecisionTreeClassifier(random_state=42)

<p class="">Your Decision Tree has learnt from the training data. Let’s now see how it performs on predictions!</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Checking the accuracy of prediction</h1><div class=""><p>It’s now time to check how well your trained model can make predictions! Let’s use your testing set to check the accuracy of your Decision Tree <code>model</code>, with the <code>score()</code> method.</p></div></div>

In [16]:
# Apply a decision tree model to fit features to the target in the training set
model.fit(features_train,target_train)

# Check the accuracy score of the prediction for the training set
model.score(features_train,target_train)*100

# Check the accuracy score of the prediction for the test set
model.score(features_test,target_test)*100

97.22666666666666

<p class="">Decision Tree algorithm did perfectly on the training set. On the testing set, it was able to correctly predict if an employee would leave or not in almost 98% of the cases!</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Exporting the tree</h1><div class=""><p>In Decision Tree classification tasks, overfitting is usually the result of deeply grown trees. As the comparison of accuracy scores on the train and test sets shows, you have overfitting in your results. This can also be learned from the tree visualization.</p>
<p>In this exercise, you will export the decision tree into a text document, which can then be used for visualization.</p></div></div>

In [17]:
# Import the tree graphical visualization export function
from sklearn.tree import export_graphviz

# Apply Decision Tree model to fit Features to the Target
model.fit(features_train,target_train)

# Export the tree to a dot file
export_graphviz(model,"tree.dot")

<p class="">Now you can copy the content of tree.dot file and preview it in webgraphviz.com!</p>
<img src="tree_graph.png" alt="tree">
<div class="">According to the graph, when <code>average_monthly_hours</code> is less than or equal to 125, complete purity is achieved.</div>
<p class="">if <code>average_monthly_hours</code> are less than or equal to 125, the tree splits one last time and stops growing.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Pruning the tree</h1><div class=""><p>Overfitting is a classic problem in analytics, especially for the decision tree algorithm. Once the tree is fully grown, it may provide highly accurate predictions for the training sample, yet fail to be that accurate on the test set. For that reason, the growth of the decision tree is usually controlled by:</p>
<ul>
<li>“Pruning” the tree and setting a limit on the maximum depth it can have.</li>
<li>Limiting the minimum number of observations in one leaf of the tree.</li>
</ul>
<p>In this exercise, you will:</p>
<ul>
<li>prune the tree and limit the growth of the tree to 5 levels of depth</li>
<li>fit it to the employee data</li>
<li>test prediction results on both training and testing sets.</li>
</ul>
<p>The variables <code>features_train</code>, <code>target_train</code>, <code>features_test</code> and <code>target_test</code> are already available in your workspace.</p></div></div>

In [18]:
# Initialize the DecisionTreeClassifier while limiting the depth of the tree to 5
model_depth_5 = DecisionTreeClassifier(max_depth=5, random_state=42)

# Fit and print the model
model_depth_5.fit(features_train,target_train)

# Print the accuracy of the prediction for the training set
print(model_depth_5.score(features_train,target_train)*100)

# Print the accuracy of the prediction for the test set
print(model_depth_5.score(features_test,target_test)*100)

97.71535247577563
97.06666666666666


<p class="">Now you have a more reasonable model!</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Limiting the sample size</h1><div class=""><p>Another method to prevent overfitting is to specify the minimum number of observations necessary to grow a leaf (or node), in the Decision Tree.</p>
<p>In this exercise, you will:</p>
<ul>
<li>set this minimum limit to 100</li>
<li>fit the new model to the employee data</li>
<li>examine prediction results on both training and test sets</li>
</ul>
<p>The variables <code>features_train</code>, <code>target_train</code>, <code>features_test</code> and <code>target_test</code> are already available in your workspace.</p></div></div>

In [19]:
# Initialize the DecisionTreeClassifier while limiting the sample size in leaves to 100
model_sample_100 = DecisionTreeClassifier(min_samples_leaf=100, random_state=42)

# Fit the model
model_sample_100.fit(features_train,target_train)

# Print the accuracy of the prediction (in percentage points) for the training set
print(model_sample_100.score(features_train,target_train)*100)

# Print the accuracy of the prediction (in percentage points) for the test set
print(model_sample_100.score(features_test,target_test)*100)

96.57747355320473
96.13333333333334


<p class="">Now you have another reasonable model!</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Calculating accuracy metrics: precision</h1><div class=""><p>The Precision score is an important metric used to measure the accuracy of a classification algorithm. It is calculated as the <strong>fraction of True Positives over the sum of True Positives and False Positives</strong>, or
<span class="MathJax_Preview" style="color: inherit; display: none;"></span><div class="MathJax_Display" style="text-align: center;"><span class="MathJax" id="MathJax-Element-6-Frame" tabindex="0" style="text-align: center; position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot; display=&quot;block&quot;><mfrac><mtext># of True Positives</mtext><mrow><mtext># of True Positives</mtext><mo>+</mo><mtext># of False Positives</mtext></mrow></mfrac><mo>.</mo></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-46" style="width: 21.958em; display: inline-block;"><span style="display: inline-block; position: relative; width: 18.753em; height: 0px; font-size: 117%;"><span style="position: absolute; clip: rect(0.697em, 1018.7em, 3.261em, -999.997em); top: -2.241em; left: 0em;"><span class="mrow" id="MathJax-Span-47"><span class="mfrac" id="MathJax-Span-48"><span style="display: inline-block; position: relative; width: 18.272em; height: 0px; margin-right: 0.11em; margin-left: 0.11em;"><span style="position: absolute; clip: rect(3.154em, 1008.34em, 4.383em, -999.997em); top: -4.698em; left: 50%; margin-left: -4.217em;"><span class="mtext" id="MathJax-Span-49" style="font-family: MathJax_Main;"># of True Positives</span><span style="display: inline-block; width: 0px; height: 4.009em;"></span></span><span style="position: absolute; clip: rect(3.154em, 1018.11em, 4.383em, -999.997em); top: -3.309em; left: 50%; margin-left: -9.079em;"><span class="mrow" id="MathJax-Span-50"><span class="mtext" id="MathJax-Span-51" style="font-family: MathJax_Main;"># of True Positives</span><span class="mo" id="MathJax-Span-52" style="font-family: MathJax_Main; padding-left: 0.216em;">+</span><span class="mtext" id="MathJax-Span-53" style="font-family: MathJax_Main; padding-left: 0.216em;"># of False Positives</span></span><span style="display: inline-block; width: 0px; height: 4.009em;"></span></span><span style="position: absolute; clip: rect(0.857em, 1018.27em, 1.231em, -999.997em); top: -1.279em; left: 0em;"><span style="display: inline-block; overflow: hidden; vertical-align: 0em; border-top: 1.3px solid; width: 18.272em; height: 0px;"></span><span style="display: inline-block; width: 0px; height: 1.071em;"></span></span></span></span><span class="mo" id="MathJax-Span-54" style="font-family: MathJax_Main;">.</span></span><span style="display: inline-block; width: 0px; height: 2.246em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -1.059em; border-left: 0px solid; width: 0px; height: 2.753em;"></span></span></nobr><span class="MJX_Assistive_MathML MJX_Assistive_MathML_Block" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mtext># of True Positives</mtext><mrow><mtext># of True Positives</mtext><mo>+</mo><mtext># of False Positives</mtext></mrow></mfrac><mo>.</mo></math></span></span></div><script type="math/tex; mode=display" id="MathJax-Element-6">\frac{\text{# of True Positives}}{\text{# of True Positives} + \text{# of False Positives}}.</script></p>
<ul>
<li>we define <strong>True Positives</strong> as the number of employees who actually left, and were classified correctly as leaving </li>
<li>we define <strong>False Positives</strong> as the number of employees who actually stayed, but were wrongly classified as leaving</li>
</ul>
<p>If there are no False Positives, the precision score is equal to 1.
If there are no True Positives, the recall score is equal to 0.</p>
<p>In this exercise, we will calculate the precision score (using the <code>sklearn</code> function <code>precision_score</code>) for our initial classification model.</p>
<p>The variables <code>features_test</code> and <code>target_test</code> are available in your workspace.</p></div></div>

In [20]:
# Import the function to calculate precision score
from sklearn.metrics import precision_score

# Predict whether employees will churn using the test set
prediction = model.predict(features_test)

# Calculate precision score by comparing target_test with the prediction
precision_score(target_test, prediction)

0.9240641711229947

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Calculating accuracy metrics: recall</h1><div class=""><p>The Recall score is another important metric used to measure the accuracy of a classification algorithm. It is calculated as the** fraction of True Positives over the sum of True Positives and False Negatives**, or
<span class="MathJax_Preview" style="color: inherit; display: none;"></span><div class="MathJax_Display" style="text-align: center;"><span class="MathJax" id="MathJax-Element-7-Frame" tabindex="0" style="text-align: center; position: relative;" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot; display=&quot;block&quot;><mfrac><mtext># of True Positives</mtext><mrow><mtext># of True Positives</mtext><mo>+</mo><mtext># of False Negatives</mtext></mrow></mfrac><mo>.</mo></math>" role="presentation"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-55" style="width: 22.332em; display: inline-block;"><span style="display: inline-block; position: relative; width: 19.073em; height: 0px; font-size: 117%;"><span style="position: absolute; clip: rect(0.697em, 1019.02em, 3.315em, -999.997em); top: -2.241em; left: 0em;"><span class="mrow" id="MathJax-Span-56"><span class="mfrac" id="MathJax-Span-57"><span style="display: inline-block; position: relative; width: 18.592em; height: 0px; margin-right: 0.11em; margin-left: 0.11em;"><span style="position: absolute; clip: rect(3.154em, 1008.34em, 4.383em, -999.997em); top: -4.698em; left: 50%; margin-left: -4.217em;"><span class="mtext" id="MathJax-Span-58" style="font-family: MathJax_Main;"># of True Positives</span><span style="display: inline-block; width: 0px; height: 4.009em;"></span></span><span style="position: absolute; clip: rect(3.154em, 1018.43em, 4.383em, -999.997em); top: -3.309em; left: 50%; margin-left: -9.239em;"><span class="mrow" id="MathJax-Span-59"><span class="mtext" id="MathJax-Span-60" style="font-family: MathJax_Main;"># of True Positives</span><span class="mo" id="MathJax-Span-61" style="font-family: MathJax_Main; padding-left: 0.216em;">+</span><span class="mtext" id="MathJax-Span-62" style="font-family: MathJax_Main; padding-left: 0.216em;"># of False Negatives</span></span><span style="display: inline-block; width: 0px; height: 4.009em;"></span></span><span style="position: absolute; clip: rect(0.857em, 1018.59em, 1.231em, -999.997em); top: -1.279em; left: 0em;"><span style="display: inline-block; overflow: hidden; vertical-align: 0em; border-top: 1.3px solid; width: 18.592em; height: 0px;"></span><span style="display: inline-block; width: 0px; height: 1.071em;"></span></span></span></span><span class="mo" id="MathJax-Span-63" style="font-family: MathJax_Main;">.</span></span><span style="display: inline-block; width: 0px; height: 2.246em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -1.122em; border-left: 0px solid; width: 0px; height: 2.816em;"></span></span></nobr><span class="MJX_Assistive_MathML MJX_Assistive_MathML_Block" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mtext># of True Positives</mtext><mrow><mtext># of True Positives</mtext><mo>+</mo><mtext># of False Negatives</mtext></mrow></mfrac><mo>.</mo></math></span></span></div><script type="math/tex; mode=display" id="MathJax-Element-7">\frac{\text{# of True Positives}}{\text{# of True Positives} + \text{# of False Negatives}}.</script></p>
<p>If there are no False Negatives, the recall score is equal to 1.
If there are no True Positives, the recall score is equal to 0.</p>
<p>In this exercise, you will calculate the precision score (using the sklearn function <code>recall_score</code>) for your initial classification model.</p>
<p>The variables <code>features_test</code> and <code>target_test</code> are available in your workspace.</p></div></div>

In [21]:
# Import the function to calculate recall score
from sklearn.metrics import recall_score

# Use the initial model to predict churn
prediction = model.predict(features_test)

# Calculate recall score by comparing target_test with the prediction
recall_score(target_test, prediction)

0.9632107023411371

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Calculating the ROC/AUC score</h1><div class=""><p>While the Recall score is an important metric for measuring the accuracy of a classification algorithm, it puts too much weight on the number of False Negatives. On the other hand, Precision is concentrated on the number of False Positives.</p>
<p>The combination of those two results in the ROC curve allows us to measure both recall and precision. The area under the ROC curve is calculated as the AUC score.</p>
<p>In this exercise, you will calculate the ROC/AUC score for the initial model using the sklearn <code>roc_auc_score()</code> function.</p>
<p>The variables <code>features_test</code> and <code>target_test</code> are available in your workspace.</p></div></div>

In [22]:
# Import the function to calculate ROC/AUC score
from sklearn.metrics import roc_auc_score

# Use initial model to predict churn (based on features of the test set)
prediction = model.predict(features_test)

# Calculate ROC/AUC score by comparing target_test with the prediction
roc_auc_score(target_test, prediction)

0.9691623087590718

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Balancing classes</h1><div class=""><p>It can significantly affect prediction results, as shown by the difference between the <code>recall</code> and <code>accuracy</code> scores. To solve the imbalance, equal weights are usually given to each class. Using the <code>class_weight</code> argument in <code>sklearn</code>'s <code>DecisionTreeClassifier</code>, one can make the classes become <code>"balanced"</code>.</p>
<p>Let’s correct our model by solving its imbalance problem:</p>
<ul>
<li>first, you’re going to  set up a model with balanced classes</li>
<li>then, you will fit it to the training data</li>
<li>finally, you will check its accuracy on the test set</li>
</ul>
<p>The variables <code>features_train</code>, <code>target_train,</code>features<em>test<code>and</code>target</em>test` are already available in your workspace.</p></div></div>

In [23]:
# Initialize the DecisionTreeClassifier 
model_depth_5_b = DecisionTreeClassifier(max_depth=5,class_weight="balanced",random_state=42)

# Fit the model
model_depth_5_b.fit(features_train,target_train)

# Print the accuracy of the prediction (in percentage points) for the test set
print(model_depth_5_b.score(features_test,target_test)*100)

93.70666666666668


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Comparison of Employee attrition models</h1><div class=""><p>In this exercise, your task is to compare the <strong>balanced</strong> and <strong>imbalanced</strong> (default) models using the pruned tree (<code>max_depth=7</code>). The <strong>imbalanced</strong> model is already done using <strong>recall</strong> and <strong>ROC/AUC</strong> scores. Complete the same steps for the <strong>balanced</strong> model.</p>
<ul>
<li>The variables <code>features_train</code>, <code>target_train</code>, <code>features_test</code> and <code>target_test</code> are already available in your workspace.</li>
<li>An imbalanced model has already been fit for you and, and its predictions saved as <code>prediction</code>.</li>
<li>The functions <code>recall_score()</code> and <code>roc_auc_score()</code> have been imported for you.</li>
</ul></div></div>

In [24]:
# Print the recall score
print(recall_score(target_test,prediction))
# Print the ROC/AUC score
print(roc_auc_score(target_test,prediction))

# Initialize the model
model_depth_7_b = DecisionTreeClassifier(max_depth=7,class_weight="balanced",random_state=42)
# Fit it to the training component
model_depth_7_b.fit(features_train,target_train)
# Make prediction using test component
prediction_b = model_depth_7_b.predict(features_test)
# Print the recall score for the balanced model
print(recall_score(target_test,prediction_b))
# Print the ROC/AUC score for the balanced model
print(roc_auc_score(target_test,prediction_b))

0.9632107023411371
0.9691623087590718
0.9319955406911928
0.959863876199084


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Cross-validation using sklearn</h1><div class=""><p>As explained in Chapter 2, overfitting the dataset is a common problem in analytics. This happens when a model has learned the data too closely: it has great performances on the dataset it was trained on, but fails to generalize outside of it.</p>
<p>While the train/test split technique you learned in Chapter 2 ensures that the model does not overfit the training set, <strong>hyperparameter tuning</strong> may result in overfitting the test component, since it consists in tuning the model to get the best prediction results on the test set. Therefore, it is recommended to validate the model on different testing sets. K-fold cross-validation allows us to achieve this:</p>
<ul>
<li>it splits the dataset into a training set and a testing set</li>
<li>it fits the model, makes predictions and calculates a score (you can specify if you want the accuracy, precision, recall...)</li>
<li>it repeats the process k times in total</li>
<li>it outputs the average of the 10 scores</li>
</ul>
<p>In this exercise, you will use Cross Validation on our dataset, and evaluate our results with the <code>cross_val_score</code> function.</p></div></div>

In [25]:
# Import the function for implementing cross validation
from sklearn.model_selection import cross_val_score

# Use that function to print the cross validation score for 10 folds
print(cross_val_score(model,features,target,cv=10))

[0.98533333 0.98533333 0.974      0.96533333 0.96       0.97933333
 0.99       0.99333333 1.         1.        ]


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Setting up GridSearch parameters</h1><div class=""><p>A hyperparameter is a parameter inside a function. For example, <code>max_depth</code> or <code>min_samples_leaf</code> are hyperparameters of the <code>DecisionTreeClassifier()</code> function. Hyperparameter tuning is the process of testing different values of hyperparameters to find the optimal ones: the one that gives the best predictions according to your objectives. In <code>sklearn</code>, you can use GridSearch to test different combinations of hyperparameters. Even better, you can use GridSearchCV() test different combinations and run cross-validation on them in one function!</p>
<p>In this exercise, you are going to prepare the different values you want to test for <code>max_depth</code> and <code>min_samples_leaf</code>. You will then put these in a dictionary, because that’s what is required for <code>GridSearchCV()</code>:</p>
<ul>
<li>the dictionary keys will be the hyperparameters names</li>
<li>the dictionary values will be the attributes (the hyperparameter values) you want to test</li>
</ul>
<p>Instead of writing all the values manually, you will use the <code>range()</code> function, which allows us to generate values incrementally. For example, <code>range(1, 10, 2)</code> will generate a list containing values ranging from 1 included to 10 not included, by increments of 2. So the final result will be <code>[1, 3, 5, 7, 9]</code>.</p></div></div>

In [26]:
# Generate values for maximum depth
depth = [i for i in range(5,21,1)]

# Generate values for minimum sample size
samples = [i for i in range(50,500,50)]

# Create the dictionary with parameters to be checked
parameters = dict(max_depth=depth, min_samples_leaf=samples)

<p class="">Your parameters are generated! In the next exercise, you will use the <code>parameters</code> dictionary you just generated to find their optimal combination.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Implementing GridSearch</h1><div class=""><p>You can now use the <code>sklearn</code> <code>GridSearchCV()</code> function to find the best combination of all of the <code>max_depth</code> and <code>min_samples_leaf</code> values you generated in the previous exercise.</p></div></div>

In [27]:
# import the GridSearchCV function
from sklearn.model_selection import GridSearchCV

# set up parameters: done
parameters = dict(max_depth=depth, min_samples_leaf=samples)

# initialize the param_search function using the GridSearchCV function, initial model and parameters above
param_search = GridSearchCV(model, parameters)

# fit the param_search to the training dataset
param_search.fit(features_train, target_train)

# print the best parameters found
print(param_search.best_params_)

{'max_depth': 5, 'min_samples_leaf': 50}


<p class="">It looks like the values that give you the best score are a minimum of samples per leaf of 100 and a maximum depth of 5.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Sorting important features</h1><div class=""><p>Among other things, Decision Trees are very popular because of their interpretability. Many models can provide accurate predictions, but Decision Trees can also quantify the effect of the different features on the target. Here, it can tell you which features have the strongest and weakest impacts on the decision to leave the company. In <code>sklearn</code>, you can get this information by using the <code>feature_importances_</code> attribute.</p>
<p>In this exercise, you're going to get the quantified importance of each feature, save them in a pandas DataFrame (a Pythonic table), and sort them from the most important to the less important. The <code>model_ best</code> Decision Tree Classifier used in the previous exercises is available in your workspace, as well as the <code>features_test</code>  and <code>features_train</code> variables.</p>
<p><code>pandas</code> has been imported as <code>pd</code>.</p></div></div>

In [28]:
# Calculate feature importances
feature_importances = model.feature_importances_

# Create a list of features: done
feature_list = list(features)

# Save the results inside a DataFrame using feature_list as an index
relative_importances = pd.DataFrame(index=feature_list, data=feature_importances, columns=["importance"])

# Sort the DataFrame to learn most important features
relative_importances.sort_values(by="importance", ascending=False)

Unnamed: 0,importance
satisfaction,0.499958
evaluation,0.153005
time_spend_company,0.139567
number_of_projects,0.098877
average_montly_hours,0.088262
salary,0.00625
technical,0.003929
support,0.002248
sales,0.001549
RandD,0.001415


<p class="">It seems that satisfaction is by far the most impactful feature on the decision to leave the company or not.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Selecting important features</h1><div class=""><p>In this exercise, your task is to select only the most important features that will be used by the final model. Remember, that the relative importances are saved in the column <code>importance</code> of the DataFrame called <code>relative_importances</code>.</p></div></div>

In [29]:
# select only features with relative importance higher than 1%
selected_features = relative_importances[relative_importances.importance>0.01]

# create a list from those features: done
selected_list = selected_features.index

# transform both features_train and features_test components to include only selected features
features_train_selected = features_train[selected_list]
features_test_selected = features_test[selected_list]

<p class="">As you can see, only 5 features have been retained out of the 17 original ones: <code>['satisfaction', 'evaluation', 'number_of_projects', 'average_montly_hours', 'time_spend_company']</code>. You’ve made sure to keep only these in your training and testing sets.</p>

<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Develop and test the best model</h1><div class=""><p>In Chapter 3, you found out that the following parameters allow you to get better model:</p>
<ul>
<li><code>max_depth = 8</code>,</li>
<li><code>min_samples_leaf = 150</code>,</li>
<li><code>class_weight = "balanced"</code></li>
</ul>
<p>In this chapter, you discovered that some of the features have a negligible impact. You realized that you could get accurate predictions using just a small number of selected, impactful features and you updated your training and testing set accordingly, creating the variables <code>features_train_selected</code> and <code>features_test_selected</code>.</p>
<p>With all this information at your disposal, you're now going to develop the best model for predicting employee turnover and evaluate it using the appropriate metrics.</p>
<p>The <code>features_train_selected</code> and <code>features_test_selected</code> variables are available in your workspace, and the <code>recall_score</code> and <code>roc_auc_score</code> functions have been imported for you.</p></div></div>

In [30]:
# Initialize the best model using parameters provided in description
model_best = DecisionTreeClassifier(max_depth=8, min_samples_leaf=150, class_weight="balanced", random_state=42)

# Fit the model using only selected features from training set: done
model_best.fit(features_train_selected, target_train)

# Make prediction based on selected list of features from test set
prediction_best = model_best.predict(features_test_selected)

# Print the general accuracy of the model_best
print(model_best.score(features_test_selected, target_test) * 100)

# Print the recall score of the model predictions
print(recall_score(target_test, prediction_best) * 100)

# Print the ROC/AUC score of the model predictions
print(roc_auc_score(target_test, prediction_best) * 100)

95.28
91.75027870680044
94.07002193314084
