# Predicting Data Science Job Changes

## Objective

In this analysis we examine the factors involved with people in a company's training program and create a model to predict which people are more likely to change jobs. The aim of this project is not to achieve the highest accuracy rating possible but to build a model using simple classification methods. More robust models are outside the scope of this project. This is because it is important to get a better understanding of the data using simpler models first before using more complex models. Further, it may be that the simpler models may solve the problem and also have the benefit of providing more interpretable results than the complex models. 

## The Data

We use the data set provided by the "HR Analytics: Job Change of Data Scientists" Kaggle competition. This data set has 19158 rows and 14 columns.


Columns are as follows:

- enrollee_id : Unique ID for candidate

- city: City code

- city_ development _index : Developement index of the city (scaled)

- gender: Gender of candidate

- relevent_experience: Relevant experience of candidate

- enrolled_university: Type of University course enrolled if any

- education_level: Education level of candidate

- major_discipline :Education major discipline of candidate

- experience: Candidate total experience in years

- company_size: No of employees in current employer's company

- company_type : Type of current employer

- lastnewjob: Difference in years between previous job and current job

- training_hours: training hours completed

- target: 0 – Not looking for job change, 1 – Looking for a job change

We are going to go through each variable seperately, clean it, perform EDA, and prepare it for modeling. Starting with categorical variables and ending with numeric variables. It should be noted that for all our categorical missing values, we choose to keep them and use them for the analysis. This is because there is a lot of valuable information in missing data and deleting it could change the character of the data set. For the numeric variables, missing data is filled in with random samples of that column because there is a very small number of missing values. 

## Import Libraries and Read Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve, auc, make_scorer
from scikitplot.metrics import plot_roc
from sklearn.ensemble import RandomForestClassifier



# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
df.head()

# Cleaning and EDA

## Initial Exploration

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

Here, we have 2 numeric variables and the rest are categorical variables. last_new_job and experience columns will later be changed to numeric data types giving us 4 numeric variables total.

In [None]:
df.isnull().sum()

We see that a little over half of the columns contain missing values so we will have to work with them in each column going forward. 

## Categorical Variables

### City

In [None]:
df['city'].value_counts()

There are 123 different cities in this data set.

In [None]:
plt.figure(figsize=(10,7))

df['city'].value_counts()[:20].sort_values(ascending=True).plot(kind='barh')
plt.title('Value Counts of Cities')
plt.xlabel('Frequency')
plt.ylabel('City Name');

Here we get the top 20 cities. city 103 has the most records with above 4,000. City 21 follows by with almost 3,000 records.

We chose not to use this variable as a predictor because we do not know which cities these actually are and what their significance may be. 

### City Development Index

In [None]:
df['gender'].value_counts()

We choose to keep the missing data here and use it as a variable.

In [None]:
# Change pandas na to an actual na value.
df['gender'].fillna('na', inplace=True)

Then we check na values in the dataframe for insights.

In [None]:
df[df['gender'] == 'na']

Nothing too out of the ordinary looking here.

Lets plot the data!

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(121)
sns.countplot(data=df, x='gender')
plt.title('Count of Gender')
plt.xlabel('Gender')
plt.ylabel('Frequency');

plt.subplot(122)
sns.barplot(data=df, x='gender', y='target')
plt.title('Gender Proportion of Job Change')
plt.xlabel('Gender')
plt.ylabel('Proportion');

We see here that na values have a count higher than females and higher than other values. In the proportion bar graph we see that na gender values certainly have a higher percent of employee job change than any other value.

Lets take a closer look.

We create a table of counts for gender and the target.

In [None]:
gender_table = pd.crosstab(df['gender'], df['target'])
gender_table

Males take up most of the data set being thirteen times more than females.

Now we take a look at a table of proprtions.

In [None]:
pd.crosstab(df['gender'], df['target'], normalize='index')

With proportions we see that females are more likely to change jobs than men. More importantly, we see that missing values have the highest proportion of job changers.

It looks like gender may play a part in predicting job changes so we do a chi squared test to test for statstical signficance of this.

In [None]:
stat, p, dof, expected = chi2_contingency(gender_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

We reject the null hypothesis because we definetly have a p value lower than .05. It looks like gender will be a good feature to add to our model.

### Relevant Experience

Relevant experience is the relevant experience of the candidate. Whether they have any or do not have any relevant experience.



In [None]:
df['relevent_experience'].value_counts()

The number of people with relevant experience is more than double the amount of no relevant experience people.

We replace the value that has relevent experience to "yes" and no relevent experience to "no" to simplify things. 

In [None]:
# replace the value has relevent experience to "yes" and no relevent experience to "no" in the column.
df['relevent_experience'] = df['relevent_experience'].replace('Has relevent experience', 'yes')
df['relevent_experience'] = df['relevent_experience'].replace('No relevent experience', 'no')


Plot the counts and proportions.

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(121)
sns.countplot(data=df, x='relevent_experience')
plt.title('Relevant Experience Counts')
plt.xlabel('Has Relevant Experience')
plt.ylabel('Count');

plt.subplot(122)
sns.barplot(data=df, x='relevent_experience', y='target')
plt.title('Relevant Experience Proportion of Job Change')
plt.xlabel('Has Relevant Experience')
plt.ylabel('Proportion');


People with relevant experience is more than double the amount of people without it. People without relevant experience are more than 10% more likely to be looking for another job.

To make sure of this difference we do another chi-squared test.

In [None]:
exp_table = pd.crosstab(df['relevent_experience'], df['target'])

In [None]:
pd.crosstab(df['relevent_experience'], df['target'], normalize = 'index')

In [None]:
stat, p, dof, expected = chi2_contingency(exp_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

We reject the null hypothesis because we definetly have a p value lower than .05. It looks like relevant experience will be a good feature to add to our model.

### Enrolled University

Enrolled University tells us whether a person is in a full time course, part time course, or not enrolled in a university course.

In [None]:
df['enrolled_university'].value_counts()

Most people are not enrolled in a university course.

We choose to fill missing values with na and use na as a value in this column. 

In [None]:
df['enrolled_university'].fillna('na', inplace=True)

In [None]:
df['enrolled_university'].value_counts()

na makes up 386 records in the data set.

lets plot it!

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(121)
sns.countplot(data=df, x='enrolled_university')
plt.title('University Course Counts')
plt.xlabel('University Enrollment')
plt.ylabel('Count');


plt.subplot(122)
sns.barplot(data=df, x='enrolled_university', y='target')
plt.title('University Course Proportion of Job Change')
plt.xlabel('University Enrollment')
plt.ylabel('Proportion');

Most people are not enrolled in university, while no enrollments also have the lowest percentage of people seeking a job change. People enrolled in full time courses have the highest proportion of people seeking a job change and na values come in second place.

We use a chi-squared test to test these differences.

In [None]:
enrolled_table = pd.crosstab(df['enrolled_university'],df['target'])
pd.crosstab(df['enrolled_university'],df['target'], normalize = 'index')

In [None]:
stat, p, dof, expected = chi2_contingency(enrolled_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

We reject the null hypothesis because we definitely have a p value lower than .05. It looks like university enrollment will be a good feature to add to our model.

### Education Level

Education level of the candidate (Phd, Masters, Graduate, High School, Primary School)

In [None]:
df['education_level'].value_counts()

Most people in the dataset are graduates.


We choose to fill missing values with na and use na as a value in this column. 

In [None]:
df['education_level'].fillna('na', inplace=True)

In [None]:
df['education_level'].value_counts()

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(121)
sns.countplot(data=df, x='education_level')
plt.title('Education Level Counts')
plt.xlabel('Education Level')
plt.ylabel('Count');


plt.subplot(122)
sns.barplot(data=df, x='education_level', y='target')
plt.title(' Education level Proportion of Job Change')
plt.xlabel('Education Level')
plt.ylabel('Proportion');


Graduates make up more than half of every other value put together however, when it comes to the proportion of job change people, it only has a small lead. 

In [None]:
edu_table = pd.crosstab(df['education_level'],df['target'])

In [None]:
pd.crosstab(df['education_level'],df['target'], normalize = 'index')

In [None]:
stat, p, dof, expected = chi2_contingency(edu_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

Depsite the differences being smaller, we still reject the null hypothesis because we definitely have a p value lower than .05. Education level should be a good predictor.

### Major

College majors are put into general categories (STEM, Humanities, Other, Business Degree, Arts, No Major)

In [None]:
df['major_discipline'].value_counts()

Stem majors definitely make up the most of the data set.

We choose to fill missing values with na and use na as a value in this column. 

In [None]:
df['major_discipline'].fillna('na', inplace=True)

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(121)
sns.countplot(data=df, x='major_discipline')
plt.title('Counts of Majors')
plt.xlabel('Major')
plt.ylabel('Count');

plt.subplot(122)
sns.barplot(data=df, x='major_discipline', y='target')
plt.title('Proportion of Majors of Job Change')
plt.xlabel('Major')
plt.ylabel('Frequency');

Stem degrees in the data set are more than 7 times larger than any other major put together. Proportionally, stem degrees, business degrees, and other degrees are almost the same.

In [None]:
major_table = pd.crosstab(df['major_discipline'],df['target'])

In [None]:
pd.crosstab(df['major_discipline'],df['target'], normalize = 'index')

In [None]:
stat, p, dof, expected = chi2_contingency(major_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

We reject the null hypothesis because p is lower than .05 so major will be used for our model.

### Company Size

Company Size is measured as a categorical variable with values(<10, 10/49, 50-99, 100-500, 500-999, 1000-4999, 5000-9999, 1000+).

In [None]:
df['company_size'].value_counts()

Most companies in the data set are between 50 and 99 people.


We choose to fill missing values with na and use na as a value in this column.

In [None]:
df['company_size'].fillna('na', inplace=True)

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(121)
sns.countplot(data=df, y='company_size')
plt.title('Company Size Counts')
plt.xlabel('Count')
plt.ylabel('Company Size');

plt.subplot(122)
sns.barplot(data=df, y='company_size', x='target')
plt.title('Company Size Proportions of Job Change')
plt.xlabel('Proportion')
plt.ylabel('Company Size');

Na values have the highest number and take up more than half of 50-99 company sizes in the data set. People who have a missing value for their company size are also twice more likely to change jobs than any other value.

In [None]:
comp_size_table = pd.crosstab(df['company_size'],df['target'])
pd.crosstab(df['company_size'],df['target'], normalize = 'index')

In [None]:
stat, p, dof, expected = chi2_contingency(comp_size_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

We reject the null hypothesis because we definetly have a p value lower than .05. It looks like company size will be a good feature to add to our model.

### Company Type
Company type has 6 values (Pvt Ltd, Funded Startup, Public Sector, Early Stage Startup, NGO, Other).

In [None]:
df['company_type'].value_counts()

Private companies take up most of the data set.

We choose to fill missing values with na and use na as a value in this column.



In [None]:
df['company_type'].fillna('na', inplace=True)

In [None]:
plt.figure(figsize=(22,7))

plt.subplot(121)
sns.countplot(data=df, y='company_type')
plt.title('Types of company Counts')
plt.xlabel('Count')
plt.ylabel('Company Type');

plt.subplot(122)
sns.barplot(data=df, y='company_type', x='target')
plt.title('Types of company proprtion of Job Change')
plt.xlabel('Proportion')
plt.ylabel('Company Type');

Pvt Ltd takes up 3 times more than any other value put together. For job change proprtions, there is a close tie between early stage startups and other types of companies.

In [None]:
comp_type_table = pd.crosstab(df['company_type'],df['target'])
pd.crosstab(df['company_type'],df['target'], normalize = 'index')

In [None]:
stat, p, dof, expected = chi2_contingency(comp_type_table)

# interpret p-value 
alpha = 0.05
print("p value is " + str(p)) 
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

We reject the null hypothesis because we definitely have a p value lower than .05. It looks like company type will be a good feature to add to our model.

## Continious Variables

### City development index

In [None]:
df['city_development_index'].describe()

In [None]:
plt.hist(data=df, x='city_development_index');

Most of the records have a city development index higher than .9.

In [None]:
plt.figure(figsize=(22,7))

plt.subplot(121)
plt.hist(data=df[df['target'] == 0], x='city_development_index')

plt.subplot(122)
plt.hist(data=df[df['target'] == 1], x='city_development_index');

The distribution of the not looking for jobs histogram on the left seems to be very similar to the distribution of all the records. The job changers have a very high amount of records between .6 and .7 .

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=df, x='target', y='city_development_index')

This box plot shows a pretty large difference between the means of job changers to non job changers.

Let's do a statstical test to make sure of this.

Here we us a Kruskal Wallace Test to see if there is a statistically significant difference.

In [None]:
no_change_df = df[df['target'] == 0]
change_df =  df[df['target'] == 1]

In [None]:
stats.kruskal(no_change_df['city_development_index'], change_df['city_development_index'])

We have a very small p-value that is below 0.05, so we reject the null hypothesis. This shows statistical significance so it should be a great variable to use for our model.

### Experience

Experience is measured in years of experience going from less than 1 to more than 20.

In [None]:
df['experience'].value_counts()

We are going to change >20 to 21, and <1 to 0 so we can have a consistent data type and the values can be readable to our model. 

We also choose to fill missing values with na and use na as a value in this column.

In [None]:
df['experience'] = df['experience'].replace('>20', 21)
df['experience'] = df['experience'].replace('<1', 21)
df['experience'] = df['experience'].fillna(np.random.choice(df['experience']))

Then change the data type to a float so it is numerical.

In [None]:
df['experience'] = df['experience'].astype('float64')

Now we check it out!

In [None]:
df['experience'].describe()

In [None]:
plt.hist(data=df, x='experience');

The highst distribution of records in the data set have 20 or more years of experience followed by 4,5,and 6 years of experience.

Lets look at job changers vs non-job changers.

In [None]:
plt.figure(figsize=(22,7))

plt.subplot(121)
plt.hist(data=df[df['target'] == 0], x='experience')

plt.subplot(122)
plt.hist(data=df[df['target'] == 1], x='experience');

When comparing both graphs we see that in the job changers histogram, there is a high frequency of records between 1-7 years of job experience and then another spike at 20 years and over. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=df, x='target', y='experience')

It looks like job changers are more likely to change having less experience years.

Lets do a Kruskal Wallace test to see if that difference is significant.

In [None]:
no_change_df = df[df['target'] == 0]
change_df =  df[df['target'] == 1]

stats.kruskal(no_change_df['experience'], change_df['experience'])

We have a very small p-value that is below 0.05, so we reject the null hypothesis. This shows statistical significance so it should be a good variable to use for our model.

### Last New Job

Difference in years between previous job and current job.

In [None]:
df['last_new_job'].value_counts()

We will need to clean this up a bit:
- Replace >4 to 5, and never to 0 years between jobs.
- We fill missing data with random samples taken from the existing data set.
- We change the data type to a float so it is numerical.

In [None]:
df['last_new_job'] = df['last_new_job'].replace('>4', 5)
df['last_new_job'] = df['last_new_job'].replace('never', 0)
df['last_new_job'] = df['last_new_job'].fillna(np.random.choice(df['last_new_job']))
df['last_new_job'] = df['last_new_job'].astype('float64')

After cleaning, we visualize the variable.

In [None]:
plt.hist(data=df, x='last_new_job');

Most records have a difference of 1 year between jobs.

In [None]:
plt.figure(figsize=(22,7))

plt.subplot(121)
plt.hist(data=df[df['target'] == 0], x='last_new_job')

plt.subplot(122)
plt.hist(data=df[df['target'] == 1], x='last_new_job');

We do not see much of a difference in the histogram distrubtions.

Lets try a box plot!

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=df, x='target', y='last_new_job');

It seems that people who change jobs seem to have lower year differences. Maybe people who have had longer gaps may not want to change jobs again. 

We do another Kruskal Wallace test.

In [None]:
no_change_df = df[df['target'] == 0]
change_df =  df[df['target'] == 1]

stats.kruskal(no_change_df['last_new_job'], change_df['last_new_job'])

The results of the test are significant with a p-value lower than .05. We will use this feature in our model.

 ### training_hours
 
 The number of hours of training for employee's current position.

In [None]:
df['training_hours'].describe()

In [None]:
plt.hist(data=df, x='training_hours');

There is a median of 47 training hours and most records are between 0 and 50 training hours.

Next we plot the distributions for job changers and non job changers.

In [None]:
plt.figure(figsize=(22,7))

plt.subplot(121)
plt.hist(data=df[df['target'] == 0], x='training_hours')

plt.subplot(122)
plt.hist(data=df[df['target'] == 1], x='training_hours');

There does not seem to be any differences in the distributions.

Let's try a boxplot.

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=df, x='target', y='training_hours')

Again the distributions look very similar.

In [None]:
no_change_df = df[df['target'] == 0]
change_df =  df[df['target'] == 1]

stats.kruskal(no_change_df['training_hours'], change_df['training_hours'])

In this Kruskal Wallace test, we get a p-value greater than .05. We fail to reject the Null-Hypothesis. This means that there are no significant differences between job changers and non-job changers with training hours so training hours will not be used for the model.

# Model Building

## Model Preparation

To prepare the model we must get dummy variables for all the categorical variables.

In [None]:
df[['male', 'other_gender', 'na_gender']] = pd.get_dummies(df['gender'])[['Male', 'Other', 'na']]
df['yes_relevant_experience'] = pd.get_dummies(df['relevent_experience'])['yes']
df[['no_enrollment', 'Full time course', 'Part time course']] = pd.get_dummies(df['enrolled_university'])[['no_enrollment', 'Full time course', 'Part time course']]
df[['Graduate','High School','Phd','Primary School','na_edu_level']] = pd.get_dummies(df['education_level'])[['Graduate','High School','Phd','Primary School','na']]
df[['Humanities', 'Other', 'Business Degree','Arts', 'No Major', 'na_major']] = pd.get_dummies(df['major_discipline'])[['Humanities', 'Other', 'Business Degree','Arts', 'No Major', 'na']]
df[['csize_na', 'csize_50-99', 'csize_100-500', 'csize_10000+', 'csize_1000-4999', 'csize_<10', 'csize_500-999', 'csize_5000-9999']] = pd.get_dummies(df['company_size'])[['na', '50-99', '100-500', '10000+', '1000-4999', '<10', '500-999', '5000-9999']]
df[['ctype_Pvt Ltd', 'ctype_Funded Startup', 'ctype_Public Sector', 'ctype_Early Stage Startup', 'ctype_NGO', 'ctype_Other']] = pd.get_dummies(df['company_type'])[['Pvt Ltd', 'Funded Startup', 'Public Sector', 'Early Stage Startup', 'NGO','Other']]

We assign the features to our X value and the target to our y value. Then we split the data to our training and validation sets.

In [None]:
X = df[['city_development_index','experience','last_new_job','training_hours', 'male', 'other_gender',
       'na_gender', 'yes_relevant_experience', 'no_enrollment',
       'Full time course', 'Part time course', 'Graduate', 'High School',
       'Phd', 'Primary School', 'na_edu_level', 'Humanities', 'Other',
       'Business Degree', 'Arts', 'No Major', 'na_major', 'csize_na',
       'csize_50-99', 'csize_100-500', 'csize_10000+', 'csize_1000-4999',
       'csize_<10', 'csize_500-999', 'csize_5000-9999', 'ctype_Pvt Ltd',
       'ctype_Funded Startup', 'ctype_Public Sector',
       'ctype_Early Stage Startup', 'ctype_NGO', 'ctype_Other']]

y = df['target']



X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=.20, random_state=0)

### Establishing Baseline Performance

To understand if our model holds any weight, we need to establish a baseline model to test our models against.

In [None]:
y_train.value_counts()

In [None]:
print('All Positive model equals:',3827/y_train.size)

print('All Negative model equals:',11499/y_train.size)

Since the all negative model has a higher accuracy we will be using it for our baseline. This means that out model must beat an accuracy score of 75.03%.

## Model Selection

### Defining model functions

This is a function to plot the ROC curve for each model.

In [None]:
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

This function efficiently trains each model on the training data and makes predictions for our validation set.

In [None]:
def fit_model(model):
    
    model.fit(X_train, y_train)
    val_preds = model.predict(X_val)
    print(pd.DataFrame(confusion_matrix(y_val,val_preds),\
            columns=["Predicted No", "Predicted Yes"],\
            index=["No","Yes"]))
    print('\n')
    print(classification_report(y_val, val_preds))
    
    probs = model.predict_proba(X_val)
    probs = probs[:, 1]
    fpr, tpr, thresholds = roc_curve(y_val, probs)
    plot_roc_curve(fpr,tpr)
    print('auc score: '+ str(roc_auc_score(y_val,val_preds)))
    

## Model Fitting

### Logistic Regression

In [None]:
log_model = LogisticRegression(max_iter=1000)

fit_model(log_model)

We get an accuracy score of 77%, 2 percent points higher than the baseline model.

### K-nearest Neighbors

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=50)

fit_model(knn_model)

With Knn, we get an accuracy score of 75%, which is the same as the baseline model.

### Decision Tree Classifier

In [None]:
tree_model = DecisionTreeClassifier(criterion='entropy', max_depth=5, max_leaf_nodes=10 )

fit_model(tree_model)

Decision tree classifier gives us an accuracy score of 79%, four percent higher than the baseline model.

### Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(n_estimators=100)

fit_model(rf_model)

Lastly, we try a random forest classifier, giving us an accuracy score of 78%, 3 points higher than the baseline.

# Conclusion

Out of all the models that we tried, the decision tree outperformed the rest. It out-performed the baseline model accuracy by 4%, logistic regression accuracy by 2%, K-nearest neighbors by 4%, and random forest classifier by 4%. This means that if we want to predict which candidates would be most likely to change jobs based on our features, we would use the Decision Tree Model.


## Next Steps
We could choose to deploy this model for real-world use however, given our accuracy rates, it might be wise to try more complex models in order to get a better accuracy rate. Deep-learning models for example may give better results however,that is outside the scope of this project.