# Modeling
---

For my evaluation metric, I will use accuracy. I want to minimize both false positives and false negatives. I think it would be worse to apply to a fraudulent job listing because whoever posted it could potentially access sensitive personal information. However, I also do not want to miss out on potential job opportunities by not applying to a job listing that was classified as fraudulent when it was really authentic.

For my baseline, I will predict that every job listing is authentic since the majority of the listings in this dataset are actually authentic.

In [16]:
import pandas as pd
import numpy as np
import wrangle as w
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [4]:
train, validate, test = w.split_data(w.wrangle_jobs(), 'fraudulent')
train.shape

(9818, 23)

In [8]:
# create x and y versions of train, validate, and test samples
x_train = train.drop(columns=['fraudulent', 'continent', 'employment_type', 'required_education', 'required_experience'])
y_train = train.fraudulent

x_validate = validate.drop(columns=['fraudulent', 'continent', 'employment_type', 'required_education', 'required_experience'])
y_validate = validate.fraudulent

x_test = test.drop(columns=['fraudulent', 'continent', 'employment_type', 'required_education', 'required_experience'])
y_test = test.fraudulent

## Baseline

In [15]:
# compute baseline accuracy
print(f'Baseline Accuracy on Train: {(y_train == 0).mean():.2%}')
print(f'Baseline Accuracy on Validate: {(y_validate == 0).mean():.2%}')

Baseline Accuracy on Train: 95.16%
Baseline Accuracy on Validate: 95.18%


## Model 1: Decision Tree
For this model, I will be using all of the features in my dataframe, a max depth of 3, and I will set the random state to 123 for reproducibility.

In [20]:
# create decision tree object
clf = DecisionTreeClassifier(max_depth=3, random_state=123)
# fit object to train
clf = clf.fit(x_train, y_train)
# make predictions
y_pred1 = clf.predict(x_train)
y_pred_v1 = clf.predict(x_validate)
# compute and print accuracy
print(f'Model 1 Train Accuracy: {accuracy_score(y_train, y_pred1):.2%}')
print(f'Model 1 Validate Accuracy: {accuracy_score(y_validate, y_pred_v1):.2%}')

Model 1 Train Accuracy: 95.16%
Model 1 Validate Accuracy: 95.18%


## Model 2: Random Forest
For this model, I will be using all of the features in my dataframe. I will set minimum samples per leaf to 1 and max depth to 10. I will also set a random state (123) so my work can be reproduced.

In [13]:
# create random forest object
rf = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)
# fit object to train sample
rf.fit(x_train, y_train)
# make predictions
y_pred2 = rf.predict(x_train)
y_pred_v2 = rf.predict(x_validate)
# compute and print accuracy
print(f'Model 2 Train Accuracy: {accuracy_score(y_train, y_pred2):.2%}')
print(f'Model 2 Validate Accuracy: {accuracy_score(y_validate, y_pred_v2):.2%}')

Model 2 Train Accuracy: 95.44%
Model 2 Validate Accuracy: 95.58%


## Model 3: Random Forest
For this model, I will only be using the features that started off as int dtypes. I will leave the hyperparameters the same as in Model 2.

In [14]:
# create random forest object
rf2 = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)
# fit object to train sample
rf2.fit(x_train[['telecommuting', 'has_company_logo', 'has_questions']], y_train)
# make predictions
y_pred3 = rf2.predict(x_train[['telecommuting', 'has_company_logo', 'has_questions']])
y_pred_v3 = rf2.predict(x_validate[['telecommuting', 'has_company_logo', 'has_questions']])
# compute and print accuracy
print(f'Model 3 Train Accuracy: {accuracy_score(y_train, y_pred3):.2%}')
print(f'Model 3 Validate Accuracy: {accuracy_score(y_validate, y_pred_v3):.2%}')

Model 3 Train Accuracy: 95.16%
Model 3 Validate Accuracy: 95.18%


My best model is Model 2. It has an accuracy of 95.44% on train and 95.58% on validate. Both of these beat the baseline! Next, I will run this model on the test dataset.

In [21]:
# create random forest object
rf = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)
# fit object to train sample
rf.fit(x_train, y_train)
# make predictions
y_pred_t2 = rf.predict(x_test)
# compute and print accuracy
print(f'Model 2 Test Accuracy: {accuracy_score(y_test, y_pred_t2):.2%}')

Model 2 Test Accuracy: 95.32%
