* Fraudulent Job Postings
This notebook will use text from job postings to determine if a job posting is legitimate or fraudulent.  The original posting of the data can be found at the following link on Kaggle. https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib
pd.set_option('display.max_columns', None)

Read in Data and select information of interest

In [2]:
df = pd.read_csv("C:/Users/schne/Desktop/NLP_Example_Job_Text/fake_job_postings.csv")
df = df[df['description'].notna()]
df.head()

Text = df['description']
Y = df['fraudulent']

Text Formatting

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
Text_Vect = count_vect.fit_transform(Text)
Text_Vect.shape
count_vect.vocabulary_.get(u'algorithm')

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(Text_Vect)
Text_tf = tf_transformer.transform(Text_Vect)
Text_tf.shape

(17879, 62231)

Split Data into Training and Testing subsets

In [4]:
from sklearn.model_selection import train_test_split
Text_Training, Text_Test, Label_Training, Label_Test = train_test_split(Text_tf,
Y, test_size= 0.2, random_state= 7991)

Build Multiple Prediction Models and Metrics Functions

When looking at the two created classifiers below the SVM classifier performed the best with a True positive rate of 96% 
and a Positive Predicted Value of 100% which is the value of interest for this classifier due to the model not passing a fraudulent job into the real job listing.  This type of model would be useful for websites that hosts job listings where the model can be used to flag job postings for review by a person.  The model can also be used by a person seeking a job and running the job description through the model can determine if it is safe to provide a listing with sensitive information.

In [5]:
def Con_Max_2by2(Matrix, Model_Name):
    Positives= Matrix[0,0] + Matrix[0,1]
    Negatives= Matrix[1,0] + Matrix[1,1]
    Predicted_Postives = Matrix[0,0] + Matrix[1,0]
    Predicted_Negatives = Matrix[0,1] + Matrix[1,1]
    True_Positive_Rate = Matrix[0,0] / Positives
    True_Negative_Rate = Matrix[1,1] / Negatives
    Positive_Predicted_Value= Matrix[0,0] / (Matrix[0,0] + Matrix[1,0])
    Negative_Predicted_Value= Matrix[1,1] / (Matrix[0,1] + Matrix[1,1])
    NEW = print(f'Model: {Model_Name} \n',
    'True Positive Rate:',True_Positive_Rate,'\n',
    'True Negative Rate:', True_Negative_Rate, '\n',
    'Positive Predicted Value:', Positive_Predicted_Value, '\n',
    'Negative Predicted Value:', Negative_Predicted_Value)
    return True_Positive_Rate,True_Negative_Rate , Positive_Predicted_Value, Negative_Predicted_Value


In [6]:
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB().fit(Text_Training, Label_Training)
NB_Test=NB_classifier.predict(Text_Test)
from sklearn.metrics import confusion_matrix
print("NB Classifier:\n","Accuarcy:",np.mean(NB_Test == Label_Test))
print("Confusion Matrix \n",confusion_matrix(NB_Test, Label_Test))
NB_Output = confusion_matrix(NB_Test, Label_Test)
NB_2by2 = Con_Max_2by2(NB_Output, 'NB')
joblib.dump(NB_classifier, 'Job_NB_Classifier')

NB Classifier:
 Accuarcy: 0.9493847874720358
Confusion Matrix 
 [[3395  178]
 [   3    0]]
Model: NB 
 True Postive Rate: 0.9501819199552197 
 True Negative Rate: 0.0 
 Positive Predicted Value: 0.9991171277221895 
 Negative Predicted Value: 0.0


['Job_NB_Classifier']

In [7]:
from sklearn.linear_model import SGDClassifier
SVM_classifier = SGDClassifier().fit(Text_Training, Label_Training)
SVM_Test = SVM_classifier.predict(Text_Test)
print("SVM Classifier:\n","Accuarcy:",np.mean(SVM_Test == Label_Test))
print("Confusion Matrix \n",confusion_matrix(SVM_Test, Label_Test))
SVM_Output = confusion_matrix(SVM_Test, Label_Test)
SVM_2by2 = Con_Max_2by2(SVM_Output, 'SVM')
joblib.dump(SVM_classifier, 'Job_SVM_Classifier')

SVM Classifier:
 Accuarcy: 0.9633668903803132
Confusion Matrix 
 [[3398  131]
 [   0   47]]
Model: SVM 
 True Postive Rate: 0.9628790025502976 
 True Negative Rate: 1.0 
 Positive Predicted Value: 1.0 
 Negative Predicted Value: 0.2640449438202247


['Job_SVM_Classifier']