# INTRODUCTION

 ##### Fake Job Description Prediction
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.

#### Data source
http://emscad.samos.aegean.gr/

### Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np

## Loading the datasets

In [2]:
data =pd.read_csv("fake_job_postings.csv")
data.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [3]:
data.info

<bound method DataFrame.info of        job_id                                              title  \
0           1                                   Marketing Intern   
1           2          Customer Service - Cloud Video Production   
2           3            Commissioning Machinery Assistant (CMA)   
3           4                  Account Executive - Washington DC   
4           5                                Bill Review Manager   
...       ...                                                ...   
17875   17876                   Account Director - Distribution    
17876   17877                                 Payroll Accountant   
17877   17878  Project Cost Control Staff Engineer - Cost Con...   
17878   17879                                   Graphic Designer   
17879   17880                         Web Application Developers   

                   location   department salary_range  \
0          US, NY, New York    Marketing          NaN   
1            NZ, , Auckland      Succ

## EDA

In [4]:
data.isna().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2695
benefits                7210
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [5]:
data.fraudulent.value_counts()

0    17014
1      866
Name: fraudulent, dtype: int64

#### 0 stands for real job posting descriptions while 1 stands for fake job postings.

In [6]:
data.columns

Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')

In [7]:
#fix spaces and column names
data.columns= data.columns.str.strip()
data.columns

Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')

In [8]:
#Reduce the number of columns with no useful
columns_to_drop = ['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function']
data = data.drop(columns=columns_to_drop)
data.head()

Unnamed: 0,description,fraudulent
0,"Food52, a fast-growing, James Beard Award-winn...",0
1,Organised - Focused - Vibrant - Awesome!Do you...,0
2,"Our client, located in Houston, is actively se...",0
3,THE COMPANY: ESRI – Environmental Systems Rese...,0
4,JOB TITLE: Itemization Review ManagerLOCATION:...,0


In [9]:
data.isnull().sum()

description    1
fraudulent     0
dtype: int64

In [10]:
# Droping the NAN values in the description column
data.dropna(subset=['description'], inplace=True)
data.isnull().sum()


description    0
fraudulent     0
dtype: int64

## Train Test Split

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.description, data.fraudulent, test_size=0.2)

In [12]:
X_train.shape

(14303,)

In [13]:
X_test.shape

(3576,)

In [14]:
type(X_train)

pandas.core.series.Series

In [15]:
X_train[:14]

7565     We are currently seeking a Customer Service Te...
65       We are a boutique digital agency based in Auck...
4606     Do you have HR experience? Are you looking to ...
15999    If working in a cubical seems like your idea o...
15198    John’s Family Grill is a local restaurant that...
9105     We are looking for a highly motivated and qual...
7682     Title: Senior Media PlannerLocation: Boca Rato...
7783     Summer Internships @ Jungle VenturesPositions:...
3886     As the TradeGecko Head of Operations, you will...
17452    We are the first online-driven solution provid...
2566     BI Developer - AnalystJob Details:- Requiremen...
7697     International Cultural Exchange Services (ICES...
10396    As Director of Software Engineering's newly fo...
14469    Critical Nurse Staffing, Inc. is looking for a...
Name: description, dtype: object

In [16]:
y_train

7565     0
65       0
4606     0
15999    0
15198    0
        ..
16538    0
14353    0
17074    0
5194     0
15542    0
Name: fraudulent, Length: 14303, dtype: int64

In [17]:
y_test

12678    0
7205     0
14411    0
3021     0
14723    0
        ..
13640    0
5719     0
9369     0
6337     0
14572    0
Name: fraudulent, Length: 3576, dtype: int64

In [18]:
type(y_train)

pandas.core.series.Series

In [19]:
type(X_train.values)

numpy.ndarray

## Create bag of words representation using CountVectorizer

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<14303x55299 sparse matrix of type '<class 'numpy.int64'>'
	with 1571535 stored elements in Compressed Sparse Row format>

In [21]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [22]:
X_train_cv.shape

(14303, 55299)

In [23]:
v.get_feature_names_out()[30862]

'nouba'

In [24]:
v.vocabulary_

{'we': 52420,
 'are': 3915,
 'currently': 11633,
 'seeking': 42237,
 'customer': 11657,
 'service': 42471,
 'team': 47064,
 'lead': 25850,
 'the': 47956,
 'right': 40806,
 'candidate': 7431,
 'will': 52835,
 'be': 5457,
 'an': 2858,
 'integral': 23788,
 'part': 32997,
 'of': 31267,
 'our': 32415,
 'talented': 46774,
 'supporting': 46141,
 'continued': 10647,
 'growth': 20988,
 'this': 48136,
 'position': 34971,
 'located': 26612,
 'in': 22866,
 'philadelphia': 33886,
 'pa': 32705,
 'location': 26615,
 'responsibilities': 40003,
 'include': 22921,
 'but': 7041,
 'not': 30817,
 'limited': 26374,
 'to': 48513,
 'coordinate': 10863,
 'work': 53071,
 'for': 19256,
 'associates': 4418,
 'process': 36009,
 'mail': 27023,
 'deliver': 12622,
 'scan': 41774,
 'out': 32429,
 'packages': 32728,
 'and': 3049,
 'image': 22597,
 'documents': 14503,
 'ensure': 16482,
 'each': 15155,
 'document': 14434,
 'is': 24423,
 'scanned': 41776,
 'interact': 23885,
 'with': 52961,
 'scanning': 41784,
 'software'

In [25]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [26]:
np.where(X_train_np[0]!=0)

(array([ 1583,  1954,  2636,  2833,  2858,  3049,  3690,  3915,  4144,
         4230,  4308,  4407,  4418,  4452,  5283,  5396,  5457,  5529,
         5878,  6863,  6912,  7041,  7079,  7431,  7615,  9527,  9682,
         9902, 10142, 10478, 10647, 10863, 11633, 11657, 12199, 12622,
        12670, 13975, 14434, 14503, 15033, 15155, 16032, 16482, 16830,
        16850, 17399, 18516, 18797, 18962, 19151, 19256, 20988, 21084,
        21097, 21460, 21764, 22597, 22603, 22866, 22921, 23069, 23586,
        23788, 23885, 24423, 24485, 25713, 25850, 26179, 26374, 26612,
        26615, 26715, 27023, 27148, 27267, 27450, 28085, 28315, 28733,
        28756, 29096, 30817, 30948, 31122, 31267, 31281, 31555, 31749,
        32415, 32429, 32705, 32728, 32997, 33531, 33749, 33886, 34779,
        34971, 35988, 36009, 36057, 36157, 36359, 36769, 36796, 37270,
        37500, 37664, 37741, 38299, 38335, 39232, 39426, 39473, 39891,
        40003, 40409, 40605, 40639, 40704, 40806, 41399, 41774, 41776,
      

In [27]:
X_train

7565     We are currently seeking a Customer Service Te...
65       We are a boutique digital agency based in Auck...
4606     Do you have HR experience? Are you looking to ...
15999    If working in a cubical seems like your idea o...
15198    John’s Family Grill is a local restaurant that...
                               ...                        
16538    Escrow Officer -Houston A well established, we...
14353    Bring it all together as our new Associate Cre...
17074    This is who we are: Network Closing Services, ...
5194     The Service Delivery Manager 1 will be located...
15542    Crypteia Networks is looking for a devops engi...
Name: description, Length: 14303, dtype: object

In [29]:
X_train[:4][7565]

'We are currently seeking a Customer Service Team Lead. The right candidate will be an integral part of our talented team, supporting our continued growth. This position will be located in our Philadelphia, PA location.Responsibilities include, but are not limited to:Coordinate work for Customer Service Associates.Process mail, deliver mail, scan in/out packages and deliver mail/packages.Scan/Image\xa0to include; scan documents, ensure each document is scanned,\xa0interact with scanning software to indicate when a batch is complete, perform quality assurance and review images, perform quality assurance of documents that have been flagged by the system, and complete Productivity Sheet to track project progress and provide numbers for billing purposesRun mail meter and inserter equipmentHandle time-off requests and day-to-day processes of the teamHelp resolve employee and customer concerns/issuesAdministrative services/processing large volume reports using excel and assisting manager wit

## Train the naive bayes model

In [30]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

MultinomialNB()

In [31]:
X_test_cv = v.transform(X_test)

## Evaluate Performance

In [32]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      3392
           1       0.75      0.43      0.54       184

    accuracy                           0.96      3576
   macro avg       0.86      0.71      0.76      3576
weighted avg       0.96      0.96      0.96      3576



In [33]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 0], dtype=int64)

In [34]:
emails = ['We saw your profile on Indeed and thought you would be a great match for the Mathematics Instructor-NUC Online opportunity.Please submit a quick application incase you have an interest.',
          'We saw your profile on Indeed and thought you would be a great match for the Digital Data Analyst opportunity. Please submit a quick application if you have any interest.'
]
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 0], dtype=int64)

## Train the model using sklearn pipeline and reduce number of lines of code

In [35]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [36]:
clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

## Evaluate Performance

In [37]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      3392
           1       0.75      0.43      0.54       184

    accuracy                           0.96      3576
   macro avg       0.86      0.71      0.76      3576
weighted avg       0.96      0.96      0.96      3576

