<div style="text-align: center; background-color: white; font-family: 'Poppins', sans-serif; color: black; padding: 20px;font-size: 24px; line-height: 2; overflow:hidden"> NLP-based Fake Job Posting Detection Model </div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv(r"C:\Users\User\Desktop\ARATHI\Dataset\NLP\fake_job_postings.csv")
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


The fraudulent column in your dataset is the target/output variable for classification. It consists of 0s and 1s, which represent:

0 → Legitimate (Real) Job Posting 

1 → Fraudulent (Fake) Job Posting 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

In [9]:
df.columns

Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')

In [5]:
data=df[['company_profile','description','requirements','fraudulent']]
data.head()

Unnamed: 0,company_profile,description,requirements,fraudulent
0,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,0
1,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,0
2,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,0
3,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",0
4,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,0


In [7]:
# Check class distribution
print(df['fraudulent'].value_counts())

fraudulent
0    17014
1      866
Name: count, dtype: int64


In [9]:
data.shape

(17880, 4)

In [11]:
# Fill NaN values with empty strings
data.fillna("", inplace=True)

In [13]:
# Combine all text fields into a single column
data['text'] = data['company_profile'] + " " + data['description'] + " " + data['requirements']

In [15]:
# Drop original text columns
data_new = data[['text', 'fraudulent']]
data_new.head()

Unnamed: 0,text,fraudulent
0,"We're Food52, and we've created a groundbreaki...",0
1,"90 Seconds, the worlds Cloud Video Production ...",0
2,Valor Services provides Workforce Solutions th...,0
3,Our passion for improving quality of life thro...,0
4,SpotSource Solutions LLC is a Global Human Cap...,0


In [2]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [18]:
# Download necessary NLTK data
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
# Modify stopwords list to exclude 'not'
stop_words = set(stopwords.words('english'))
stop_words.discard('not')

In [23]:
# Initialize stemmer
stemmer = PorterStemmer()

In [25]:
# Text Preprocessing Function
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = ' '.join([stemmer.stem(word) for word in text.split() if word not in stop_words])  # Remove stopwords & apply stemming
    return text

In [27]:
# Apply cleaning
data_new['cleaned_text'] = data_new['text'].apply(clean_text)
data_new.head()

Unnamed: 0,text,fraudulent,cleaned_text
0,"We're Food52, and we've created a groundbreaki...",0,food52 weve creat groundbreak awardwin cook si...
1,"90 Seconds, the worlds Cloud Video Production ...",0,90 second world cloud video product service90 ...
2,Valor Services provides Workforce Solutions th...,0,valor servic provid workforc solut meet need c...
3,Our passion for improving quality of life thro...,0,passion improv qualiti life geographi heart ev...
4,SpotSource Solutions LLC is a Global Human Cap...,0,spotsourc solut llc global human capit manag c...


In [29]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(data_new['cleaned_text'], data_new['fraudulent'], test_size=0.2, random_state=42)

In [31]:
# Convert text to numerical features using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=10000)  # Convert text to TF-IDF vectors
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [33]:
# Apply SMOTE to balance classes
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_tfidf, y_train)

In [35]:
# Check class balance after SMOTE
print(pd.Series(y_train_balanced).value_counts())

fraudulent
0    13619
1    13619
Name: count, dtype: int64


In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train_balanced, y_train_balanced)

In [39]:
# Predict on the test set
y_pred = model.predict(X_test_tfidf)

In [41]:
# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9664429530201343
              precision    recall  f1-score   support

           0       0.99      0.97      0.98      3395
           1       0.63      0.84      0.72       181

    accuracy                           0.97      3576
   macro avg       0.81      0.91      0.85      3576
weighted avg       0.97      0.97      0.97      3576



In [43]:
# Train and evaluate Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_balanced, y_train_balanced)
y_pred_nb = nb_model.predict(X_test_tfidf)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Report:\n", classification_report(y_test, y_pred_nb))

Naive Bayes Accuracy: 0.941834451901566
Naive Bayes Report:
               precision    recall  f1-score   support

           0       0.99      0.94      0.97      3395
           1       0.46      0.88      0.61       181

    accuracy                           0.94      3576
   macro avg       0.73      0.91      0.79      3576
weighted avg       0.97      0.94      0.95      3576



In [45]:
# Train and evaluate Support Vector Machine (SVM) model
svm_model = SVC()
svm_model.fit(X_train_balanced, y_train_balanced)
y_pred_svm = svm_model.predict(X_test_tfidf)
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("SVM Report:\n", classification_report(y_test, y_pred_svm))

SVM Accuracy: 0.9823825503355704
SVM Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99      3395
           1       1.00      0.65      0.79       181

    accuracy                           0.98      3576
   macro avg       0.99      0.83      0.89      3576
weighted avg       0.98      0.98      0.98      3576



In [47]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [51]:
dt = DecisionTreeClassifier()
dt.fit(X_train_balanced, y_train_balanced)
y_pred_dt = dt.predict(X_test_tfidf)
print("\nDecision Tree Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(classification_report(y_test, y_pred_dt))


Decision Tree Model Performance:
Accuracy: 0.9499
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      3395
           1       0.50      0.74      0.60       181

    accuracy                           0.95      3576
   macro avg       0.74      0.85      0.79      3576
weighted avg       0.96      0.95      0.95      3576



In [53]:
rf = RandomForestClassifier(n_estimators=500, random_state=42)
rf.fit(X_train_balanced, y_train_balanced)
y_pred_rf = rf.predict(X_test_tfidf)
print("\nRandom Forest Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(classification_report(y_test, y_pred_rf))


Random Forest Model Performance:
Accuracy: 0.9807
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3395
           1       0.95      0.65      0.77       181

    accuracy                           0.98      3576
   macro avg       0.97      0.83      0.88      3576
weighted avg       0.98      0.98      0.98      3576



In [55]:
from sklearn.ensemble import GradientBoostingClassifier

In [57]:
# Create the Gradient Boosting model
gbm = GradientBoostingClassifier(
    n_estimators=300,     
    learning_rate=0.1,   
    max_depth=4,         
    min_samples_split=15,      
    random_state=42
)

# Train the model
gbm.fit(X_train_balanced, y_train_balanced)

y_pred_gbm = rf.predict(X_test_tfidf)
print("\nRandom Forest Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_gbm):.4f}")
print(classification_report(y_test, y_pred_gbm))


Random Forest Model Performance:
Accuracy: 0.9807
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3395
           1       0.95      0.65      0.77       181

    accuracy                           0.98      3576
   macro avg       0.97      0.83      0.88      3576
weighted avg       0.98      0.98      0.98      3576



In [60]:
import joblib

# Save the model
joblib.dump(svm_model, "fake_job_detector_svm.pkl")

# Save the TF-IDF vectorizer (to transform text later)
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

print("Model and vectorizer saved successfully!")

Model and vectorizer saved successfully!


In [62]:
# Load the model
loaded_model = joblib.load("fake_job_detector_svm.pkl")

# Load the TF-IDF vectorizer
loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")

print("Model and vectorizer loaded successfully!")

Model and vectorizer loaded successfully!


In [64]:
def predict_job(text):
    cleaned_text = clean_text(text)
    text_tfidf = loaded_vectorizer.transform([text])  # Convert text to TF-IDF
    prediction = loaded_model.predict(text_tfidf)[0]  # Get prediction (0 or 1)
    return "FAKE JOB" if prediction == 1 else "REAL JOB"

In [5]:
sample_data = data_new.loc[:5]


NameError: name 'data_new' is not defined

In [76]:
list(sample_data['text'])

["We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City. Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub

In [78]:
# Example Prediction
new_job = "We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City. Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours"
print(predict_job(new_job))

REAL JOB


In [90]:
frd_data = data_new[data_new["fraudulent"]==1]
frd_data = frd_data.head()
frd_data

Unnamed: 0,text,fraudulent,cleaned_text
98,...,1,staf amp recruit done right oil amp energi ind...
144,The group has raised a fund for the purchase ...,1,group rais fund purchas home southeast student...
173,Edison International and Refined Resources hav...,1,edison intern refin resourc partner effort str...
180,Sales Executive Sales Executive,1,sale execut sale execut
215,...,1,staf amp recruit done right oil amp energi ind...


In [92]:
list(frd_data['text'])

["\xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0Staffing &amp; Recruiting done right for the Oil &amp; Energy Industry!Represented candidates are automatically granted the following perks: Expert negotiations on your behalf, maximizing your compensation package and implimenting ongoing increases\xa0Significant signing bonus by Refined Resources (in addition to any potential signing bonuses our client companies offer)1 Year access to AnyPerk: significant corporate discounts on cell phones, event tickets, house cleaning and everything inbetween. \xa0You'll save thousands on daily expenditures\xa0Professional Relocation Services for out of town candidates* All candidates are encouraged to participate in our Referral Bonus Program ranging anywhere from $500 - $1,000 for all successfully hired candidates... referred directly to the Ref

In [96]:
sam_data = "\xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0Staffing &amp; Recruiting done right for the Oil &amp; Energy Industry!Represented candidates are automatically granted the following perks: Expert negotiations on your behalf, maximizing your compensation package and implimenting ongoing increases\xa0Significant signing bonus by Refined Resources (in addition to any potential signing bonuses our client companies offer)1 Year access to AnyPerk: significant corporate discounts on cell phones, event tickets, house cleaning and everything inbetween. \xa0You'll save thousands on daily expenditures\xa0Professional Relocation Services for out of town candidates* All candidates are encouraged to participate in our Referral Bonus Program ranging anywhere from $500 - $1,000 for all successfully hired candidates... referred directly to the Refined Resources teamPlease submit referrals via online Referral FormThank you and we look forward to working with you soon! \xa0[ Click to enlarge Image ] IC&amp;E Technician | Bakersfield, CA Mt. PosoPrincipal Duties and Responsibilities:\xa0Calibrates, tests, maintains, troubleshoots, and installs all power plant instrumentation, control systems and electrical equipment.Performs maintenance on motor control centers, motor operated valves, generators, excitation equipment and motors.Performs preventive, predictive and corrective maintenance on equipment, coordinating work with various team members.Designs and installs new equipment and/or system modifications.Troubleshoots and performs maintenance on DC backup power equipment, process controls, programmable logic controls (PLC), and emission monitoring equipment.Uses maintenance reporting system to record time and material use, problem identified and corrected, and further action required; provides complete history of maintenance on equipment.Schedule, coordinate, work with and monitor contractors on specific tasks, as required.Follows safe working practices at all times.Identifies safety hazards and recommends solutions.Follows environmental compliance work practices.Identifies environmental non-compliance problems and assist in implementing solutions.Assists other team members and works with all departments to support generating station in achieving their performance goals.Trains other team members in the areas of instrumentation, control, and electrical systems.Performs housekeeping assignments, as directed.Conduct equipment and system tagging according to company and plant rules and regulations.Perform equipment safety inspections, as required, and record results as appropriate.\xa0Participate in small construction projects.\xa0 Read and interpret drawings, sketches, prints, and specifications, as required.Orders parts as needed to affect maintenance and repair.Performs Operations tasks on an as-needed basis and other tasks as assigned.Available within a reasonable response time for emergency call-ins and overtime, plus provide acceptable off-hour contact by phone and company pager.\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Excellent Verbal and Written Communications Skills:Ability to coordinate work activities with other team members on technical subjects across job families.Ability to work weekends, holidays, and rotating shifts, as required. QualificationsKnowledge, Skills &amp; Abilities:\xa0A high school diploma or GED is required. Must have a valid driver’s license. Ability to read, write, and communicate effectively in English.\xa0\xa0Good math skills.\xa0Four years of experience as an I&amp;C Technician and/or Electrician in a power plant environment, preferably with a strong electrical background, up to and including, voltages to 15 KV to provide the following:Demonstrated knowledge of electrical equipment, electronics, schematics, basics of chemistry and physics and controls and instrumentation.Demonstrated knowledge of safe work practices associated with a power plant environment.Demonstrated ability to calibrate I&amp;C systems and equipment, including analytic equipment.Demonstrated ability to configure and operate various test instruments and equipment, as necessary, to troubleshoot and repair plant equipment including, but not limited to, distributed control systems, programmable logic controllers, motor control centers, transformers, generators, and continuous emissions monitor (CEM) systems.Demonstrated ability to work with others in a team environment.\xa0"
print(predict_job(sam_data))

REAL JOB
