## Motivation

Job scams have become an increasingly prevalent issue in Singapore, with fraudsters exploiting job seekers through deceptive advertisements on social media and job platforms. These scams not only cause financial losses but also inflict psychological distress on victims who may lose trust in legitimate employment opportunities.

As part of this technical assessment, I aim to explore the data, perform the necessary preprocessing techniques and develop a robust and adaptive algorithm to detect and prevent fraudulent job postings before they reach unsuspecting users. Given the rapid evolution of scam tactics, traditional rule-based methods may no longer be sufficient. A data-driven, machine learning approach enables us to dynamically identify and flag suspicious job listings based on linguistic patterns, metadata and other key indicators.

## Recruitment Scam Dataset Information

The [Recruitment Scam Dataset](https://www.kaggle.com/datasets/amruthjithrajvr/recruitment-scam) is a publicly available dataset containing 17880 real-life job ads, the  dataset contains 17014 legitimate and 866 fraudulent job ads published between 2012 to 2014.

The dimensions are shown below:
- **title:**: Title of job ad.
- **location:** Location of job.
- **department:** Department of company which the job belong.
- **salary_range:** Salary range of job.
- **salary_range:** Salary range of job.
- **company_profile:** Company profile.
- **description:** Description of job.
- **requirements:** Job requirements.
- **benefits:** Job benefits.
- **telecommuting:** `t` (True) or `f` (False).
- **has_company_logo:** `t` (True) or `f` (False).
- **has_questions:** `t` (True) or `f` (False).
- **employment_type:** Type of employment.
- **required_experience:** Experience required.
- **required_education:** Education required.
- **industry:** Type of industry.
- **function:** Job function.
- **fraudulent:** `t` (fraudulent), `f` (not fraudulent).
- **in_balanced_dataset:** `t` (True), `f` (False).

The Recruitment Scam Dataset is stored in `data.csv`.

## Import Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [None]:
# !pip3 install -r requirements.txt

In [1]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords
# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Part-of-speech tagging
from nltk import pos_tag
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
import nltk
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# Data Preparation (Loading CSV)

Load the dataset `data.csv` file into a pandas DataFrame
- `data.csv` is loaded into `df` DataFrame.

In [2]:
df = pd.read_csv('data.csv')

In [3]:
df.info()
print("Dataframe Shape:", df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   title                17880 non-null  object
 1   location             17534 non-null  object
 2   department           6333 non-null   object
 3   salary_range         2868 non-null   object
 4   company_profile      14572 non-null  object
 5   description          17880 non-null  object
 6   requirements         15191 non-null  object
 7   benefits             10684 non-null  object
 8   telecommuting        17880 non-null  object
 9   has_company_logo     17880 non-null  object
 10  has_questions        17880 non-null  object
 11  employment_type      14409 non-null  object
 12  required_experience  10830 non-null  object
 13  required_education   9775 non-null   object
 14  industry             12977 non-null  object
 15  function             11425 non-null  object
 16  frau

In [4]:
df.head()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
0,Marketing Intern,"US, NY, New York",Marketing,,"<h3>We're Food52, and we've created a groundbr...","<p>Food52, a fast-growing, James Beard Award-w...",<ul>\r\n<li>Experience with content management...,,f,t,f,Other,Internship,,,Marketing,f,f
1,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"<h3>90 Seconds, the worlds Cloud Video Product...",<p>Organised - Focused - Vibrant - Awesome!<br...,<p><b>What we expect from you:</b></p>\r\n<p>Y...,<h3><b>What you will get from us</b></h3>\r\n<...,f,t,f,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,f,f
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,<h3></h3>\r\n<p>Valor Services provides Workfo...,"<p>Our client, located in Houston, is actively...",<ul>\r\n<li>Implement pre-commissioning and co...,,f,t,f,,,,,,f,f
3,Account Executive - Washington DC,"US, DC, Washington",Sales,,<p>Our passion for improving quality of life t...,<p><b>THE COMPANY: ESRI – Environmental System...,<ul>\r\n<li>\r\n<b>EDUCATION: </b>Bachelor’s o...,<p>Our culture is anything but corporate—we ha...,f,t,f,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,f,f
4,Bill Review Manager,"US, FL, Fort Worth",,,<p>SpotSource Solutions LLC is a Global Human ...,<p><b>JOB TITLE:</b> Itemization Review Manage...,<p><b>QUALIFICATIONS:</b></p>\r\n<ul>\r\n<li>R...,<p>Full Benefits Offered</p>,f,t,t,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,f,f


In [4]:
df["fraudulent"].value_counts()

fraudulent
f    17014
t      866
Name: count, dtype: int64

In [6]:
df[df["fraudulent"] == "t"].head()

Unnamed: 0,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,in_balanced_dataset
98,IC&E Technician,"US, , Stocton, CA",Oil & Energy,95000-115000,<p> ...,"<p><b><img src=""#URL_ae07dc35dfe86ebc1101b48ee...",<h3><b>Qualifications</b></h3>\r\n<p><b>Knowle...,<p><b>BENEFITS</b></p>\r\n<p><b>What is offere...,f,t,t,Full-time,Mid-Senior level,High School or equivalent,Oil & Energy,Other,t,f
144,Forward Cap.,,,,,<p>The group has raised a fund for the purchas...,,,f,f,f,,,,,,t,t
173,Technician Instrument & Controls,US,Power Plant & Energy,,"<p><img src=""#URL_044fce3aa43cecf7fd7f1fd790ab...",<p><b>Technician Instrument &amp; Controls</b>...,<p><b>JOB QUALIFICATIONS</b><br><br>-Ability t...,"<p>we are a team of almost 8,000 employees who...",f,t,t,Full-time,Mid-Senior level,Certification,Electrical/Electronic Manufacturing,Other,t,f
180,Sales Executive,"PK, SD, Karachi",Sales,,,<p>Sales Executive</p>,<p>Sales Executive</p>,<p>Sales Executive</p>,f,f,f,,,,,Sales,t,t
215,IC&E Technician Mt Poso,"US, CA, Bakersfield, CA / Mt. Poso",Oil & Energy,95000-115000,<p> ...,"<p><b><img src=""#URL_ae07dc35dfe86ebc1101b48ee...",<h3><b> Qualifications</b></h3>\r\n<p><b>Knowl...,<p><b>BENEFITS</b></p>\r\n<p><b>What is offere...,f,t,t,Full-time,Mid-Senior level,High School or equivalent,Oil & Energy,Other,t,f
