## Fake Data Detection


### About The Data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent. A small proportion of these descriptions are fake or scam which can be identified by the column "fraudulent". 

The data is provide by the University of the Aegean | Laboratory of Information & Communication Systems Security

http://emscad.samos.aegean.gr/

## Dictonary:
-  job_id: Unique ID (int64)
-  title: Title of job description (str)
-  location: Geographical location of the job ad (Example: US, NY, New York)
-  department: Corporate department (e.g. Marketing, Success, Sales, ANDROIDPIT, ...)
-  salary_range: Indicative salary range (e.g. 50,000-60,000 ($))
-  company_profile: A brief company description.
-  description: The details description of the job ad.
-  requirements: Enlisted requirements for the job opening.
-  benefits: Enlisted offered benefits by the employer.
-  telecommuting: True for telecommuting positions. --> remote or not
-  has_company_logo: True if company logo is present.
-  has_questions: True if screening questions are present.
-  employment_type: Type of emplyment (e.g. Full-type, Part-time, Contract, etc.)
-  required_experience: Required Experience (e.g. Executive, Entry level, Intern, etc.)
-  required_education: Required Education (e.g. Doctorate, Master’s Degree, Bachelor, etc)
-  industry: Industry (e.g. Automotive, IT, Health care, Real estate, etc.)
-  function: Position as function in the company (e.g. Consulting, Engineering, Research, Sales etc.)
-  fraudulent: Classifcation target (0, 1)


# Columns to do:
## string manipulation
- title
- company_profile
- description
- requirements
- benefits

## one-hot encode
- location (3.105) - cities and countries --> remove
    - countries = 90 (346 is NA) --> keep
- industry (groups = 131) --> boolean mask (group all with less than 30 into one group) --> create category with missings
- function (groups = 37)  --> create category with missings

- employment_type (groups = 5) 
- required_experience (groups = 7)
- required_education (groups = 13)

## binary (no mising)
- telecommuting
- has_company_logo
- has_questions
- salary_range --> turn into binary (has salary range or not)
- department (groups = 1337) --> binary 

## target
- fraudulent (binary)

## dropping
department (groups = 1337) --> boolean mask (group all with less than 30 into one group) --> drop for now

# Questions
- How to impute data with more sophisticated methods?
- How to examine whether values are true NAs or just the result of company size?


In [None]:
# Install dependencies as needed:
import pandas as pd 


In [None]:
# read file
data_path = '/home/lars/code/syeda-tabassum-rahaman/scam-job-detector/raw_data/fake_job_postings.csv'
df = pd.read_csv(data_path)
# print("First 5 records:", df.head())

In [19]:
df['department'].map(lambda x: 0 if pd.isna(x) else 1).value_counts()

department
0    11547
1     6333
Name: count, dtype: int64

In [None]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd
from pathlib import Path

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean raw data by
    - Creating new features for columns with missing values above >30% as binary features: missing = 0, not missing = 1
    - Cleaning text data by removing stopwords, digits, lamatizing, etc.
    - 
    """
    def preprocessing(sentence):

        stop_words = set(stopwords.words('english'))

        # remove punctuation
        for punctuation in string.punctuation:
            sentence = sentence.replace(punctuation, '')

        # set to lowercase
        sentence = sentence.lower()

        # remove numbers
        for char in string.digits:
            sentence = ''.join(char for char in sentence if not char.isdigit())

        # tokenize
        tokens = word_tokenize(sentence)

        # removing stop words
        tokens = [word for word in tokens if word not in stop_words]

        # lemmatize
        tokens = [WordNetLemmatizer().lemmatize(word, pos='v') for word in tokens]

        return ' '.join(tokens)

    # path = Path('.')
    df = pd.read_csv('../raw_data/fake_job_postings.csv')
    print('dataset loaded')

    # Creating binary columns for missing values:
    df['department_binary'] = df['department'].map(lambda x: 0 if pd.isna(x) else 1)
    
    df['salary_range_binary'] = df['salary_range'].map(lambda x: 0 if pd.isna(x) else 1)
    
    # Clean text data
    cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']

    df = df.copy()

    for col in cols:
        df[col] = df[col].fillna('missing value')

    for col in cols:
        df[col] = df[col].apply(preprocessing)
    
    # extracting country ID
    df['country'] = df['location'].astype(str).apply(lambda x: x.split(',')[0])

    # dropping columns
    df.drop(columns=['salary_range', 'department', 'location', 'job_id'], inplace=True)
    

    print("✅ data cleaned")

    return df


In [16]:
d_test = clean_data(df)

d_test

dataset loaded
✅ data cleaned


Unnamed: 0,title,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent,department_binary,salary_range_binary,country
0,market intern,food weve create groundbreaking awardwinning c...,food fastgrowing jam beard awardwinning online...,experience content management systems major pl...,miss value,0,1,0,Other,Internship,,,Marketing,0,1,0,US
1,customer service cloud video production,second worlds cloud video production service s...,organise focus vibrant awesomedo passion custo...,expect youyour key responsibility communicate ...,get usthrough part second team gainexperience ...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0,1,0,NZ
2,commission machinery assistant cma,valor service provide workforce solutions meet...,client locate houston actively seek experience...,implement precommissioning commission procedur...,miss value,0,1,0,,,,,,0,0,0,US
3,account executive washington dc,passion improve quality life geography heart e...,company esri – environmental systems research ...,education bachelor ’ master ’ gi business admi...,culture anything corporate—we collaborative cr...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0,1,0,US
4,bill review manager,spotsource solutions llc global human capital ...,job title itemization review managerlocation f...,qualificationsrn license state texasdiploma ba...,full benefit offer,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0,0,0,US
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,account director distribution,vend look awesome new talent come join us youl...,case first time ’ visit website vend award win...,ace role youwill eat comprehensive statements ...,expect uswe open culture openly share result i...,0,1,1,Full-time,Mid-Senior level,,Computer Software,Sales,0,1,0,CA
17876,payroll accountant,weblinc ecommerce platform service provider fa...,payroll accountant focus primarily payroll fun...,ba bs account desire fun love genuine passion ...,health amp wellnessmedical planprescription dr...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0,1,0,US
17877,project cost control staff engineer cost contr...,provide full time permanent position many medi...,experience project cost control staff engineer...,least years professional experienceability wor...,miss value,0,0,0,Full-time,,,,,0,0,0,US
17878,graphic designer,miss value,nemsia studios look experience visualgraphic d...,must fluent latest versions corel amp adobe cc...,competitive salary compensation base experienc...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0,0,0,NG


In [26]:
df['Country_clean'] = df['location'].astype(str).apply(lambda x: x.split(',')[0])

In [27]:
df['Country_clean']

0        US
1        NZ
2        US
3        US
4        US
         ..
17875    CA
17876    US
17877    US
17878    NG
17879    NZ
Name: Country_clean, Length: 17880, dtype: object

In [51]:
missing_df = pd.DataFrame({
"missing_count": df.isna().sum(),
"missing_percent": (df.isna().mean() * 100).round(2)
})

missing_df.sort_values("missing_percent", ascending=False)

Unnamed: 0,missing_count,missing_percent
salary_range,15012,83.96
department,11547,64.58
required_education,8105,45.33
benefits,7212,40.34
required_experience,7050,39.43
function,6455,36.1
industry,4903,27.42
employment_type,3471,19.41
company_profile,3308,18.5
requirements,2696,15.08


In [None]:
df['country'] = df['location'].str[:2]
df['country'].isna().sum()

90

In [None]:
df['location'].split(',')

AttributeError: 'list' object has no attribute 'split'

In [4]:
print(f'''
Shape: {df.shape}
Size: {df.size}
Unique Ids: {df.job_id.nunique()}
Locations: {df.location.nunique()}
Departments: {df.department.nunique()}; {df.department.unique()}
Salary Range: {df.salary_range.describe()}
Column names: {df.columns}

'''
)



Shape: (17880, 18)
Size: 321840
Unique Ids: 17880
Locations: 3105
Departments: 1337; ['Marketing' 'Success' nan ... 'Admin - Clerical' 'Administrative Dept'
 'Hospitality']
Salary Range: count     2868
unique     874
top        0-0
freq       142
Name: salary_range, dtype: object
Column names: Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')




In [14]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):

    stop_words = set(stopwords.words('english'))

    # remove punctuation
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    # set to lowercase
    sentence = sentence.lower()

    # remove numbers
    for char in string.digits:
        sentence = ''.join(char for char in sentence if not char.isdigit())

    # tokenize
    tokens = word_tokenize(sentence)

    # removing stop words
    tokens = [word for word in tokens if word not in stop_words]

    # lemmatize
    tokens = [WordNetLemmatizer().lemmatize(word, pos='v') for word in tokens]

    return ' '.join(tokens)


In [15]:
# creating unique values for missing text data
# cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']
# Clean reviews
cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']

df_t = df.copy()

for col in cols:
    df_t[col] = df_t[col].fillna('missing value')

In [16]:
# Clean reviews
for col in cols:
    df_t[col] = df_t[col].apply(preprocessing)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pipe = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB()
)

# Set parameters to search
X = df['clean_reviews']
y = df['target_encoded']

params = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2)),
    'multinomialnb__alpha': (0.1,1)
}

# Perform grid search on pipeline
# grid_search = GridSearchCV(
#     estimator = pipe,
#     param_grid = params,
#     scoring = "recall",
#     cv = 5,
#     n_jobs=-1,
#     verbose=1
# )

# Thoughts

- For the basic ML Pipeline, we vectorize each of the descriptions seperately, add these vectors along with metainformation to train the model.
- For deep learning, we will create different paths
    1. Creating one document per entry by merging all information in a systematic way.
    2. Creating a meta-data sentence, and leave all description parts seperated. We will then train one moddel per part. At theend we will train a model taking the probabilities for each part to get to a desion.
    3. Same as 2. just that we will train one model taking all parts as inputs
