## Fake Data Detection


### About The Data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent. A small proportion of these descriptions are fake or scam which can be identified by the column "fraudulent". 

The data is provide by the University of the Aegean | Laboratory of Information & Communication Systems Security

http://emscad.samos.aegean.gr/

## Dictonary:
-  job_id: Unique ID (int64)
-  title: Title of job description (str)
-  location: Geographical location of the job ad (Example: US, NY, New York)
-  department: Corporate department (e.g. Marketing, Success, Sales, ANDROIDPIT, ...)
-  salary_range: Indicative salary range (e.g. 50,000-60,000 ($))
-  company_profile: A brief company description.
-  description: The details description of the job ad.
-  requirements: Enlisted requirements for the job opening.
-  benefits: Enlisted offered benefits by the employer.
-  telecommuting: True for telecommuting positions. --> remote or not
-  has_company_logo: True if company logo is present.
-  has_questions: True if screening questions are present.
-  employment_type: Type of emplyment (e.g. Full-type, Part-time, Contract, etc.)
-  required_experience: Required Experience (e.g. Executive, Entry level, Intern, etc.)
-  required_education: Required Education (e.g. Doctorate, Master’s Degree, Bachelor, etc)
-  industry: Industry (e.g. Automotive, IT, Health care, Real estate, etc.)
-  function: Position as function in the company (e.g. Consulting, Engineering, Research, Sales etc.)
-  fraudulent: Classifcation target (0, 1)


# Columns to do:
## string manipulation
- title
- company_profile
- description
- requirements
- benefits

## one-hot encode
- location (3.105) - cities and countries --> remove
    - countries = 90 (346 is NA) --> keep
- industry (groups = 131) --> boolean mask (group all with less than 30 into one group) --> create category with missings
- function (groups = 37)  --> create category with missings

- employment_type (groups = 5) 
- required_experience (groups = 7)
- required_education (groups = 13)

## binary (no mising)
- telecommuting
- has_company_logo
- has_questions
- salary_range --> turn into binary (has salary range or not)
- department (groups = 1337) --> binary 

## target
- fraudulent (binary)

## dropping
department (groups = 1337) --> boolean mask (group all with less than 30 into one group) --> drop for now

# Questions
- How to impute data with more sophisticated methods?
- How to examine whether values are true NAs or just the result of company size?


In [30]:
# Install dependencies as needed:
import pandas as pd 


In [31]:
# read file
data_path = '/home/lars/code/syeda-tabassum-rahaman/scam-job-detector/raw_data/fake_job_postings.csv'
df = pd.read_csv(data_path)
# print("First 5 records:", df.head())

In [None]:
print(f'''
Shape: {df.shape}
Size: {df.size}
Unique Ids: {df.job_id.nunique()}
Locations: {df.location.nunique()}
Departments: {df.department.nunique()}; {df.department.unique()}
Salary Range: {df.salary_range.describe()}
Column names: {df.columns}

'''
)

In [32]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd
from pathlib import Path

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean raw data by
    - Creating new features for columns with missing values above >30% as binary features: missing = 0, not missing = 1
    - Cleaning text data by removing stopwords, digits, lamatizing, etc.
    - 
    """
    def preprocessing(sentence):

        stop_words = set(stopwords.words('english'))

        # remove punctuation
        for punctuation in string.punctuation:
            sentence = sentence.replace(punctuation, '')

        # set to lowercase
        sentence = sentence.lower()

        # remove numbers
        for char in string.digits:
            sentence = ''.join(char for char in sentence if not char.isdigit())

        # tokenize
        tokens = word_tokenize(sentence)

        # removing stop words
        tokens = [word for word in tokens if word not in stop_words]

        # lemmatize
        tokens = [WordNetLemmatizer().lemmatize(word, pos='v') for word in tokens]

        return ' '.join(tokens)

    # path = Path('.')
    df = pd.read_csv('../raw_data/fake_job_postings.csv')
    print('dataset loaded')

    # Creating binary columns for missing values:
    df['department_binary'] = df['department'].map(lambda x: 0 if pd.isna(x) else 1)
    
    df['salary_range_binary'] = df['salary_range'].map(lambda x: 0 if pd.isna(x) else 1)
    
    # Clean text data
    cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']

    df = df.copy()

    for col in cols:
        df[col] = df[col].fillna('missing value')

    for col in cols:
        df[col] = df[col].apply(preprocessing)
    
    # extracting country ID
    df['country'] = df['location'].astype(str).apply(lambda x: x.split(',')[0])

    # dropping columns
    df.drop(columns=['salary_range', 'department', 'location', 'job_id'], inplace=True)
    

    print("✅ data cleaned")

    return df


In [33]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import FunctionTransformer


# catagorical columns for One-Hot Encoding
categorical_columns = [
    'country',
    'industry',
    'function',
    'employment_type'
]
# ordinal columns for Ordinal Encoding
ordinal_columns = [
    'required_experience',
    'required_education'
]
#binary columns for binary encoding
binary_columns = ['has_company_logo', 'has_questions', 'department_binary', 'salary_range_binary']

#text columns for TF-IDF Vectorizer
text_columns = [
        'title',
        'company_profile',
        'description',
        'requirements',
        'benefits'
]

#reference lists for ordinal encoding
experience_order = [
    "Not Applicable",
    "Unknown",
    "Internship",
    "Entry level",
    "Associate",
    "Mid-Senior level",
    "Director",
    "Executive"
]

education_order = [
    "Unknown",
    "High School or equivalent",
    "Vocational",
    "Certification",
    "Some College Coursework Completed",
    "Associate Degree",
    "Bachelor's Degree",
    "Professional",
    "Master's Degree"
]


# preprocessor pipeline
def preprocessing_pipeline() -> ColumnTransformer:

    cat_transformer = make_pipeline(
        SimpleImputer(strategy='constant', fill_value='missing'),
        OneHotEncoder(handle_unknown='ignore')
    )
    ordinal_transformer = make_pipeline(
        SimpleImputer(strategy='constant', fill_value='missing'),
        OrdinalEncoder(
        categories=[experience_order, education_order],
        handle_unknown="use_encoded_value",
        unknown_value=-1)
    )
    binary_transformer = make_pipeline(
        SimpleImputer(strategy='most_frequent', fill_value=0),
        OneHotEncoder(handle_unknown='ignore')
    )

    def combine_text(X):
        return X[text_columns].fillna("").agg(" ".join, axis=1)
    
    text_transformer = make_pipeline(
        FunctionTransformer(combine_text, validate=False),
        TfidfVectorizer(max_features=5000)
    )

    
    preprocessor = make_column_transformer(
        (cat_transformer, categorical_columns),
        (ordinal_transformer, ordinal_columns),
        (binary_transformer, binary_columns),
        (text_transformer, text_columns)
    )
    return preprocessor

# train preprocessor pipeline
def train_preprocessor(X_train: pd.DataFrame, X_test: pd.DataFrame) -> np.ndarray:
    preprocessor = preprocessing_pipeline()
    X_train_preprocessed = preprocessor.fit_transform(X_train)
    X_test_preprocessed = preprocessor.transform(X_test)
    return X_train_preprocessed, X_test_preprocessed



In [None]:
df = clean_data(df)
# Extract X and y
X = df.drop(columns=['fraudulent'])
y = df['fraudulent']
# Make train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# preprocess train and test data
X_train_preprocessed, X_test_preprocessed = train_preprocessor(X_train, X_test)
# X_test_preprocessed = test_preprocessor(X_test)

dataset loaded
✅ data cleaned


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegression

df = clean_data(df)
# Extract X and y
X = df.drop(columns=['fraudulent'])
y = df['fraudulent']
# Make train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# preprocess train and test data
X_train_preprocessed, X_test_preprocessed = train_preprocessor(X_train, X_test)
# X_test_preprocessed = test_preprocessor(X_test)

cv = StratifiedKFold(n_splits=5)

pipe = Pipeline([
    # ("clean_preproc", clean_preproc),
    ("classifier", LogisticRegression(max_iter=1000))
])

cross_val_score(pipe, X_train_preprocessed, y_train, cv=cv).mean()

cross_model = cross_validate(pipe, X_train_preprocessed, y_train, cv=cv)
cross_model


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, accuracy_score

model = LogisticRegression(max_iter=1000)

result = model.fit(X_train_preprocessed, y_train)

result.score

y_pred = result.predict(X_test_preprocessed)

print(recall_score(y_test, y_pred), precision_score(y_test, y_pred), accuracy_score(y_test, y_pred))



np.float64(0.9738533643916376)

{'fit_time': array([0.61700773, 0.65386415, 0.56823897, 0.60018778, 0.63627195]),
 'score_time': array([0.00219345, 0.00320959, 0.0033884 , 0.00193262, 0.00251102]),
 'test_score': array([0.97413492, 0.97553303, 0.97203775, 0.9751835 , 0.97237762])}

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, accuracy_score

model = LogisticRegression(max_iter=1000)

result = model.fit(X_train_preprocessed, y_train)

result.score

<bound method ClassifierMixin.score of LogisticRegression(max_iter=1000)>

In [29]:
y_pred = result.predict(X_test_preprocessed)

print(recall_score(y_test, y_pred), precision_score(y_test, y_pred), accuracy_score(y_test, y_pred))



0.5606936416184971 0.97 0.9779082774049217


# Thoughts

- For the basic ML Pipeline, we vectorize each of the descriptions seperately, add these vectors along with metainformation to train the model.
- For deep learning, we will create different paths
    1. Creating one document per entry by merging all information in a systematic way.
    2. Creating a meta-data sentence, and leave all description parts seperated. We will then train one moddel per part. At theend we will train a model taking the probabilities for each part to get to a desion.
    3. Same as 2. just that we will train one model taking all parts as inputs
