## Fake Data Detection


### About The Data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent. A small proportion of these descriptions are fake or scam which can be identified by the column "fraudulent". 

The data is provide by the University of the Aegean | Laboratory of Information & Communication Systems Security

http://emscad.samos.aegean.gr/

## Dictonary:
-  job_id: Unique ID (int64)
-  title: Title of job description (str)
-  location: Geographical location of the job ad (Example: US, NY, New York)
-  department: Corporate department (e.g. Marketing, Success, Sales, ANDROIDPIT, ...)
-  salary_range: Indicative salary range (e.g. 50,000-60,000 ($))
-  company_profile: A brief company description.
-  description: The details description of the job ad.
-  requirements: Enlisted requirements for the job opening.
-  benefits: Enlisted offered benefits by the employer.
-  telecommuting: True for telecommuting positions. --> remote or not
-  has_company_logo: True if company logo is present.
-  has_questions: True if screening questions are present.
-  employment_type: Type of emplyment (e.g. Full-type, Part-time, Contract, etc.)
-  required_experience: Required Experience (e.g. Executive, Entry level, Intern, etc.)
-  required_education: Required Education (e.g. Doctorate, Masterâ€™s Degree, Bachelor, etc)
-  industry: Industry (e.g. Automotive, IT, Health care, Real estate, etc.)
-  function: Position as function in the company (e.g. Consulting, Engineering, Research, Sales etc.)
-  fraudulent: Classifcation target (0, 1)


# Columns to do:
## string manipulation
- title
- company_profile
- description
- requirements
- benefits

## one-hot encode
- location (3.105) - cities and countries --> remove
    - countries = 90 (346 is NA) --> keep
- industry (groups = 131) --> boolean mask (group all with less than 30 into one group) --> create category with missings
- function (groups = 37)  --> create category with missings

- employment_type (groups = 5) 
- required_experience (groups = 7)
- required_education (groups = 13)

## binary (no mising)
- telecommuting
- has_company_logo
- has_questions
- salary_range --> turn into binary (has salary range or not)

## target
- fraudulent (binary)

## dropping
department (groups = 1337) --> boolean mask (group all with less than 30 into one group) --> drop for now

# Questions
- How to impute data with more sophisticated methods?
- How to examine whether values are true NAs or just the result of company size?


In [3]:
# Install dependencies as needed:
import pandas as pd 


In [4]:
# read file
data_path = '/home/lars/code/syeda-tabassum-rahaman/scam-job-detector/raw_data/fake_job_postings.csv'
df = pd.read_csv(data_path)
# print("First 5 records:", df.head())

In [6]:
df.function

0                   Marketing
1            Customer Service
2                         NaN
3                       Sales
4        Health Care Provider
                 ...         
17875                   Sales
17876     Accounting/Auditing
17877                     NaN
17878                  Design
17879             Engineering
Name: function, Length: 17880, dtype: object

In [51]:
missing_df = pd.DataFrame({
"missing_count": df.isna().sum(),
"missing_percent": (df.isna().mean() * 100).round(2)
})

missing_df.sort_values("missing_percent", ascending=False)

Unnamed: 0,missing_count,missing_percent
salary_range,15012,83.96
department,11547,64.58
required_education,8105,45.33
benefits,7212,40.34
required_experience,7050,39.43
function,6455,36.1
industry,4903,27.42
employment_type,3471,19.41
company_profile,3308,18.5
requirements,2696,15.08


In [None]:
df['country'] = df['location'].str[:2]
df['country'].isna().sum()

90

In [4]:
print(f'''
Shape: {df.shape}
Size: {df.size}
Unique Ids: {df.job_id.nunique()}
Locations: {df.location.nunique()}
Departments: {df.department.nunique()}; {df.department.unique()}
Salary Range: {df.salary_range.describe()}
Column names: {df.columns}

'''
)



Shape: (17880, 18)
Size: 321840
Unique Ids: 17880
Locations: 3105
Departments: 1337; ['Marketing' 'Success' nan ... 'Admin - Clerical' 'Administrative Dept'
 'Hospitality']
Salary Range: count     2868
unique     874
top        0-0
freq       142
Name: salary_range, dtype: object
Column names: Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')




In [14]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):

    stop_words = set(stopwords.words('english'))

    # remove punctuation
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    # set to lowercase
    sentence = sentence.lower()

    # remove numbers
    for char in string.digits:
        sentence = ''.join(char for char in sentence if not char.isdigit())

    # tokenize
    tokens = word_tokenize(sentence)

    # removing stop words
    tokens = [word for word in tokens if word not in stop_words]

    # lemmatize
    tokens = [WordNetLemmatizer().lemmatize(word, pos='v') for word in tokens]

    return ' '.join(tokens)


In [15]:
# creating unique values for missing text data
# cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']
# Clean reviews
cols = ['title', 'company_profile', 'description', 'requirements', 'benefits']

df_t = df.copy()

for col in cols:
    df_t[col] = df_t[col].fillna('missing value')

In [16]:
# Clean reviews
for col in cols:
    df_t[col] = df_t[col].apply(preprocessing)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pipe = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB()
)

# Set parameters to search
X = df['clean_reviews']
y = df['target_encoded']

params = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2)),
    'multinomialnb__alpha': (0.1,1)
}

# Perform grid search on pipeline
# grid_search = GridSearchCV(
#     estimator = pipe,
#     param_grid = params,
#     scoring = "recall",
#     cv = 5,
#     n_jobs=-1,
#     verbose=1
# )

# Thoughts

- For the basic ML Pipeline, we vectorize each of the descriptions seperately, add these vectors along with metainformation to train the model.
- For deep learning, we will create different paths
    1. Creating one document per entry by merging all information in a systematic way.
    2. Creating a meta-data sentence, and leave all description parts seperated. We will then train one moddel per part. At theend we will train a model taking the probabilities for each part to get to a desion.
    3. Same as 2. just that we will train one model taking all parts as inputs
