### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

In [1]:
#Define Data Scientists as job_titles containing data scientist or machine learning

In [2]:
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib as plt
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.preprocessing import StandardScaler

#from sklearn import _______
%matplotlib inline

In [3]:
jobs = pd.read_csv('out3.csv', delimiter='\t')
jobs = jobs.drop(jobs.columns[[0]], axis=1)


In [4]:
jobs['data_scientist'] = 0
test = ['data scientist', 'machine learning']
for i in range(len(jobs)):
    for j in range(len(test)):
        if test[j] in jobs.job_title[i]:
            jobs.data_scientist[i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [5]:
[jobs.job_title.apply(pd.Series).stack().value_counts()<3].index

<function list.index>

In [6]:
#drop job_title, and all job_title dummy columns
job_title_dummies = jobs.job_title.apply(pd.Series).stack().value_counts()\
[jobs.job_title.apply(pd.Series).stack().value_counts()>2].index

job_title_dummies2=[]
for i in job_title_dummies:
    if i in jobs.columns:
        job_title_dummies2.append(i)

In [7]:
#too many missing values, and difficult to impute
jobs = jobs.drop(columns = ['min_experience'], axis=1)
#Company dummies covers addres and district.
jobs = jobs.drop(columns = ['link','company', 'address', 'district'], axis=1)
#Similar to salary variables
jobs = jobs.drop(columns = ['salary_low', 'salary_high', 'salary_time', ], axis=1)
#Created dummies for these
jobs = jobs.drop(columns = ['job_title', 'employment_type', 'job_category', 'seniority', 'skills'], axis=1)
jobs = jobs = jobs.drop(columns = job_title_dummies2, axis=1)
jobs.head()

Unnamed: 0,salary_mid,101 DIGITAL PTE. LTD.,99 PTE. LTD.,A-STAR-EDUCATION HOLDINGS PTE. LTD.,ABEJA SINGAPORE PTE. LTD.,ABI RESOURCES & SERVICES PTE. LTD.,ACCELA RECRUITMENT SERVICES PTE. LTD.,ACCENTURE PTE LTD,ACHIEVE CAREER CONSULTANT PTE LTD,ACRONIS ASIA RESEARCH AND DEVELOPMENT PTE. LTD.,...,working knowledge,written communication,written communication skills,written verbal,years experience,years relevant,years relevant experience,years working,years working experience,data_scientist
0,5750.0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,6400.0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21871,0.233659,1
2,11500.0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.21871,0.233659,1
3,4000.0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.20651,0.0,0.0,0.0,0.0,0
4,3800.0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.121946,0.0,0.0,0.0,0.0,0.0,1


In [8]:
#shuffle then split target
y_test=[0,0]

#baseline should be around 0.915 from 1-sum(y)/len(y)

#make sure baseline is between 0.9 and 0.93 so it is shuffled with enough distribution in test set.
while sum(y_test)/len(y_test) < 0.07 or sum(y_test)/len(y_test) > 0.1:
    jobs = jobs.sample(frac=1)
    y = jobs.data_scientist
    X = jobs.drop(columns = 'data_scientist')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
baseline = max(1-sum(y_test)/len(y_test), sum(y_test)/len(y_test))
print(baseline)

0.9189189189189189


In [9]:
lr = LogisticRegression(solver='newton-cg').fit(X_train, y_train)
cross_val_score(lr, X_train, y_train, cv=4, scoring="accuracy")

array([0.95495495, 0.93243243, 0.95045045, 0.94144144])

In [10]:
y_train_pred = cross_val_predict(lr, X_train, y_train, cv=4)
print('confusion matrix\n', confusion_matrix(y_train, y_train_pred))
print('precision score', precision_score(y_train, y_train_pred))
print('recall score', recall_score(y_train, y_train_pred))
print('f1 score', f1_score(y_train, y_train_pred))
print('aoc score', roc_auc_score(y_train, y_train_pred))

confusion matrix
 [[806   6]
 [ 43  33]]
precision score 0.8461538461538461
recall score 0.4342105263157895
f1 score 0.5739130434782609
aoc score 0.7134106818771064


In [11]:
y_test_pred = cross_val_predict(lr, X_test, y_test, cv=4)
print('confusion matrix\n', confusion_matrix(y_test, y_test_pred))
print('precision score', precision_score(y_test, y_test_pred))
print('recall score', recall_score(y_test, y_test_pred))
print('f1 score', f1_score(y_test, y_test_pred))
print('aoc score', roc_auc_score(y_test, y_test_pred))

confusion matrix
 [[199   5]
 [ 16   2]]
precision score 0.2857142857142857
recall score 0.1111111111111111
f1 score 0.16
aoc score 0.5433006535947713


In [12]:
rnd_clf = RandomForestClassifier(n_estimators=50, max_leaf_nodes=50, n_jobs=-2)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

In [13]:
print('confusion matrix\n', confusion_matrix(y_test, y_pred_rf))
print('precision score', precision_score(y_test, y_pred_rf))
print('recall score', recall_score(y_test, y_pred_rf))
print('f1 score', f1_score(y_test, y_pred_rf))
print('aoc score', roc_auc_score(y_test, y_pred_rf))

confusion matrix
 [[204   0]
 [  9   9]]
precision score 1.0
recall score 0.5
f1 score 0.6666666666666666
aoc score 0.75


In [14]:
svmclf = SVC(kernel='linear')  
svmclf.fit(X_train, y_train)
y_pred_svm = svmclf.predict(X_test)

In [15]:
print('confusion matrix\n', confusion_matrix(y_test, y_pred_svm))
print('precision score', precision_score(y_test, y_pred_svm))
print('recall score', recall_score(y_test, y_pred_svm))
print('f1 score', f1_score(y_test, y_pred_svm))
print('aoc score', roc_auc_score(y_test, y_pred_svm))

confusion matrix
 [[203   1]
 [ 18   0]]
precision score 0.0
recall score 0.0
f1 score 0.0
aoc score 0.49754901960784315


In [16]:
mnvclf = MultinomialNB()  
mnvclf.fit(X_train, y_train)
y_pred_mnv = mnvclf.predict(X_test)

In [17]:
print('confusion matrix\n', confusion_matrix(y_test, y_pred_mnv))
print('precision score', precision_score(y_test, y_pred_mnv))
print('recall score', recall_score(y_test, y_pred_mnv))
print('f1 score', f1_score(y_test, y_pred_mnv))
print('aoc score', roc_auc_score(y_test, y_pred_mnv))

confusion matrix
 [[178  26]
 [ 15   3]]
precision score 0.10344827586206896
recall score 0.16666666666666666
f1 score 0.1276595744680851
aoc score 0.5196078431372549
