For Q2, we will attempt to classify if the job is data science related using only the job description feature.

The data will first be labelled with some keywords in the job title column.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

%matplotlib inline

In [2]:
jobs = pd.read_pickle('jobs2.pkl')
jobs

Unnamed: 0,job_title,company,location,job_desc,full_time,part_time,temporary,internship,contract,permanent,salary
0,Product Lead,unavailable,Tiong Bahru Estate,Mission: Your mission is to build a world cl...,0,0,0,0,0,0,
1,Product Manager,99.co,Singapore,99.co is looking for an experienced Product Ma...,0,0,0,0,0,0,7800.0
2,Regional Marketing Manager GC GCMS,Thermo Fisher Scientific,Woodlands,Job Description Job Title: Regional Marketing ...,0,0,0,0,0,0,
3,Regional Product Specialist Insurance,unavailable,Outram,The Job: Reporting directly to the Regional ...,1,0,0,0,0,1,
4,Interchange and Pricing Analyst,Adyen,Singapore,Adyen is looking for a Interchange & Pricing A...,0,0,0,0,0,0,
5,VP Data Analyst GTO,United Overseas Bank,Singapore,Functional area: Business Technology Services ...,0,0,0,0,0,0,
6,Desktop Engineer,unavailable,Singapore,"Be a ‘roaming’ engineer, moving to and coverin...",1,0,0,0,0,1,
7,Consultant Senior Consultant Banking Business...,Synpulse,Singapore,"Responsibilities: Conduct research, data coll...",0,0,0,0,0,0,
8,R&D Programme Manager,ST Electronics (Satcom & Sensor Systems) Pte Ltd,Singapore,"Job Responsibilities: Project planning, manage...",0,0,0,0,0,0,
9,Associate Specialist Cybersecurity DSP CLT ...,MSD,Singapore,The trainee will undergo a 9 - 12-month progr...,0,0,0,0,0,0,


In [3]:
jobs['datascience_job'] = 0

In [4]:
# Using the job title lets label whether our data is a data science related job
for idx, title in zip(jobs['job_title'].index, jobs['job_title'].values):
    if 'data' in title.lower() and 'analyst' in  title.lower():
        jobs.iat[idx, -1] = 1
    elif 'data' in title.lower() and 'scientist' in  title.lower():
        jobs.iat[idx, -1] = 1
    elif 'data' in title.lower() and 'engineer' in  title.lower():
        jobs.iat[idx, -1] = 1
    elif 'business' in title.lower() and 'analyst' in  title.lower():
        jobs.iat[idx, -1] = 1
    elif 'business' in title.lower() and 'intelligence' in  title.lower():
        jobs.iat[idx, -1] = 1
    elif 'machine learning' in title.lower() or 'deep learning' in title.lower() or 'ml' in title.lower():
        jobs.iat[idx, -1] = 1
    elif 'artificial intelligence' in title.lower() or 'ai' in title.lower():
        jobs.iat[idx, -1] = 1
    else:
        jobs.iat[idx, -1] = 0

In [5]:
jobs['datascience_job'].value_counts()

0    2160
1     854
Name: datascience_job, dtype: int64

In [6]:
pd.to_pickle(jobs, 'jobs_ds')

In [7]:
X = jobs.loc[:, 'job_desc']
y = jobs.loc[:, 'datascience_job']

In [8]:
# Using the textacy package to do some more comprehensive preprocessing
from textacy.preprocess import preprocess_text

clean_jd = [preprocess_text(desc, fix_unicode=True, lowercase=True, transliterate=False,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for desc in X]

In [9]:
# Preprocess our text data to Tfidf
tfv = TfidfVectorizer(ngram_range=(1,4), max_features=2500)
X_jd = tfv.fit_transform(clean_jd).toarray()
X_dfjd = pd.DataFrame(X_jd, columns=tfv.get_feature_names())
print(X_dfjd.shape)

(3014, 2500)


In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_dfjd, y, test_size = 0.25)

In [11]:
# majority class is 0
base_accuracy = 1 - y.mean()
print('Base Accuracy Score: ' + str(base_accuracy))

Base Accuracy Score: 0.7166556071665561


In [12]:
# Running Logistic Regression to classify test set:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('Accuracy Score: ' + str(logreg.score(X_test, y_test)))
print('=============================================================')
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('=============================================================')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))

Accuracy Score: 0.809018567639
Classification Report:
             precision    recall  f1-score   support

          0       0.81      0.95      0.87       524
          1       0.82      0.48      0.60       230

avg / total       0.81      0.81      0.79       754

Confusion Matrix:
     0    1
0  500   24
1  120  110


In [23]:
# Running Random forest with gridsearch to classify test set:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = { 
           "n_estimators" : range(1,40,1),
           "max_depth" : range(1,40,1),
}

rfc = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, verbose=1, n_jobs=3, scoring='f1')
rfc.fit(X_train, y_train)

best_rfc = rfc.best_estimator_
print('Best Parameters: ' + str(rfc.best_params_))
print('=============================================================')
print('Best Accuracy Score: ' + str(rfc.best_score_))
print('=============================================================')
print('Classification Report:')
print(classification_report(y_test, rfc.predict(X_test)))
print('=============================================================')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, rfc.predict(X_test))))

Fitting 5 folds for each of 1521 candidates, totalling 7605 fits


[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:   15.6s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:   28.5s
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:   50.0s
[Parallel(n_jobs=3)]: Done 794 tasks      | elapsed:  1.4min
[Parallel(n_jobs=3)]: Done 1244 tasks      | elapsed:  2.2min
[Parallel(n_jobs=3)]: Done 1794 tasks      | elapsed:  3.4min
[Parallel(n_jobs=3)]: Done 2444 tasks      | elapsed:  5.1min
[Parallel(n_jobs=3)]: Done 3194 tasks      | elapsed:  7.2min
[Parallel(n_jobs=3)]: Done 4044 tasks      | elapsed:  9.9min
[Parallel(n_jobs=3)]: Done 4994 tasks      | elapsed: 13.3min
[Parallel(n_jobs=3)]: Done 6044 tasks      | elapsed: 17.5min
[Parallel(n_jobs=3)]: Done 7194 tasks      | elapsed: 22.3min
[Parallel(n_jobs=3)]: Done 7605 out of 7605 | elapsed: 24.2min finished


Best Parameters: {'max_depth': 37, 'n_estimators': 38}
Best Accuracy Score: 0.644860752215
Classification Report:
             precision    recall  f1-score   support

          0       0.81      0.96      0.88       532
          1       0.83      0.45      0.58       222

avg / total       0.82      0.81      0.79       754

Confusion Matrix:
     0    1
0  512   20
1  122  100


## Handling the imbalance in class labels

### Undersampling the majority class

In [13]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(ratio='majority', random_state=1)
X_train_rus, y_train_rus = rus.fit_sample(X_train, y_train)

print("Distribution of class labels before resampling {}".format(Counter(y_train)))
print("Distribution of class labels after resampling {}".format(Counter(y_train_rus)))

Distribution of class labels before resampling Counter({0: 1636, 1: 624})
Distribution of class labels after resampling Counter({0: 624, 1: 624})


In [14]:
# Running Logistic Regression to classify test set:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

logreg = LogisticRegression()
logreg.fit(X_train_rus,  y_train_rus)

y_pred = logreg.predict(X_test)

print('Accuracy Score: ' + str(logreg.score(X_test, y_test)))
print('=============================================================')
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('=============================================================')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))

Accuracy Score: 0.831564986737
Classification Report:
             precision    recall  f1-score   support

          0       0.91      0.84      0.87       524
          1       0.69      0.81      0.75       230

avg / total       0.84      0.83      0.84       754

Confusion Matrix:
     0    1
0  440   84
1   43  187


In [18]:
# Running Random forest with gridsearch to classify test set:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = { 
           "n_estimators" : range(1,40,1),
           "max_depth" : range(1,40,1),
}

rfc = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, verbose=1, n_jobs=3, scoring='f1')
rfc.fit(X_train_rus, y_train_rus)

best_rfc = rfc.best_estimator_
print('Best Parameters: ' + str(rfc.best_params_))
print('=============================================================')
print('Best Accuracy Score: ' + str(rfc.best_score_))
print('=============================================================')
print('Classification Report:')
print(classification_report(y_test, rfc.predict(X_test)))
print('=============================================================')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, rfc.predict(X_test))))

Fitting 5 folds for each of 1521 candidates, totalling 7605 fits


[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    3.1s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:   11.5s
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:   25.7s
[Parallel(n_jobs=3)]: Done 794 tasks      | elapsed:   49.6s
[Parallel(n_jobs=3)]: Done 1244 tasks      | elapsed:  1.3min
[Parallel(n_jobs=3)]: Done 1794 tasks      | elapsed:  2.1min
[Parallel(n_jobs=3)]: Done 2444 tasks      | elapsed:  3.1min
[Parallel(n_jobs=3)]: Done 3194 tasks      | elapsed:  4.3min
[Parallel(n_jobs=3)]: Done 4044 tasks      | elapsed:  5.8min
[Parallel(n_jobs=3)]: Done 4994 tasks      | elapsed:  7.4min
[Parallel(n_jobs=3)]: Done 6044 tasks      | elapsed:  9.3min
[Parallel(n_jobs=3)]: Done 7194 tasks      | elapsed: 11.4min
[Parallel(n_jobs=3)]: Done 7605 out of 7605 | elapsed: 12.2min finished


Best Parameters: {'max_depth': 22, 'n_estimators': 38}
Best Accuracy Score: 0.782913449934
Classification Report:
             precision    recall  f1-score   support

          0       0.90      0.78      0.84       524
          1       0.62      0.80      0.69       230

avg / total       0.81      0.79      0.79       754

Confusion Matrix:
     0    1
0  410  114
1   47  183


A better accuracy score than baseline is achieved for both logistic regression and random forest after undersampling the majority class. 

Logistic regression seems to be the better model for predicting. It achieved better results for all the metrics being considered.