# Major tasks accomplished in this notebook
- 1) Classify the outcome variables (CASE_STATUS) into binary 0,1
- 2) Pre-process the job addendums
- 3) Create a job posting classifier Using TF-IDF 
- 4) Create and run a Multinomial Naive Bayes model to see the prediction of CASE_STATUS by the text features

## Pre-Modeling

In [2]:
## Read in the combined dataset
combined_certificate_postings= pd.read_csv('combined_certificate_postings.csv')
combined_certificate_postings.head()

Unnamed: 0.1,Unnamed: 0,CASE_NUMBER,CASE_STATUS,combined_job_postings
0,0,H-300-20265-835437,Determination Issued - Certification,the most economical and reasonable charges for...
1,1,H-300-20260-827678,Determination Issued - Certification,Incoming transportation and subsistence advanc...
2,2,H-300-20260-827308,Determination Issued - Certification,An employee may be terminated for just cause. ...
3,3,H-300-20258-821801,Determination Issued - Certification,"ELECTRONIC COMMUNICATION\nCell phones, along w..."
4,4,H-300-20258-821682,Determination Issued - Certification,Employee Expectations and Behavior Continued:\...


### Classify the outcome variables CASE_STATUS into 0 or 1 classifer 

In [3]:
## 1) check the category
combined_certificate_postings.CASE_STATUS.unique()
combined_certificate_postings['CASE_STATUS'].value_counts()


array(['Determination Issued - Certification',
       'Determination Issued - Denied',
       'Determination Issued - Partial Certification',
       'Determination Issued - Certification (Expired)',
       'Determination Issued - Partial Certification (Expired)'],
      dtype=object)

Determination Issued - Certification                      11165
Determination Issued - Certification (Expired)             1812
Determination Issued - Denied                               168
Determination Issued - Partial Certification                 80
Determination Issued - Partial Certification (Expired)       25
Name: CASE_STATUS, dtype: int64

In [5]:
## 2) Classify the CASE_STATUS into binary variables (0 and 1)
## if certification +  certification (expired) >> 1 (approved cases)
## if partial certification + partial certification (expired) + denied (no approved cases)
combined_certificate_postings['CASE_OUTCOME'] = np.where(combined_certificate_postings['CASE_STATUS'].str.contains('Denied|Partial'), 0, 1)


In [6]:
## 3) check if coding to binary vairable work
combined_certificate_postings['CASE_OUTCOME'].unique()
combined_certificate_postings['CASE_STATUS'].value_counts()
combined_certificate_postings['CASE_OUTCOME'].value_counts()
# the number does add up!

array([1, 0])

Determination Issued - Certification                      11165
Determination Issued - Certification (Expired)             1812
Determination Issued - Denied                               168
Determination Issued - Partial Certification                 80
Determination Issued - Partial Certification (Expired)       25
Name: CASE_STATUS, dtype: int64

1    12977
0      273
Name: CASE_OUTCOME, dtype: int64

###  Randomly Select 1000 Our Certified Case (Labelled as One)

In [14]:
## subset to certified case (label as 1) and not certified case (label as 0)
certified = combined_certificate_postings[combined_certificate_postings["CASE_OUTCOME"]==1]
notcertified = combined_certificate_postings[combined_certificate_postings["CASE_OUTCOME"]==0]
## show be 12977 rows
certified.shape
## show be 273 rows
notcertified.shape

(12977, 5)

(273, 5)

In [12]:
## randomly select 1000 positive cases
certified_1000 = certified.sample(n = 1000)
certified_1000.shape
certified_1000.head()

(1000, 5)

Unnamed: 0.1,Unnamed: 0,CASE_NUMBER,CASE_STATUS,combined_job_postings,CASE_OUTCOME
6283,6283,H-300-20020-263904,Determination Issued - Certification,All workers are required to follow common sani...,1
1350,1350,H-300-20163-646331,Determination Issued - Certification,Incoming transportation and subsistence advanc...,1
9862,9862,H-300-19326-162101,Determination Issued - Certification (Expired),H-2A workers must depart the United States at ...,1
8840,8840,H-300-19350-199601,Determination Issued - Certification,ELECTRONIC COMMUNICATION\nCell phones along wi...,1
10637,10637,H-300-19302-115364,Determination Issued - Certification,Incoming transportation and subsistence advanc...,1


In [16]:
## rowbind back to our notcertified dataset
combined_selected = pd.concat([certified_1000, notcertified])
combined_selected.shape

(1273, 5)

## Run Text Processing


### Convert characters to lower case

In [17]:
## lower case
start_time = time.time()
combined_selected['postings_lower']= combined_selected['combined_job_postings'].apply(lambda x: x.lower())
print("--- %s seconds ---" % (time.time() - start_time))


--- 0.12079405784606934 seconds ---


### Tokenize the words

In [18]:
## tokenized
start_time = time.time()
combined_selected['postings_tokenized'] = combined_selected['postings_lower'].apply(word_tokenize)
print("--- %s seconds ---" % (time.time() - start_time))

--- 13.432610988616943 seconds ---


### Remove stopwords

In [21]:
## define stopwords
other_stopwords = ["after", "before", "employer", "employ", "job", "although", "provide", "complete","hour","time",
                  "begin","list","require","task","transportation","worker","workers","working","work","worked","works"]

list_stopwords = stopwords.words("english")+ other_stopwords

stopwords_complete = list_stopwords + other_stopwords
start_time = time.time()
## remove those
combined_selected['posting_without_stopwords']=combined_selected['postings_tokenized'].apply(lambda x: [word for word in x if word not in stopwords_complete])

print("--- %s seconds ---" % (time.time() - start_time))

--- 4.602038145065308 seconds ---


### Perform Stemming

In [22]:
## stemming 
start_time = time.time()
porter = PorterStemmer()
combined_selected['stemmed'] = combined_selected['posting_without_stopwords'].apply(lambda x: [porter.stem(y)for y in x]) # Stem every word.
print("--- %s seconds ---" % (time.time() - start_time))

--- 20.114768028259277 seconds ---


### Remove words that is less than 3 characters and punctuation

In [23]:
## keep isalpha() and the length of the word that is greater than 3
combined_selected['cleaned']=combined_selected['stemmed'].apply(lambda x: [word for word in x if word.isalpha() and len(word)>3])
combined_selected['cleaned']

6283     [requir, follow, common, sanitari, practic, ti...
1350     [incom, subsist, complet, contract, deduct, ac...
9862     [must, depart, unit, state, complet, contract,...
8840     [electron, commun, cell, phone, along, suffici...
10637    [incom, subsist, complet, contract, deduct, ac...
                               ...                        
13245              [paid, done, arizona, done, california]
13246    [rule, guidanc, regard, accept, conduct, stand...
13247    [upon, complet, contract, dismiss, earlier, re...
13248    [april, middl, irrig, detail, around, farm, us...
13249    [hous, offer, hous, provid, hous, clean, compl...
Name: cleaned, Length: 1273, dtype: object

### Join back each word


In [24]:
combined_selected['cleaned']=combined_selected['cleaned'].apply(lambda x: " ".join(x))
combined_selected['cleaned']

6283     requir follow common sanitari practic time par...
1350     incom subsist complet contract deduct accord a...
9862     must depart unit state complet contract period...
8840     electron commun cell phone along suffici minut...
10637    incom subsist complet contract deduct accord a...
                               ...                        
13245                    paid done arizona done california
13246    rule guidanc regard accept conduct standard ge...
13247    upon complet contract dismiss earlier reason c...
13248    april middl irrig detail around farm usag plum...
13249    hous offer hous provid hous clean complianc ap...
Name: cleaned, Length: 1273, dtype: object

## Modeling 

### Split the data into training and testing

In [25]:
combined_selected.head()
combined_selected['CASE_OUTCOME'].value_counts()

Unnamed: 0.1,Unnamed: 0,CASE_NUMBER,CASE_STATUS,combined_job_postings,CASE_OUTCOME,postings_lower,postings_tokenized,posting_without_stopwords,stemmed,cleaned
6283,6283,H-300-20020-263904,Determination Issued - Certification,All workers are required to follow common sani...,1,all workers are required to follow common sani...,"[all, workers, are, required, to, follow, comm...","[required, follow, common, sanitary, practices...","[requir, follow, common, sanitari, practic, ti...",requir follow common sanitari practic time par...
1350,1350,H-300-20163-646331,Determination Issued - Certification,Incoming transportation and subsistence advanc...,1,incoming transportation and subsistence advanc...,"[incoming, transportation, and, subsistence, a...","[incoming, subsistence, advanced/paid, 50, %, ...","[incom, subsist, advanced/paid, 50, %, complet...",incom subsist complet contract deduct accord a...
9862,9862,H-300-19326-162101,Determination Issued - Certification (Expired),H-2A workers must depart the United States at ...,1,h-2a workers must depart the united states at ...,"[h-2a, workers, must, depart, the, united, sta...","[h-2a, must, depart, united, states, completio...","[h-2a, must, depart, unit, state, complet, con...",must depart unit state complet contract period...
8840,8840,H-300-19350-199601,Determination Issued - Certification,ELECTRONIC COMMUNICATION\nCell phones along wi...,1,electronic communication\ncell phones along wi...,"[electronic, communication, cell, phones, alon...","[electronic, communication, cell, phones, alon...","[electron, commun, cell, phone, along, suffici...",electron commun cell phone along suffici minut...
10637,10637,H-300-19302-115364,Determination Issued - Certification,Incoming transportation and subsistence advanc...,1,incoming transportation and subsistence advanc...,"[incoming, transportation, and, subsistence, a...","[incoming, subsistence, advanced/paid, 50, %, ...","[incom, subsist, advanced/paid, 50, %, complet...",incom subsist complet contract deduct accord a...


1    1000
0     273
Name: CASE_OUTCOME, dtype: int64

In [26]:
from sklearn.model_selection import train_test_split
train,test=train_test_split(combined_selected,test_size=0.2)
# ## check the shape of the test, train and original dataset
# train.shape
# test.shape
# combined_certificate_postings.shape
# ## write the test and train data into separate csv
# train.to_csv('train_addendum&outcome.csv')
# test.to_csv('test_addendum&outcome.csv')

In [27]:
train.head()

Unnamed: 0.1,Unnamed: 0,CASE_NUMBER,CASE_STATUS,combined_job_postings,CASE_OUTCOME,postings_lower,postings_tokenized,posting_without_stopwords,stemmed,cleaned
6191,6191,H-300-20021-265051,Determination Issued - Partial Certification (...,Employer will train workers. Training will inc...,0,employer will train workers. training will inc...,"[employer, will, train, workers, ., training, ...","[train, ., training, include, limited, safety,...","[train, ., train, includ, limit, safeti, train...",train train includ limit safeti train protect ...
10450,10450,H-300-19309-126181,Determination Issued - Certification (Expired),Harassment: The employer committed to providin...,1,harassment: the employer committed to providin...,"[harassment, :, the, employer, committed, to, ...","[harassment, :, committed, providing, safe, ,,...","[harass, :, commit, provid, safe, ,, flexibl, ...",harass commit provid safe flexibl respect envi...
10419,10419,H-300-19310-129885,Determination Issued - Certification,...incurred by the worker for transportation a...,1,...incurred by the worker for transportation a...,"[..., incurred, by, the, worker, for, transpor...","[..., incurred, daily, subsistence, place, com...","[..., incur, daili, subsist, place, come, ,, w...",incur daili subsist place come whether abroad ...
10724,10724,H-300-19298-110492,Determination Issued - Certification (Expired),No deductions except those required by law wil...,1,no deductions except those required by law wil...,"[no, deductions, except, those, required, by, ...","[deductions, except, required, law, made, brin...","[deduct, except, requir, law, made, bring, 's,...",deduct except requir made bring earn feder min...
12747,12747,H-300-20329-926801,Determination Issued - Certification,TERMINATIONS: The employer may terminate the ...,1,terminations: the employer may terminate the ...,"[terminations, :, the, employer, may, terminat...","[terminations, :, may, terminate, notification...","[termin, :, may, termin, notif, appropri, stat...",termin termin notif appropri state feder agenc...


## Create a job postings classifier using TF-IDF

In [28]:
vectorizer = TfidfVectorizer(min_df=5, max_df=0.9)

In [29]:
def tfidf_topwords(df,kind):
    if kind=="test":
        tf_idf = vectorizer.transform(df['cleaned'])
    elif kind=="train":
        tf_idf = vectorizer.fit_transform(df['cleaned'])
        tf_idf = vectorizer.transform(df['cleaned'])
    else:
        return ("wrong input")
    return tf_idf

In [30]:
X_train_tf=tfidf_topwords(train,"train")
X_test_tf=tfidf_topwords(test,"test")

In [31]:
X_train_tf.shape

(1018, 2696)

In [32]:
X_test_tf.shape

(255, 2696)

## Create and run a Multinomial Naive Bayes model+ Modeling Result 

In [34]:
# y value
train_y=train["CASE_OUTCOME"]
test_y=test["CASE_OUTCOME"]

In [35]:
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf, train_y)
y_pred = naive_bayes_classifier.predict(X_test_tf)
print(metrics.classification_report(test_y, y_pred, target_names=['Certified', 'Denied']))

MultinomialNB()

              precision    recall  f1-score   support

   Certified       0.86      0.11      0.19        55
      Denied       0.80      0.99      0.89       200

    accuracy                           0.80       255
   macro avg       0.83      0.55      0.54       255
weighted avg       0.81      0.80      0.74       255



In [36]:
print("Confusion matrix:")
print(metrics.confusion_matrix(test_y, y_pred))

Confusion matrix:
[[  6  49]
 [  1 199]]


#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------#

## Findings

### As presented in the cells above, the data is extremely imbalanced (with 12977 positive cases (fully certified) and 273 negative cases (not fully certified & denied)). In this case, extra wrangling of the data is required or else the result of the predictive model would not be informative because the predictors will highly likely to predict any given test sample to the positive group which also means that the precision and recall will be very high. On the other hand, in this situation, the precision and recall would be very low for the negative cases. Thus, I randomly sampled the postive cases (case status that is fully certified) to only include 1000 cases to miminize the effect of imbalanced dataset. The model works better  as the accuracy is 0.8 and the weighted average for precision, recall and f1-score is relatively high which is respectively 0.81, 0.80 and 0.74.

## Future Works

### Indeed, there might be some bias generated from only include the 1000 positive cases. Future works can include performing other methods that tackle the imbalanced data issue such performing data augmentation, which can allow further verification of the results.