## Cleaning Resume and Job Posts Data for use by Job Recommender System 

### Datasets
* https://www.kaggle.com/maitrip/resumes; 1219 resumes; ID, category, text description
* https://www.kaggle.com/samdeeplearning/deepnlp; 125 resumes; may drop this one
* https://www.kaggle.com/madhab/jobposts/data; 19,000 job posts from 2004-2015.




Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import re
import nltk
import spacy
import string

First, start with resume dataset

In [2]:
resume_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/resume_dataset_maitrip.csv'

In [55]:
df = pd.read_csv(resume_path)
df.head()

Unnamed: 0,ID,Category,Resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert..."


In [21]:
#what does one resume look like
df.Resume[0]

"b'John H. Smith, P.H.R.\\n800-991-5187 | PO Box 1673 | Callahan, FL 32011 | info@greatresumesfast.com\\n\\nApproachable innovator with a passion for Human Resources.\\n\\nSENIOR HUMAN RESOURCES PROFESSIONAL\\nPersonable, analytical, flexible Senior HR Professional with multifaceted expertise. Seasoned Benefits Administrator with\\nextensive experience working with highly paid professionals in client-relationship-based settings. Dynamic team leader\\ncapable of analyzing alternatives and identifying tough choices while communicating the total value of benefit and\\ncompensation packages to senior level executives and employees.\\n\\nCORE COMPETENCIES\\nBenefits Administration \\xe2\\x80\\x93 Customer Service \\xe2\\x80\\x93 Cost Control \\xe2\\x80\\x93 Recruiting \\xe2\\x80\\x93 Acquisition Management \\xe2\\x80\\x93 Compliance Reporting\\nRetention \\xe2\\x80\\x93 Professional Services \\xe2\\x80\\x93 Domestic & International Benefits \\xe2\\x80\\x93 Collaboration \\xe2\\x80\\x93 Adap

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Text Preprocessing

https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing

In [56]:
df["resume"] = df["Resume"]
df.drop(["Resume"], axis=1, inplace=True)
df.head()

Unnamed: 0,ID,Category,resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert..."


In [23]:
df.head()

Unnamed: 0,ID,Category,resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert..."


In [54]:
#remove unicode
def unicodetoascii(text):

    TEXT = (text.
    		replace('\\xe2\\x80\\x99', "'").
            replace('\\xc3\\xa9', 'e').
            replace('\\xe2\\x80\\x90', '-').
            replace('\\xe2\\x80\\x91', '-').
            replace('\\xe2\\x80\\x92', '-').
            replace('\\xe2\\x80\\x93', '-').
            replace('\\xe2\\x80\\x94', '-').
            replace('\\xe2\\x80\\x94', '-').
            replace('\\xe2\\x80\\x98', "'").
            replace('\\xe2\\x80\\x9b', "'").
            replace('\\xe2\\x80\\x9c', '"').
            replace('\\xe2\\x80\\x9c', '"').
            replace('\\xe2\\x80\\x9d', '"').
            replace('\\xe2\\x80\\x9e', '"').
            replace('\\xe2\\x80\\x9f', '"').
            replace('\\xe2\\x80\\xa6', '...').
            replace('\\xe2\\x80\\xb2', "'").
            replace('\\xe2\\x80\\xb3', "'").
            replace('\\xe2\\x80\\xb4', "'").
            replace('\\xe2\\x80\\xb5', "'").
            replace('\\xe2\\x80\\xb6', "'").
            replace('\\xe2\\x80\\xb7', "'").
            replace('\\xe2\\x81\\xba', "+").
            replace('\\xe2\\x81\\xbb', "-").
            replace('\\xe2\\x81\\xbc', "=").
            replace('\\xe2\\x81\\xbd', "(").
            replace('\\xe2\\x81\\xbe', ")").replace('\\xef\\x81\\xb3', "").replace('\\n', ' ')
             )[1:]
    return TEXT

In [25]:
#remove punctuation
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

In [26]:
#remove stopwords
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])


In [28]:
#remove URLs
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [29]:
#remove HTML
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

In [44]:
#remove digits
DIGITS_TO_REMOVE = "0123456789"
def remove_digits(text):
    return text.translate(str.maketrans('', '', DIGITS_TO_REMOVE))

In [57]:
df["resume"] = df["resume"].astype(str).str.lower()
df['resume'] = df['resume'].apply(remove_urls)
df['resume'] = df['resume'].apply(remove_html)
df['resume'] = df["resume"].apply(unicodetoascii)
df['resume'] = df['resume'].apply(remove_punctuation)
df['resume'] = df['resume'].apply(remove_stopwords)
df['resume'] = df['resume'].apply(remove_digits)

In [58]:
df.head()

Unnamed: 0,ID,Category,resume
0,1,HR,john h smith phr po box callahan fl infogre...
1,2,HR,name surname address mobile noemail personal p...
2,3,HR,anthony brown hr assistant areas expertise per...
3,4,HR,id career objective pursue growth oriented car...
4,5,HR,human resources director xefxxbexpert organiza...


In [59]:
df['resume'][0]

'john h smith phr  po box  callahan fl  infogreatresumesfastcom approachable innovator passion human resources senior human resources professional personable analytical flexible senior hr professional multifaceted expertise seasoned benefits administrator extensive experience working highly paid professionals clientrelationshipbased settings dynamic team leader capable analyzing alternatives identifying tough choices communicating total value benefit compensation packages senior level executives employees core competencies benefits administration customer service cost control recruiting acquisition management compliance reporting retention professional services domestic international benefits collaboration adaptability change management defined contribution plans auditing negotiation corporate hr policies full lifecycle training k form  confidential files eeo aap fmla std ltd hb visa vets  eap processing payroll hr technology hris data management auditing ultipro back office ultipro we

## next gensim tokenize