# NLP-REFERENCES

1. http://www.nltk.org/book/
2. https://course.spacy.io/en/
3. https://github.com/mhagiwara/100-nlp-papers
4. https://github.com/keon/awesome-nlp
5. https://github.com/shashankg7/Deep-Learning-for-NLP-Resources
6. http://cs224d.stanford.edu/syllabus.html
7. https://www.kaggle.com/learn/natural-language-processing
8. https://www.youtube.com/playlist?list=PL8FFE3F391203C98C

Dealing with any NLP problem involves the following steps:
1. Problem Identification, and shortlisting plausible NLP Models that can be applied

OVERUSED NLP APPLICATIONS:
    1. Sentiment Analysis (vader, bert)
    2. Topic Analysis
    3. Similarity models
    4. Keyword Extraction
    5. Classification and clusterning (applications of topic, keyword, sentiment, similarity)
    6. Summarization
    7. Text Generation

2. Corpous/ Dataset Selection

3. Cleaning the corpous (very important)
    1. Cleaning unwanted patterns (email, emoji, urls)
    2. Removal of stop words
    3. Removal of most repeated and least repeated words
    4. Removal of symbols, special charecters
    5. Stemming or lemmatization
    5. Tokenization splitting on ' ' or some special charecter
    - 2 or more meaningful words are joined together, spell checker (Spark NLP) 
    6. N-Gram cobinations of splitted tokens
    7. POS Tagging, NER Tagging, removal of unwanted taged words

4. Testing models that solve a problem
   Each class of NLP applications can have multiple approaches, we have test and opt the best one

# Problem Statement

#### Extract skill keywords from given job description and then cluster those keywords to give meaningful skill clusters

In [5]:
import pandas as pd
import numpy as np
import re
import json
import string
from ast import literal_eval
punc=string.punctuation
print(punc)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
st = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\31405\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
df= pd.read_csv(r'D:\OneDrive - Indian School of Business\Projects_archive\JSI\archive\JSI Project\JSI_Naukri\data\ntwrk_scrape\data.csv', low_memory=False, nrows= 500)
df.shape

(500, 17)

In [9]:
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   jobTitle        500 non-null    object 
 1   jobId           500 non-null    float64
 2   jobType         440 non-null    object 
 3   currency        500 non-null    object 
 4   posted_days     500 non-null    object 
 5   companyName     500 non-null    object 
 6   tags            500 non-null    object 
 7   companyId       500 non-null    float64
 8   jobdescription  500 non-null    object 
 9   functionalarea  500 non-null    object 
 10  scraping_date   500 non-null    object 
 11  posted_date     500 non-null    object 
 12  monthYear       500 non-null    object 
 13  salary          500 non-null    object 
 14  experience      492 non-null    object 
 15  education       0 non-null      float64
 16  location        500 non-null    object 
dtypes: float64(3), object(14)
memory us

In [10]:
type(literal_eval("{'name': 'sachin'}"))

dict

In [11]:
df['jobdescription'].head().tolist()

['<p>Dear Candidate,</p><br/><p>We are looking to hire Graduates who can speak English & Hindi for Tally ERP support.</p><p> </p><p>JD:</p><ul><li>Role: To provide voice support to Tally customers from all over India and across the globe</li><li>Qualification: Any Graduate having basic knowledge of accounts and ready to work in an Accounts process</li><li>Must possess good communication skills in Hindi & English</li><li>Terms & Conditions: Need to submit 10th & 12th original marks card for a period of 16 months</li><li>Shifts: Willing to work in different day shifts starting from 9:00am to 7:30pm (girls) and 10:30am to 9:30pm (boys) based on 9 hour working</li><li>Remuneration: Best in the industry</li></ul><p> </p><p>Interested candidates can walk in with your updated profile to the below mentioned address;</p><p> </p><p><strong>Abhilash Nair,</strong> HR Assistant Manager - <strong>9789830683</strong></p><p><strong>Greet Technologies Pvt Ltd</strong></p><p>327-328, 6th Sector, 5th Ma

# Text Cleaning

In [13]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(re.escape(i), '', input_txt)        
    return input_txt

def remove_stop_words(text):
    return " ".join([word for word in str(text).split() if word not in st])

def removeNumbers(text):
    """ Removes integers """
    text = ''.join([i for i in text if not i.isdigit()])         
    return text

def remove_punc(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in punc])

def clean_text(lst):
    # remove twitter Return handles (RT @xxx:)
    # lst = np.vectorize(remove_pattern)(lst, "RT @[\w]*:")
    # remove twitter handles (@xxx)
    # lst = np.vectorize(remove_pattern)(lst, "@[\w]*")
    # remove URL links (httpxxx)
    lst = np.vectorize(remove_pattern)(lst, "https?://[A-Za-z0-9./]*")
    # remove special tags
    lst= np.vectorize(remove_pattern)(lst, "<(.*?)>")
    lst = np.core.defchararray.replace(lst, "[^a-zA-Z]", "")
    return str(lst)

In [14]:
df['jobdescription']= df['jobdescription'].str.replace('nan', '')
df['jobdescription']= df['jobdescription'].apply(lambda text: clean_text(text))

In [15]:
df['jobdescription'].head().tolist()

['<>Dear Candidate,</><><>We are looking to hire Graduates who can seak Engsh & Hindi for Tally ERP suort.</><> </><>JD:</><><>Role: To rovide voice suort to Tally customers from all over India and across the globe</><>Quafication: Any Graduate having basic knowledge of accounts and ready to work in an Accounts rocess</><>Must ossess good communication skills in Hindi & Engsh</><>Terms & Conditions: Need to submit 10th & 12th original marks card for a eriod of 16 months</><>Shifts: Wilng to work in different day shifts starting from 9:00am to 7:30m (girls) and 10:30am to 9:30m (boys) based on 9 hour working</><>Remuneration: Best in the industry</></><> </><>Interested candidates can walk in with your udated rofile to the below mentioned address;</><> </><><>Abhilash Nair,</> HR Assistant Manager - <>9789830683</></><><>Greet Technologies Pvt Ltd</></><>327-328, 6th Sector, 5th Main Service Road, HSR Layout, Bengaluru, 560102</><>Walk-In Timing: 10:30am to 5:30m</>',
 'com,CA(Inter),CM

In [16]:
df['jobdescription']= df['jobdescription'].str.replace('[^\w\s]','')
df['jobdescription']= df['jobdescription'].str.lower()
df['jobdescription']= df['jobdescription'].apply(lambda text: remove_stop_words(text))
df['jobdescription']= df['jobdescription'].apply(lambda text: removeNumbers(text))
df['jobdescription']= df['jobdescription'].apply(lambda text: remove_punc(text))
df['jobdescription']= df['jobdescription'].str.replace('[^\w\s]','')

  df['jobdescription']= df['jobdescription'].str.replace('[^\w\s]','')
  df['jobdescription']= df['jobdescription'].str.replace('[^\w\s]','')


In [17]:
df['jobdescription'].head().tolist()

['dear candidatewe looking hire graduates seak engsh hindi tally erp suort jdrole rovide voice suort tally customers india across globequafication graduate basic knowledge accounts ready work accounts rocessmust ossess good communication skills hindi engshterms conditions need submit th th original marks card eriod monthsshifts wilng work different day shifts starting am m girls am m boys based hour workingremuneration best industry interested candidates walk udated rofile mentioned address abhilash nair hr assistant manager greet technologies pvt ltd th sector th main service road hsr layout bengaluru walkin timing am m',
 'comcaintercmaintercsinter ca cs cma applyshould knowledge finalization accountsauditing tallytaxation work individualsfirmsand companiessupervision training provided good candidates interested learning self motivating',
 'looking fresher candidates executive handle accountingtdsgstincome taxroc compliancerocall taxationand audits related companyfreshers trainedca i

In [18]:
# Split joined keywords
# NER Tagging
# spell check

In [24]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\31405\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [27]:
tokenized

['position chartered accountantchennai cororate audit services chennai freshers also acceted first attemt throughoutqualification st attemt chartered accountantexerience years post ca final qualificationareas exosurework background exerience internal external audits referably construction infra rojects sound knowledge accounting auditing standards ifrs qualification exerience forensic accountingaudit would added advantagepurose osition hel comlete audit lan oe testing icfrjob profile primary work art team rofessionally qualified accountants carry audits issue audit reorts travelling must auditing different locations days monthsecondaryto work achieving deartmental objectives like oe testing icfr develoing data analytics contribute auditing system etc location chennaiplease attach cv mention current ctc']

In [28]:
help(sent_tokenize)

Help on function sent_tokenize in module nltk.tokenize:

sent_tokenize(text, language='english')
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus



In [29]:
wordsList

['position',
 'chartered',
 'accountantchennai',
 'cororate',
 'audit',
 'services',
 'chennai',
 'freshers',
 'also',
 'acceted',
 'first',
 'attemt',
 'throughoutqualification',
 'st',
 'attemt',
 'chartered',
 'accountantexerience',
 'years',
 'post',
 'ca',
 'final',
 'qualificationareas',
 'exosurework',
 'background',
 'exerience',
 'internal',
 'external',
 'audits',
 'referably',
 'construction',
 'infra',
 'rojects',
 'sound',
 'knowledge',
 'accounting',
 'auditing',
 'standards',
 'ifrs',
 'qualification',
 'exerience',
 'forensic',
 'accountingaudit',
 'would',
 'added',
 'advantagepurose',
 'osition',
 'hel',
 'comlete',
 'audit',
 'lan',
 'oe',
 'testing',
 'icfrjob',
 'profile',
 'primary',
 'work',
 'art',
 'team',
 'rofessionally',
 'qualified',
 'accountants',
 'carry',
 'audits',
 'issue',
 'audit',
 'reorts',
 'travelling',
 'must',
 'auditing',
 'different',
 'locations',
 'days',
 'monthsecondaryto',
 'work',
 'achieving',
 'deartmental',
 'objectives',
 'like',
 

In [31]:
from nltk.tokenize import word_tokenize, sent_tokenize
txt='position chartered accountantchennai cororate audit services chennai freshers also acceted first attemt throughoutqualification st attemt chartered accountantexerience years post ca final qualificationareas exosurework background exerience internal external audits referably construction infra rojects sound knowledge accounting auditing standards ifrs qualification exerience forensic accountingaudit would added advantagepurose osition hel comlete audit lan oe testing icfrjob profile primary work art team rofessionally qualified accountants carry audits issue audit reorts travelling must auditing different locations days monthsecondaryto work achieving deartmental objectives like oe testing icfr develoing data analytics contribute auditing system etc location chennaiplease attach cv mention current ctc'
tokenized = sent_tokenize(txt, )

for i in tokenized:
    # Word tokenizers is used to find the words 
    # and punctuation in a string
    wordsList = nltk.word_tokenize(i)
  
    # removing stop words from wordList
    wordsList = [w for w in wordsList if not w in st] 
  
    #  Using a Tagger. Which is part-of-speech 
    # tagger or POS-tagger. 
    tagged = nltk.pos_tag(wordsList) 
    print(tagged)

[('position', 'NN'), ('chartered', 'VBD'), ('accountantchennai', 'JJ'), ('cororate', 'NN'), ('audit', 'NN'), ('services', 'NNS'), ('chennai', 'NN'), ('freshers', 'NNS'), ('also', 'RB'), ('acceted', 'VBD'), ('first', 'JJ'), ('attemt', 'JJ'), ('throughoutqualification', 'NN'), ('st', 'NN'), ('attemt', 'NN'), ('chartered', 'VBN'), ('accountantexerience', 'NN'), ('years', 'NNS'), ('post', 'NN'), ('ca', 'MD'), ('final', 'JJ'), ('qualificationareas', 'NNS'), ('exosurework', 'VBP'), ('background', 'JJ'), ('exerience', 'NN'), ('internal', 'JJ'), ('external', 'JJ'), ('audits', 'NN'), ('referably', 'RB'), ('construction', 'NN'), ('infra', 'NN'), ('rojects', 'VBZ'), ('sound', 'JJ'), ('knowledge', 'NN'), ('accounting', 'NN'), ('auditing', 'NN'), ('standards', 'NNS'), ('ifrs', 'JJ'), ('qualification', 'NN'), ('exerience', 'NN'), ('forensic', 'JJ'), ('accountingaudit', 'NN'), ('would', 'MD'), ('added', 'VBD'), ('advantagepurose', 'JJ'), ('osition', 'NN'), ('hel', 'NN'), ('comlete', 'JJ'), ('audit', 

In [26]:
st

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [32]:
from collections import Counter
cnt = Counter()

for text in df["jobdescription"].values:
    for word in text.split(' '):
        cnt[word] += 1
cnt.most_common(100)

[('san', 991),
 ('experience', 383),
 ('knowledge', 289),
 ('accounting', 249),
 ('good', 224),
 ('years', 198),
 ('accounts', 186),
 ('skills', 183),
 ('work', 161),
 ('stylefontszetsan', 159),
 ('stylefontfamly', 158),
 ('candidate', 131),
 ('ficial', 123),
 ('communication', 120),
 ('new', 119),
 ('required', 115),
 ('business', 110),
 ('team', 109),
 ('working', 107),
 ('exerience', 106),
 ('must', 100),
 ('excel', 96),
 ('management', 95),
 ('tally', 92),
 ('gst', 86),
 ('qottmes', 78),
 ('roman', 78),
 ('qotserfsan', 77),
 ('sales', 75),
 ('tax', 73),
 ('account', 71),
 ('fice', 70),
 ('ms', 69),
 ('day', 68),
 ('audit', 64),
 ('data', 64),
 ('bank', 64),
 ('sap', 61),
 ('ca', 60),
 ('office', 59),
 ('end', 58),
 ('erp', 56),
 ('able', 54),
 ('timely', 54),
 ('customer', 54),
 ('roles', 53),
 ('written', 50),
 ('degree', 50),
 ('tds', 50),
 ('related', 49),
 ('job', 49),
 ('candidates', 48),
 ('etc', 48),
 ('role', 47),
 ('langenin', 47),
 ('software', 45),
 ('strong', 45),
 ('in

# TFIDF- Model