GitHub - twail-net/Crackr: Keyword Extraction system using Brown Clustering - (This version is trained to extract keywords from job listings)

README:

Keyword extraction is an extremely interesting topic in Information Retrieval- keywords are widely acknowledged to be extremely important in the field of text retrieval, and particularly while developing large scale modern search engines that limit the size of the inverted index used by the system.

In this project we propose to build a system using modern NLP techniques such as Part of Speech Tagging, Brown Clustering and Rapid Automatic Keywords Extraction (RAKE) to use a small initial seed of keywords to generate more candidate keywords in a semi-supervised manner and expose the system as a JSON based web service.

The service can be launched by going to the webserver module and running the python script serve.py

$ python serve.py

Serves the module on a webserver on port 8080 of localhost.

<> Technologies and Frameworks Used

Front-end: HTML/CSS/Javascript, Jquery, Bootstrap

Backend: Python 2.7, Web.py Framework

Part-of-Speech Tagger: Stanford POSTagger with NLTK bindings

The system also utilizes a C++ implementation of Brown's Clustering Algorithm and Rapid Automatic Keyword Extraction

RAKE:

This system uses a customized implementation of RAKE built on the design by github user Aneesha. (https://github.com/aneesha/RAKE)

Brown Clustering:

This system uses Percy Liang’s Brown Clustering Implementation in C++. Data: Datasets for skills and job listings were provided by www.Collegefeed.com and are confidential. Terminology Used:

NAIVE: Naive selection of keywords that were already present in the seed corpus. RAKE: Keywords extracted via the Rapid Automatic Keyword Extraction Algorithm CRAKR: New approach to Keyword Extraction using Part-of-Speech tagging on a candidate document and Brown Clustering on a large corpus of contextual documents.

Key Software Modules: serve.py – Main server file textprocess.py – contains code for textprocessing postagger.py – interface with the Stanford POS Tagger rake.py – A customized python implementation of the RAKE algorithm candygen.py – Contains the implementations of the Naïve keyword extraction and CRAKR algorithms. index.html/index_helpers.js – Contains code for the Front end and GUI

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
webserver		webserver
.gitignore		.gitignore
README.md		README.md
Readme.txt		Readme.txt
SETReport.docx		SETReport.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

twail-net/Crackr

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages