Skip to content
Switch branches/tags

Latest commit


To be as useful as the other skill extractors, SectionExtract needs to
report the start index of the match in each yielded CandidateSkill. To enable this, we add an include_span argument to the sentence tokenizer. Doing this actually helps us remove some index tracking code from other skill extractors.

Git stats


Failed to load latest commit information.
Latest commit message
Commit time
Dec 29, 2016


Build Status Code Coverage Updates Python 3 PyPI Code Climate

Open Skills Project - Machine Learning

This is the library for the methods usable by the Open Skills API, including processing algorithms and utilities for computing our jobs and skills taxonomy.

New to Skills-ML? Check out the Skills-ML Tour! It will get you started with the concepts. You can also check out the notebook version of the tour which you can run on your own.


Hosted on Github Pages

Quick Start

1. Virtualenv

skills-ml depends on python3.6, so create a virtual environment using a python3.6 executable.

virtualenv venv -p /usr/bin/python3.6

Activate your virtualenv

source venv/bin/activate

2. Installation

pip install skills-ml

3. Import skills_ml

import skills_ml
  • There are a couple of examples of specific uses of components to perform specific tasks in examples.
  • Check out the descriptions of different algorithm types in algorithms/ and look at any individual directories that match what you'd like to do (e.g. skill extraction, job title normalization)
  • skills-airflow is the open-source production system that uses skills-ml algorithms in an Airflow pipeline to generate open datasets

Building the Documentation

skills-ml uses a forked version of pydocmd, and a custom script to keep the pydocmd config file up to date. Here's how to keep the docs updated before you push:

$ cd docs $ PYTHONPATH="../" python # this will update docs/pydocmd.yml with the package/module structure and export the Skills-ML Tour notebook to the documentation directory $ pydocmd serve # will serve local documentation that you can check in your browser $ pydocmd gh-deploy # will update the gh-pages branch


  • algorithms/ - Core algorithmic module. Each submodule is meant to contain a different type of component, such as a job title normalizer or a skill tagger, with a common interface so different pipelines can try out different versions of the components.
  • datasets/ - Wrappers for interfacing with different datasets, such as ONET, Urbanized Area.
  • evaluation/ - Code for testing different components against each other.



This project is licensed under the MIT License - see the file for details.