Skip to content
A tool to match license text with SPDX license list using a an algorithm with finds close matches. It follows SPDX Matching guidelines to keep the substantial text as well as ignore the replaceable text for matching purposes.
Python
Branch: master
Clone or download

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
spdx_license_matcher Fix the catch error in the try except block in computation file Jul 19, 2019
.gitignore Add .gitignore file Jul 10, 2019
LICENSE Add LICENSE Jul 10, 2019
README.md Update README.md Jul 10, 2019
requirements.txt Update requirements.txt file Jul 19, 2019
setup.py Add setup.py file Jul 15, 2019
tool.jar Update SPDX tools file with v2.1.18 Jul 10, 2019

README.md

SPDX License Match Tool

A python tool which takes the license text from the user, compares it with the SPDX license list using an algorithm which finds close matches and returns differences if the input license text is found to be a close match.

Installation

Ensure that you are using Python3 for installation of the tool.

  • Clone the repository

    • git clone https://github.com/Ugtan/spdx-license-match-tool.git
  • Make a Python3 virtual environment

    • python3 -m venv <name-of-the-virtual-env>
  • Activate the venv

    • source <name-of-the-virtual-env>/bin/activate
  • Install dependencies inside virtual environment

    • pip3 install -r requirements.txt

Usage

First of all before using the tool you will have to create a database of SPDX license list for this just run python build_licenses.py.

To run the tool just use the command python matcher.py -f <file-name> -t <threshold> -l <limit>.

  • filename is the file with the license text(if you don't provide the file as well then it will prompt you to add it).
  • threshold is a value upto which we will just won't consider a match.(optional)
  • limit is the value where if the dice score is above this value then we will consider it a perfect match.(optional)

You can also run python matcher.py --help for more info.

Workflow

The workflow of the tool is as follows:

  • Reads the license text as input from the user.
  • Build a redis database with all the license text present on the SPDX license list.
  • Compare the license text with the license text present in the database.
    • Normalises the license text based on the SPDX Matching guidelines while ignore the replaceable text and only focusing on substantial text for matching purposes.
    • Tokenizes normalised text into a list of bigrams. This is necessary for the token based algorithm we are using for our use case.
    • Use a token based similarity metric algorithm namely Sorensen dice algorithm which is based on the logic to find the common tokens, and divide it by the total number of tokens present by combining both of the sets. This algorithm helps us to distinguish our close matches.
    • A threshold value is used where we just won't consider a match.
    • If the match is 100% then we say its a perfect match.
    • If the match is between a threshold value and 100% then we apply the full matching algorithms and compares the closely matched license text to the license text of SPDX Standard License using a method present in the SPDX tools.
      • If there is a match then the given license text matches with the SPDX standard license.
      • If there is no match then we simply display the differences of the given license text with that of SPDX license list.
You can’t perform that action at this time.