@sciactive/tokenizer - Tokenizer for Natural Language Processing

Tokenize string input into 32-bit integers (for full text search and natural language processing).

It works by removing stop words, normalizing text, stemming each word, then tokenizing both the original and the stemmed version of the word. It returns the tokens, along with their position in the text (1 indexed), and whether the token is a stemmed version. Stemmed versions will have the same position as their unstemmed counterparts.

For email addresses, it will provide the username part and the domain part as stemmed tokens, then the whole address as an unstemmed token.

For URLs, it will provide the protocol, domain, and path as stemmed tokens, then the whole address as an unstemmed token.

Usage

import { Tokenizer } from '@sciactive/tokenizer';

const input = 'A natural language string input goes here.';

const tokenizer = new Tokenizer(/* Optional config object can go here. */);

const tokens = tokenizer.tokenize(input);
// [
//   { token: -603851912, position: 1, stem: true },
//   { token: -1167464141, position: 1, stem: false },
//   { token: 1865099040, position: 2, stem: true },
//   { token: -723816011, position: 2, stem: false },
//   { token: -1631669591, position: 3, stem: false },
//   { token: -668454185, position: 4, stem: false },
//   { token: 1835329032, position: 5, stem: true },
//   { token: 364389343, position: 5, stem: false },
//   { token: -1323595880, position: 6, stem: false }
// ]

const detailed = tokenizer.detailedTokenize(input);
// {
//   original: [ 'natural', 'language', 'string', 'input', 'goes', 'here' ],
//   stemmed: [ 'natur', 'languag', 'string', 'input', 'goe', 'here' ],
//   tokens: [
//     [
//       {
//         input: 'natur',
//         token: -603851912,
//         position: 1,
//         stem: true
//       },
//       {
//         input: 'natural',
//         token: -1167464141,
//         position: 1,
//         stem: false
//       }
//     ],
//     [
//       {
//         input: 'languag',
//         token: 1865099040,
//         position: 2,
//         stem: true
//       },
//       {
//         input: 'language',
//         token: -723816011,
//         position: 2,
//         stem: false
//       }
//     ],
//     [
//       {
//         input: 'string',
//         token: -1631669591,
//         position: 3,
//         stem: false
//       }
//     ],
//     [
//       {
//         input: 'input',
//         token: -668454185,
//         position: 4,
//         stem: false
//       }
//     ],
//     [
//       {
//         input: 'goe',
//         token: 1835329032,
//         position: 5,
//         stem: true
//       },
//       { input: 'goes', token: 364389343, position: 5, stem: false }
//     ],
//     [
//       {
//         input: 'here',
//         token: -1323595880,
//         position: 6,
//         stem: false
//       }
//     ]
//   ]
// }

Options

The config object includes the following options:

stopWords: The stop word list. Defaults to the sciactive list.
language: A language from listLanguages(). Defaults to 'eng'.
stemmingAlgorithm: A stemming algorithm from listStemmingAlgorithms(language). Defaults to 'porter'.

`tokenize`

This function returns a simplified output of just each token, its position, and whether it is a stemmed version.

`detailedTokenize`

This function gives more detail, including the original tokens, the stemmed versions, and an array of the sets of tokens. Each set will include the unstemmed version in the last position, and any stemmed versions before it.

Other Languages

To use the tokenizer for other languages, you can use the static functions to see what other languages and stemming algorithms are available and get the stop list for a language. Don't forget to set the stop word list for the language you're using.

Note that the stop word list for English returned by getLanguageStopWords('eng') is much longer than the default stop word list (the sciactive list).

import { Tokenizer } from '@sciactive/tokenizer';

// Get the available languages and features for your language.
const supportedLanguages = Tokenizer.listLanguages();
const frenchStemmingAlgorithms = Tokenizer.listStemmingAlgorithms('fra');
const frenchStopWordList = Tokenizer.getLanguageStopWords('fra');

// You must set all of `language`, `stemmingAlgorithm`, and `stopWords` for non-English support.
const tokenizer = new Tokenizer({
  language: 'fra',
  stemmingAlgorithm: 'snowball',
  stopWords: Tokenizer.getLanguageStopWords('fra'),
});

const tokens = tokenizer.tokenize(
  `J'ai envie de Taco Bell. Je vais aller en voiture au restaurant et acheter toute la nourriture.`,
);

// Original: j'ai envie taco bell vais aller voiture restaurant acheter toute nourriture
// Stemmed: jai envi taco bel vais aller voitur restaur achet tout nourritur

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.husky		.husky
src		src
.gitignore		.gitignore
.lintstagedrc.mjs		.lintstagedrc.mjs
.npmignore		.npmignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
LICENSE		LICENSE
README.md		README.md
commitlint.config.cjs		commitlint.config.cjs
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

@sciactive/tokenizer - Tokenizer for Natural Language Processing

Usage

Options

`tokenize`

`detailedTokenize`

Other Languages

License

About

Uh oh!

Releases

Packages

Languages

License

sciactive/tokenizer

Folders and files

Latest commit

History

Repository files navigation

@sciactive/tokenizer - Tokenizer for Natural Language Processing

Usage

Options

tokenize

detailedTokenize

Other Languages

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`tokenize`

`detailedTokenize`

Packages