Skip to content

πŸ‡¬πŸ‡§ An extensible, robust and lightweight (45kB) Wiktionary.org scraper to fetch detailed information about words in various languages.

License

Notifications You must be signed in to change notification settings

vxern/wiktionary-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

A lightweight scraper to fetch information about words in various languages from Wiktionary.

Table of contents

Usage

To start using the scraper, first install it using the following command:

npm install wiktionary-scraper

The simplest way of using the scraper is as follows:

import * as Wiktionary from "wiktionary-scraper";

const results = await Wiktionary.get("word");

You can change the language of the target word by setting the lemmaLanguage:

import * as Wiktionary from "wiktionary-scraper";

const results = await Wiktionary.get('o', {
  lemmaLanguage: "Romanian",
});

You can specify if redirects should be followed by setting followRedirects to true:

import * as Wiktionary from "wiktionary-scraper";

// Redirects to and returns results for "Germany".
const results = await Wiktionary.get('germany', {
  followRedirects: true,
});

By default, the User-Agent header used in requests is filled in using a default value mentioning wiktionary-scraper.

To remove it, set userAgent to undefined.

If you want to change it, specify userAgent:

import * as Wiktionary from "wiktionary-scraper";

const results = await Wiktionary.get('word', {
  userAgent: "Your App (https://example.com)",
});

You can also parse HTML of the website directly, bypassing the fetch step.

ℹ️ Notice that, as opposed to get(), parse() is synchronous:

import * as Wiktionary from "wiktionary-scraper";

const results = Wiktionary.parse(html);

Completeness

This library currently only supports the English version of Wiktionary.

Features

  • Parses both single- and multiple-etymology entries.
  • Recognises standard, non-standard and some explicitly disallowed parts of speech, as defined here. In total, there are 60+ recognised parts of speech, which should cover the vast majority of definitions.
    • Note, however, that it is very possible that the library will fail to recognise certain niche, non-standard parts of speech. Should you come across any, please post an issue.

Section support

  • Description
  • Glyph origin
  • Etymology
  • Pronunciation
  • Production
  • Definitions
  • Usage notes
  • Reconstruction notes
  • Inflection sections:
    • Inflection
    • Conjugation
    • Declension
  • Mutation
  • Quotations
  • Alternative forms
  • Alternative reconstructions
  • Relations:
    • Synonyms
    • Antonyms
    • Hypernyms
    • Hyponyms
    • Meronyms
    • Holonyms
    • Comeronyms
    • Troponyms
    • Parasynonyms
    • Coordinate terms
    • Derived terms
    • Related terms
  • Translations
  • Trivia
  • See also
  • References
  • Further reading
  • Anagrams
  • Examples

Recognised parts of speech

Parts of speech
  • Adjective
  • Adverb
  • Ambiposition
  • Article
  • Circumposition
  • Classifier
  • Conjunction
  • Contraction
  • Counter
  • Determiner
  • Ideophone
  • Interjection
  • Noun
  • Numeral
  • Participle
  • Particle
  • Postposition
  • Preposition
  • Pronoun
  • Proper noun
  • Verb
Morphemes
  • Circumfix
  • Combining form
  • Infix
  • Interfix
  • Prefix
  • Root
  • Suffix
Symbols
  • Diacritical mark
  • Letter
  • Ligature
  • Number
  • Punctuation mark
  • Syllable
  • Symbol
Phrases
  • Phrase
  • Proverb
  • Prepositional phrase
Han characters and language-specific varieties
  • Han character
  • Hanzi
  • Kanji
  • Hanja
Other
  • Romanization
  • Logogram
  • Determinative
Explicitly disallowed parts of speech

You know, just in case somebody didn't follow the rules on Wiktionary.

  • Abbreviation
  • Acronym
  • Initialism
  • Cardinal-number
  • Ordinal-number
  • Cardinal-numeral
  • Ordinal-numeral
  • Clitic
  • Gerund
  • Idiom
Library additions
  • Adposition
  • Affix
  • Character