GitHub - samwilson/Wikparser: Wiktionary Parser

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
classes		classes
LICENSE.md		LICENSE.md
README.txt		README.txt
language.config.php		language.config.php
wikparser.php		wikparser.php
Repository files navigation

Wikparser 0.3a
=============

Wiktionary Text Parser
Author: Yves Bourque
http://www.igrec.ca/projects/wiktionary-text-parser/

CHANGELOG 0.3a
==============
- Language parameters are now set via an array ($langParameters)
- Parser classes no longer set all language parameters

CHANGELOG 0.3
=============
- Added gender parse functionality
- Added support for Spanish (partial)
- Added support for German (partial)
- Changed to mysqli
- Simplified mysql data retrieval
- Reduced number of variables
- Added additional error messages


DESCRIPTION
===========
The Wiktionary Parser (or Wikparser) is a small tool written in PHP that allows users to extract specific information from the Wiktionary API or a local copy of Wiktionary's database in MySQL.

Currently, this software is able to extract the following information:
- Part of speech/Lexical categories (pos)
- Synonyms (syn)
- Hypernyms (hyper)
- Gender (gender)
- Definitions (def)

for the following languages:
- English
- French
- Spanish (partial)
- German (partial)

Additional language support may be added by following the guide at http://www.igrec.ca/projects/wiktionary-text-parser/ or at the end of this file.

REQUIREMENTS
============
You will need:
- Apache or some other web server platform
- PHP 5
- cURL

INSTALLATION
============
Simply download and copy files and folders to a location accessible from your Web server.

USAGE
=====
To use Wikparser, you need to call or point your browser to the wikparser.php file with the following parameters and values (* indicates a mandatory parameter):

- *word: any string (e.g. /wikparser.php?word=dog)
- *query: the type of query; "def" for definitions, "syn" for synonyms, "pos" for parts of speech, "hyper" for hypernyms, and "gender" for gender (e.g. /wikparser.php?query=pos)
- lang: Wiktionary language code. Script currently supports English ("en"), french ("fr"), Spanish ("es"), and German ("de") natively [default: en].
- count: number of items to return [default: 100]
- source: location of Wiktionary data; "local" for a local MySql copy of Wiktionary; "api" for Wiktionary’s API [default: api]

EXAMPLES
========
The examples below use the Wikparser hosted at www.igrec.ca.

- Get first 2 definitions of the word "table" in English directly from Wiktionary:
http://www.igrec.ca/project-files/wikparser/wikparser.php?word=table&query=def&count=2

- Get all parts of speech for the word "puissance" in French directly from Wiktionary:
http://www.igrec.ca/project-files/wikparser/wikparser.php?word=puissance&query=pos&lang=fr

ADDING SUPPORT FOR OTHER LANGUAGES
==================================
In order to add support for other languages, you must first determine the language code used by Wiktionary. It's usually the standard two letter code, but you can always check by going to wiktionary.org and selecting the language you're interested in. Then look at the first few letters of the URL:

http://tr.wiktionary.org/ : "tr" for Turkish
http://vi.wiktionary.org/ : "vi" for Vietnamese

Now open the language.config.php file in the root of the Wiktionary Parser. You'll see a PHP switch. You must add a new case (or modify one of the ones included if you don't care about keeping current language functionality) for the language you want to work with. You'll see the following:

case "INSERT LANGUAGE CODE HERE":
	$langParameters = array(
		"langCode" => "",
		"langHeader" => "",
		"langSeparator" => "",
		"defHeader" => "",
		"defTag" => "",
		"synHeader" => "",
		"hyperHeader" => "",
		"genderPattern" => "",
		"posMatchType" => "",
		"posPattern" => "",
		"posArray" => "",
		"posExtraString" => "",
	);
	break;

For instance, if you're working with Turkish, you would insert tr between the case quotes. As for the rest, you'll need to actually have a look at the output generated by the Wiktionary API (the output is also identical for a local copy of the database). You'll need to call the API with a word and look at the output to identify each one of the parameters above. Follow the link below for an example of the output for the word abuelo via the Wiktionary API using the spanish language code (es):

http://es.wiktionary.org/w/api.php?action=parse&prop=wikitext&page=abuelo&format=xmlfm

You'll need to scan multiple words to determine what patterns to use for whatever language you're interested in. It can be tricky, as the raw data is messy and inconsistent. You'll often find identifiers that differ from one entry to the next. Once you've figured out how Wiktionary encodes its data for that language, you can begin to fill in the parameters. Not all parameters need to be set for the parser to work; if you're only interested in extracting synonyms, then only synHeader requires a value. One by one:

langCode:
The string that identifies the language within Wiktionary (e.g. en, de, tr, etc.)

langheader:
The string that identifies the section for whatever language you're working with. Wiktionary will often list multiple languages on a page for a given word (table, for instance, is both valid in English and French). It's important to identify the string that starts a language section so that info from another language isn't parsed. Ex. ==English==.

langSeparator:
The string that separates each language on a given page. Sometimes it's a simple string (e.g. "----" in the English Wiktionary), but in other cases you might have to use a partial string. For example, the French Wiktionary wraps languages within "== {{=fr=}} ==", so we can assume that each new language section will begin with "== {{=". This is therefore the langSeparator for French entries.

defHeader:
The string that begins the definitions section. Not always present (e.g. English). In German, all definitions fall under the {{Bedeutungen}} string.

defTag:
Definitions are usually preceded by some non-alphanumeric character (e.g. in English by "# " (notice the space)). This differs between languages, however.

synHeader:
String that identifies the synonyms section (e.g. English: ====Synonyms====).

hyperHeader:
String that identifies the hypernyms section (e.g. English: ====Hypernyms====).

genderPattern:
A regular expression that captures a words gender. Patterns used are often inconsistent, so you'll need to go through a few pages to make sure you've identified all possible strings.

posMatchType:
Either "array" or "preg". This is how the parts of speech will be identified. If, like for English, there is a limited number of possibilities, you can simply store them in an array and set this variable to "array". If the parts of speech vary greatly (like they do for French), then you'll want to use a regular expression and set this variable to "preg".

posPattern:
If the parts of speech vary greatly, you'll need to write a regular expression in order to identify them. If you're unfamiliar with regular expressions, have a look at this quick guide, which also has a link to a tutorial.

posArray:
If the parts of speech do not vary and are limited in number, you can store them all in this array and set the posmatchtype variable to "array."

posExtraString:
When using regular expressions to match POS, you often need to add unrelated strings in order to capture the correct entry (e.g. in German, POS is preceded by {{Wortart|). Add this string here to have the parser strip at output.

Once these parameters are set, you should be able to call the script with the new language code set to the lang parameter.