Skip to content

Commit

Permalink
first release
Browse files Browse the repository at this point in the history
  • Loading branch information
Yoshiki Shibukawa committed Feb 15, 2013
0 parents commit f20ae7f
Show file tree
Hide file tree
Showing 26 changed files with 15,387 additions and 0 deletions.
24 changes: 24 additions & 0 deletions LICENSE.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
License
-------

Copyright (c) 2013, Yoshiki Shibukawa

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided
that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and
the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS
BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include *.rst
include setup.*
recursive-include src *.py
include MANIFEST.in
49 changes: 49 additions & 0 deletions PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Metadata-Version: 1.1
Name: snowballstemmer
Version: 0.1.0
Summary: This package provides 16 stemmer algorithms (15 + Poerter English stemmer) generated from Snowball algorithms.
Home-page: https://github.com/shibukawa/snowball_py
Author: Yoshiki Shibukawa
Author-email: yoshiki at shibu.jp
License: BSD
Description:
It includes following language algorithms:

* Danish
* Dutch
* English (Standard, Porter)
* Finnish
* French
* German
* Hungarian
* Italian
* Norwegian
* Portuguese
* Romanian
* Russian
* Spanish
* Swedish
* Turkish

Keywords: stemmer
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python
Classifier: Natural Language :: Danish
Classifier: Natural Language :: Dutch
Classifier: Natural Language :: English
Classifier: Natural Language :: Finnish
Classifier: Natural Language :: French
Classifier: Natural Language :: German
Classifier: Natural Language :: Hungarian
Classifier: Natural Language :: Italian
Classifier: Natural Language :: Norwegian
Classifier: Natural Language :: Portuguese
Classifier: Natural Language :: Romanian
Classifier: Natural Language :: Russian
Classifier: Natural Language :: Spanish
Classifier: Natural Language :: Swedish
Classifier: Natural Language :: Turkish
Classifier: Operating System :: OS Independent
94 changes: 94 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
Snowball stemming library collection for Python
===============================================

This document pertains to the Python version of the stemmer library distribution,
available for download from:

* https://github.com/shibukawa/snowball_jsx/

Original program is maintained at following place:

* http://snowball.tartarus.org/

Original Snowball product is created by Dr Martin Porter and Richard Boulton (Java porting) under
BSD license.

How to use library
------------------

You can use each stemming modules from Python program.

.. code-block:: python

import snowballstemmer

stemmer = snowballstemmer.EnglishStemmer();
print(stemmer.stem("We are the world"));

Following modules are common modules. Don't forget bundle these modules to your program:

* ``snowballstemmer/__init__.py``
* ``snowballstemmer/among.py``
* ``snowballstemmer/basestemmer.jsx``

Following modules are optiona modules. Select your needed language modules:

* ``danish_stemmer.py``
* ``dutch_stemmer.py``
* ``english_stemmer.py``
* ``finnish_stemmer.py``
* ``french_stemmer.py``
* ``german_stemmer.py``
* ``hungarian_stemmer.py``
* ``italian_stemmer.py``
* ``norwegian_stemmer.py``
* ``porter_stemmer.py``
* ``portuguese_stemmer.py``
* ``romanian_stemmer.py``
* ``russian_stemmer.py``
* ``spanish_stemmer.py``
* ``swedish_stemmer.py``
* ``turkish_stemmer.py``

The TestApp example
-------------------

The :file:`testapp.jsx` example program allows you to run any of the stemmers
on a sample vocabulary.

Usage::

testapp.py <algorithm> "sentences ... "

.. code-block:: bash

$ python testapp.py English "sentences... "

License
-------

It is a BSD licensed library.

-----------------------------

Copyright (c) 2013, Yoshiki Shibukawa

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided
that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and
the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS
BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

55 changes: 55 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/usr/bin/env python

from distutils.core import setup

setup(name='snowballstemmer',
version='0.1.0',
description='This package provides 16 stemmer algorithms (15 + Poerter English stemmer) generated from Snowball algorithms.',
long_description='''
It includes following language algorithms:
* Danish
* Dutch
* English (Standard, Porter)
* Finnish
* French
* German
* Hungarian
* Italian
* Norwegian
* Portuguese
* Romanian
* Russian
* Spanish
* Swedish
* Turkish
''',
author='Yoshiki Shibukawa',
author_email='yoshiki at shibu.jp',
url='https://github.com/shibukawa/snowball_py',
keywords="stemmer",
license="BSD",
package_dir={"snowballstemmer": "src/snowballstemmer"},
classifiers = [
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'License :: OSI Approved :: BSD License',
'Programming Language :: Python',
'Natural Language :: Danish',
'Natural Language :: Dutch',
'Natural Language :: English',
'Natural Language :: Finnish',
'Natural Language :: French',
'Natural Language :: German',
'Natural Language :: Hungarian',
'Natural Language :: Italian',
'Natural Language :: Norwegian',
'Natural Language :: Portuguese',
'Natural Language :: Romanian',
'Natural Language :: Russian',
'Natural Language :: Spanish',
'Natural Language :: Swedish',
'Natural Language :: Turkish',
'Operating System :: OS Independent'
]
)
140 changes: 140 additions & 0 deletions src/sample/stemwords.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
import sys
import re
import codecs

import snowballstemmer
from snowballstemmer.danish_stemmer import DanishStemmer
from snowballstemmer.dutch_stemmer import DutchStemmer
from snowballstemmer.english_stemmer import EnglishStemmer
from snowballstemmer.finnish_stemmer import FinnishStemmer
from snowballstemmer.french_stemmer import FrenchStemmer
from snowballstemmer.german_stemmer import GermanStemmer
from snowballstemmer.hungarian_stemmer import HungarianStemmer
from snowballstemmer.italian_stemmer import ItalianStemmer
from snowballstemmer.norwegian_stemmer import NorwegianStemmer
from snowballstemmer.porter_stemmer import PorterStemmer
from snowballstemmer.portuguese_stemmer import PortugueseStemmer
from snowballstemmer.romanian_stemmer import RomanianStemmer
from snowballstemmer.russian_stemmer import RussianStemmer
from snowballstemmer.spanish_stemmer import SpanishStemmer
from snowballstemmer.swedish_stemmer import SwedishStemmer
from snowballstemmer.turkish_stemmer import TurkishStemmer

stemmers = {
'danish': DanishStemmer,
'dutch': DutchStemmer,
'english': EnglishStemmer,
'finnish': FinnishStemmer,
'french': FrenchStemmer,
'german': GermanStemmer,
'hungarian': HungarianStemmer,
'italian': ItalianStemmer,
'norwegian': NorwegianStemmer,
'porter': PorterStemmer,
'portuguese': PortugueseStemmer,
'romanian': RomanianStemmer,
'russian': RussianStemmer,
'spanish': SpanishStemmer,
'swedish': SwedishStemmer,
'turkish': TurkishStemmer
}

def usage():
print('''usage: jsx_stemwords [-l <language>] [-i <input file>] [-o <output file>] [-c <character encoding>] [-p[2]] [-h]

The input file consists of a list of words to be stemmed, one per
line. Words should be in lower case, but (for English) A-Z letters
are mapped to their a-z equivalents anyway. If omitted, stdin is
used.

If -c is given, the argument is the character encoding of the input
and output files. If it is omitted, the UTF-8 encoding is used.

If -p is given the output file consists of each word of the input
file followed by \"->\" followed by its stemmed equivalent.
If -p2 is given the output file is a two column layout containing
the input words in the first column and the stemmed eqivalents in
the second column.

Otherwise, the output file consists of the stemmed words, one per
line.

-h displays this help''')

def main():
argv = sys.argv[1:]
if len(argv) < 5:
usage()
else:
pretty = 0
input = ''
output = ''
encoding = 'utf_8'
language = 'English'
show_help = False
while len(argv):
arg = argv[0]
argv = argv[1:]
if arg == '-h':
show_help = True
break;
elif arg == "-p":
pretty = 1
elif arg == "-p2":
pretty = 2
elif arg == "-l":
if len(argv) == 0:
show_help = True
break
language = argv[0]
argv = argv[1:]
elif arg == "-i":
if len(argv) == 0:
show_help = True
break
input = argv[0]
argv = argv[1:]
elif arg == "-o":
if len(argv) == 0:
show_help = True
break
output = argv[0]
argv = argv[1:]
elif arg == "-c":
if len(argv) == 0:
show_help = True
break
encoding = argv[0]
if show_help or input == '' or output == '':
usage()
else:
stemming(language, input, output, encoding, pretty)


def stemming(lang, input, output, encoding, pretty):
result = []
stemmerClass = stemmers.get(lang.lower(), EnglishStemmer)
stemmer = stemmerClass()
for original in codecs.open(input, "r", encoding).readlines():
original = original.strip()
stemmed = stemmer.stem(original)
if result:
result.append('\n')
if pretty == 0:
if stemmed != "":
result.append(stemmed)
elif pretty == 1:
result.append(original, " -> ", stemmed)
elif pretty == 2:
result.append(original)
if len(original) < 30:
result.append(" " * (30 - len(original)))
else:
result.append("\n")
result.append(" " * 30)
result.append(stemmed)
outfile = codecs.open(output, "w", encoding)
outfile.write(''.join(result) + '\n')
outfile.close()

main()
Loading

0 comments on commit f20ae7f

Please sign in to comment.