first release

shibukawa · Feb 15, 2013 · f20ae7f · f20ae7f
commit f20ae7f
Show file tree

Hide file tree

Showing 26 changed files with 15,387 additions and 0 deletions.
diff --git a/LICENSE.rst b/LICENSE.rst
@@ -0,0 +1,24 @@
+License
+-------
+
+Copyright (c) 2013, Yoshiki Shibukawa
+
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided
+that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this list of conditions and
+  the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions
+  and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,4 @@
+include *.rst
+include setup.*
+recursive-include src *.py
+include MANIFEST.in
diff --git a/PKG-INFO b/PKG-INFO
@@ -0,0 +1,49 @@
+Metadata-Version: 1.1
+Name: snowballstemmer
+Version: 0.1.0
+Summary: This package provides 16 stemmer algorithms (15 + Poerter English stemmer) generated from Snowball algorithms.
+Home-page: https://github.com/shibukawa/snowball_py
+Author: Yoshiki Shibukawa
+Author-email: yoshiki at shibu.jp
+License: BSD
+Description: 
+        It includes following language algorithms:
+
+        * Danish
+        * Dutch
+        * English (Standard, Porter)
+        * Finnish
+        * French
+        * German
+        * Hungarian
+        * Italian
+        * Norwegian
+        * Portuguese
+        * Romanian
+        * Russian
+        * Spanish
+        * Swedish
+        * Turkish
+
+Keywords: stemmer
+Platform: UNKNOWN
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Programming Language :: Python
+Classifier: Natural Language :: Danish
+Classifier: Natural Language :: Dutch
+Classifier: Natural Language :: English
+Classifier: Natural Language :: Finnish
+Classifier: Natural Language :: French
+Classifier: Natural Language :: German
+Classifier: Natural Language :: Hungarian
+Classifier: Natural Language :: Italian
+Classifier: Natural Language :: Norwegian
+Classifier: Natural Language :: Portuguese
+Classifier: Natural Language :: Romanian
+Classifier: Natural Language :: Russian
+Classifier: Natural Language :: Spanish
+Classifier: Natural Language :: Swedish
+Classifier: Natural Language :: Turkish
+Classifier: Operating System :: OS Independent
diff --git a/README.rst b/README.rst
@@ -0,0 +1,94 @@
+Snowball stemming library collection for Python
+===============================================
+
+This document pertains to the Python version of the stemmer library distribution,
+available for download from:
+
+* https://github.com/shibukawa/snowball_jsx/
+
+Original program is maintained at following place:
+
+* http://snowball.tartarus.org/
+
+Original Snowball product is created by Dr Martin Porter and  Richard Boulton (Java porting) under
+BSD license.
+
+How to use library
+------------------
+
+You can use each stemming modules from Python program.
+
+.. code-block:: python
+
+   import snowballstemmer
+
+   stemmer = snowballstemmer.EnglishStemmer();
+   print(stemmer.stem("We are the world"));
+
+Following modules are common modules. Don't forget bundle these modules to your program:
+
+* ``snowballstemmer/__init__.py``
+* ``snowballstemmer/among.py``
+* ``snowballstemmer/basestemmer.jsx``
+
+Following modules are optiona modules. Select your needed language modules:
+
+* ``danish_stemmer.py``
+* ``dutch_stemmer.py``
+* ``english_stemmer.py``
+* ``finnish_stemmer.py``
+* ``french_stemmer.py``
+* ``german_stemmer.py``
+* ``hungarian_stemmer.py``
+* ``italian_stemmer.py``
+* ``norwegian_stemmer.py``
+* ``porter_stemmer.py``
+* ``portuguese_stemmer.py``
+* ``romanian_stemmer.py``
+* ``russian_stemmer.py``
+* ``spanish_stemmer.py``
+* ``swedish_stemmer.py``
+* ``turkish_stemmer.py``
+
+The TestApp example
+-------------------
+
+The :file:`testapp.jsx` example program allows you to run any of the stemmers
+on a sample vocabulary.
+
+Usage::
+
+   testapp.py <algorithm> "sentences ... "
+
+.. code-block:: bash
+
+   $ python testapp.py English "sentences... "
+
+License
+-------
+
+It is a BSD licensed library.
+
+-----------------------------
+
+Copyright (c) 2013, Yoshiki Shibukawa
+
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided
+that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this list of conditions and
+  the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions
+  and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
diff --git a/setup.py b/setup.py
@@ -0,0 +1,55 @@
+#!/usr/bin/env python
+
+from distutils.core import setup
+
+setup(name='snowballstemmer',
+      version='0.1.0',
+      description='This package provides 16 stemmer algorithms (15 + Poerter English stemmer) generated from Snowball algorithms.',
+      long_description='''
+It includes following language algorithms:
+
+* Danish
+* Dutch
+* English (Standard, Porter)
+* Finnish
+* French
+* German
+* Hungarian
+* Italian
+* Norwegian
+* Portuguese
+* Romanian
+* Russian
+* Spanish
+* Swedish
+* Turkish
+''',
+      author='Yoshiki Shibukawa',
+      author_email='yoshiki at shibu.jp',
+      url='https://github.com/shibukawa/snowball_py',
+      keywords="stemmer",
+      license="BSD",
+      package_dir={"snowballstemmer": "src/snowballstemmer"},
+      classifiers = [
+          'Development Status :: 4 - Beta',
+          'Intended Audience :: Developers',
+          'License :: OSI Approved :: BSD License',
+          'Programming Language :: Python',
+          'Natural Language :: Danish',
+          'Natural Language :: Dutch',
+          'Natural Language :: English',
+          'Natural Language :: Finnish',
+          'Natural Language :: French',
+          'Natural Language :: German',
+          'Natural Language :: Hungarian',
+          'Natural Language :: Italian',
+          'Natural Language :: Norwegian',
+          'Natural Language :: Portuguese',
+          'Natural Language :: Romanian',
+          'Natural Language :: Russian',
+          'Natural Language :: Spanish',
+          'Natural Language :: Swedish',
+          'Natural Language :: Turkish',
+          'Operating System :: OS Independent'
+     ]
+)
diff --git a/src/sample/stemwords.py b/src/sample/stemwords.py
@@ -0,0 +1,140 @@
+import sys
+import re
+import codecs
+
+import snowballstemmer
+from snowballstemmer.danish_stemmer import DanishStemmer
+from snowballstemmer.dutch_stemmer import DutchStemmer
+from snowballstemmer.english_stemmer import EnglishStemmer
+from snowballstemmer.finnish_stemmer import FinnishStemmer
+from snowballstemmer.french_stemmer import FrenchStemmer
+from snowballstemmer.german_stemmer import GermanStemmer
+from snowballstemmer.hungarian_stemmer import HungarianStemmer
+from snowballstemmer.italian_stemmer import ItalianStemmer
+from snowballstemmer.norwegian_stemmer import NorwegianStemmer
+from snowballstemmer.porter_stemmer import PorterStemmer
+from snowballstemmer.portuguese_stemmer import PortugueseStemmer
+from snowballstemmer.romanian_stemmer import RomanianStemmer
+from snowballstemmer.russian_stemmer import RussianStemmer
+from snowballstemmer.spanish_stemmer import SpanishStemmer
+from snowballstemmer.swedish_stemmer import SwedishStemmer
+from snowballstemmer.turkish_stemmer import TurkishStemmer
+
+stemmers = {
+    'danish': DanishStemmer,
+    'dutch': DutchStemmer,
+    'english': EnglishStemmer,
+    'finnish': FinnishStemmer,
+    'french': FrenchStemmer,
+    'german': GermanStemmer,
+    'hungarian': HungarianStemmer,
+    'italian': ItalianStemmer,
+    'norwegian': NorwegianStemmer,
+    'porter': PorterStemmer,
+    'portuguese': PortugueseStemmer,
+    'romanian': RomanianStemmer,
+    'russian': RussianStemmer,
+    'spanish': SpanishStemmer,
+    'swedish': SwedishStemmer,
+    'turkish': TurkishStemmer
+}
+
+def usage():
+    print('''usage: jsx_stemwords [-l <language>] [-i <input file>] [-o <output file>] [-c <character encoding>] [-p[2]] [-h]
+
+The input file consists of a list of words to be stemmed, one per
+line. Words should be in lower case, but (for English) A-Z letters
+are mapped to their a-z equivalents anyway. If omitted, stdin is
+used.
+
+If -c is given, the argument is the character encoding of the input
+and output files.  If it is omitted, the UTF-8 encoding is used.
+
+If -p is given the output file consists of each word of the input
+file followed by \"->\" followed by its stemmed equivalent.
+If -p2 is given the output file is a two column layout containing
+the input words in the first column and the stemmed eqivalents in
+the second column.
+
+Otherwise, the output file consists of the stemmed words, one per
+line.
+
+-h displays this help''')
+
+def main():
+    argv = sys.argv[1:]
+    if len(argv) < 5:
+        usage()
+    else:
+        pretty = 0
+        input = ''
+        output = ''
+        encoding = 'utf_8'
+        language = 'English'
+        show_help = False
+        while len(argv):
+            arg = argv[0]
+            argv = argv[1:]
+            if arg == '-h':
+                show_help = True
+                break;
+            elif arg == "-p":
+                pretty = 1
+            elif arg == "-p2":
+                pretty = 2
+            elif arg == "-l":
+                if len(argv) == 0:
+                    show_help = True
+                    break
+                language = argv[0]
+                argv = argv[1:]
+            elif arg == "-i":
+                if len(argv) == 0:
+                    show_help = True
+                    break
+                input = argv[0]
+                argv = argv[1:]
+            elif arg == "-o":
+                if len(argv) == 0:
+                    show_help = True
+                    break
+                output = argv[0]
+                argv = argv[1:]
+            elif arg == "-c":
+                if len(argv) == 0:
+                    show_help = True
+                    break
+                encoding = argv[0]
+        if show_help or input == '' or output == '':
+            usage()
+        else:
+            stemming(language, input, output, encoding, pretty)
+
+
+def stemming(lang, input, output, encoding, pretty):
+    result = []
+    stemmerClass = stemmers.get(lang.lower(), EnglishStemmer)
+    stemmer = stemmerClass()
+    for original in codecs.open(input, "r", encoding).readlines():
+        original = original.strip()
+        stemmed = stemmer.stem(original)
+        if result:
+            result.append('\n')
+        if pretty == 0:
+            if stemmed != "":
+                result.append(stemmed)
+        elif pretty == 1:
+            result.append(original, " -> ", stemmed)
+        elif pretty == 2:
+            result.append(original)
+            if len(original) < 30:
+                result.append(" " * (30 - len(original)))
+            else:
+                result.append("\n")
+                result.append(" " * 30)
+            result.append(stemmed)
+    outfile = codecs.open(output, "w", encoding)
+    outfile.write(''.join(result) + '\n')
+    outfile.close()
+
+main()