# Stemming (and Inverse Stemming) words from multiple languages

For more information on the inner workings of the algorithm, refer to: 
http://snowball.tartarus.org/algorithms/french/stemmer.html

The following content is derived from the quickstart guide [here](https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart_python3.txt), which is [licensed](https://github.com/snowballstem/pystemmer/blob/master/LICENSE) under the MIT License and contains traces of the 3-Clause BSD License.

In [None]:
# !pip3 install pystemmer

In [1]:
# Quickstart
# This is a very brief introduction to the use of PyStemmer.
# First, import the library:

import Stemmer

# Just for show, we'll display a list of the available stemming algorithms:

print(Stemmer.algorithms())
# ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']

# Now, we'll get an instance of the french stemming algorithm:
stemmer = Stemmer.Stemmer('french')

# Stem a single word:
print(stemmer.stemWord('coder'))
# cod

# Stem a list of words:
print(stemmer.stemWords(['coder', 'codera']))
# ['cod', 'cod']

# Strings which are supplied are assumed to be unicode.
# We can use UTF-8 encoded input, too:
print(stemmer.stemWords(['coder', b'codera']))
# ['cod', b'cod']

# Each instance of the stemming algorithms uses a cache to speed up processing of
# common words.  By default, the cache holds 10000 words, but this may be
# modified.  The cache may be disabled entirely by setting the cache size to 0:
print(stemmer.maxCacheSize)
# 10000

stemmer.maxCacheSize = 1000
print(stemmer.maxCacheSize)
# 1000

['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
cod
['cod', 'cod']
['cod', b'cod']
10000
1000


## How to do Inverse Stemming?

We'll need to be able to do the backward pass too to convert the topics from the LDA back to words. For more information, see how the above Stemmer class is wrapped into the code of the current [project](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA). This is done in [lda_service/logic/stemmer.py](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py)

Quickly explained, inverse stemming can be done by keeping track of each original words that pointed to the stemmed version of that word, and their count. When doing the inverse stemming, the word with the top count can then be retrieved. 

```python
# Original comments:
['Un super-chat marche sur le trottoir',
 'Les super-chats aiment ronronner',
 'Les chats sont ronrons',
 'Un super-chien aboie',
 'Deux super-chiens',
 "Combien de chiens sont en train d'aboyer?"]

# Original comments without stop words:
['super-chat marche trottoir',
 'super-chats aiment ronronner',
 'chats ronrons',
 'super-chien aboie',
 'Deux super-chiens',
 'Combien chiens train aboyer?']

# Stemmed comments:
['sup chat march trottoir',
 'sup chat aiment ronron',
 'chat ronron',
 'sup chien aboi',
 'deux sup chien',
 'combien chien train aboi']

# Custom stemmer's cache that was saved for the inverse pass later on which will need to choose the top corresponding words back from their counts:
{'aboi': {'aboie': 1, 'aboyer': 1},
 'aiment': {'aiment': 1},
 'chat': {'chat': 1, 'chats': 2},
 'chien': {'chien': 1, 'chiens': 2},
 'combien': {'Combien': 1},
 'deux': {'Deux': 1},
 'march': {'marche': 1},
 'ronron': {'ronronner': 1, 'ronrons': 1},
 'sup': {'super': 4},
 'train': {'train': 1},
 'trottoir': {'trottoir': 1}}
```