![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/collab/Text_Pre_Processing_and_Cleaning/NLU_Stemmer_example.ipynb)
# Stemming with NLU 

Stemming returns the base form, the so called stem / root or base word of every token in the input data.    

I. e. 'He was hungry' becomes 'He wa hungri'


Stemming works by applying a heuristic process that strips and mutates suffixes on  words.


# 1. Install Java and NLU

In [None]:

import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null   

## 2. Load Model and stemm sample string

In [None]:
import nlu
pipe = nlu.load('en.stem')
pipe.predict('He was suprised by the diversity of NLU')

Unnamed: 0_level_0,sentence,stem
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,He was suprised by the diversity of NLU,"[he, wa, supris, by, the, divers, of, nlu]"


# 3. Get one row per stemmed token by setting outputlevel to token.    
This lets us compare what the original token was and what it was stemmed to to. 

In [None]:
pipe.predict('He was suprised by the diversity of NLU', output_level='token')

Unnamed: 0_level_0,token,stem
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,He,he
0,was,wa
0,suprised,supris
0,by,by
0,the,the
0,diversity,divers
0,of,of
0,NLU,nlu


# 4. Checkout the Stemm models NLU has to offer for other languages than English!

In [None]:
nlu.print_all_model_kinds_for_action('stem')

For language <en> NLU provides the following Models : 
nlu.load('en.stem') returns Spark NLP model stemmer


## 4.1 Let's try German stemming!

In [None]:
nlu.load('de.stem').predict("Er war von der Vielfältigkeit des NLU Packets begeistert",output_level='token')

Unnamed: 0_level_0,token,stem
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Er,er
0,war,war
0,von,von
0,der,der
0,Vielfältigkeit,vielfältigkeit
0,des,de
0,NLU,nlu
0,Packets,packet
0,begeistert,begeistert
