# Self-programmed stemmed KWIC in Python

## Definition of the class for regex-based stemmers 
A primitive default regex for stemming is provided

In [1]:
import nltk, re

class RegexStemmer(object):
    def __init__(self, r=r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'):
        self._r = r

    def stem(self,word):
        m = re.match(self._r, word)
        return m.group(1)

In [2]:
regex_stemmer = RegexStemmer()
regex_stemmer.stem('seeming')

'seem'

## Definition of the class for indexed texts incl. concordances

In [3]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def _stem(self, word):
        return self._stemmer.stem(word).lower()


    def concordance(self, word, width=40):
        key = self._stem(word)		# stemmed keyword
        wc = width//4				# words of context to consider
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = f'{lcontext[-width:]:>{width}}' # trim to desired character width
            rdisplay = f'{rcontext[:width]:{width}}'   # trim to desired character width
            print(ldisplay, rdisplay)


## Demo of the self-made KWIC program

In [4]:
text = nltk.corpus.webtext.words('grail.txt')
regex_index = IndexedText(regex_stemmer, text)

Apply concordance method to indexed text

In [11]:
regex_index.concordance('die',width=50)

 cross this bridge . BLACK KNIGHT : Then you shall die . ARTHUR : I command you , as King of the     
n , rode forth from Camelot . He was not afraid to die , O brave Sir Robin . He was not at all       


In [9]:
regex_index.concordance('dying')

     it says . ARTHUR : Look , if he was dying , he wouldn ' t bother to carve ' 


## Use a prefabricated stemmer with the self-built text indexing class
Condition: Stemmer must be able to call the method `stem()`.
Remember... [duck typing](https://www.youtube.com/watch?v=x3v9zMX1s4s)

In [13]:
porter_stemmer = nltk.PorterStemmer()
porter_index = IndexedText(porter_stemmer, text)
porter_index.concordance('dying', width=20)

GHT : Then you shall die . ARTHUR : I    
He was not afraid to die , O brave Sir   
, you shall not have died in vain ! CONCO
         Oh , he ' s died ! FATHER : And 
YNARD : He must have died while carving i
    Look , if he was dying , he wouldn ' 


Well-behaved object-oriented decomposition allows for plug'n'play of components.

### The [Lancaster stemmer](https://www.youtube.com/watch?v=x3v9zMX1s4s) in action

In [14]:
lancaster_stemmer = nltk.LancasterStemmer()
lancaster_index = IndexedText(lancaster_stemmer, text)
lancaster_index.concordance('died')

ve , brave Concorde , you shall not have died in vain ! CONCORDE : Uh , I '      
               ! GUEST # 2 : Oh , he ' s died ! FATHER : And I want his only daug
 : What is that ? MAYNARD : He must have died while carving it . LAUNCELOT : Oh ,
