# Self-programmed stemmed KWIC in Python
Learning goals:
 - Understanding how a class-based decomposition enables plug and play of components
 - Understanding what is needed for offline indexing and online search 
 

## Definition of the class for regex-based stemmers 
A primitive default regex for stemming is provided

In [None]:
import nltk, re

class RegexStemmer(object):
    def __init__(self, r=r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'):
        self._r = r

    def stem(self,word):
        m = re.match(self._r, word)
        return m.group(1)

In [None]:
regex_stemmer = RegexStemmer()
regex_stemmer.stem('seeming')

## Definition of the class for indexed texts incl. concordances

In [None]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def _stem(self, word):
        return self._stemmer.stem(word).lower()


    def concordance(self, word, width=40):
        key = self._stem(word)		# stemmed keyword
        wc = width//4				# words of context to consider
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = f'{lcontext[-width:]:>{width}}' # trim to desired character width
            rdisplay = f'{rcontext[:width]:{width}}'   # trim to desired character width
            print(ldisplay, rdisplay)


## Demo of the self-made KWIC program

In [None]:
text = nltk.corpus.webtext.words('grail.txt')
regex_index = IndexedText(regex_stemmer, text)

Apply concordance method to indexed text

In [None]:
regex_index.concordance('dying',width=30)

In [None]:
regex_index.concordance('dying')

## Use a prefabricated stemmer with the self-built text indexing class
Condition: Stemmer must be able to call the method `stem()`.
Remember... [duck typing](https://www.youtube.com/watch?v=x3v9zMX1s4s)

In [None]:
porter_stemmer = nltk.PorterStemmer()
porter_index = IndexedText(porter_stemmer, text)
porter_index.concordance('dying', width=60)

Well-behaved object-oriented decomposition allows for plug'n'play of components.

### The [Lancaster stemmer](https://www.youtube.com/watch?v=x3v9zMX1s4s) in action

In [None]:
lancaster_stemmer = nltk.LancasterStemmer()
lancaster_index = IndexedText(lancaster_stemmer, text)
lancaster_index.concordance('died')

In [None]:
porter_stemmer.stem('accidentally')