# Exercise 9 -  Natural Language Processing

## a)Tokenization and stemming using NLTK

### AIM :
To write a python program to Tokenize and stem words or sentences using NLTK
 

### ABOUT THE MODULE:

***Module*** - nltk
         
***Installation*** - pip install nltk
       
Stemming is the process of producing morphological variants of a root/base word
               
#### Class PorterStemmer

PorterStemmer is an abstract base class from which the porter stemmer function is derived. Porter stemmer is one of the most gentle stemmers.

##### Usage:
```
porter = PorterStemmer()
porter.stem("string")
```

#### Class LancasterStemmer

LancasterStemmer is an abstract base class from which the lancaster stemmer function is derived. Lancaster stemmer is an aggresive stemmer.

##### Usage:
```
lancaster =LancasterStemmer()
lancaster.stem("string")
```

#### Functions: 

**word_tokenize(string)** - List() of word in the given string

**wordpunct_tokenize(string)** - Same as word_tokenize but also splits punctuations marks like " ' , : etc..

**sent_tokenize(string)** -  List() of sencences in the given string

### SOURCE CODE :

In [1]:
from nltk import (
    PorterStemmer, LancasterStemmer,
    word_tokenize, wordpunct_tokenize, sent_tokenize
)
import yaml

porter = PorterStemmer()
lancaster = LancasterStemmer()

line = "\n"+"="*80+"\n"

def stem_passage(passage,punct_tokenize = False):
    return " ".join(
        stem_sentence(sentence,punct_tokenize)
        for sentence in sent_tokenize(passage)
    )

def stem_sentence(sentence,punct_tokenize=False):
    tokenizer_func = wordpunct_tokenize if punct_tokenize else word_tokenize
    return " ".join(
        porter.stem(word)
        for word in tokenizer_func(sentence)
    )

if __name__ == "__main__":
    with open("stemming.yaml") as f:
        stemming_words, stemming_passage = yaml.full_load(f).values()
    print(
        "STEMMING WORDS:\n",
        "STEMMER COMPARISION:\n",sep = "\n"
    )
    print("-"*64)
    print(f"|{'WORD':^20}|{'PORTER STEMMER':^20}|{'LANCASTER STEMMER':^20}|")
    print("-"*64)
    for word in stemming_words:
        print(f"|{word:^20}|{porter.stem(word):^20}|{lancaster.stem(word):^20}|")
    print("-"*64)
    print(line)

    print("STEMMING PASSAGES:\n")
    
    print(
        "ORIGINAL PASSAGE :\n",
        stemming_passage,
        "\nSTEMMED PASSAGE :\n",
        stem_passage(stemming_passage),
        "\nSTEMMED PASSAGE (using wordpunct_tokenizer) :\n",
        stem_passage(stemming_passage,True),sep="\n"
    )
    print(line)

STEMMING WORDS:

STEMMER COMPARISION:

----------------------------------------------------------------
|        WORD        |   PORTER STEMMER   | LANCASTER STEMMER  |
----------------------------------------------------------------
|        cats        |        cat         |        cat         |
|      trouble       |       troubl       |       troubl       |
|     troubling      |       troubl       |       troubl       |
|      troubled      |       troubl       |       troubl       |
|       friend       |       friend       |       friend       |
|      friends       |       friend       |       friend       |
|     friendship     |     friendship     |       friend       |
|    friendships     |     friendship     |       friend       |
|       stabil       |       stabil       |       stabl        |
|    destabilize     |      destabil      |        dest        |
|  misunderstanding  |   misunderstand    |   misunderstand    |
|      railroad      |      railroad      |      ra

---

## b) Spell Checking using Pyspellchecker

### AIM :
To write a python program to check the spelling of a word using pyspellchecker

### ABOUT THE MODULE :

***Module*** - pyspellchecker
***Installation*** - pip intall pyspellchecker

#### class SpellChecker
The object of SpellChecker class has various function like correcting and finding similar words from a String.

##### Usage:
```
 spell= SpellChecker()
 spell.correction(word)
```

#### Functions 

**SpellChecker().correction(word)** - The  grammatically corrected string for the given word.
     
**SpellChecker().candidates(word)** - The set of words similar to the given word.

**SpellChecker().unknown(words)** - The subset of misspelled words from the given words.

**SpellChecker().word_probability(word)** -  The probability of the word being the desired, correct word
    

### SOURCE CODE :

In [2]:
from spellchecker import SpellChecker
from nltk import word_tokenize
import yaml
spell = SpellChecker()

line = "\n"+"="*80+"\n"
    
if __name__ == "__main__":
    with open("spell_checking.yaml") as f:
        sentence,spell_check_words = yaml.full_load(f).values()
    words = word_tokenize(sentence)
    misspelled_words = spell.unknown(words)
    correct_sentence = " ".join(spell.correction(word) for word in words)
    print(
        "SPELL CORRECTION:",
        "\nORIGINAL SENTENCE:\n",
        sentence,
        "\nWORDS:\n",
        ", ".join(words),
        "\nMISSPELLED WORDS:\n",
        ", ".join(misspelled_words),
        "\nCORRECTED SENTENCE:\n",
        correct_sentence,sep="\n"
    )
    print(line)
    print("SPELL CHECKER TESTS:")
    
    print("-"*80)
    print(f"|{'WORD':^13}|{'PROBABILITY':^15}|{'CANDIDATES':^32}|{'CORRECTION':^15}|")
    print("-"*80)
    for word in spell_check_words:
        print(
            f"|{word:^13}"
            f"|{spell.word_probability(word):^15.3}"
            f"|{', '.join(spell.candidates(word)):^32}"
            f"|{spell.correction(word):^15}|"
        )
    print("-"*80)
    print(line)

SPELL CORRECTION:

ORIGINAL SENTENCE:

someting is happenning hete. Do yuo knw wht?

WORDS:

someting, is, happenning, hete, ., Do, yuo, knw, wht, ?

MISSPELLED WORDS:

knw, yuo, happenning, hete, someting, wht

CORRECTED SENTENCE:

something is happening here . Do you know what ?


SPELL CHECKER TESTS:
--------------------------------------------------------------------------------
|    WORD     |  PROBABILITY  |           CANDIDATES           |  CORRECTION   |
--------------------------------------------------------------------------------
|  calandar   |      0.0      |            calendar            |   calendar    |
| lightening  |   8.28e-07    |           lightening           |  lightening   |
|   misspel   |      0.0      |            misspelt            |   misspelt    |
|  necessary  |   0.000187    |           necessary            |   necessary   |
|  bussiness  |      0.0      | fussiness, bossiness, business |   business    |
|   recieve   |      0.0      |        relieve,

---