## NLP Part B: Spellchecker and Autocorrector Application
____________________________
Created by: Group 4

### 1. Data PreProcessing
Data preprocessing stages are split into the following parts:
+ #### Tokenization of Words
+ #### Case Normalization
+ #### Removing the following:
  - Punctuation
  - Stop Words
  - Numeric Characters
  - Special Characters
  - Accented Characters

+ #### Stemming and Lemmatization?
  
<u><i>More Text Cleaning Considerations:</i></u>

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.
- Resolve contractions for casual text.

-----------------------------------------
### 4. Design Deliverables

b)	Your application must be able to find the spelling errors and suggest a few words to the user to modify the text.

c)	The spelling errors that need to be addressed by your system are:

i.	Non-words (wrong spelling, where the word does not exist)

ii.	Real-words (wrong spelling due to wrong context, but the misspelt word does exist)
    - Grammatical errors, typos e.t.c

d)	The techniques used for the detection of the spelling errors must include: <body>
  <p style="color:rgb(255,0,0);"> - Bigrams</p>
     <p style="color:rgb(255,0,0);"> - Minimum Edit Distance,</p>
     <p style="color:rgb(255,0,0);">- Other suitable popular techniques used in NLP</p>
   </body>

<p>e)	Provide the following functionality in your application: </p>
<p>   Ability to show a sorted list of all words in the corpus with the facility of exploring the list and search for a     specific word.</p>

	Ability to highlight the misspelled words, and right click to suggest the correct words (with their minimum edit     distance from the wrong word)


#### Load Corpus and Importing Packages

In [11]:
# Load Corpus
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()


# import packages
import nltk
import re

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter
from nltk.util import ngrams
from nltk.metrics.distance import edit_distance
from nltk.corpus import words
from nltk.tokenize import RegexpTokenizer
from itertools import chain
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import *
from nltk.corpus import wordnet as wn
import string
import unicodedata
import heapq                               
import os
import time
from tqdm import tqdm
from difflib import SequenceMatcher
lemmatizer = WordNetLemmatizer()

In [12]:
text[:1000]

'The 100 tradÃ©Â®â€\xa0Â¥mark! â„¢ Â® Reading Books The Project Gutenberg EBook of Metamorphosis, by Franz Kafka\nTranslated by David Wyllie.\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **\n**     Please follow the copyright guidelines in this file.     **\n\nÃŸ\nTitle: Metamorphosis\n\nAuthor: Franz Kafka\n\nTranslator: David Wyllie\n\nRelease Date: August 16, 2005 [EBook #5200]\nFirst posted: May 13, 2002\nLast updated: May 20, 2012\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK METAMORPHOSIS ***\n\n\n\n\nCopyright (C) 2002 David Wyllie.\n\n\n\n\n\n  Metamorphosis\n  Franz Kafka\n\nTranslated by David Wyllie\n\n\n\nI\n\n\nOne morning, when Gregor Samsa woke from troubled dreams, he found\nh

#### Tokenization of Words

In [14]:
# split into words by white space
words = text.split()

# split based on words only

words = re.split(r'\W+', text)

# split into words by white space
words = text.split()
print(words[:100])

['The', '100', 'tradÃ©Â®â€', 'Â¥mark!', 'â„¢', 'Â®', 'Reading', 'Books', 'The', 'Project', 'Gutenberg', 'EBook', 'of', 'Metamorphosis,', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie.', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.net', '**', 'This', 'is', 'a', 'COPYRIGHTED', 'Project', 'Gutenberg', 'eBook,', 'Details', 'Below', '**', '**', 'Please', 'follow', 'the', 'copyright', 'guidelines', 'in', 'this', 'file.', '**', 'ÃŸ', 'Title:', 'Metamorphosis', 'Author:', 'Franz', 'Kafka', 'Translator:', 'David', 'Wyllie', 'Release', 'Date:', 'August', '16,', '2005']


#### Case Normalization

In [15]:
# text normalization - convert to lower case
nor_words = [word.lower() for word in words]
print(nor_words[:100])

['the', '100', 'tradã©â®â€', 'â¥mark!', 'â„¢', 'â®', 'reading', 'books', 'the', 'project', 'gutenberg', 'ebook', 'of', 'metamorphosis,', 'by', 'franz', 'kafka', 'translated', 'by', 'david', 'wyllie.', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'you', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.net', '**', 'this', 'is', 'a', 'copyrighted', 'project', 'gutenberg', 'ebook,', 'details', 'below', '**', '**', 'please', 'follow', 'the', 'copyright', 'guidelines', 'in', 'this', 'file.', '**', 'ãÿ', 'title:', 'metamorphosis', 'author:', 'franz', 'kafka', 'translator:', 'david', 'wyllie', 'release', 'date:', 'august', '16,', '2005']


#### Removing Punctuation 

In [5]:
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped_words = [w.translate(table) for w in nor_words]
print(stripped_words[:100])

#### Removing Stop Words

In [16]:
# Remove stop words
stop_words = set(stopwords.words('english'))
no_st_words = [w for w in stripped_words if not w in stop_words]
print(no_st_words[:100])

['100', 'tradã©â®â€', 'â¥mark', 'â„¢', 'â®', 'reading', 'books', 'project', 'gutenberg', 'ebook', 'metamorphosis', 'franz', 'kafka', 'translated', 'david', 'wyllie', 'ebook', 'use', 'anyone', 'anywhere', 'cost', 'almost', 'restrictions', 'whatsoever', 'may', 'copy', 'give', 'away', 'reuse', 'terms', 'project', 'gutenberg', 'license', 'included', 'ebook', 'online', 'wwwgutenbergnet', '', 'copyrighted', 'project', 'gutenberg', 'ebook', 'details', '', '', 'please', 'follow', 'copyright', 'guidelines', 'file', '', 'ãÿ', 'title', 'metamorphosis', 'author', 'franz', 'kafka', 'translator', 'david', 'wyllie', 'release', 'date', 'august', '16', '2005', 'ebook', '5200', 'first', 'posted', 'may', '13', '2002', 'last', 'updated', 'may', '20', '2012', 'language', 'english', '', 'start', 'project', 'gutenberg', 'ebook', 'metamorphosis', '', 'copyright', 'c', '2002', 'david', 'wyllie', 'metamorphosis', 'franz', 'kafka', 'translated', 'david', 'wyllie', 'one', 'morning', 'gregor']


#### Removing Numeric Characters

In [17]:
#Remove numeric characters
no_numbers = ' '.join(c for c in no_st_words if not c.isdigit())
print(no_numbers[:100])

tradã©â®â€ â¥mark â„¢ â® reading books project gutenberg ebook metamorphosis franz kafka translated 


#### Remove Special Characters

In [18]:
# function to remove special characters
def remove_s_c(no_numbers):
    # define the pattern to keep
    rem = r'[^a-zA-z0-9.,!?/:;\"\'\s\w+)]' 
    return re.sub(rem, '',no_numbers)
 
# calling the function
no_sc_words = remove_s_c(no_numbers)

print(no_sc_words[:100])

# resulting in double spaces after removing special characters

tradãââ âmark â â reading books project gutenberg ebook metamorphosis franz kafka translated david w


#### Remove Accented Characters

In [19]:
# imports
# function to remove accented characters
def remove_a_c(no_sc_words):
    new_text = unicodedata.normalize('NFKD', no_sc_words).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return new_text
# call function
no_ac_words = remove_a_c(no_sc_words)
no_ws_words = (" ".join(no_ac_words.split()))

print(no_ac_words[:100])

tradaaa amark a a reading books project gutenberg ebook metamorphosis franz kafka translated david w


#### Remove Whitespaces

In [27]:
no_ws_words = (" ".join(no_ac_words.split()))

print(no_ws_words[:1000])


tradaaa amark a a reading books project gutenberg ebook metamorphosis franz kafka translated david wyllie ebook use anyone anywhere cost almost restrictions whatsoever may copy give away reuse terms project gutenberg license included ebook online wwwgutenbergnet copyrighted project gutenberg ebook details please follow copyright guidelines file ay title metamorphosis author franz kafka translator david wyllie release date august ebook first posted may last updated may language english start project gutenberg ebook metamorphosis copyright c david wyllie metamorphosis franz kafka translated david wyllie one morning gregor samsa woke troubled dreams found transformed bed horrible vermin lay armourlike back lifted head little could see brown belly slightly domed divided arches stiff sections bedding hardly able cover seemed ready slide moment many legs pitifully thin compared size rest waved helplessly looked whats happened thought wasnt dream room proper human room although little small l

In [41]:
type(no_ws_words)

str

In [42]:
def Convert(string): 
    li = list(string.split(" ")) 
    return li 
print(Convert(no_ws_words))

TypeError: 'list' object is not callable

In [39]:
a = Convert(no_ws_words)
a

TypeError: 'list' object is not callable

In [37]:
type(list)

list

#### Milestones

In [4]:
# To-Do List

#1. Expand Contractions
#2. Create Dictionary
#3. Add Unigrams / Bigrams
#4. Check repeatition / unnecessary imported libraries
#5. Create Copy of File After Every Session

In [32]:
def clean_text(text):

    text = re.sub("^\d+\s|\s\d+\s|\s\d\w\d|\s\d+$", " ",text)
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "that is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"we'd", "we would", text)
    text = re.sub(r'[{}@_*>()\\#%+=\[\]]','', text)
    
    return text


In [33]:
# Clean the text 
clean_data = []

for data in list:
    clean_data.append(clean_text(data))

In [36]:
clean_data[:50]

['tradaaa',
 'amark',
 'a',
 'a',
 'reading',
 'books',
 'project',
 'gutenberg',
 'ebook',
 'metamorphosis',
 'franz',
 'kafka',
 'translated',
 'david',
 'wyllie',
 'ebook',
 'use',
 'anyone',
 'anywhere',
 'cost',
 'almost',
 'restrictions',
 'whatsoever',
 'may',
 'copy',
 'give',
 'away',
 'reuse',
 'terms',
 'project',
 'gutenberg',
 'license',
 'included',
 'ebook',
 'online',
 'wwwgutenbergnet',
 'copyrighted',
 'project',
 'gutenberg',
 'ebook',
 'details',
 'please',
 'follow',
 'copyright',
 'guidelines',
 'file',
 'ay',
 'title',
 'metamorphosis',
 'author']