# CS5615 - Information Retrieval - Assignment 1 - Basic Text Preprocessing Techniques

#### Author - KATS JAYATHILAKA (209338R)

### Reading the text file

In [220]:
datafile = open("assignment_data.txt")
raw_data = datafile.read()
datafile.close()

### Converting all letters to lowercase

In [221]:
lc_data = raw_data.lower()

### Removing all numbers

In [222]:
import re
nonum_data = re.sub(r'\d+', '', lc_data)

### Removing all punctuations

In [223]:
import string
nopunc_data = nonum_data.translate(str.maketrans("","", string.punctuation))

### Removing trailing and ending whitespaces

In [224]:
stripped_data = nopunc_data.strip()

### Tokenization
Before executing the tokenization code in the next cell, go to the python shell in your conda evironment and type the following commands. This needs to be done only once.

`import nltk
nltk.download()`

Then an installation window appears. Go to the 'Models' tab and select 'punkt' from under the 'Identifier' column. Then click Download and it will install the necessary files. Then no errors would occur.

In [225]:
import nltk
tokens = nltk.word_tokenize(stripped_data)
print(f"No. of tokens: {len(tokens)}")

No. of tokens: 1215


#### Tokenization accuracy check

In [226]:
test_str = 'Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing'
nltk.word_tokenize(test_str)

['Mr.',
 'O',
 '’',
 'Neill',
 'thinks',
 'that',
 'the',
 'boys',
 '’',
 'stories',
 'about',
 'Chile',
 '’',
 's',
 'capital',
 'aren',
 '’',
 't',
 'amusing']

In [228]:
nltk.word_tokenize("multi-task")

['multi-task']

In [198]:
# to clear any previous output
open("tokenizer_output.txt", "w").close()

tof = open("tokenizer_output.txt", "a")
for token in tokens:
    tof.write(token + "\n")

### Spell Correction
This steps takes some time.

In [174]:
from spellchecker import SpellChecker
spc = SpellChecker()
spell_corrected_tokens = [spc.correction(token) for token in tokens]

#### Spell Correction accuracy check

In [237]:
# ISOLATED SPELL CHECK
incorrect_words = [
    'absense', 'absentse', 'abcense',
    'allegaince', 'allegience', 'alegiance',
    'apparant', 'aparent', 'apparrent',
    'concious', 'consious',
    'commited', 'comitted',
    'definitly', 'definately', 'defiantly',
    'hygene', 'hygine', 'hiygeine', 'higeine', 'hygeine',
    'mischievious', 'mischevous', 'mischevious',
    'succesful', 'successfull', 'sucessful',
    'tommorow', 'tommorrow',
    'vaccuum', 'vaccum', 'vacume',
    'writting', 'writeing',
    'wellfare', 'welfair'
]
print([spc.correction(iw) for iw in incorrect_words])

['absence', 'absentee', 'absence', 'allegiance', 'allegiance', 'allegiance', 'apparent', 'apparent', 'apparent', 'conscious', 'conscious', 'committed', 'committed', 'definitely', 'definately', 'defiantly', 'hygiene', 'hygiene', 'hygiene', 'hygiene', 'hygiene', 'mischievous', 'mischievous', 'mischievous', 'succesful', 'successful', 'successful', 'tommorrow', 'tommorrow', 'vacuum', 'vacuum', 'value', 'writing', 'writing', 'welfare', 'welfare']


In [258]:
# CONTEXT SENSITIVE SPELL CHECK
sentences = []
sentences.append("The demonstrtors gahered near the SpaceX faciity, where the company builds its Falcon 9 rockets, lining up along Crenshaw Boulevard, CBS LA repoted.")
sentences.append("Ther aim: to stop the rocket company from launchin Tukey's Turksat 5A communication saellite.")
sentences.append("SpaceX is epected to launch the mission Nov. 30 from Cape Canaveral, Florida using a Falcon 9, acording to Spaceflight Now.")
sentences.append("(Some press and social medi reports have referred to the satellite as Turksat 1A, but it is Turksat 5A tha SpaceX will launch.)")

for s in sentences:
    print(spc.correction(s))

The demonstrtors gahered near the SpaceX faciity, where the company builds its Falcon 9 rockets, lining up along Crenshaw Boulevard, CBS LA repoted.
Ther aim: to stop the rocket company from launchin Tukey's Turksat 5A communication saellite.
SpaceX is epected to launch the mission Nov. 30 from Cape Canaveral, Florida using a Falcon 9, acording to Spaceflight Now.
(Some press and social medi reports have referred to the satellite as Turksat 1A, but it is Turksat 5A tha SpaceX will launch.)


### Removing stop words
The standard stop words are taken from the scikit-learn package.

In [255]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print(f"No. of stop words in the package: {len(ENGLISH_STOP_WORDS)}")

results = [t for t in tokens if not t in ENGLISH_STOP_WORDS]
print(f"No. of tokens after removing stop words: {len(results)}")

print(f"No. of stop words removed from original tokens: {len(tokens)-len(results)}")

No. of stop words in the package: 318
No. of tokens after removing stop words: 742
No. of stop words removed from original tokens: 473


### Stemming

In [176]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in results]

### Lemmatization
Before executing the lemmatization code in the next cell, go to the python shell in your conda evironment and type the following commands. This needs to be done only once.

`import nltk
nltk.download('wordnet')`

This will download the corpora to your home directory in linux (Downloads in Windows, hopefully) and that default folder will automatically be looked up during lemmatization. The output of the above command execution is as follows.

`[nltk_data] Downloading package wordnet to
[nltk_data]     /home/singhabahu/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
True
`

In [177]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(stem) for stem in stems]

#### Experiment on Stemming vs Lemmatization
Here, the spell-corrected, stop-words-removed list of tokens in the variable <ins>**results**</ins>, are used for both stemming and lemmatization. Only a subset of the tokens are represented for convenience of presentation. Subset size is specified by <ins>**TOKEN_COUNT**</ins>.

In [290]:
import random
TOKEN_COUNT = 10

random_results = []

for i in range(0, (TOKEN_COUNT-1)):
    random_results.append(results[random.randint(0, len(results))])

print("Original Tokens")
print(random_results)

print("After Stemming (No lemmatization done)")
print([stemmer.stem(result) for result in random_results])

print("After Lemmatization (No stemming done)")
print([lemmatizer.lemmatize(result) for result in random_results])

Original Tokens
['contain', 'good', 'onthisday', '��questions', 'task', 'approach', 'madame', 'influence', 'canada']
After Stemming (No lemmatization done)
['contain', 'good', 'onthisday', '��question', 'task', 'approach', 'madam', 'influenc', 'canada']
After Lemmatization (No stemming done)
['contain', 'good', 'onthisday', '��questions', 'task', 'approach', 'madame', 'influence', 'canada']
