Introduction ¹

As the assignment indicated, we used the Wordnet dictionary and the Birkbeck spelling error corpus in this project. The original Birkbeck corpus includes many different rules for the extraction of the required data; such as the differences between American and British English spellings. Handling this matter was out of the context of this assignment. Therefore, we use the version of Roger Mitton's Birkbeck which includes the correct spelling of each token alongside it.
We used the Levenshtein algorithm to measure the distance between two tokens.

Problem solving

In the following sentences, we explain our approach to solving this problem. First, we calculate the minimum edit distance between each misspelled token and every word in the dictionary. Then, we keep the 10 most similar (least distant) dictionary words for each misspelled word. In the final step, we calculate success at 1, 5, 10 for the correct spelling of the misspelled token. In other words, we search the set of most similar words to find the existence of the correct form of the misspelled token. We used Python and Golang programming languages to implement the solution. In python nltk library is used for the Wordnet corpus. Also, by writing Wordnet's words into a file, we used the same dictionary in Golang.
As per the request of the assignment, we reported results for average success at k=1, 5, 10 using the pytrec eval terrier.
Assume $n$ is the length of dictionary, $m$ is the maximum length of a token. The time complexity of the algorithm for each misspelled token is $O(nm^2)$. Considering the time complexity, it is critical reducing the run-time by utilizing different methods such as parallelism. In table [2] you can see the impact of parallelism on this task.

Experiments

Both source codes are available in a single repository on GitHub [@Ripo]. The python version of the program includes a utils package, data directory, and main.py and main.ipynb. the utils package consists of the Levenshtein algorithm and some functions that create the set of most similar words and success at K calculator. For the Golang program, there are two files in the src-go directory, consisting of utils and main.
In order to enhance the speed of the calculations we employed the concurrency features of Golang, known as Goroutines. A Goroutine is a lightweight thread managed by the Golang program. We reported the results in table [1].\

Results

	Python	Go
Success at 1	27%	26%
Success at 5	44%	44%
Success at 10	50%	49%
runtime	14.3 H	18min

	runtime for 6 instances using python
Sequential	80 seconds
Parallel	17 seconds

Discussion

This method is not applicable in spell correction programs due to its time complexity. Furthermore, this method for spell correction misses many misspelled words because it does not consider the context.

You can also see the https://github.com/rezaBarzgar/Assignment1-COMP-8730/blob/master/NLP_Assignment1.pdf ↩

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
Golang		Golang
data		data
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
NLP_Assignment1.pdf		NLP_Assignment1.pdf
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

Golang

Golang

data

data

src

src

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

NLP_Assignment1.pdf

NLP_Assignment1.pdf

README.md

README.md

main.ipynb

main.ipynb

requirements.txt

requirements.txt

Repository files navigation

Introduction ¹

Problem solving

Experiments

Results

Discussion

About

Releases

Packages

Languages

License

rezaBarzgar/Assignment1-COMP-8730

Folders and files

Latest commit

History

Repository files navigation

Introduction 1

Problem solving

Experiments

Results

Discussion

Footnotes

About

Resources

License

Stars

Watchers

Forks

Languages

Introduction ¹