GitHub - theraccoun/NLP: Homework for my NLP class

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LexiconImprovement.py		LexiconImprovement.py
MaxMatch.py		MaxMatch.py
MinNumberEdits.py		MinNumberEdits.py
README.txt		README.txt
maccoun-assgn1.py		maccoun-assgn1.py
maccoun-desc-assgn1.txt		maccoun-desc-assgn1.txt
mainProgram.py		mainProgram.py

Repository files navigation

README

Steven MacCoun
NLP CSCI 5832 Fall 2012

HW 1


Part 3 ->

For part 3 I first removed all one and two letter words that were not in a list of most frequent one/two letter words.

For the maxMatch algorithm, I used a search ahead in the algorithm. Once a match for a word is found, the algorithm first searches ahead and each time it finds a match, it records it as a possible match. It selects the longest possible word in its search ahead that it can find, adds that maxMatch list, and then increments the pointer to that point in the word.

I ran this algorithm both forwards and backwards. After getting both results, I compared the lists by looking at which had the longest first and last words (I did this by summing the lenght of the first and last words for each result). The intuition behind this was that long words tend to be correct more often, and maxMatch tends to really screw up once it starts missing after its initial hits. Thus, if it had a long first word and last word, it was more likely that those words were correct, and so were the words in between. If the words had the same length first and last words, then maxMatch compared which had more long words (where I defined long as >= 4) since, again, I assumed that long words were more likely to be correct and if a long word was matched it was more likely the other words would be correct.