Skip to content

rggjan/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This is the code for a Class Project of the "Introduction to Natural Language Processing" cource at ETH Zuerich

Exercise 1

How does it work

We built a little DOM. A "Word" consists of multiple "Splittings". A "Splitting" has a "WordPart" in the role prefix, stem and suffix

"Word" 1--* "Splitting" 1-Prefix-1 "WordPart"
	                	1-Stem-1 "WordPart"
    	            	1-Suffix-1 "WordPart"

A "Text" contains prefixes, stems and suffixes which are unique within the text. (In one text instance, there are no two suffixes with the same name)

Text parsing

In a first step, the text is read using the Text class. In there, some normalization is done (like removing of uppercase letters and punctuation). After that, a list of all unique words together with their word counts is generated.

Splittings

The splittings are generated by the word class. All possible splittings are generated (using a double loop).

Finite state Transducer

We decided to implement our own FST, since OpenFST was not available for java. Also we thought we could benefit in learning something about the mechanics of FSTs when we implemented it ourselves.

There is no single class capturing the concept of the FST. First, the states and the links are constructed. Then a configuration is initialized with the start state, the tapes and a result collector.

Then the configuration is beeing run. The configuration recursively creates copies of the itself and lets the copies traverse the links. Whenever a configuration reaches an accepting state, the configuration is passed to the result collector.

Weights

Each link has a weight, and the weights are summed up in the configurations as they traverse the links.

FstPrinter

The FstPrinter traverses a state graph and creates a textual representation of the graph, suitable for the dot tool (which is part of GraphViz)

ResultCollector

The result collector collects the results of running an FST.

Strategy

Analyzer

Our general strategy for the Analyzer was as follows: After having generated all the possible prefixes, stems and suffixes, we generate a big FST that consists (basically) of a start, a middle and a stop state. Then, all the prefixes with their respective weights are added as edges between the start and the middle state (for example transducing "up" to "up^"). The same thing happens for stems ("house" to "house") and suffixes ("ing" to "^ing").

At the end, we can just run the transducer and get (throught the weights) different estimates on where the splitting could be done.

Expander

The FST of the affix expander starts with links for each prefix to a common post prefix state. From there there is a link for each stem to a per stem pre suffix state. From there there is a link for each possible suffix, followed by a FST generating all possible forms (prefix/suffix combinations) of the particular stem.

Only the weights of the parsing part of the FST's are set, the weights of the generating part are all set to zero.

Therefore, the probabilities of the accepting configurations depend only on the score of the parsing.

Weights

We tried different strategies for the weights in the final FSTs. The general idea was to take the sum over the number of times the wordpart was seen as a first approximation of its probability. For example, when we see "ing" 10 times as suffix in different unique words, then we weight it with a factor 10 in the final FST.

Depending on Word count

One adaption we tried was not to count it per unique word, but per word. For example, having seen "raising, raising, raising, melted, melting" should give the splitting "melting" a higher probability than the splitting "melted", because the "ing" had been seen much more often than the "ed" form.

However, this did not improve the results much, so we abandoned it again.

Depending on Splitting probability

We also implemented a weighted that was inverted proportional to the number of valid splittings per word. That means, the more different splittings the word has, the less weight each splitting receives. For example, a word with two possible splittings ("house" => {"hou^se", "hous"e"}) would give each wordpart in each splitting 1/2 weight, while a word with three possible splittings ("housing" => {"hous^ing", "hou^sing", "housi^ng"}) would assign a weight of 1/3 to each wordpart.

This method, also, did not increase the correctness of the FST much. Especially the stems of long words became very unlikely, and often outweighted by strange pre- and suffixes.

Depending on word part length

The last and really successful adaption to the basic weighting we did was to make it dependent on the word part lenght. With the basic weighting, the main problem was that prefixes and suffixes like "a" and "e" (that consisted of single letters) were extremely common. Because there are only 26 of them they were too much overweighted.

To correct that, we thought about the following: Instead of using the absolute count of each prefix, we use the relative count, compared to the average count of all the other prefixes (stems/suffixes respectively) with the same count: Weight = wortpart.count / wordpart.average.count = wordpart.count / (wordpart(length).sum / wordpart(length).size) = wordpart.count * wordpart(length).size / wordpart(length).sum

We see, the weight is proportional to the number of other wordparts with the same size. So, for a certain prefix (stem/suffix), the weight is equal to the number of times it is seen multiplied by the number of different wordparts with the same length.

In the end, this gave us quite good results, so thats the solution that is implemented right now.

Exercise 2.1

Classes description

We read the text, splitting it according to "=====", in the TextParser. All the other processing, until we have just words, sentences and tags, are also done in this class.

Next, the most important classes are the "State" and "StateCollection" classes. "State" is basically one of our hidden Markov states, where you can add and remove empirical word counts and transition counts, and then read the probability counts out of it.

The "StateCollection" is is then the whole hidden Markov Model, with many States and a function to calculate the probability of a sentence under this model.

Finally, in the "Viterbi" Class, we implemented our "Viterbi" Algorithm. However, we didn't do that directly with a dynamic programming table, but recursively with a cache, what should give us about the same performance.

Results

We can read all the 50 training files and then calculate the probabilities of all the sentences in a test file in a few seconds. Using all the training files, we then get quite good results and an overall correctness of about 76%.

We can also clearly see how the correctness gets better, the more training data we have. To show this, here a list of correctness for a certain test file depending on the number of training files:

# Test files Correctness
1 47.0 %
2 53.8 %
4 59.4 %
8 64.8 %
16 69.3 %
32 74.0 %
50 76.8 %

When we look at the table, we can see that doubling the test data gives us roughly 5% Correctness gain. That means, we need to exponentially increase the training data to get a linear increase in correctness.

Exercise 2.2

Classes description

The basic classes in the second exercise are the same as for the first part of exercise 2. However, some new classes were added and some were adapted to the new needs:

  • Two classes derived from State were introduced. The CountedState represents a state with probabilities generated by counting, the OptimizedState represents a state with probabilities generated by optimization. There are differences in representations as well as in setting the probabilities.

  • OptimizedStateCollection and CountedStateCollection were introduced to make the handling of the states type safe.

  • UnsupervisedTrainingAlgorithm contains the training algorithm for unsupervised training of an OptimizedStateCollection

  • The BigDouble represents a floating point value with 52 bits signigicand and 64 bits exponent. This is required to handle the small probabilities which occur in our algorithms.

As stopping criterion for the unsupervised training loop, we calculate the possibility of the whole training text. If this probability increases by less than 10 per cent in one iteration, we stop.

Results

Due to the long run times, only about 2 training texts are feasible, what is what we tested our algorithm with.

However, the results for a given training set look quite reasonable. The ten unknown tags are reasonably distributed over all the sentences, and similar word types seem to be tagged with similar tags (like "I", "he" and "Scotty" often with tag "s2", for example).

About

Natural Language Processing Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages