# Modified Ensegment Program - Team Arceus

## Documentation

### The main modification we did

After reading through the Norvig textbook chapter 14's section on n-grams and word segmentation, we noticed one major ambiguity between the default code and the sample snippet present in the textbook:

The default code has the helper function that initializes `missingfn` does not take into account:
- Size of candidate word (k)
- Avoid longest word functions
```
self.missingfn = missingfn or (lambda k, N: 1.0 / N)
```
So we just modified it to:
```
self.missingfn = missingfn or (lambda k, N: 10.0 / (N * 10 ** len(k)))
```
### The overall segmentation results of the small edit are:

In [1]:
from ensegment import *
Pw = Pdist(data=datafile("../data/count_1w.txt"))
segmenter = Segment(Pw)
with open("../data/input/dev.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

choose spain
this is a test
who represents
experts exchange
speed of art
un climate change body
we are the people
mention your faves
now playing
the walking dead
follow me
we are the people
mention your faves
check domain
big rock
name cheap
apple domains
honesty hour
being human
follow back
social media
30 seconds to earth
current rate sought to go down
this is insane
what is my name
is it time
let us go
me too
now thatcher is dead
advice for young journalists


## Analysis

With the default unseen word probability of 1/N, we modified the probability for unseen word to avoid having high probability for very long words. For example, the default solution for "dev.txt" still consists of the following unsegmented words: 
<br>
- "unclimatechangebody"
- "mentionyourfaves"
- "30secondstoearth"
- "ratesoughttogodown"
- "nowthatcherisdead"


The default solution gives a F-score of 0.82.
We initially changed the probability from 1/N to 10/N, which did not improve the segmentation accuracy and lowered the score to 0.78. We further decrease the probability by a factor of 5 for each letter in the candidate word, which improved the score to 0.97, but still has the faulty word "30secondstoearth".

When the program attempts to segment the word "30secondstoearth", the probability of P("30 seconds to earth") will be lower than the P("30secondstoearth) because "30" cannot be found in the corpus file "count_lw.txt".

Therefore we further decrease the probability of unseen words by a factor 0f 10 for each word, which improved F-score to 1.0, and all words from "dev.txt" seem to be segmented properly.