Scoring

Perplexity - we include perplexity because it is a standard NLP evaluation metric. Once a poetry model is trained, we can pass in a testing poem and calculate the model's perplexity for that poem. The lower the perplexity score, the better.
Meter - for a poem generated by the model, we evaluate its performance for meter by calculating the fraction of lines in the poem that match the meter of the poem's category. For example, if the model generated a Shakespeare sonnet and 12 out of the 14 lines were in iambic pentameter, then the poem's meter score would be 12/14 = 85.6; a higher meter score is better.
Rhyme - for a poem generated by the model, we evaluate its rhyming performance by calculating the fraction of lines in the poem that match the rhyme structure for that poem's category. For example, if the model generated a limerick and the rhyming pattern was AABBB (instead of AABBA), then the poem's rhyme score would be 4/5 = 0.80; a higher rhyme score is better.
Human evaluation - Because poetry is an art form, we believe that it's important for a human to score poetry generated by the model. We will have people read and rate the poems - depending on our budget and time, we could do double blind experiments amongst ourselves, recruit some friends or classmates, or use MTurk. The readers will rate the poems out of 10 points, where 10 is the highest score.

Running the Code

From the home directory, run python src/score.py [n for ngram] [training input file] [perplexity input file] [classification input file]

Provide feedback

Saved searches