In this notebook there are some examples of how to perform some exploratory analysis in our corpus in order to assess some issues we could run into when creating poetry.

First of all, let us execute another notebook, in order to have access to some variables.

In [None]:
import io
from nbformat import current

def execute_notebook(nbfile):
    
    with io.open(nbfile) as f:
        nb = current.read(f, 'json')
    
    ip = get_ipython()
    
    for cell in nb.worksheets[0].cells:
        if cell.cell_type != 'code':
            continue
        ip.run_cell(cell.input)

Let us load the configuration code that defines variables and functions specific to our application.

In [None]:
execute_notebook("Get_started.ipynb")

## Lexical exploratory analysis

Count the number of potential verses:

In [None]:
Pv = [extract_verses(document) for document in Dv]
print ("The number of verses is: " + str(len(Pv)))

Find the number of verses which do not adjust to the rhyming convention.

In [None]:
noRhymingVerses = Poetry.noRhymingSentences(Pv)
# print noRhymingVerses # uncomment if interested in checking the verses
print "Number of verses that do not rhyme with anything: " + str(len(noRhymingVerses))
proportion = float(len(noRhymingVerses)) / float(len(Pv))
print "Proportion of verses that do not rhyme with anything: " + "%.2f" % (proportion * 100) + "%"

Show the endings of the verses with no rhyme.

In [None]:
lastWordsNoRhymingVerses = General.sortStringListByReverseString(Poetry.noRhymingLastWords(Pv))
print lastWordsNoRhymingVerses

Find the number of verses which rhyme with a given word. Just edit the first line of code and run again the cell.

In [None]:
rhymingWord = "kalera"
rhymingVerses = Poetry.rhymingSentences(rhymingWord, Pv)
print rhymingVerses
print "Number of verses that rhyme with " + rhymingWord + ": " + str(len(rhymingVerses))
proportion = float(len(rhymingVerses)) / float(len(Pv))
print "Proportion of verses that rhyme with " + rhymingWord + ": " + "%.2f" % (proportion * 100) + "%"

Show the verses which rhyme with the previously given word.

In [None]:
print rhymingVerses

Create a new file removing the sentences that cannot rhyme.

In [None]:
possibleRhymes = General.substractList(Pv, noRhymingVerses)
General.saveListOfSentencesToFile(possibleRhymes, '8tx_clean.txt')

Compute the number of rhyming partitions of the set of verses.

In [None]:
noRhymeSentences, noRhymeLastWords, cleanPartitionIndices, cleanPartitionSentences = Poetry.analyzeProspectiveRhymes(Pv)
print "Number of partitions: " + str(len(cleanPartitionSentences))

Compute the number of rhyming partitions of the set of verses that have more elements than the minimum number of rhyming verses in a stanza.

In [None]:
print "Number of partitions of minimum size: " + str(len(Poetry.possiblePartitions(cleanPartitionSentences, RP)))

Create a list with a verse from every partition along with the number of elements in such partition.

In [None]:
exampleAndHowMany = [(partition[0], len(partition)) for partition in cleanPartitionSentences]
exampleAndHowMany.sort(key=lambda tup: tup[1], reverse = True)
print "List of representatives of the partitions, along with the number of members of that partition"
print exampleAndHowMany

The following plots are shown below:
- Plot of the number of verses in each equivalence class
- Plot of the logarithm of the number of verses in each equivalence class
- Plot of the histogram of the number of equivalence classes according to the equivalence class size
- Plot of the histogram of the number of equivalence classes according to the logarithm of the equivalence class size

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 999;

In [None]:
cardinalsPartition = [len(elem) for elem in Poetry.possiblePartitions(cleanPartitionSentences, RP)]
cardinalsPartition.sort(reverse=True)
print cardinalsPartition
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib

matplotlib.rc('xtick', labelsize=15) 
matplotlib.rc('ytick', labelsize=15) 

minimumNumberOfVerses = 0
maximumNumberOfVerses = 100000
filteredCardinalsPartition = [elem for elem in cardinalsPartition if elem >= minimumNumberOfVerses and elem <= maximumNumberOfVerses]

plt.rcParams['figure.figsize'] = (15.0, 8.0)
plt.bar(range(len(filteredCardinalsPartition)), filteredCardinalsPartition)
plt.ylabel('Number of verses', fontsize = 20)
plt.xlabel('Equivalence class ordinal', fontsize = 20)

plt.show()

import numpy as np


logFilteredCardinalsPartition = np.log(filteredCardinalsPartition)
plt.rcParams['figure.figsize'] = (15.0, 8.0)
plt.bar(range(len(filteredCardinalsPartition)), logFilteredCardinalsPartition)
plt.ylabel('Log of the number of verses', fontsize = 20)
plt.xlabel('Equivalence class ordinal', fontsize = 20)
plt.show()

plt.rcParams['figure.figsize'] = (15.0, 8.0)
plt.hist(filteredCardinalsPartition, General.numberBins(filteredCardinalsPartition))
plt.ylabel('Number of quivalence classes', fontsize = 20)
plt.xlabel('Equivalence class size', fontsize = 20)
plt.show()

plt.rcParams['figure.figsize'] = (10.0, 8.0)
logFilteredCardinalsPartition = np.log(filteredCardinalsPartition)
plt.hist(logFilteredCardinalsPartition, General.numberBins(logFilteredCardinalsPartition))
plt.ylabel('Number of equivalence classes', fontsize = 20)
plt.xlabel('Log of the equivalence class size', fontsize = 20)
plt.show()

## Semantic exploratory analysis

- Build a semantic model from the set of documents provided by the user. The parameters are the following ones:
    * **lemmatizedDs**: set of lemmatized documents
    * **number_topics**: number of topics to create the LSI model
    * **filtered_words**: these words will be filtered out before creating the model
    * **no_below**: minimum number of documents in which a word has to appear
    * **no_above**: maximum percentage of documents in which a word could appear

In [None]:
# customized values
number_topics = 100
filtered_words = ['dut', 'ni', 'zu', 'da', 'du', 'dute', 'zen', 'ere', 'gu', 'dugu', 'ez', 'bat', 'hori', 'hor', 'dira', 
            'baina', 'bi', 'zi', 'zut', 'zituzten', 'atzo', 'beste', 'dela']
no_below = 5
no_above = 0.2

# semantic model creation
dictionary, corpus, tfidfModel, lsiModel = NLP.semanticsExtractor(lemmatizedDs, number_topics, filtered_words, no_below, no_above)

It is possible to save the semantic model for further use.

In [None]:
nameModel = data_directory + 'new_model'
NLP.savePrecomputedData(dictionary, corpus, tfidfModel, lsiModel, nameModel)

And to load that model if necessary.

In [None]:
nameModel = data_directory + 'new_model'
dictionary, corpus, tfidfModel, lsiModel = NLP.loadPrecomputedData(nameModel)

Find the verses more similar to a given theme according to the semantic models.

In [None]:
fileSim = NLP.getSimilarityMatrix(lemmatizedDv, dictionary, tfidfModel, lsiModel)
theme = 'pozik' # the semantic similarity of the verses will be computed against this theme
simsWithNew = NLP.simsFromSentence(NLP.lemmatizeString(theme), dictionary, lsiModel, fileSim)
numberChosen = -1 # number of the best sentences returned
bestIndexes, bestValues, bestSentencesLemmatized, bestSentencesOriginal = NLP.getIndexesAndSentencesFromSimsValues(simsWithNew, lemmatizedDv_filename, Dv_filename, numberChosen)
results = sorted(zip(bestSentencesOriginal, bestValues), key=lambda pair: pair[1], reverse = True)
print results[0:100]

Find the verses more similar to a given theme according to the semantic models and that also rhyme with a sentence.

In [None]:
sentence = 'maitasuna baieztu zenduten eleizan'
#sentence = results[0][0]
rhymingResults = [result for result in results[0:500] if is_rhyme(result[0], sentence)]
print rhymingResults