<a href="https://colab.research.google.com/github/RiccardoCozzi96/DeepComedy/blob/main/Preprocessing_and_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing and Tokenization
This notebook is used to create the tokenized dataset which will later be used to train the model. 

We have stored our parsed and tokenized datasets in our GitHub repository, which can be cloned. The parsed (clean) datasets and the tokenized texts are already available in the "DeepComedy/dataset" folder but by running the cells below they will be overwritten by the new dataset created by the new code. By default, this notebook creates the exact datasets we provided. 

For a better explaination of what is happening here, please read our relation on GitHub [here](https://github.com/RiccardoCozzi96/DeepComedy/blob/main/Cozzi-Liscio%20(2020)%20Deep%20Comedy%20project%20work%20report.pdf).


In [None]:
import re
import io
import sys
import numpy as np
import string
import pandas as pd
!pip install pyphen

# retrieve our GitHub repository
!git clone "https://github.com/RiccardoCozzi96/DeepComedy"

sys.path.append("DeepComedy/tokenizer/")
sys.path.append("DeepComedy/metrics/")
sys.path.append("DeepComedy/datasets/")

from comedy_tokenizer import ComedyTokenizer
from comedy_metrics import *

Collecting pyphen
[?25l  Downloading https://files.pythonhosted.org/packages/7c/5a/5bc036e01389bc6a6667a932bac3e388de6e7fa5777a6ff50e652f60ec79/Pyphen-0.10.0-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 2.8MB/s 
[?25hInstalling collected packages: pyphen
Successfully installed pyphen-0.10.0
Cloning into 'DeepComedy'...
remote: Enumerating objects: 421, done.[K
remote: Total 421 (delta 0), reused 0 (delta 0), pack-reused 421
Receiving objects: 100% (421/421), 1.67 MiB | 1.84 MiB/s, done.
Resolving deltas: 100% (61/61), done.


In [None]:
# settings folders and files names
dataset_folder = "DeepComedy/datasets/"
parsed_folder = dataset_folder+"parsed/"
tokenized_folder = dataset_folder+"tokenized/"
original_text_filename = ["commedia", "convivio", "detto", "vita", "fiore"]
parsed_text_filename = "parsed_*.txt"
tokenized_text_filename = "tokenized_*.txt"
hyphenation_dictionary = "DeepComedy/tokenizer/dantes_hyphenation_dictionary.csv"

## Hyphenation

### Uploading the hyphenation dictionary

In [None]:
# load the DANTE'S dictionary from file
hyphenation_vocabulary = pd.read_csv(hyphenation_dictionary, index_col="word")
hyphenation_vocabulary = hyphenation_vocabulary.iloc[:, :1] # the 2rd col contains the tunes, not relevant for hyphenation
hyphenation_vocabulary.head(-1)

Unnamed: 0_level_0,hyphenation
word,Unnamed: 1_level_1
abaglia,a-ba-glia
abaier,a-ba-ier
abandonarmi,a-ban-do-nar-mi
abandonato,a-ban-do-na-to
abandono,a-ban-dó-no
...,...
zita,zì-ta
zodiaco,zo-dì-a-co
zona,zò-na
zucca,zùc-ca


In [None]:
# transforming to dictionary
hyphenation_dictionary = hyphenation_vocabulary.to_dict()["hyphenation"]

#some words have accents
print(hyphenation_dictionary["zavorra"])

#some others don't
print(hyphenation_dictionary["oscura"])

za-vòr-ra
o-scu-ra


### Creating tokenizer

The class Tokenizer provides all the methods needed to hyphenate and tokenize a text.

In [None]:
import pyphen
from comedy_tokenizer import ComedyTokenizer

tokenizer = ComedyTokenizer(dictionary=hyphenation_dictionary, 
                            synalepha=True, 
                            use_tercets=True)

# alternatively, load the csv file one in this line
# tokenizer = ComedyTokenizer.from_dataframe(pd.read_csv("ultimate_hyphenation.csv", index_col="word"),
#                                            synalepha=True, 
#                                            use_tercets=True)
print(tokenizer.hyphenate("zavorra"))

za-vòr-ra


### Testing hyphenation

Our tokenizer exploits both `pyphen` procedure for hyphenation and the .csv file built by hyphenating all the Dante's terms (in Divine Comedy and other productions). Pyphen is used in exceptions cases. 

In [None]:
### HYPHENATION TEST ###
import pyphen
import pandas as pd
dic = pyphen.Pyphen(lang='it')
errors = "oscura atletica ostracismo cruento paura aiuta diocesi odracardo anima".split(" ")
print("{:20}  {:20} {:30}".format("word", "pyphen", "our"))
print("-"*80)



for e in errors:
    print("{:20}  {:20} {:30}".format(e, dic.inserted(e), tokenizer.hyphenate(e)))


word                  pyphen               our                           
--------------------------------------------------------------------------------
oscura                oscu-ra              o-scu-ra                      
atletica              atle-ti-ca           atle-ti-ca                    
ostracismo            ostra-ci-smo         ostra-ci-smo                  
cruento               cruen-to             cruen-to                      
paura                 pau-ra               pa-ù-ra                       
aiuta                 aiu-ta               a-iu-ta                       
diocesi               dio-ce-si            dio-ce-si                     
odracardo             odra-car-do          odra-car-do                   
anima                 ani-ma               à-ni-ma                       


### Testing hyphenation and synalepha

Each word is splitted in syllables divided by spaces. Then the text is tagged as follows: 
* spaces: `<S>`

* start of verse:`<V>`
* end of verse: `</V>`
* end of verse: `</V>`
* start of tercet: `<T>`
* <S>end of tercet: `</T>`</S> *(seems to be not useful)*
* synalepha (between tokens A and B): `A~B`

**NOTE**: the accented characters are retrieved by the hyphenation dictionary. If needed, they can be easily replaced by their coresponding unaccented one when not at the ending of the word.

In [None]:
test = ["e tu che se' costì anima viva",
        "nel mezzo del cammin di nostra vita",
        "mi ritrovai per una selva oscura", 
        "che la diritta via era smarrita",
        "a un pianto o a un riso",
        "selvaggia e aspra e forte"]

for t in test:
    s = tokenizer.tokenize_phrase(t)
    print("\n{:40}\n{:40}".format(s, tokenizer.clear_text(s)))



<V> e <S> tu <S> ché <S> se' <S> co stì <S> à ni ma <S> vì va </V>
e tu ché se' costì ànima vìva
          

<V> nél <S> mèz zo <S> dél <S> cam min <S> di <S> no stra <S> vì ta </V>
nél mèzzo dél cammin di nostra vìta
    

<V> mi <S> ri tro va i <S> per <S> ù na <S> sél va~o scu ra </V>
mi ritrovai per ùna sélva oscura
       

<V> ché <S> la <S> di rìt ta <S> vì a <S> è ra <S> smar ri ta </V>
ché la dirìtta vìa èra smarrita
        

<V> a~un <S> piàn to~o~a~un <S> rì so </V>
a un piànto o a un rìso
                

<V> sel vag gia~e <S> à spra~e <S> fòr te </V>
selvaggia e àspra e fòrte
              


## Tokenizing Dante's productions

Now let us tokenize all the datasets, starting from the Divine Comedy. 

In [None]:
for file_name in original_text_filename:
    
    # tokenize texts
    parsed_path = parsed_folder + parsed_text_filename.replace("*", file_name)
    with open(parsed_path, encoding="utf-8") as file:
        data = file.readlines()
        data = tokenizer.tokenize_text(data, use_tercets = True if file_name == "commedia" else False)
        if file_name == "commedia":
            print(data[:6], "\n")
        
    # write tokenized text to file
    tokenized_path = tokenized_folder + tokenized_text_filename.replace("*", file_name)
    with open(tokenized_path, "w+", encoding="utf-8") as out:
        for line in data:
            out.write(line+"\n")
   
    print(f"'{file_name}' tokenized and saved as {tokenized_path}")

['<T> <V> nél <S> mèz zo <S> dél <S> cam min <S> di <S> no stra <S> vì ta </V>'
 '<V> mi <S> ri tro va i <S> per <S> ù na <S> sél va~o scu ra </V>'
 "<V> che' <S> la <S> di rìt ta <S> vì a <S> è ra <S> smar ri ta . </V>"
 '<T> <V> à hi <S> quàn to~a <S> dir <S> qual <S> è ra <S> è <S> cò sa <S> dù ra </V>'
 '<V> é sta <S> sél va <S> sel vag gia~e <S> à spra~e <S> fòr te </V>'
 '<V> ché <S> nél <S> pen sier <S> ri no va <S> la <S> pau ra ! </V>'] 

'commedia' tokenized and saved as DeepComedy/datasets/tokenized/tokenized_commedia.txt
'convivio' tokenized and saved as DeepComedy/datasets/tokenized/tokenized_convivio.txt
'detto' tokenized and saved as DeepComedy/datasets/tokenized/tokenized_detto.txt
'vita' tokenized and saved as DeepComedy/datasets/tokenized/tokenized_vita.txt
'fiore' tokenized and saved as DeepComedy/datasets/tokenized/tokenized_fiore.txt
