# Part-of-Speech (POS) Tagger

To create and train our Part-of-Speech tagger, we used Italian words, sentences and tags found in the treebank "Italian-VIT", found at https://github.com/UniversalDependencies/UD_Italian-VIT/tree/master. This source provided 3 files (train, dev and test) in form of .conllu. To read the data, we used the method below, that tracks and saves every sentence after the string "# text" and reads all the rows below it, keeping the word POS pairs in the 2nd and 4th columns. Before the method, we make sure to separate any commas or periods that are attached to words.

In [1]:
import pandas as pd
import re

def add_space(sentence):
    # Using regular expression to find comma or period attached to a word
    sentence = re.sub(r'(\w)([,.])', r'\1 \2', sentence)
    return sentence

def read_conllu(file_path):
    sentences = []
    words = []
    POS_tags = []
    index = []
    sentence = None

    with open(file_path, 'r', encoding='utf-8') as file:
        i = -1
        for line in file:
            line = line.strip()
            if line == '':
                if sentence:
                    sentences.append(add_space(sentence))
                    sentence = None
            elif line.startswith('# text'):
                i += 1
                sentence = "<s>" + line.split("=")[1] + " <e>"
            elif line.startswith('#'):
                continue
            else:
                tokens = line.split('\t')
                words.append(tokens[1])
                index.append(i)
                POS_tags.append(tokens[3])
        sent_df = pd.DataFrame({'Sentences': sentences})
        words_df = pd.DataFrame({'Words': words,'Sentence': index, 'POS': POS_tags})
    return sent_df, words_df

## Training, Dev and Test sets.

For each one of these sets we kept a dataframe of the sentences and one of the words with their POS tags and their sentence origin.

In [2]:
train_sent_df, train_words_df = read_conllu("data/UD_Italian-VIT-master/it_vit-ud-train.conllu")
train_sent_df

Unnamed: 0,Sentences
0,<s> Le infrastrutture come fattore di competit...
1,<s> Negli ultimi anni la dinamica dei polo di ...
2,<s> Il raggiungimento e il mantenimento di pos...
3,<s> Quest'ultimo è funzione di variabili strut...
4,"<s> Il contesto milanese , se da un lato è sta..."
...,...
8272,<s> Premio Elsa Morante . <e>
8273,<s> È nato il premio Elsa Morante che verrà as...
8274,<s> Questo Premio che non avrà sede fissa né s...
8275,<s> sono promotori dell'iniziativa Patrizia Ca...


In [3]:
train_words_df

Unnamed: 0,Words,Sentence,POS
0,Le,0,DET
1,infrastrutture,0,NOUN
2,come,0,ADP
3,fattore,0,NOUN
4,di,0,ADP
...,...,...,...
241585,.,8275,PUNCT
241586,Libri,8276,NOUN
241587,in,8276,ADP
241588,campo,8276,NOUN


In [4]:
train_sent_df['Sentences'].apply(lambda x: len(x.split())).mean()

25.73082034553582

In [5]:
len(train_words_df["Words"].unique())

23050

In [6]:
test_sent_df, test_words_df = read_conllu("data/UD_Italian-VIT-master/it_vit-ud-test.conllu")
test_sent_df

Unnamed: 0,Sentences
0,<s> Non sono consentite assegnazioni provvisor...
1,<s> È consentita inoltre la partecipazione pro...
2,<s> I predetti motivi devono costituire oggett...
3,<s> In caso di ricongiungimento al familiare d...
4,<s> Ai fini della possibilità di presentazione...
...,...
1062,<s> Scrooge era il suo unico esecutore testame...
1063,"<s> Anzi il nostro Scrooge , che per verità il..."
1064,<s> Il ricordo dei funerali mi fa tornare al p...
1065,<s> Non c'è dunque dubbio che Marley era morto...


In [7]:
test_words_df

Unnamed: 0,Words,Sentence,POS
0,Non,0,ADV
1,sono,0,AUX
2,consentite,0,VERB
3,assegnazioni,0,NOUN
4,provvisorie,0,ADJ
...,...,...,...
27964,per,1066,ADP
27965,narrarvi,1066,_
27966,narrar,1066,VERB
27967,vi,1066,PRON


In [8]:
test_sent_df['Sentences'].apply(lambda x: len(x.split())).mean()

23.529522024367385

In [9]:
len(test_words_df["Words"].unique())

5851

In [10]:
dev_sent_df, dev_words_df = read_conllu("data/UD_Italian-VIT-master/it_vit-ud-dev.conllu")
dev_sent_df

Unnamed: 0,Sentences
0,"<s> Ha l'acqua calda , più o meno si veste . <e>"
1,<s> malgrado le guerre e i disastri naturali e...
2,<s> È come un'energia che sta crescendo comple...
3,"<s> L'onorevole Charles Rose , deputato democr..."
4,"<s> Da qualche tempo , la sua espressione pref..."
...,...
738,<s> Le gravi esigenze di salute dell'aspirante...
739,<s> Possono chiedere l'assegnazione provvisori...
740,<s> La relativa domanda va formulata contestua...
741,<s> Possono partecipare al movimento delle ass...


In [11]:
dev_words_df

Unnamed: 0,Words,Sentence,POS
0,Ha,0,VERB
1,l',0,DET
2,acqua,0,NOUN
3,calda,0,ADJ
4,",",0,PUNCT
...,...,...,...
31091,possono,742,AUX
31092,richiedere,742,VERB
31093,per,742,ADP
31094,trasferimento,742,NOUN


In [13]:
train_words_df['POS'].unique()

array(['DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', '_', 'ADJ', 'AUX', 'ADV',
       'VERB', 'PRON', 'CCONJ', 'SCONJ', 'NUM', 'SYM', 'X', 'INTJ',
       'PART'], dtype=object)

In [14]:
test_words_df['POS'].unique()

array(['ADV', 'AUX', 'VERB', 'NOUN', 'ADJ', '_', 'ADP', 'DET', 'PUNCT',
       'CCONJ', 'PRON', 'NUM', 'SCONJ', 'PROPN', 'X', 'INTJ', 'SYM'],
      dtype=object)

In [12]:
dev_words_df['POS'].unique()

array(['VERB', 'DET', 'NOUN', 'ADJ', 'PUNCT', 'ADV', 'CCONJ', 'PRON',
       'ADP', 'AUX', 'NUM', 'PROPN', '_', 'X', 'SCONJ', 'INTJ'],
      dtype=object)

In [12]:
dev_sent_df['Sentences'].apply(lambda x: len(x.split())).mean()

35.51682368775236

In [13]:
len(dev_words_df["Words"].unique())

3638

## Window sets creation

Below we can see all the unique POS tags that will be used as labels in our classification.

In [14]:
train_words_df['POS'].unique()

array(['DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', '_', 'ADJ', 'AUX', 'ADV',
       'VERB', 'PRON', 'CCONJ', 'SCONJ', 'NUM', 'SYM', 'X', 'INTJ',
       'PART'], dtype=object)

In [15]:
len(train_words_df['POS'].unique())

18

With the methods seen below, we will create 3 datasets (for training, development and testing) using the windows approach. By looking in our sentences, for each word in position i we will save its previous (i-1) and next words (i+1). Also we will keep the POS tag for the word i, which will be our target column.

In [11]:
from itertools import combinations
from tqdm import tqdm

def generate_three_word_combinations(sentence):
    combinations = []
    words = sentence.split()
    for i in range(0,len(words)):
        if (words[i] == "<s>") or (words[i] == "<e>"):
            continue
        combinations.append(list([words[i-1], words[i], words[i+1]]))
    return combinations

def generate_df(sent_df ,word_df,set_name):
    new_data = {'Wi-1': [], 'Wi': [], 'Wi+1': [], "Wi_POS_tag" : []}
    i = 0
    for sentence in tqdm(sent_df['Sentences'], desc="Processing" + set_name + "Sentences"):
        combinations_list = generate_three_word_combinations(sentence)
        for combination in combinations_list:
            if (not word_df[(word_df["Words"]==combination[1]) & (word_df["Sentence"]==i)]['POS'].empty):
                new_data['Wi-1'].append(combination[0])
                new_data['Wi'].append(combination[1])
                new_data['Wi+1'].append(combination[2])
                new_data["Wi_POS_tag"].append(word_df[(word_df["Words"]==combination[1]) & (word_df["Sentence"]==i)]['POS'].values[0])
        i += 1
    windows_df = pd.DataFrame(new_data)
    return windows_df

dev_windows_df = generate_df(dev_sent_df, dev_words_df, "Development")
dev_windows_df

ProcessingDevelopmentSentences: 100%|██████████| 743/743 [01:31<00:00,  8.14it/s]


Unnamed: 0,Wi-1,Wi,Wi+1,Wi_POS_tag
0,<s>,Ha,l'acqua,VERB
1,l'acqua,calda,",",ADJ
2,calda,",",più,PUNCT
3,",",più,o,ADV
4,più,o,meno,CCONJ
...,...,...,...,...
23665,che,possono,richiedere,AUX
23666,possono,richiedere,per,VERB
23667,richiedere,per,trasferimento,ADP
23668,per,trasferimento,.,NOUN


In [12]:
train_windows_df = generate_df(train_sent_df, train_words_df, "Train")
test_windows_df = generate_df(test_sent_df, test_words_df, "Test")

ProcessingTrainSentences: 100%|██████████| 8277/8277 [1:26:05<00:00,  1.60it/s]
ProcessingTestSentences: 100%|██████████| 1067/1067 [01:14<00:00, 14.34it/s]


In [13]:
train_windows_df

Unnamed: 0,Wi-1,Wi,Wi+1,Wi_POS_tag
0,<s>,Le,infrastrutture,DET
1,Le,infrastrutture,come,NOUN
2,infrastrutture,come,fattore,ADP
3,come,fattore,di,NOUN
4,fattore,di,competitività,ADP
...,...,...,...,...
183845,Ramondino,.,<e>,PUNCT
183846,<s>,Libri,in,NOUN
183847,Libri,in,campo,ADP
183848,in,campo,.,NOUN


In [14]:
test_windows_df

Unnamed: 0,Wi-1,Wi,Wi+1,Wi_POS_tag
0,<s>,Non,sono,ADV
1,Non,sono,consentite,AUX
2,sono,consentite,assegnazioni,VERB
3,consentite,assegnazioni,provvisorie,NOUN
4,assegnazioni,provvisorie,nell'ambito,ADJ
...,...,...,...,...
21497,storia,che,son,PRON
21498,che,son,per,VERB
21499,son,per,narrarvi,ADP
21500,per,narrarvi,.,_


We save our sets to train the MLP in EX_10_B.ipynb.

In [15]:
train_windows_df.to_csv("train_windows.csv", index=False)
test_windows_df.to_csv("test_windows.csv", index=False)
dev_windows_df.to_csv("dev_windows.csv", index=False)

## Baseline Model

We developed a majority baseline model for the classification, that assigns each word to its most frequent tag in training and for new words, assigns them to the most frequent tag of training in general.

In [16]:
from sklearn.metrics import classification_report

def majority_baseline(train_df, test_df):
    most_frequent_tags = train_df.groupby('Words')['POS'].agg(lambda x: x.mode().iloc[0]).to_dict()
    test_df['prediction'] = test_df['Words'].map(most_frequent_tags)
    most_frequent_tag_training = train_df['POS'].mode().iloc[0]
    test_df['prediction'].fillna(most_frequent_tag_training, inplace=True)
    report = classification_report(test_df['POS'], test_df['prediction'])
    print(report)

In [17]:
majority_baseline(train_words_df, test_words_df)

              precision    recall  f1-score   support

         ADJ       0.82      0.68      0.74      1555
         ADP       0.99      0.99      0.99      3862
         ADV       0.92      0.92      0.92      1234
         AUX       0.86      0.97      0.91      1058
       CCONJ       0.98      0.91      0.94       774
         DET       0.94      0.98      0.96      3898
        INTJ       0.69      0.69      0.69        16
        NOUN       0.72      0.95      0.82      4967
         NUM       0.98      0.86      0.91       418
        PRON       0.78      0.78      0.78      1259
       PROPN       0.98      0.56      0.71      1290
       PUNCT       1.00      1.00      1.00      3351
       SCONJ       0.83      0.39      0.53       326
         SYM       1.00      1.00      1.00         3
        VERB       0.93      0.63      0.75      2281
           X       0.86      0.27      0.41        44
           _       0.99      0.96      0.97      1633

    accuracy              