In [5]:
import sys

sys.path.insert(0, "..")

In [6]:
from src.scrapper import parse_conllu_file
from src.visualization import plot_frequency_of_
from analyses.utils import (
    get_stats,
    build_counts,
    build_dataframes,
    print_top_tokens_given_tag,
    display_side_by_side,
    visualize_sample,
    get_sentence_idx_given_pair,
)
import pandas as pd
import numpy as np

In [7]:
pd.set_option("display.max_columns", 500)

# General information

## Catalan language

Catalan is a language that approximately 9.2 million people speak. It is the official language of Andorra and the Spanish autonomous communities of Catalonia and the Balearic Islands. It is, however, also spoken in Valencia (where it is also referred to as Valencian) and some zones of France, Italy, and the community of Aragon.
Catalan is a Western Romance language, which means that comes from Latin, hence it belongs to the Indo-European language family.

## The data
* Sentences from the corpus [Ancora](https://clic.ub.edu/corpus/)
* Github repository available [here](https://github.com/UniversalDependencies/UD_Catalan-AnCora/tree/master)
* Train dataset available [here](https://github.com/UniversalDependencies/UD_Catalan-AnCora/blob/master/ca_ancora-ud-train.conllu) (13123 sentences)
* Test dataset available [here](https://github.com/UniversalDependencies/UD_Catalan-AnCora/blob/master/ca_ancora-ud-test.conllu) (1846 sentences)

In [8]:
train_info = parse_conllu_file("../datasets/ca_ancora-ud-train.conllu")
test_info = parse_conllu_file("../datasets/ca_ancora-ud-test.conllu")

train_df = build_dataframes(train_info)
test_df = build_dataframes(test_info)

tags = train_df.tags.unique()

Let's take a look at the characteristics of the datasets:

In [9]:
get_stats(train_info)

Total sentences: 13123
Average sentence length: 34
Minimum sentence length: 2
Maximum sentence length: 250
Percentile 25, length: 21.0
Percentile 50, length: 32.0
Percentile 75, length: 44.0


* In the **training dataset** there are more than 13k sentences-
* The maximum length is 250 tokens long, 175 times bigger than the smallest, which is 2 tokens long (probably a word and a punctuation sign).
    - Both lengths are, however, rare. They could be considered outliers since we can see that the percentiles are around 21-44 (25 and 75 percentile respectively)
* The average sentence lenght is 34, much longer than the English average. Reasons for this to happen can be several:
    - This could be expected if we consider that the English data is taken from sources where it was the L2 language, hence the sentences might be less elaborate.
    - Catalan language has a lot of particles that don't have meaning per se but are constantly (and necessarily) used - e.g. different kind of pronouns, verb tenses that require a composition of verbs, etc. 
    - After a small processing it has been discovered that the word "del" (which means "from the") is included in the corpus, followed by the two words that compose it: "de" (from) and "el" (the), which leads to having three tokens ("del" tagged as `-`, "de" tagged as `adp` and "el" tagged as `det`). This is also one of the factors that increase the length of the sentence. It is also important to take this into account when performing the analysis of the viterbi algorithm.
    - One could also simply consider that both datasets contain data from very different topics and different backgrounds, which can affect to the length of a sentence.


In [10]:
sentence_lengths = [len(sentence) for sentence in train_info]
min_idx = np.argmin(sentence_lengths)
max_idx = np.argmax(sentence_lengths)

visualize_sample(train_info, min_idx)

Unnamed: 0,0,1
tokens,efe,.
tags,propn,punct


* The shortest sentence is a proper name. There is no clarity why there's this proper name isolated in a sentence, probably a result of preprocessing a text from a webpage. 
* For the sake of curiosity, it probably refers to [Agència EFE](https://ca.wikipedia.org/wiki/Ag%C3%A8ncia_EFE) a communication agency and news service.
* We could remove it from the dataset since it does not add much value, but since the data is expected to be already reviewed and quality-supervised, we will leave it here.

In [11]:
visualize_sample(train_info, max_idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249
tokens,entre,les,tasques,que,es,duran,a,terme,hi,ha,la,sensibilització,dels,de,els,pagesos,sobre,l',amplitud,del,de,el,problema,en,l',àmbit,agrari,",",perquè,',són,ells,mateixos,els,qui,han,de,defensar,la,pròpia,salut,i,la,de,les,seves,pròpies,famílies,',;,formació,dels,de,els,pagesos,i,altres,persones,i,institucions,vinculades,amb,l',activitat,rural,",",en,especial,dels,de,els,dirigents,rurals,;,servei,de,documentació,per,totes,aquelles,institucions,i,persones,que,es,vulguin,informar,sobre,els,diferents,aspectes,de,la,defensa,de,la,salut,laboral,;,assessorament,sobre,la,nova,normativa,de,prevenció,de,riscos,i,la,promoció,adequada,per,al,a,el,seu,compliment,;,estudi,dels,de,els,principals,problemes,que,es,produeixen,en,les,condicions,de,treball,dels,de,els,pagesos,;,implantació,d',una,xarxa,de,vigilància,i,alerta,dels,de,els,principals,incidents,produïts,en,aquest,àmbit,;,i,propostes,dirigides,a,la,millora,de,les,condicions,de,treball,",",amb,criteris,ergonòmics,i,a,l',aplicació,de,diferents,mètodes,d',organització,i,ordenació,del,de,el,treball,;,en,homenatge,a,l',àmbit,de,treball,que,cobria,en,lluís,nomen,a,la,comissió,permanent,del,de,el,sindicat,",",s',impulsarà,a,través,de,la,fundació,la,promoció,i,la,realització,d',estudis,",",investigacions,i,activitats,en,el,camp,de,la,preservació,dels,de,els,valors,productius,",",ecològics,i,culturals,de,l',espai,agrari,i,rural,de,catalunya,.
tags,adp,det,noun,pron,pron,verb,adp,noun,pron,verb,det,noun,_,adp,det,noun,adp,det,noun,_,adp,det,noun,adp,det,noun,adj,punct,sconj,punct,aux,pron,det,det,pron,aux,adp,verb,det,adj,noun,cconj,det,adp,det,det,adj,noun,punct,punct,noun,_,adp,det,noun,cconj,det,noun,cconj,noun,adj,adp,det,noun,adj,punct,adp,adj,_,adp,det,noun,adj,punct,noun,adp,noun,adp,det,det,noun,cconj,noun,pron,pron,verb,verb,adp,det,det,noun,adp,det,noun,adp,det,noun,adj,punct,noun,adp,det,adj,noun,adp,noun,adp,noun,cconj,det,noun,adj,adp,_,adp,det,det,noun,punct,noun,_,adp,det,adj,noun,pron,pron,verb,adp,det,noun,adp,noun,_,adp,det,noun,punct,noun,adp,det,propn,adp,propn,cconj,propn,_,adp,det,adj,noun,adj,adp,det,noun,punct,cconj,noun,adj,adp,det,noun,adp,det,noun,adp,noun,punct,adp,noun,adj,cconj,adp,det,noun,adp,det,noun,adp,noun,cconj,noun,_,adp,det,noun,punct,adp,noun,adp,det,noun,adp,noun,pron,verb,det,propn,propn,adp,det,propn,propn,_,adp,det,noun,punct,pron,verb,adp,noun,adp,det,propn,det,noun,cconj,det,noun,adp,noun,punct,noun,cconj,noun,adp,det,noun,adp,det,noun,_,adp,det,noun,adj,punct,adj,cconj,adj,adp,det,noun,adj,cconj,adj,adp,propn,punct


* The longest sentence talks about an existing problematic with farmers and enumerates a list of actions that will be taken in order to raise awarness and enhance sensibilization towards that.
* It is not common to have long sentences like this, but the listing of detailed actions leads to it being so long.

In [12]:
get_stats(test_info)

Total sentences: 1846
Average sentence length: 33
Minimum sentence length: 2
Maximum sentence length: 178
Percentile 25, length: 21.0
Percentile 50, length: 30.0
Percentile 75, length: 43.0


* The **test dataset** is 15% of the training size.
* The shortest sentence is still 2 and the longest one is 178, shorter than the training one yet, quite long either way.

In [13]:
sentence_lengths = [len(sentence) for sentence in test_info]
min_idx = np.argmin(sentence_lengths)
max_idx = np.argmax(sentence_lengths)

visualize_sample(test_info, min_idx)

Unnamed: 0,0,1
tokens,1994,.
tags,noun,punct


In [14]:
visualize_sample(test_info, max_idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177
tokens,entre,els,actes,ja,concertats,en,destaca,l',espectacle,',testimoni,verdaguer,',",",al,a,el,teatre,nacional,de,catalunya,",",el,dia,29,de,juny,",",amb,una,diversitat,d',iniciatives,al,a,el,llarg,de,tota,la,jornada,;,un,conjunt,d',exposicions,",",organitzades,per,la,biblioteca,de,catalunya,",",la,primera,de,les,quals,s',inaugurarà,a,vic,el,7,de,juny,;,la,celebració,d',una,festa,tipogràfica,",",dins,de,la,popular,festa,dels,de,els,súpers,del,de,el,club,super,3,de,tv3,",",a,l',octubre,;,verdaguer,i,les,llengües,europees,",",un,homenatge,a,la,pluralitat,de,llengües,a,què,ha,estat,traduïda,l',obra,de,verdaguer,que,tindrà,lloc,a,berga,l',11,de,setembre,;,una,campanya,de,promoció,de,la,lectura,a,través,de,les,biblioteques,",",fent,ús,de,l',atlàntida,",",adreçada,sobretot,al,a,el,públic,infantil,i,juvenil,",",i,un,acte,de,cloenda,",",previst,per,al,a,el,novembre,al,a,el,palau,de,la,música,.
tags,adp,det,noun,adv,adj,pron,verb,det,noun,punct,propn,propn,punct,punct,_,adp,det,propn,propn,adp,propn,punct,det,noun,num,adp,noun,punct,adp,det,noun,adp,noun,_,adp,det,noun,adp,det,det,noun,punct,det,noun,adp,noun,punct,adj,adp,det,propn,adp,propn,punct,det,adj,adp,det,pron,pron,verb,adp,propn,det,num,adp,noun,punct,det,noun,adp,det,propn,propn,punct,adv,adp,det,adj,propn,_,adp,det,propn,_,adp,det,propn,propn,num,adp,propn,punct,adp,det,noun,punct,propn,cconj,det,propn,propn,punct,det,noun,adp,det,noun,adp,noun,adp,pron,aux,aux,verb,det,noun,adp,propn,pron,verb,noun,adp,propn,det,num,adp,noun,punct,det,noun,adp,noun,adp,det,noun,adp,noun,adp,det,noun,punct,verb,noun,adp,det,propn,punct,adj,adv,_,adp,det,noun,adj,cconj,adj,punct,cconj,det,noun,adp,noun,punct,adj,adp,_,adp,det,noun,_,adp,det,propn,adp,det,propn,punct


* Once more, the longest sentence contains an enumeration of descriptive and detailed a items.
* There are also lots of tokens assigned to punctuation signs (quotes, commas, etc)
* The shortest sentence in this case is a number (of a year) on its own and the punctuation sign.

# Words

Let's now take a look at the distribution of the data:

✏️ It is recommended to play with the plots shown below. They can be inreacted with in the following ways:
* Hover over the plots to see extra informations of the bars (sometimes not shown in the axes due to lack of space).
* Zoom in any interval to check specific values.
* Top right menu to perform different operations, such as reset the plot to the original state using the house icon.

In [15]:
train_word_counts, train_tag_counts, train_pair_counts = build_counts(train_info)
test_word_counts, test_tag_counts, test_pair_counts = build_counts(test_info)

In [16]:
plot_frequency_of_("words", train_word_counts, test_word_counts)

* This plot shows the frequency of the top 50 most common tokens (since showing them all returned too much information)
* Both, train and test data, seem to have a similar distribution on the top tokens of the vocabulary.
* The most frequent tokens are "de" (of/from), the comma token ",", "el" and "la" (the masculine and femenine article determiner "the").
* "A" has more occurrences in the train data than the "." token, and reversed in the test data. However, in both cases the number of occurrences is pretty similar in both.
* In general, the distribution of the tags us quite similar in both cases.

# Tags

We have a total of 17 tags (for a further explanation, it can be checked [here](https://universaldependencies.org/u/pos/))
1. Noun $\rightarrow$ common nouns
2. Adp  $\rightarrow$ adposition
3. Det  $\rightarrow$ determiners
4. Punct    $\rightarrow$ punctuation signs
5. PropN    $\rightarrow$ proper nouns
6. Verb $\rightarrow$ verbs
7. Adj  $\rightarrow$ adjectives
8. Pron $\rightarrow$ pronouns
9. Aux  $\rightarrow$ auxiliary
10. `-`   $\rightarrow$ the token means nothing. It is not included in the list of tag of the universal dependencies, but must be ignored. In this dataset, it appears when splitting the particle "del" as explained some sections above.
11. Adv $\rightarrow$ adverb
12. Cconj   $\rightarrow$ coordinating conjunction
13. Sconj   $\rightarrow$ subordinating conjunction
14. Num $\rightarrow$ numeral
15. Sym $\rightarrow$ symbol
16. Part    $\rightarrow$ particle
17. Intj    $\rightarrow$ interjection
18. x   $\rightarrow$ other

Let's check an example for each:

In [17]:
train_df[["tokens", "tags"]].drop_duplicates(["tags"])

Unnamed: 0,tokens,tags
0,el,det
1,tribunal,propn
3,(,punct
6,ha,aux
7,confirmat,verb
9,condemna,noun
10,a,adp
11,quatre,num
15,especial,adj
16,i,cconj


And their frequency and distribution:

In [18]:
plot_frequency_of_("tags", train_tag_counts, test_tag_counts)

* The distribution in both datasets is also similar. 
* The test dataset does not contain any instance of "x" ($\iff$ other).

# Tag-Word pairs

Now let's check the tag-word pairs frequency. It is important to check this, since a word can be interpreted in different ways according to the context (due to polysemy an/or homonymy).

In [19]:
plot_frequency_of_("word-pair tag", train_pair_counts, test_pair_counts)

The top pairs are the same and are common and usual words as determiners, conjunctions, adpositions, etc.

Let's now check which tokens have several tags assigned:

In [20]:
diff_tags_same_token = (
    train_df[["tokens", "tags"]].groupby(["tokens"]).agg(set).reset_index()
)
diff_tags_same_token["length_set"] = diff_tags_same_token.apply(
    lambda x: len(x["tags"]), axis=1
)
print(
    f"Total tokens with >1 tag assigned is {len(diff_tags_same_token[diff_tags_same_token.length_set > 1])}"
)
for nr_tags in range(1, max(diff_tags_same_token.length_set.unique()) + 1):
    print(
        f"Total tokens with {nr_tags} tags assigned is {len(diff_tags_same_token[diff_tags_same_token.length_set == nr_tags])}"
    )

Total tokens with >1 tag assigned is 3360
Total tokens with 1 tags assigned is 25742
Total tokens with 2 tags assigned is 2958
Total tokens with 3 tags assigned is 353
Total tokens with 4 tags assigned is 44
Total tokens with 5 tags assigned is 5


* Most tokens have 1,2 or 3 tokens assigned; being 2 the most common possible one.
* Tokens that have more than three tags are rare, but exist.
* Checking the tables below, one would need the context to verify, but it is likely that the datasets would need a revision, since the five tokens with fiva different tags assigned, don't seem to match with some of the tags. For instance, there seems to be a bias towards assigning some of those to `propn` (proper nouns). But again, we would need to check the context in the sentences.
* These are important to have in mind, since they can be problematic when performing our tests and analysis of viterbi. Having so many tags assigned can make the algorithm to be unstable when predicting the tag for words like "força", "total", "segons", "mateix" and "cap". Predicting a wrong tag for those, can make the rest of the sentence carry over the error.

In [21]:
diff_tags_same_token[diff_tags_same_token.length_set > 1].sort_values(
    by=["length_set"], ascending=False
).head(10)

Unnamed: 0,tokens,tags,length_set
13705,força,"{adv, det, noun, verb, propn}",5
27157,total,"{adv, det, noun, adj, propn}",5
25218,segons,"{adp, sconj, noun, adj, propn}",5
18197,mateix,"{adv, det, pron, noun, adj}",5
5984,cap,"{adp, det, pron, noun, propn}",5
28545,vista,"{noun, propn, adj, verb}",4
26588,tant,"{sconj, noun, adv, det}",4
20076,on,"{pron, noun, propn, sconj}",4
22984,qualsevol,"{pron, propn, adj, det}",4
22999,quart,"{noun, propn, adj, num}",4


Even with tokens that have assigned two tags, we find another tendency towards assigning words as proper nouns; specially the pair {noun, propn} appears several times. 
The dataset might have had some errors occurring during the manual tagging.

In [22]:
diff_tags_same_token[diff_tags_same_token.length_set <= 2].sort_values(
    by=["length_set"], ascending=False
).head(10)

Unnamed: 0,tokens,tags,length_set
14530,granja,"{noun, propn}",2
22815,publicat,"{adj, verb}",2
22780,psicològica,"{propn, adj}",2
6909,codi,"{noun, propn}",2
6901,coca,"{noun, propn}",2
22806,pubilla,"{noun, propn}",2
22808,publicacions,"{noun, propn}",2
22810,publicada,"{adj, verb}",2
22811,publicades,"{adj, verb}",2
6885,cobra,"{propn, verb}",2


In [23]:
idx = get_sentence_idx_given_pair(train_info, ("pubilla", "propn"))
visualize_sample(train_info, idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
tokens,les,ajudes,",",que,seran,"""",atractives,"""",",",segons,corbacho,",",també,es,podran,sol·licitar,a,pubilla,cases,i,a,la,florida,.
tags,det,noun,punct,pron,aux,punct,adj,punct,punct,adp,propn,punct,adv,pron,aux,verb,adp,propn,propn,cconj,adp,det,propn,punct


In [28]:
idx = get_sentence_idx_given_pair(train_info, ("coca", "propn"))
visualize_sample(train_info, idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44
tokens,convençuda,que,vendre,un,saramago,a,un,addicte,als,a,els,best-sellers,més,fàcils,pot,perjudicar,seriosament,la,salut,(,o,com,a,mínim,sacsejar,algun,cervell,),",",maite,coca,suma,17,anys,de,prescripcions,literàries,personalitzades,i,sense,error,en,els,diagnòstics,.
tags,adj,sconj,verb,det,noun,adp,det,adj,_,adp,det,noun,adv,adj,aux,verb,adv,det,noun,punct,cconj,sconj,adp,noun,verb,det,noun,punct,punct,propn,propn,verb,num,noun,adp,noun,adj,adj,cconj,adp,noun,adp,det,noun,punct


In [29]:
idx = get_sentence_idx_given_pair(train_info, ("força", "propn"))
visualize_sample(train_info, idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62
tokens,amb,el,lema,',amb,la,nostra,força,millorem,la,salut,en,el,treball,',",",en,aquest,tercer,congrés,de,ccoo,a,les,comarques,gironines,en,què,han,participat,147,dels,de,els,165,delegats,que,hi,ha,",",s',han,debatut,els,informes,de,gestió,i,balanç,del,de,el,secretari,general,",",jordi,preses,i,el,programa,de,futur,.
tags,adp,det,noun,punct,adp,det,det,propn,propn,det,propn,adp,det,propn,punct,punct,adp,det,adj,noun,adp,propn,adp,det,noun,adj,adp,pron,aux,verb,num,_,adp,det,num,noun,pron,pron,verb,punct,pron,aux,verb,det,noun,adp,noun,cconj,noun,_,adp,det,noun,adj,punct,propn,propn,cconj,det,noun,adp,noun,punct


* The two first examples above, show a correct tagging of the given names ("coca" and "pubilla") as proper nouns, since they represent the surname of a person and the first name of a compound-name place.
* The last one, however, shows an incorrect tagging of the name "força", since it this context, it should be a (common) noun.
* We need to take into account that some input data will be wrong, and this can affect our model.

Lastly, we will display the top 5 words for each tag that we find in train and test:

In [26]:
train_dfs = [print_top_tokens_given_tag(train_df, tag) for tag in tags]
display_side_by_side(*train_dfs)

Unnamed: 0,tags,tokens,count
0,det,el,18635
1,det,la,15053
2,det,l',7902
3,det,els,7011
4,det,les,4767

Unnamed: 0,tags,tokens,count
0,propn,catalunya,619
1,propn,barcelona,517
2,propn,govern,349
3,propn,generalitat,342
4,propn,sant,247

Unnamed: 0,tags,tokens,count
0,punct,",",20933
1,punct,.,12972
2,punct,"""",3525
3,punct,',2943
4,punct,(,1326

Unnamed: 0,tags,tokens,count
0,aux,va,4003
1,aux,ha,3699
2,aux,és,1685
3,aux,van,1221
4,aux,han,1096

Unnamed: 0,tags,tokens,count
0,verb,fer,582
1,verb,té,376
2,verb,ha,370
3,verb,fa,328
4,verb,fet,224

Unnamed: 0,tags,tokens,count
0,noun,any,700
1,noun,anys,693
2,noun,milions,492
3,noun,pessetes,457
4,noun,president,362

Unnamed: 0,tags,tokens,count
0,adp,de,30344
1,adp,a,13207
2,adp,d',6684
3,adp,per,5791
4,adp,en,5570

Unnamed: 0,tags,tokens,count
0,num,dos,479
1,num,tres,358
2,num,dues,244
3,num,quatre,197
4,num,cent,176

Unnamed: 0,tags,tokens,count
0,adj,gran,263
1,adj,passat,261
2,adj,primer,242
3,adj,nou,222
4,adj,general,200

Unnamed: 0,tags,tokens,count
0,cconj,i,10038
1,cconj,o,629
2,cconj,però,621
3,cconj,ni,211
4,cconj,sinó,138

Unnamed: 0,tags,tokens,count
0,_,del,5379
1,_,al,2617
2,_,dels,2070
3,_,als,792
4,_,pel,501

Unnamed: 0,tags,tokens,count
0,pron,que,5664
1,pron,_,4999
2,pron,es,2648
3,pron,s',1813
4,pron,hi,1046

Unnamed: 0,tags,tokens,count
0,sconj,que,5280
1,sconj,com,1448
2,sconj,perquè,509
3,sconj,si,481
4,sconj,quan,353

Unnamed: 0,tags,tokens,count
0,adv,no,2218
1,adv,més,1723
2,adv,també,784
3,adv,ja,580
4,adv,després,457

Unnamed: 0,tags,tokens,count
0,sym,%,203
1,sym,50%,15
2,sym,10%,12
3,sym,30%,12
4,sym,40%,11

Unnamed: 0,tags,tokens,count
0,part,no,113

Unnamed: 0,tags,tokens,count
0,x,8'5,2
1,x,1'2,1
2,x,17'5,1
3,x,32'5,1
4,x,38'3,1

Unnamed: 0,tags,tokens,count
0,intj,vaja,3
1,intj,he,2
2,intj,compte,1
3,intj,déu,1
4,intj,home,1


* The most common determiners are the articles (the feminine, masculine, singular and plural varieties).
* The most common proper nouns are Catalunya, Barcelona, Govern and Generalitat. The texts must have been taken from governemnt-related couments.
* `Ha` is labelled both as verb and aux and it is very common in both tags, this particle might be confusing for the model.
* `És` is labelled as an auxiliary verb, but depending on the context could be misleading (since it is the third person singular of the verb to be)
* The `num` category and de `x` category could be equivalent since `x` seems to show the numeric values whereas `num` is the word representation of the numbers.
* `adv` and `part` have both the token "no" and could also be considered equivalent.

In [27]:
test_dfs = [print_top_tokens_given_tag(test_df, tag) for tag in tags]
display_side_by_side(*test_dfs)

Unnamed: 0,tags,tokens,count
0,det,el,2567
1,det,la,2203
2,det,l',1155
3,det,els,984
4,det,les,645

Unnamed: 0,tags,tokens,count
0,propn,catalunya,76
1,propn,barcelona,70
2,propn,generalitat,63
3,propn,govern,44
4,propn,josep,43

Unnamed: 0,tags,tokens,count
0,punct,",",2947
1,punct,.,1837
2,punct,',491
3,punct,"""",456
4,punct,(,179

Unnamed: 0,tags,tokens,count
0,aux,va,609
1,aux,ha,475
2,aux,és,215
3,aux,van,162
4,aux,han,139

Unnamed: 0,tags,tokens,count
0,verb,fer,77
1,verb,fa,52
2,verb,té,45
3,verb,ha,38
4,verb,tenir,28

Unnamed: 0,tags,tokens,count
0,noun,any,104
1,noun,anys,89
2,noun,milions,83
3,noun,obra,57
4,noun,pessetes,57

Unnamed: 0,tags,tokens,count
0,adp,de,4332
1,adp,a,1803
2,adp,d',881
3,adp,en,801
4,adp,per,762

Unnamed: 0,tags,tokens,count
0,num,dos,70
1,num,tres,59
2,num,dues,30
3,num,un,27
4,num,set,26

Unnamed: 0,tags,tokens,count
0,adj,passat,38
1,adj,general,36
2,adj,nou,30
3,adj,primer,30
4,adj,primera,29

Unnamed: 0,tags,tokens,count
0,cconj,i,1374
1,cconj,o,89
2,cconj,però,74
3,cconj,ni,29
4,cconj,sinó,23

Unnamed: 0,tags,tokens,count
0,_,del,769
1,_,al,323
2,_,dels,285
3,_,als,122
4,_,pel,69

Unnamed: 0,tags,tokens,count
0,pron,que,779
1,pron,_,599
2,pron,es,356
3,pron,s',277
4,pron,hi,141

Unnamed: 0,tags,tokens,count
0,sconj,que,625
1,sconj,com,190
2,sconj,quan,58
3,sconj,perquè,50
4,sconj,si,45

Unnamed: 0,tags,tokens,count
0,adv,no,282
1,adv,més,215
2,adv,també,120
3,adv,ja,105
4,adv,ahir,61

Unnamed: 0,tags,tokens,count
0,sym,"0,2%",4
1,sym,5%,4
2,sym,"0,1%",3
3,sym,"1,5%",3
4,sym,"3,5%",3

Unnamed: 0,tags,tokens,count
0,part,no,21

Unnamed: 0,tags,tokens,count

Unnamed: 0,tags,tokens,count
0,intj,hola,1
1,intj,marbella,1
2,intj,sóc,1


* It can be highlighted that there is no `x` category here, as mentioned before.
* "Que" appears as the top `pron` and `sconj`. We will need to take a look at how the model deals with this particle.
* The proper nouns are almost the same as the training datasets.
* `Adv` and `part` share "no" as well as happens in the training.

# Conclusions

* The Catalan dataset is formed by ~13k sentences and a 15% more for the testing dataset.
* The sentences are long and can reach a total of 250 tokens per sentence. 
    * Some of those are long lists enumerating different detailed items.
    * Some are as short as a simple word with the end of word dot '.'. 
* Both datasets have a similar distribution in terms of frequency of words and tags.
* Several tags can be assigned to a same token. The most varied token can have up until 5 different tags assigned in different contexts due to the different meanings a word can take in catalan and the lack of cases/declinations in the languages (which could help differentiating the type of word it is).
* Some of the words in the language are represented in one single particle (e.g. "de" + "el" becomes "del" and it's the union of a preprosition and a determiner). To deal with this, the data is formed in a way that these cases are tagged as `-` the word itself ("del") and then adding the two words composing this particle next to it ("de" + "el"). This makes the tokens inside a sentence to be more than the words itself.