In [22]:
import sys

sys.path.insert(0, "..")

In [23]:
from src.scrapper import parse_conllu_file
from src.visualization import plot_frequency_of_
from analyses.utils import (
    get_stats,
    build_counts,
    build_dataframes,
    print_top_tokens_given_tag,
    display_side_by_side,
    visualize_sample,
    get_sentence_idx_given_pair,
)
import pandas as pd
import numpy as np

In [24]:
pd.set_option("display.max_columns", 500)

# General information

## English language

English is a language with approximately 1.457 billion speakers and it’s the most learned language in the world, making the number of native speakers (people who have it as an L1 language) almost 3 times less than those who have it as a second language (L2).
English is a West Germanic language in the Indo-European language family. Unlike the Catalan, it does not come from Latin.

## The data

* The corpus is referred as ESLSpok in the project, but its data is from [NICT JLE](https://alaginrc.nict.go.jp/nict_jle/index_E.html), a corpus of spoken second language English
* Github repository available [here](https://github.com/UniversalDependencies/UD_English-ESLSpok/tree/master)
* Train dataset available [here](https://github.com/UniversalDependencies/UD_English-ESLSpok/blob/master/en_eslspok-ud-train.conllu) (1856 sentences)
* Test dataset available [here](https://github.com/UniversalDependencies/UD_English-ESLSpok/blob/master/en_eslspok-ud-test.conllu) (232 sentences)

In [25]:
train_info = parse_conllu_file("../datasets/en_eslspok-ud-train.conllu")
test_info = parse_conllu_file("../datasets/en_eslspok-ud-test.conllu")

train_df = build_dataframes(train_info)
test_df = build_dataframes(test_info)

tags = train_df.tags.unique()

Let's take a look at the characteristics of the datasets:

In [26]:
get_stats(train_info)

Total sentences: 1856
Average sentence length: 9
Minimum sentence length: 2
Maximum sentence length: 48
Percentile 25, length: 5.0
Percentile 50, length: 7.0
Percentile 75, length: 11.0


[EXPLANATION]

In [27]:
sentence_lengths = [len(sentence) for sentence in train_info]
min_idx = np.argmin(sentence_lengths)
max_idx = np.argmax(sentence_lengths)

visualize_sample(train_info, min_idx)

Unnamed: 0,0,1
tokens,nnto,
tags,x,x


[EXPLANATION]

In [28]:
visualize_sample(train_info, max_idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47
tokens,when,you,sit,down,on,the,chairs,",",and,also,take,off,the,chairs,",",some,beginners,get,the,poles,stuck,on,the,snow,",",and,some,people,really,getting,trouble,sticking,on,the,poles,and,your,body,stuck,on,the,poles,",",and,finally,chairs,stop,.
tags,sconj,pron,verb,adp,adp,det,noun,punct,cconj,adv,verb,adp,det,noun,punct,det,noun,verb,det,noun,verb,adp,det,noun,punct,cconj,det,noun,adv,verb,noun,verb,adp,det,noun,cconj,pron,noun,verb,adp,det,noun,punct,cconj,adv,noun,verb,punct


[EXPLANATION]

In [29]:
get_stats(test_info)

Total sentences: 232
Average sentence length: 10
Minimum sentence length: 2
Maximum sentence length: 60
Percentile 25, length: 5.0
Percentile 50, length: 8.0
Percentile 75, length: 12.0


[EXPLANATION]

In [30]:
sentence_lengths = [len(sentence) for sentence in test_info]
min_idx = np.argmin(sentence_lengths)
max_idx = np.argmax(sentence_lengths)

visualize_sample(test_info, min_idx)

Unnamed: 0,0,1
tokens,she,nawatobi
tags,pron,x


In [31]:
visualize_sample(test_info, max_idx)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59
tokens,for,example,",",the,man,close,to,the,window,is,chewing,gum,",",and,he,looks,out,of,the,class,",",and,behind,him,",",there,are,two,girls,",",who,are,chatting,with,each,other,",",and,in,the,center,of,this,class,",",there,is,one,guy,who,is,drinking,and,who,is,listening,to,the,c,d.
tags,adp,noun,punct,det,noun,adj,part,det,noun,aux,verb,noun,punct,cconj,pron,verb,adp,adp,det,noun,punct,cconj,adp,pron,punct,pron,verb,num,noun,punct,pron,aux,verb,adp,det,adj,punct,cconj,adp,det,noun,adp,det,noun,punct,pron,verb,num,noun,pron,aux,verb,cconj,pron,aux,verb,part,det,noun,noun


[EXPLANATION]

# Words

Let's now take a look at the distribution of the data:

✏️ It is recommended to play with the plots shown below. They can be inreacted with in the following ways:
* Hover over the plots to see extra informations of the bars (sometimes not shown in the axes due to lack of space).
* Zoom in any interval to check specific values.
* Top right menu to perform different operations, such as reset the plot to the original state using the house icon.

In [32]:
train_word_counts, train_tag_counts, train_pair_counts = build_counts(train_info)
test_word_counts, test_tag_counts, test_pair_counts = build_counts(test_info)

In [33]:
plot_frequency_of_("words", train_word_counts, test_word_counts)

[EXPLANATION]


# Tags

[CHECK EXPLANATION]

We have a total of 17 tags (for a further explanation, it can be checked [here](https://universaldependencies.org/u/pos/))
1. Noun $\rightarrow$ common nouns
2. Adp  $\rightarrow$ adposition
3. Det  $\rightarrow$ determiners
4. Punct    $\rightarrow$ punctuation signs
5. PropN    $\rightarrow$ proper nouns
6. Verb $\rightarrow$ verbs
7. Adj  $\rightarrow$ adjectives
8. Pron $\rightarrow$ pronouns
9. Aux  $\rightarrow$ auxiliary
10. `-`   $\rightarrow$ the token means nothing. It is not included in the list of tag of the universal dependencies, but must be ignored. In this dataset, it appears when splitting the particle "del" as explained some sections above.
11. Adv $\rightarrow$ adverb
12. Cconj   $\rightarrow$ coordinating conjunction
13. Sconj   $\rightarrow$ subordinating conjunction
14. Num $\rightarrow$ numeral
15. Sym $\rightarrow$ symbol
16. Part    $\rightarrow$ particle
17. Intj    $\rightarrow$ interjection
18. x   $\rightarrow$ other

Let's check an example for each:

In [34]:
train_df[["tokens", "tags"]].drop_duplicates(["tags"])

Unnamed: 0,tokens,tags
0,it,pron
1,was,aux
2,telephone,noun
4,",",punct
5,and,cconj
8,only,adv
13,know,verb
15,the,det
18,no,intj
24,lucky,adj


And their frequency and distribution:

In [35]:
plot_frequency_of_("tags", train_tag_counts, test_tag_counts)

[EXPLANATION]


# Tag-Word pairs

[EXPLANATION]


In [36]:
plot_frequency_of_("word-pair tag", train_pair_counts, test_pair_counts)

[EXPLANATION]

Let's now check which tokens have several tags assigned:

In [37]:
diff_tags_same_token = (
    train_df[["tokens", "tags"]].groupby(["tokens"]).agg(set).reset_index()
)
diff_tags_same_token["length_set"] = diff_tags_same_token.apply(
    lambda x: len(x["tags"]), axis=1
)
print(
    f"Total tokens with >1 tag assigned is {len(diff_tags_same_token[diff_tags_same_token.length_set > 1])}"
)
for nr_tags in range(1, max(diff_tags_same_token.length_set.unique()) + 1):
    print(
        f"Total tokens with {nr_tags} tags assigned is {len(diff_tags_same_token[diff_tags_same_token.length_set == nr_tags])}"
    )

Total tokens with >1 tag assigned is 171
Total tokens with 1 tags assigned is 1735
Total tokens with 2 tags assigned is 153
Total tokens with 3 tags assigned is 14
Total tokens with 4 tags assigned is 3
Total tokens with 5 tags assigned is 0
Total tokens with 6 tags assigned is 1


[EXPLANATION]


In [38]:
diff_tags_same_token[diff_tags_same_token.length_set > 1].sort_values(
    by=["length_set"], ascending=False
).head(10)

Unnamed: 0,tokens,tags,length_set
943,like,"{verb, adj, intj, sconj, adv, adp}",6
1498,so,"{cconj, intj, adv, sconj}",4
6,'s,"{aux, pron, verb, part}",4
1663,that,"{sconj, adv, det, pron}",4
106,at,"{sconj, adv, adp}",3
698,good,"{propn, intj, adj}",3
1353,right,"{intj, adv, adj}",3
1680,this,"{adv, det, pron}",3
1764,up,"{adv, x, adp}",3
148,before,"{sconj, adv, adp}",3


[EXPLANATION]


In [39]:
diff_tags_same_token[diff_tags_same_token.length_set <= 2].sort_values(
    by=["length_set"], ascending=False
).head(10)

Unnamed: 0,tokens,tags,length_set
1408,second,"{noun, adj}",2
1726,travel,"{noun, verb}",2
1317,refund,"{noun, verb}",2
1118,no,"{intj, det}",2
421,day,"{noun, propn}",2
1521,sorry,"{intj, adj}",2
646,french,"{propn, adj}",2
1526,spanish,"{propn, adj}",2
184,book,"{noun, verb}",2
409,d,"{noun, x}",2


In [40]:
# idx = get_sentence_idx_given_pair(train_info, ("pubilla", "propn")) # LOOK FOR EXAMPLE
# visualize_sample(train_info, idx)

[EXPLANATION]


Lastly, we will display the top 5 words for each tag that we find in train and test:

In [41]:
train_dfs = [print_top_tokens_given_tag(train_df, tag) for tag in tags]
display_side_by_side(*train_dfs)

Unnamed: 0,tags,tokens,count
0,pron,i,840
1,pron,you,231
2,pron,it,226
3,pron,my,216
4,pron,we,92

Unnamed: 0,tags,tokens,count
0,aux,is,226
1,aux,'s,129
2,aux,was,112
3,aux,'m,84
4,aux,do,82

Unnamed: 0,tags,tokens,count
0,noun,time,44
1,noun,people,32
2,noun,car,31
3,noun,school,31
4,noun,train,31

Unnamed: 0,tags,tokens,count
0,punct,.,1622
1,punct,",",778
2,punct,?,144
3,punct,"""",48
4,punct,-,18

Unnamed: 0,tags,tokens,count
0,cconj,and,576
1,cconj,but,178
2,cconj,or,93
3,cconj,so,45
4,cconj,either,1

Unnamed: 0,tags,tokens,count
0,adv,so,218
1,adv,very,126
2,adv,just,58
3,adv,much,42
4,adv,now,41

Unnamed: 0,tags,tokens,count
0,verb,have,117
1,verb,go,92
2,verb,like,76
3,verb,know,75
4,verb,thank,73

Unnamed: 0,tags,tokens,count
0,det,the,508
1,det,a,294
2,det,this,49
3,det,some,35
4,det,that,27

Unnamed: 0,tags,tokens,count
0,intj,no,41
1,intj,yeah,35
2,intj,yes,34
3,intj,please,17
4,intj,like,14

Unnamed: 0,tags,tokens,count
0,adj,good,39
1,adj,last,34
2,adj,nice,29
3,adj,many,27
4,adj,other,22

Unnamed: 0,tags,tokens,count
0,part,to,463
1,part,n't,126
2,part,not,71
3,part,'s,22
4,part,',6

Unnamed: 0,tags,tokens,count
0,sconj,because,65
1,sconj,when,31
2,sconj,if,30
3,sconj,that,24
4,sconj,after,9

Unnamed: 0,tags,tokens,count
0,adp,in,200
1,adp,of,140
2,adp,for,97
3,adp,on,70
4,adp,with,68

Unnamed: 0,tags,tokens,count
0,num,one,58
1,num,two,32
2,num,four,10
3,num,three,9
4,num,five,8

Unnamed: 0,tags,tokens,count
0,propn,charlie,31
1,propn,tokyo,15
2,propn,hokkaido,14
3,propn,japan,14
4,propn,hiroshima,13

Unnamed: 0,tags,tokens,count
0,x,-,4
1,x,ku,3
2,x,nante,3
3,x,bye,2
4,x,nabe,2


[EXPLANATION]


In [42]:
test_dfs = [print_top_tokens_given_tag(test_df, tag) for tag in tags]
display_side_by_side(*test_dfs)

Unnamed: 0,tags,tokens,count
0,pron,i,109
1,pron,you,29
2,pron,it,27
3,pron,my,25
4,pron,she,13

Unnamed: 0,tags,tokens,count
0,aux,is,23
1,aux,'s,20
2,aux,do,17
3,aux,was,17
4,aux,'m,12

Unnamed: 0,tags,tokens,count
0,noun,time,8
1,noun,people,7
2,noun,house,5
3,noun,party,5
4,noun,work,5

Unnamed: 0,tags,tokens,count
0,punct,.,203
1,punct,",",116
2,punct,?,15
3,punct,"""",8
4,punct,...,3

Unnamed: 0,tags,tokens,count
0,cconj,and,83
1,cconj,but,27
2,cconj,so,8
3,cconj,or,4

Unnamed: 0,tags,tokens,count
0,adv,so,30
1,adv,very,21
2,adv,much,11
3,adv,just,9
4,adv,really,6

Unnamed: 0,tags,tokens,count
0,verb,like,13
1,verb,have,12
2,verb,know,12
3,verb,go,9
4,verb,thank,9

Unnamed: 0,tags,tokens,count
0,det,the,70
1,det,a,41
2,det,some,10
3,det,this,8
4,det,that,7

Unnamed: 0,tags,tokens,count
0,intj,yeah,8
1,intj,yes,7
2,intj,like,5
3,intj,no,3
4,intj,ok,2

Unnamed: 0,tags,tokens,count
0,adj,good,6
1,adj,last,6
2,adj,little,5
3,adj,interesting,4
4,adj,many,4

Unnamed: 0,tags,tokens,count
0,part,to,62
1,part,n't,25
2,part,not,13
3,part,'s,3
4,part,na,1

Unnamed: 0,tags,tokens,count
0,sconj,because,7
1,sconj,when,5
2,sconj,that,4
3,sconj,if,3
4,sconj,as,1

Unnamed: 0,tags,tokens,count
0,adp,in,22
1,adp,for,19
2,adp,of,19
3,adp,at,13
4,adp,from,5

Unnamed: 0,tags,tokens,count
0,num,one,8
1,num,two,8
2,num,three,3
3,num,eleven,1
4,num,five,1

Unnamed: 0,tags,tokens,count
0,propn,charlie,3
1,propn,hokkaido,2
2,propn,japan,2
3,propn,osaka,2
4,propn,saitama,2

Unnamed: 0,tags,tokens,count
0,x,jukai,1
1,x,m,1
2,x,m.,1
3,x,nawatobi,1


[EXPLANATION]


# Conclusions

[EXPLANATION]
