In [1]:
import pandas as pd

### GLAFF Format
GLAFF is formatted into 17 parts, each separated by `|`.

1. wordform
2. morphosyntactic tag in GRACE format
3. lemma
4. 0, 1 or several pronunciations (separated by a semicolon), encoded in IPA
5. SAMPA transcriptions
6. absolute frequency of the categorized form in Frantext 20e corpus
7. relative frequency (per million words) of the categorized form in Frantext 20e corpus
8. absolute frequency of the categorized lemma in Frantext 20e corpus
9. relative frequency (per million words) of the categorized lemma in Frantext 20e corpus
10. absolute frequency of the categorized form in LM10 corpus
11. relative frequency (per million words) of the categorized form in LM10 corpus
12. absolute frequency of the categorized lemma in LM10 corpus
13. relative frequency (per million words) of the categorized lemma in LM10 corpus
14. absolute frequency of the categorized form in FrWac corpus
15. relative frequency (per million words) of the categorized form in FrWac corpus
16. absolute frequency of the categorized lemma in FrWac corpus
17. relative frequency (per million words) of the categorized lemma in FrWac corpus

The first 2 are of interest: wordform and GRACE morphosyntactic tag

`übercélèbre|Afpfs|übercélèbre|y.bœʁ.se.lɛbʁ|y.b9R.se.lEbR|0|0|1|0|0|0|0|0|0|0|0|0`

In [17]:
file = "GLAFF-1.2.2/glaff-1.2.2.txt"

data = pd.read_csv(file, sep="|", names=['wordform', 'GRACE'], usecols=[0,1]) # only concerned with first 2 columns
data

Unnamed: 0,wordform,GRACE
0,übercélèbre,Afpfs
1,übercélèbre,Afpms
2,überconsommateur,Ncms
3,übersexuel,Ncms
4,ṯāʾ,Ncmp
...,...,...
1406852,zyva,Ncms
1406853,zyvas,Ncmp
1406854,zyzel,Ncms
1406855,zyzomys,Ncmp


### GRACE format

https://www.researchgate.net/publication/37441700_Linguistic_Issues_in_GRACE_evaluation_of_Part-Of-Speech_tagging_for_French

as stated in the above paper: the GRACE format follows the following pattern:
1. PoS
2. Type = c (common) or k (cardinal) or p (proper)
3. Gender = m (masculine) or f (feminine)
4. Number = s (singular) or p (plural)
<ul>
PoS :   <li>A = Adj</li>
        <li>R = Adv</li>
        <li>C = Conj</li>
        <li>D = Det</li>
        <li>I = Interjections</li>
        <li>N = Noun</li>
        <li>P = ProN</li>
        <li>S = Prep</li>
        <li>V = Verb</li>
        <li>F = Punctuation</li>
        <li>X = Residual</li>
        <li>? = Extra-lexical forms </li>
</ul>

In [25]:
all_nouns = data[data['GRACE'].str.contains('N')]
all_nouns

Unnamed: 0,wordform,GRACE
2,überconsommateur,Ncms
3,übersexuel,Ncms
4,ṯāʾ,Ncmp
5,ṯāʾ,Ncms
6,açaï,Ncms
...,...,...
1406852,zyva,Ncms
1406853,zyvas,Ncmp
1406854,zyzel,Ncms
1406855,zyzomys,Ncmp


In [27]:
common_nouns = all_nouns[all_nouns['GRACE'].str.contains('Nc')]
common_nouns

Unnamed: 0,wordform,GRACE
2,überconsommateur,Ncms
3,übersexuel,Ncms
4,ṯāʾ,Ncmp
5,ṯāʾ,Ncms
6,açaï,Ncms
...,...,...
1406852,zyva,Ncms
1406853,zyvas,Ncmp
1406854,zyzel,Ncms
1406855,zyzomys,Ncmp


There are appears to be no cardinal nor proper nouns in the French dataset

Plurality is not of our concern so we can furthr remove them

In [28]:
sg_nouns = common_nouns[common_nouns['GRACE'].str.contains('s')]
sg_nouns

Unnamed: 0,wordform,GRACE
2,überconsommateur,Ncms
3,übersexuel,Ncms
5,ṯāʾ,Ncms
6,açaï,Ncms
8,aïaut,Ncms
...,...,...
1406850,zython,Ncms
1406851,zythum,Ncms
1406852,zyva,Ncms
1406854,zyzel,Ncms


In [29]:
sg_masc_nouns = sg_nouns[sg_nouns['GRACE'].str.contains('m')]
sg_masc_nouns

Unnamed: 0,wordform,GRACE
2,überconsommateur,Ncms
3,übersexuel,Ncms
5,ṯāʾ,Ncms
6,açaï,Ncms
8,aïaut,Ncms
...,...,...
1406850,zython,Ncms
1406851,zythum,Ncms
1406852,zyva,Ncms
1406854,zyzel,Ncms


In [30]:
sg_fem_nouns = sg_nouns[sg_nouns['GRACE'].str.contains('f')]
sg_fem_nouns

Unnamed: 0,wordform,GRACE
18,aïeule,Ncfs
23,aïkidoka,Ncfs
27,aïkidokate,Ncfs
29,çakti,Ncfs
39,AAL,Ncfs
...,...,...
1406819,zymologie,Ncfs
1406829,zymosimétrie,Ncfs
1406830,zymotechnie,Ncfs
1406847,zythologie,Ncfs


In [38]:
df_concat = pd.concat([sg_masc_nouns, sg_fem_nouns], ignore_index=True).rename(columns={'wordform': 'noun'})
df_concat['gender'] = df_concat['GRACE'].map({'Ncms': 'masculine', 'Ncfs': 'feminine'})
french_nouns = df_concat.drop(columns='GRACE')
french_nouns

Unnamed: 0,noun,gender
0,überconsommateur,masculine
1,übersexuel,masculine
2,ṯāʾ,masculine
3,açaï,masculine
4,aïaut,masculine
...,...,...
104753,zymologie,feminine
104754,zymosimétrie,feminine
104755,zymotechnie,feminine
104756,zythologie,feminine


In [39]:
french_nouns.to_csv('french_nouns_glaff.csv')