This notebook calculates the coverage of Wikidata Lexemes regarding a given Wikipedia language edition.

All code and text on this page is double licensed under CC0 and into the Public Domain.

First let's download the cleaned-up data from https://download.wmcloud.org/corpora

This data has been extracted from the language models published at https://www.tensorflow.org/datasets/catalog/wiki40b (see https://www.aclweb.org/anthology/2020.lrec-1.297/ for the language models).

In [11]:
languages = [
    'ar', 'bg', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et',
    'fa', 'fi', 'fr', 'he', 'hi', 'hr', 'hu', 'id', 'it', # 'ja',
    'ko', 'lt', 'lv', 'ms', 'nl', 'no', 'pl', 'pt', 'ro', 'ru',
    'sk', 'sl', 'sr', 'sv', 'th', 'tl', 'tr', 'uk', 'vi',
    # 'zh-cn', 'zh-tw'
]
# languages = [ 'lv' ]

import urllib.request
import os.path

for language in languages:
    filename = language + '.txt.gz'
    if os.path.exists('corpus-' + filename):
        print('Already downloaded ' + language)
        continue
    url = 'https://download.wmcloud.org/corpora/' + filename
    urllib.request.urlretrieve(url, 'corpus-' + filename)
    print('Downloaded ' + language)


Already downloaded ar
Already downloaded bg
Already downloaded ca
Already downloaded cs
Already downloaded da
Already downloaded de
Already downloaded el
Already downloaded en
Already downloaded es
Already downloaded et
Already downloaded fa
Already downloaded fi
Already downloaded fr
Already downloaded he
Already downloaded hi
Already downloaded hr
Already downloaded hu
Already downloaded id
Already downloaded it
Already downloaded ko
Already downloaded lt
Already downloaded lv
Already downloaded ms
Already downloaded nl
Already downloaded no
Already downloaded pl
Already downloaded pt
Already downloaded ro
Already downloaded ru
Already downloaded sk
Already downloaded sl
Already downloaded sr
Already downloaded sv
Already downloaded th
Already downloaded tl
Already downloaded tr
Already downloaded uk
Already downloaded vi


We create a wordlist from the downloaded files.

In [None]:
import gzip
import json

for language in languages:
    wordlist = {}
    print('Reading ' + language)
        
    errorcount = 0
    tokencount = 0
    
    corpusfile = 'corpus-' + language + '.txt.gz'
    linecount = 0
    first = True
    for line in gzip.open(corpusfile, 'r'):
        linecount += 1
        if linecount % 100000 == 0:
            print('{:,} articles processed'.format(linecount))
        for c in [
                b'.', b',', b'_', b'(', b')', b'=', b'"',
                b'\xe2\x80\x9e', b'\xe2\x80\x9f', b'\xe0\xa5\xa4', b'\xe0\xa5\xa5'
        ]:
            line = line.replace(c, b' ')
        words = line.split()
        for word in words:
            tokencount += 1
            try:
                word = word.decode('utf-8')
            except UnicodeDecodeError as e:
                # print(e)
                # print(word)
                # print('')
                errorcount += 1
                continue
            if word in ['', 'NEWLINE']:
                continue
            if word.isdigit():
                continue
            word = word.lower()
            if word not in wordlist:
                wordlist[word] = 0
            wordlist[word] += 1
    
    output = open('wordlist-' + language + '.txt', 'w')
    
    tencount = 0
    tentokencount = 0
    for l in sorted(wordlist.items(), reverse=True, key=lambda x: x[1]):
        if l[1] > 10:
            tencount += 1
            tentokencount += l[1]
            output.write(l[0] + ' ' + str(l[1]) + '\n')
    output.close()
    
    with open('meta-' + language + '.txt', 'w') as output:
        json.dump({
            'corpus': 'wiki40b',
            'numberOfFormsInWiki': len(wordlist),
            'numberOfFormsInWikiTen': tencount,
            'numberOfTokens': tokencount,
            'numberOfTokensTen': tentokencount,
            'unicodeErrors': errorcount
        }, output, indent=4)
    print('Read {} with {:,} different word forms, {:,} with 10+ in {:,} words ({:,} errors)'.format(language, len(wordlist), tencount, tokencount, errorcount))


Reading ar
100,000 articles processed
200,000 articles processed
300,000 articles processed
400,000 articles processed
500,000 articles processed
600,000 articles processed
Read ar with 2,460,454 different word forms, 248,979 with 10+ in 77,178,818 words (117,326 errors)
Reading bg
100,000 articles processed
200,000 articles processed
Read bg with 1,011,156 different word forms, 124,250 with 10+ in 36,705,914 words (31,557 errors)
Reading ca
100,000 articles processed
200,000 articles processed
300,000 articles processed
400,000 articles processed
500,000 articles processed
600,000 articles processed
700,000 articles processed
800,000 articles processed
Read ca with 1,589,331 different word forms, 186,266 with 10+ in 117,262,711 words (91,911 errors)
Reading cs
100,000 articles processed
200,000 articles processed
600,000 articles processed
Read cs with 2,208,808 different word forms, 273,087 with 10+ in 80,899,609 words (181 errors)
Reading da
100,000 articles processed
200,000 articl

Now that we have the wordlists, let's download the Lexemes from Wikidata using the latest dumps.

In [4]:
lexemesfile = 'latest-lexemes.json.gz'
if os.path.exists(lexemesfile):
    print('Already downloaded')
else:
    url = 'https://dumps.wikimedia.org/wikidatawiki/entities/' + lexemesfile
    urllib.request.urlretrieve(url, lexemesfile)
    print('Downloaded lexemes')

Downloaded lexemes


Let's get all the Forms attested in the Lexemes and store them as a formlist.

In [20]:
import gzip
import json

count = 0

qmap = {
    'ar' : 'Q13955',
    'bg' : 'Q7918',
    'ca' : 'Q7026',
    'cs' : 'Q9056',
    'da' : 'Q9035',
    'de' : 'Q188',
    'el' : 'Q9129',
    'en' : 'Q1860',
    'es' : 'Q1321',
    'et' : 'Q9072',
    'fa' : 'Q9168',
    'fi' : 'Q1412',
    'fr' : 'Q150',
    'he' : 'Q9288',
    'hi' : 'Q11051',
    'hr' : 'Q6654',
    'hu' : 'Q9067',
    'id' : 'Q9240',
    'it' : 'Q652',
#    'ja' : 'Q5287',
    'ko' : 'Q9176',
    'lt' : 'Q9083',
    'lv' : 'Q9078',
    'ms' : 'Q9237',
    'nl' : 'Q7411',
    'no' : 'Q9043',
    'pl' : 'Q809',
    'pt' : 'Q5146',
    'ro' : 'Q7913',
    'ru' : 'Q7737',
    'sk' : 'Q9058',
    'sl' : 'Q9063',
    'sr' : 'Q9299',
    'sv' : 'Q9027',
    'th' : 'Q9217',
    'tl' : 'Q34057',
    'tr' : 'Q256',
    'uk' : 'Q8798',
    'vi' : 'Q9199',
#    'zh-cn' : 'Q24841726',
#    'zh-tw' : 'Q262828'
}
mapq = {}
for k in qmap:
    mapq[qmap[k]] = k

outputs = {}
for language in languages:
    outputs[language] = open('formlist-' + language + '.txt', 'w')

errorcount = 0
dictcount = 0
    
for line in gzip.open('latest-lexemes.json.gz'):
    count += 1
    line = line.decode('utf-8').strip()
    if len(line) < 2: continue
    if line[-1] == ',':
        line = line[:-1]
    lexeme = json.loads(line)
    if lexeme['language'] in mapq:
        dictcount += 1
        for form in lexeme['forms']:
            try:
                outputs[mapq[lexeme['language']]].write(form['representations'][mapq[lexeme['language']]]['value'] + '\n')
            except:
                errorcount += 1
                print(errorcount)
                print(lexeme['id'])
                print(lexeme['lemmas'])
                print('')

for language in languages:
    outputs[language].close()

print('{:,} Lexemes total, {:,} used'.format(count, dictcount))

1
L2729
{'ar': {'language': 'ar', 'value': 'فعل'}, 'ar-x-Q775724': {'language': 'ar-x-Q775724', 'value': 'فِعْل'}}

2
L2729
{'ar': {'language': 'ar', 'value': 'فعل'}, 'ar-x-Q775724': {'language': 'ar-x-Q775724', 'value': 'فِعْل'}}

3
L11565
{'pt': {'language': 'pt', 'value': 'cachorro'}}

4
L11565
{'pt': {'language': 'pt', 'value': 'cachorro'}}

5
L11565
{'pt': {'language': 'pt', 'value': 'cachorro'}}

6
L11565
{'pt': {'language': 'pt', 'value': 'cachorro'}}

7
L64262
{'he-x-Q21283070': {'language': 'he-x-Q21283070', 'value': 'בַּיִת'}, 'he-x-Q2975864': {'language': 'he-x-Q2975864', 'value': 'בית'}}

8
L64262
{'he-x-Q21283070': {'language': 'he-x-Q21283070', 'value': 'בַּיִת'}, 'he-x-Q2975864': {'language': 'he-x-Q2975864', 'value': 'בית'}}

9
L295573
{'ja': {'language': 'ja', 'value': '明かり'}, 'ja-hira': {'language': 'ja-hira', 'value': 'あかり'}}

10
L346685
{'fr': {'language': 'fr', 'value': 'agrément'}}

11
L402496
{'fr': {'language': 'fr', 'value': 'garde-magasin'}}

12
L695
{'he-x-Q2

Now we create the progress report for the given languages.

In [24]:
import urllib

def load_filter(language):
    try:
        page = urllib.request.urlopen( 'https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/Filter/' + language + '?action=raw')
        text = page.read().decode('utf-8')
        lines = text.split('\n')
        filtered = set()
        for line in lines:
            if not (line.startswith('*') or line.startswith('#')):
                continue
            _, word = line.split(' ', 1)
            filtered.add(word.lower())
        return filtered
    except:
        print('Could not load filter page for ' + language)
        return []

for language in languages:
    filtered = load_filter(language)
    forms = set()
    for line in open('formlist-' + language + '.txt'):
        form = line.strip().lower()
        form = form.replace('.', '')
        forms.add(form)

    tokencount = 0
    wordcount = 0
    coveredtokens = 0
    uncoveredtokens = 0
    coveredwords = 0
    uncoveredwords = 0
    for line in open('wordlist-' + language + '.txt'):
        word, _, count = line.strip().rpartition(' ')
        count = int(count)
        tokencount += count
        wordcount += 1
        if word in forms or word in filtered:
            coveredwords += 1
            coveredtokens += count
        else:
            uncoveredwords += 1
            uncoveredtokens += count

    print('== {{{{Q|{}}}}} ({}) =='.format(qmap[language], language))
    print('<table><tr><td>')
    print('* Forms in Wikidata: {:,}'.format(len(forms)))
    print('* Forms in Wikipedia (w/ 10+ tokens): {:,}'.format(wordcount))
    print('* Tokens (10+ only): {:,}'.format(tokencount))
    print('* Covered forms: {:,} ({:.1%})'.format(coveredwords, 1.0*coveredwords/wordcount))
    print('* Missing forms:  {:,} ({:.1%})'.format(uncoveredwords, 1.0*uncoveredwords/wordcount))
    print('* Covered tokens: {:,} ({:.1%})'.format(coveredtokens, 1.0*coveredtokens/tokencount))
    print('* Missing tokens: {:,} ({:.1%})'.format(uncoveredtokens, 1.0*uncoveredtokens/tokencount))
    print('* [[Wikidata:Lexicographical_coverage/Missing/{}|Most frequent missing forms]]'.format(language))
    print('</td><td>')
    print('{{{{Graph:Chart|width=100|type=pie|legend=Forms|x=Covered,Missing|y1={},{}}}}}'.format(
        coveredwords,
        uncoveredwords
    ))
    print('</td><td>')
    print('{{{{Graph:Chart|width=100|type=pie|legend=Tokens|x=Covered,Missing|y1={},{}}}}}'.format(
        coveredtokens,
        uncoveredtokens
    ))
    print('</td></td></table>')
    print('')


Could not load filter page for ar
== {{Q|Q13955}} (ar) ==
<table><tr><td>
* Forms in Wikidata: 202
* Forms in Wikipedia (w/ 10+ tokens): 248,809
* Tokens (10+ only): 69,904,516
* Covered forms: 35 (0.0%)
* Missing forms:  248,774 (100.0%)
* Covered tokens: 245,740 (0.4%)
* Missing tokens: 69,658,776 (99.6%)
* [[Wikidata:Lexicographical_coverage/Missing/ar|Most frequent missing forms]]
</td><td>
{{Graph:Chart|width=100|type=pie|legend=Forms|x=Covered,Missing|y1=35,248774}}
</td><td>
{{Graph:Chart|width=100|type=pie|legend=Tokens|x=Covered,Missing|y1=245740,69658776}}
</td></td></table>

Could not load filter page for bg
== {{Q|Q7918}} (bg) ==
<table><tr><td>
* Forms in Wikidata: 166
* Forms in Wikipedia (w/ 10+ tokens): 124,069
* Tokens (10+ only): 33,507,484
* Covered forms: 152 (0.1%)
* Missing forms:  123,917 (99.9%)
* Covered tokens: 347,412 (1.0%)
* Missing tokens: 33,160,072 (99.0%)
* [[Wikidata:Lexicographical_coverage/Missing/bg|Most frequent missing forms]]
</td><td>
{{Graph:Ch

And finally the list of the most frequent missing Forms.

In [28]:
for language in languages:
    top = 1000
    forms = set()
    for line in open('formlist-' + language + '.txt'):
        forms.add(line.strip().lower())

    output = open('missing-' + language + '.txt', 'w')

    filtered = load_filter(language)

    for line in open('wordlist-' + language + '.txt'):
        word, _, count = line.strip().rpartition(' ')
        word = word.lower()

        count = int(count)
        if word in forms or word in filtered:
            pass
        else:
            top -= 1
            if top < 0:
                break
            output.write('# {} ({:,})\n'.format(word, count))
    
    output.close()
    print('Stored in missing-{}.txt'.format(language))


Could not load filter page for ar
Stored in missing-ar.txt
Could not load filter page for bg
Stored in missing-bg.txt
Could not load filter page for ca
Stored in missing-ca.txt
Could not load filter page for cs
Stored in missing-cs.txt
Could not load filter page for da
Stored in missing-da.txt
Stored in missing-de.txt
Could not load filter page for el
Stored in missing-el.txt
Stored in missing-en.txt
Could not load filter page for es
Stored in missing-es.txt
Could not load filter page for et
Stored in missing-et.txt
Could not load filter page for fa
Stored in missing-fa.txt
Could not load filter page for fi
Stored in missing-fi.txt
Could not load filter page for fr
Stored in missing-fr.txt
Could not load filter page for he
Stored in missing-he.txt
Could not load filter page for hi
Stored in missing-hi.txt
Could not load filter page for hr
Stored in missing-hr.txt
Could not load filter page for hu
Stored in missing-hu.txt
Could not load filter page for id
Stored in missing-id.txt
Could 