We'll be using the

In [1]:
!pip install pymediawiki

Collecting pymediawiki
  Downloading pymediawiki-0.7.0-py3-none-any.whl (23 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 4.9 MB/s eta 0:00:01
Collecting soupsieve>1.2
  Downloading soupsieve-2.2-py3-none-any.whl (33 kB)
Installing collected packages: soupsieve, beautifulsoup4, pymediawiki
Successfully installed beautifulsoup4-4.9.3 pymediawiki-0.7.0 soupsieve-2.2


In [2]:
%config Completer.use_jedi = False

In [3]:
import io

import sentencepiece as spm

from mediawiki import MediaWiki

from mediawiki.exceptions import  DisambiguationError, PageError

Let's start with simple english Wikipedia

In [128]:
wikipedia = MediaWiki(lang='simple')

wikipedia.base_url

'https://simple.wikipedia.org'

In [129]:
wikipedia.random(pages=10)

['Cabaret (1972 movie)',
 'Deaths in 2012',
 'Donald Crisp',
 'FC Spartak Vladikavkaz',
 'Keturah',
 'Ricardo Zonta',
 'North Wilkesboro, North Carolina',
 'Central European Summer Time',
 'Deciduous',
 'Thesaurus']

# Tokenizer

### Vocabulary Word List

We are going to BNC spoken frequency word lists as the vocabulary

In [93]:
wiktionary = MediaWiki(url='https://simple.wiktionary.org/w/api.php')

wikipedia.base_url

'https://simple.wiktionary.org'

In [169]:
bnc_word_list = []
for list_num in range(1,4):
    for letter in ['a', 'f', 'n', 's']:
        word_list_title = 'Wiktionary:BNC spoken freq %02d%s' % (list_num, letter)
        print("Processing %s" % word_list_title)
        word_list_page = wiktionary.page(word_list_title)
        sections = word_list_page.sections.copy()
        for section in sections:
            bnc_word_list.extend([word.strip() for word in word_list_page.section(section).split('•')])
    

Processing Wiktionary:BNC spoken freq 01a
Processing Wiktionary:BNC spoken freq 01f
Processing Wiktionary:BNC spoken freq 01n
Processing Wiktionary:BNC spoken freq 01s
Processing Wiktionary:BNC spoken freq 02a
Processing Wiktionary:BNC spoken freq 02f
Processing Wiktionary:BNC spoken freq 02n
Processing Wiktionary:BNC spoken freq 02s
Processing Wiktionary:BNC spoken freq 03a
Processing Wiktionary:BNC spoken freq 03f
Processing Wiktionary:BNC spoken freq 03n
Processing Wiktionary:BNC spoken freq 03s


In [170]:
len(bnc_word_list)

16470

In [233]:
bnc_word_list[-10:]

['Yugoslav',
 'Yugoslavs',
 'Zealand',
 'Zealander',
 'Zealanders',
 'zone',
 'zones',
 'zonal',
 'zoned',
 'zoning']

Train a sentence piece model for tokenization

In [180]:
# Train and load a sentence peice model without writing to disk
model = io.BytesIO()
spm.SentencePieceTrainer.train(sentence_iterator=map(lambda x:x.lower(), bnc_word_list), model_writer=model, vocab_size=2048, model_type='bpe')
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())

In [606]:
with open("Models/Tokenizer/bnc_2048_bpe.model", "wb") as f:
    f.write(model.getbuffer())

Get a random page to test it out

In [194]:
rand_page_title = wikipedia.random(pages=1)
rand_page_title

'Littoral zone'

In [195]:
rand_page = wikipedia.page(rand_page_title)

In [196]:
rand_page.summarize()

'The littoral zone refers to that part of a sea, lake or river which is close to the shore. In coastal environments the littoral zone extends from the high water mark, which is rarely under water, to shoreline areas that are permanently submerged. It always includes this intertidal zone and is often used to mean the same as the intertidal zone. However, the meaning of "littoral zone" can extend well beyond the intertidal zone.\nIn oceanography and marine biology, the idea of the littoral zone is extended roughly to the edge of the continental shelf.'

In [201]:
sp.encode(rand_page.summarize().lower(), out_type=str, nbest_size=-2)[:7]

['▁the', '▁li', 'tt', 'or', 'al', '▁z', 'one']

### Alternative Word List

We could have also generated a shorter word list for the tokenizer using `Wikipedia:Basic English combined wordlist` page.

In [8]:
word_list_page = wikipedia.page('Wikipedia:Basic English combined wordlist')

To quote the page:

In [9]:
print(word_list_page.summarize())

This is the maximum Basic English combined wordlist. It is what the advanced student will know when moving from Basic English to the standard English language. So any student who knows all of these words has gone far beyond Basic English.
It actually contains well over 2,600 words and combines five separate word lists:

850 Basic English words.
179 international words:50 international nouns.
12 names of sciences.
12 title and organizational names.
50 general utility.
5 onomatopoeic (sounds like) words.
50 words about time and numbers.1293 words used as an addendum.
215 compound words (made up of Basic English words).
91 new words made from adding the allowed endings: -er, -ed, -ing, -ly, -s, and the prefix un-.Total: 2626 words.


Let's extract the word list

In [13]:
# Split the page into lines
lines = word_list_page.content.split('\n')

# The actual words are grouped into categories: we split by ':' and take the 2nd part
# Inside in each category they are sparated by '•', so we join all the lines and then split by the symbol
# And we abuse list comprehension to do it in one line :D
%time
word_list = [word.strip() for word in '•'.join([line.split(':')[1].strip() for line in lines if '•' in line]).split('•')]

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.34 µs


In [14]:
word_list[:10]

['a',
 'able',
 'about',
 'account',
 'acid',
 'across',
 'act',
 'addition',
 'adjustment',
 'advertisement']

# Data Gathering

In [597]:
rand_page_title =  wikipedia.random()
rand_page = wikipedia.page(rand_page_title, auto_suggest=False)
print(rand_page_title)

The Dick Van Dyke Show


In [602]:
rand_page.summary

'The Dick Van Dyke Show is an American situation comedy television series. It premiered on CBS on October 3, 1961. The series lasted for 158 half-hour episodes. The show came to an end on June 1, 1966 and is no longer airing. that same year, CBS began broadcasting in color. The show was created and produced by Carl Reiner. The executive producer was Sheldon Leonard. The show starred Dick Van Dyke, Mary Tyler Moore, Rose Marie and Morey Amsterdam. It was about a fictional television comedy writer, Rob Petrie (Van Dyke). The series won 15 Emmy Awards. In 1997, the episodes "Coast-to-Coast Big Mouth" and "It May Look Like a Walnut" were ranked at 8 and 15 on TV Guide\'s 100 Greatest Episodes of All Time. In 2002, it was ranked at 13 on TV Guide\'s 50 Greatest TV Shows of All Time.In 1991, Nick at Nite began airing reruns, and continued airing until 2000, when it was removed from the lineup. It also aired on TV Land from 2000 to 2007, and again from 2011 to 2013.\nIn 1993, Bud Light launch

In [603]:
links = list(set(rand_page.backlinks).union(rand_page.links))
len(links)

40

In [604]:
categories = [category for category in map(lambda x: x.lower(), rand_page.categories) if not any(sub in category for sub in ['births', 'deaths', 'list', 'pages'])]

In [601]:
categories

['1961 establishments in the united states',
 '1966 disestablishments in the united states',
 'american sitcoms',
 'cbs network shows',
 'emmy award winning programs',
 'english-language television programs']

In [586]:
for category in categories:
    print(category)
    print(len(wikipedia.categorymembers(category, results=200, subcategories=False)))
    

county seats in west virginia
0
towns in west virginia
0


In [553]:
rand_page.summary.splitlines()

['Petnjica (Serbian Cyrillic: Петњица) is a city in Montenegro. In 2011, 539 people lived there.']

In [558]:
print(rand_page.redirects)

[]


In [559]:
all_category_members = []
for members in map(lambda category: wikipedia.categorymembers(category, results=200, subcategories=False) , categories):
    all_category_members.extend(members)


In [560]:
len(set(all_category_members))

26

In [561]:
all_category_members

['List of cities in Montenegro',
 'Andrijevica',
 'Bar, Montenegro',
 'Berane',
 'Bijelo Polje',
 'Budva',
 'Cetinje',
 'Danilovgrad',
 'Gusinje',
 'Herceg Novi',
 'Kolašin',
 'Kotor',
 'Mojkovac',
 'Nikšić',
 'Petnjica',
 'Plav, Montenegro',
 'Pljevlja',
 'Plužine',
 'Podgorica',
 'Rožaje',
 'Šavnik',
 'Tivat',
 'Tuzi',
 'Ulcinj',
 'Andrijevica',
 'Bar, Montenegro',
 'Berane',
 'Bijelo Polje',
 'Budva',
 'Cetinje',
 'Danilovgrad',
 'Gusinje',
 'Herceg Novi',
 'Kolašin',
 'Kotor',
 'Mojkovac',
 'Nikšić',
 'Petnjica',
 'Plav, Montenegro',
 'Pljevlja',
 'Plužine',
 'Podgorica',
 'Rožaje',
 'Šavnik',
 'Tivat',
 'Tuzi',
 'Ulcinj',
 'Žabljak',
 'Template:Municipalities of Montenegro']

In [562]:
rand_page = wikipedia.page( 'Analog Science Fiction and Fact people')
rand_page.summarize()

'Lawrence is a city in Douglas County in the state of Kansas in the United States. It is in the northeastern part of the state, near the Kansas City area. It is the county seat of Douglas County. In 2010, 87,643 people lived there; though in 2019, there were an estimated 98,193 people. This makes it the sixth-biggest city in Kansas. The University of Kansas and Haskell Indian Nations University are in Lawrence.\nThe New England Emigrant Aid Company (NEEAC) created Lawrence in 1854. It is named after Amos Adams Lawrence, who gave financial support to the city. During Bleeding Kansas, Lawrence was where the Wakarusa War (1855) and the Sack of Lawrence (1856) happened. Lawrence is also the site of the Lawrence Massacre (1863) which happened during the American Civil War (1861–1865).\nLawrence started as an important place for free-state politics. After that, Lawrence\'s economy grew to be in many industries. These industries include agriculture, manufacturing, and education. Lawrence is c

In [504]:
rand_page.redirects

['Lawrence (Kansas)', 'Lawrence, KS']

In [466]:
rand_page.backlinks

['1864',
 '1939 Nebraska vs. Kansas State football game',
 'Abilene, Kansas',
 'Agenda, Kansas',
 'Alan Mulally',
 'Allen County, Kansas',
 'Amos Adams Lawrence',
 'Anderson County, Kansas',
 'Andover, Kansas',
 'April 21',
 'Area code 785',
 'Arkansas City, Kansas',
 'Ashland, Kansas',
 'Atchison County, Kansas',
 'Atchison, Kansas',
 'Atwood, Kansas',
 'August 21',
 'Augusta, Kansas',
 'Baldwin City, Kansas',
 'Barber County, Kansas',
 'Bart D. Ehrman',
 'Barton County, Kansas',
 'Basehor, Kansas',
 'Bel Aire, Kansas',
 'Belleville, Kansas',
 'Belvoir, Kansas',
 'Big 12 Conference',
 'Bill Hougland',
 'Black Jack, Kansas',
 'Bleeding Kansas',
 'Bonner Springs, Kansas',
 'Border Ruffian',
 'Bourbon County, Kansas',
 'Brian McClendon',
 'Bright Eyes (band)',
 'Brown County, Kansas',
 'Burlington, Kansas',
 'Butler County, Kansas',
 'Carnival of Souls',
 'Centron Corporation',
 'Change of Heart (street paper)',
 'Chanute, Kansas',
 'Charles Branscomb',
 'Charles Duncan Michener',
 'Char

In [21]:
rand_page.categories

['1930 establishments in the United States',
 'Analog Science Fiction and Fact',
 'Magazines established in 1930',
 'Magazines formerly owned by Condé Nast',
 'Monthly magazines published in the United States',
 'Penny Publications magazines',
 'Pulp magazines',
 'Science fiction digests',
 'Science fiction magazines established in the 1930s',
 'Science fiction magazines published in the United States',
 'Street & Smith']

In [25]:
rand_page = wikipedia.page('Haloperidol')
rand_page.summarize()

"Haloperidol, sold under the brand name Haldol among others, is a typical antipsychotic medication. Haloperidol is used in the treatment of schizophrenia, tics in Tourette syndrome, mania in bipolar disorder, delirium, agitation, acute psychosis, and hallucinations in alcohol withdrawal. It may be used by mouth or injection into a muscle or a vein. Haloperidol typically works within 30 to 60 minutes. A long-acting formulation may be used as an injection every four weeks in people with schizophrenia or related illnesses, who either forget or refuse to take the medication by mouth.Haloperidol may result in a movement disorder known as tardive dyskinesia which may be permanent. Neuroleptic malignant syndrome and QT interval prolongation may occur. In older people with psychosis due to dementia it results in an increased risk of death. When taken during pregnancy it may result in problems in the infant. It should not be used in people with Parkinson's disease.Haloperidol was discovered in 

In [28]:
rand_page.backlinks

['(+)-CPCA',
 '1,1,1-Trichloroethane',
 '1,3-Diaminopropane',
 '1-(2-Dimethylaminoethyl)dihydropyrano(3,2-e)indole',
 '1-(2-Diphenyl)piperazine',
 '1-(3-Chlorophenyl)-4-(2-phenylethyl)piperazine',
 '1-(4-(Trifluoromethyl)phenyl)piperazine',
 '1-Aminocyclopropane-1-carboxylic acid',
 '1-Benzyl-4-(2-(diphenylmethoxy)ethyl)piperidine',
 '1-Methyl-3-propyl-4-(p-chlorophenyl)piperidine',
 '1-Methylpsilocin',
 '17α-Alkylated anabolic steroid',
 '17α-Dihydroequilenin',
 '17α-Dihydroequilin',
 '17β-Dihydroequilenin',
 '17β-Dihydroequilin',
 '18-Methoxycoronaridine',
 "2'-Oxo-PCE",
 '2,2,2-Trichloroethanol',
 '2,3-Methylenedioxyamphetamine',
 '2,5-Dimethoxy-4-bromoamphetamine',
 '2,5-Dimethoxy-4-chloroamphetamine',
 '2,5-Dimethoxy-4-ethoxyamphetamine',
 '2,5-Dimethoxy-4-ethylamphetamine',
 '2,5-Dimethoxy-4-fluoroamphetamine',
 '2,5-Dimethoxy-4-iodoamphetamine',
 '2,5-Dimethoxy-4-isopropylamphetamine',
 '2,5-Dimethoxy-4-methylamphetamine',
 '2,5-Dimethoxy-4-nitroamphetamine',
 '2,5-Dimethoxy-4-p

In [27]:
rand_page.categories

['4-Hydroxypiperidines',
 'Antiemetics',
 'Belgian inventions',
 'Butyrophenone antipsychotics',
 'Chemical substances for emergency medicine',
 'Chloroarenes',
 'Fluoroarenes',
 'Janssen Pharmaceutica',
 'Johnson & Johnson brands',
 'NMDA receptor antagonists',
 'Prolactin releasers',
 'Suspected embryotoxicants',
 'Suspected fetotoxicants',
 'Tertiary alcohols',
 'Typical antipsychotics',
 'World Health Organization essential medicines']

In [34]:
import random

In [39]:
wikipedia.categorytree(random.choice(rand_page.categories))

{'Tertiary alcohols': {'depth': 0,
  'sub-categories': {'1-Ethenylcyclopentanols': {'depth': 1,
    'sub-categories': {},
    'links': ['Aglepristone',
     'Lilopristone',
     'Norgesterone',
     'Norvinisterone',
     'Rostafuroxin'],
    'parent-categories': ['Alkene derivatives',
     'Cyclopentanols',
     'Tertiary alcohols']},
   '1-Ethylcyclopentanols': {'depth': 1,
    'sub-categories': {},
    'links': ['Bolenol',
     '5α-Dihydronorethandrolone',
     'Ethylestradiol',
     'Ethylestrenol',
     'Ethyltestosterone',
     'Norethandrolone',
     'Onapristone'],
    'parent-categories': ['Cyclopentanols', 'Tertiary alcohols']},
   '1-Ethynylcyclopentanols': {'depth': 1,
    'sub-categories': {},
    'links': ['5α-Dihydroethisterone',
     '5α-Dihydrolevonorgestrel',
     '5α-Dihydronorethisterone',
     'ERA-63',
     'Ethinylestradiol',
     'Ethinylestradiol sulfonate',
     'Ethisterone',
     'Etonogestrel',
     'Mestranol',
     'Norethisterone',
     'Noretynodrel',
 

In [41]:
wikipedia.categorytree("Antiemetics")

{'Antiemetics': {'depth': 0,
  'sub-categories': {'5-HT3 antagonists': {'depth': 1,
    'sub-categories': {},
    'links': ['5-HT3 antagonist',
     'Alosetron',
     'AS-8112',
     'Azasetron',
     'Batanopride',
     'Bemesetron',
     'Bupropion',
     'Cilansetron',
     'Dazopride',
     'Dolasetron',
     'Fluoxetine',
     'Granisetron',
     'Hydroxybupropion',
     'Lerisetron',
     'Litoxetine',
     'Lurosetron',
     'Metoclopramide',
     'Nitrous oxide',
     'Ondansetron',
     'Palonosetron',
     'Ramosetron',
     'Renzapride',
     'Ricasetron',
     'Sevoflurane',
     'Thujone',
     'Tropanserin',
     'Tropisetron',
     'Xenon',
     'Zacopride',
     'Zatosetron'],
    'parent-categories': ['Antiemetics', 'Serotonin antagonists']},
   'Tetrahydrocannabinol': {'depth': 1,
    'sub-categories': {},
    'links': ['Tetrahydrocannabinol',
     'Dronabinol',
     'Olivetol',
     'Tetrahydrocannabinol-C4',
     'Tetrahydrocannabinolic acid',
     'Tetrahydrocannab

In [30]:
rand_page.redirects()

['ATC code N05AD01',
 'ATCvet code QN05AD01',
 'Adverse effects of haloperidol',
 'Aloperidin',
 'Aloperidol',
 'Aloperidolo',
 'Aloperidon',
 'Apo-Haloperidol',
 'Bioperidolo',
 'Brotopon',
 'C21H23ClFNO2',
 'Dozic',
 'Dozix',
 'Einalon S',
 'Eukystol',
 'Galoperidol',
 'Haldol',
 'Haldol La',
 'Haldol Solutab',
 'Halidol',
 'Halog Solution',
 'Halojust',
 'Halol (drug)',
 'Halol (medication)',
 'Halopal',
 'Haloperido',
 'Haloperidol lactate',
 'Halopidol',
 'Halopoidol',
 'Halosten',
 'Keselan',
 'Lealgin Compositum',
 'Mixidol',
 'Novo-Peridol',
 'Pekuces',
 'Peluces',
 'Peridol',
 'Pernox',
 'Pms Haloperidol',
 'Serenace',
 'Serenase',
 'Serenelfi',
 'Sernel',
 'Sigaperidol',
 'Ulcolind',
 'Uliolind',
 'Vesalium']

In [31]:
rand_page.links

['(R)-3-Nitrobiphenyline',
 '1,1,1-Trichloroethane',
 '1,3-Diaminopropane',
 '1-Aminocyclopropane-1-carboxylic acid',
 '1-Aminocyclopropanecarboxylic acid',
 '18-Methoxycoronaridine',
 '1P-LSD',
 '2,2,2-Trichloroethanol',
 '2,5-Dimethoxy-4-bromoamphetamine',
 '2,5-Dimethoxy-4-chloroamphetamine',
 '2,5-Dimethoxy-4-iodoamphetamine',
 '2,5-Dimethoxy-4-methylamphetamine',
 '2-Amino-2-(3-hydroxy-5-methylisoxazol-4-yl)acetic acid',
 '2-Bromo-LSD',
 '2-MDP',
 '2-Methyl-5-hydroxytryptamine',
 '2-Methyl-6-(phenylethynyl)pyridine',
 '2-OH-NPA',
 '2-Pyridylethylamine',
 '24S-Hydroxycholesterol',
 '25B-NBOMe',
 '25C-NBOMe',
 '25CN-NBOH',
 '25I-NBF',
 '25I-NBMD',
 '25I-NBOH',
 '25I-NBOMe',
 '25TFM-NBOMe',
 '2C (psychedelics)',
 '2C-B',
 '2C-B-FLY',
 '2C-E',
 '2C-I',
 '2C-T-2',
 '2C-T-21',
 '2C-T-7',
 '2CB-Ind',
 '2CBCB-NBOMe',
 '2CBFly-NBOMe',
 '3,4,5-Trimethoxyamphetamine',
 '3,4-Methylenedioxy-N-hydroxyamphetamine',
 '3,4-Methylenedioxyamphetamine',
 '3-(2-carboxypiperazin-4-yl)propyl-1-phosphoni